#### Visualizing Embeddings 
Embeddings are a very powerful tool for several downstream machine learning applications. Visualizing embeddings are a powerful way to allow us to understand how they behave and debug the embeddings

In [1]:
import vertexai

vertexai.init()

In [2]:
# import text embedding models from openai
from vertexai.language_models import TextEmbeddingModel

In [3]:
in_1 = "Missing flamingo discovered at swimming pool"

in_2 = "Sea otter spotted on surfboard by beach"

in_3 = "Baby panda enjoys boat ride"


in_4 = "Breakfast themed food truck beloved by all!"

in_5 = "New curry restaurant aims to please!"


in_6 = "Python developers are wonderful people"

in_7 = "TypeScript, C++ or Java? All are great!" 


input_text_lst_news = [in_1, in_2, in_3, in_4, in_5, in_6, in_7]

In [4]:
import numpy as np
from vertexai.language_models import TextEmbeddingModel

embedding_model = TextEmbeddingModel.from_pretrained(
    "textembedding-gecko@001")

GoogleAuthError: 
Unable to authenticate your request.
Depending on your runtime environment, you can complete authentication by:
- if in local JupyterLab instance: `!gcloud auth login` 
- if in Colab:
    -`from google.colab import auth`
    -`auth.authenticate_user()`
- if in service account or other: please follow guidance in https://cloud.google.com/docs/authentication

In [None]:
embeddings = []
for text in input_text_lst_news:
    embedding = embedding_model.get_embeddings([text])
    embeddings.append(embedding[0].values)
    
embeddings_array = np.array(embeddings)

In [None]:
print("Shape: " + str(embeddings_array.shape))
print(embeddings_array)

#### Reduce embeddings from 768 to 2 dimensions for visualization
- We'll use principal component analysis (PCA).
- You can learn more about PCA in [this video](https://www.coursera.org/learn/unsupervised-learning-recommenders-reinforcement-learning/lecture/73zWO/reducing-the-number-of-features-optional) from the Machine Learning Specialization. 

We use algorithms like PCA to compress the embedding dimension without losing the semantic information

In [None]:
from sklearn.decomposition import PCA

PCAModel = PCA(n_components=2)
PCAModel.fit(embeddings_array)
new_values = PCAModel.transform(embeddings_array)

In [None]:
print("Shape: " + str(new_values.shape))
print(new_values)

In [None]:
import matplotlib.pyplot as plt
import mplcursors
%matplotlib ipympl

from utils import plot_2D
plot_2D(new_values[:,0], new_values[:,1], input_text_lst_news)

#### Embeddings and Similarity
- Plot a heat map to compare the embeddings of sentences that are similar and sentences that are dissimilar.

In [None]:
in_1 = """He couldn’t desert 
          his post at the power plant."""

in_2 = """The power plant needed 
          him at the time."""

in_3 = """Cacti are able to 
          withstand dry environments.""" 

in_4 = """Desert plants can 
          survive droughts.""" 

input_text_lst_sim = [in_1, in_2, in_3, in_4]

In [None]:
embeddings = []
for input_text in input_text_lst_sim:
    emb = embedding_model.get_embeddings([input_text])[0].values
    embeddings.append(emb)
    
embeddings_array = np.array(embeddings)

In [None]:
from utils import plot_heatmap

y_labels = input_text_lst_sim

# Plot the heatmap
plot_heatmap(embeddings_array, y_labels = y_labels, title = "Embeddings Heatmap")

Note: the heat map won't show everything because there are 768 columns to show. To adjust the heat map with your mouse:
Hover your mouse over the heat map. Buttons will appear on the left of the heatmap. Click on the button that has a vertical and horizontal double arrow (they look like axes).
Left click and drag to move the heat map left and right.
Right click and drag up to zoom in.
Right click and drag down to zoom out.

#### Compute cosine similarity
- The `cosine_similarity` function expects a 2D array, which is why we'll wrap each embedding list inside another list.
- You can verify that sentence 1 and 2 have a higher similarity compared to sentence 1 and 4, even though sentence 1 and 4 both have the words "desert" and "plant".

In [5]:
from sklearn.metrics.pairwise import cosine_similarity

In [6]:
def compare(embedding, idx1, idx2):
    return cosine_similarity([embedding[idx1]], [embedding[idx2]])

In [None]:
print(in_1)
print(in_2)
print(compare(embeddings,0,1))

In [None]:
print(in_1)
print(in_4)
print(compare(embeddings,0,3))