# Advanced Embeddings

In [None]:
from dotenv import find_dotenv, load_dotenv
load_dotenv(find_dotenv())

# Similarity

There are several benefits to having the embedding of a word, a primary one is that it gives us the ability to compare how close two words are in meaning. One way of simpling doing so is by taking the **normalized dot product**.

The similarity between two embeddings is given by their dot product. 

$$
\text{Similarity} = \vec{v} \cdot \vec{w}
$$

First let embed two words

In [None]:
from langchain_dartmouth.embeddings import DartmouthEmbeddings
from langchain_core.output_parsers import JsonOutputParser

embeddings = DartmouthEmbeddings()
text_1 = "Japan"
text_2 = "Pizza"

embed_1 = embeddings.embed_query(text_1)
embed_2 = embeddings.embed_query(text_2)


Now let's do the dot product
$$
v\cdot w = \sum_{i = 1}^N(v_i \cdot w_i)
$$

In [None]:

multiplied_list = []
length_of_vector = len(embed_1)

for i in range(length_of_vector):
    multiplied_list.append(embed_1[i] * embed_2[i])

dot_product = sum(multiplied_list)
print(f"Similarity between '{text_1}' and '{text_2}': {dot_product:.2f}")

## Visualizing an embedding
A better way to understand an embedding is to visualize it. Let's generate some random words related to different domains, and find their embeddings

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import umap
from langchain_dartmouth.llms import ChatDartmouth


llm = ChatDartmouth(model_name="llama-3-1-8b-instruct", seed=42, temperature=0.0)
parser = JsonOutputParser()

chain = llm | parser

response = chain.invoke(
    "Generate 30 different words that are well-suited to showcase how word embeddings work. "
    "Draw the words from domains like animals, finance, and food. The food one should contain tomato "
    "Return the words in JSON format, using the domain as the key, and the words as values. "
)

In [None]:
words = pd.DataFrame.from_dict(response).melt(var_name="domain", value_name="word")

embeddings = DartmouthEmbeddings(model_name="bge-large-en-v1-5")
words["embedding"] = embeddings.embed_documents(words["word"])


<div class="alert alert-info">

**Note:** It is difficult to visualize a 1024 dimensional vector, as we're not 1024 dimensional humans! One way to get around this is by using a [UMAP](https://umap-learn.readthedocs.io/en/latest/) (Uniform Manifold Approximation and Projection) to represent this large vector as a 2 dimesional one. 
</div>



In [None]:
import umap

embeddings_list = words["embedding"].to_list()
mapper = umap.UMAP().fit(embeddings_list)
umap_embeddings = pd.DataFrame(mapper.transform(embeddings_list), columns=["UMAP_x", "UMAP_y"])

words = pd.concat([words, umap_embeddings], axis=1)

words.sample(3)

In [None]:

# Create a scatter plot using matplotlib
for i in words["domain"].unique():
    subset = words[words["domain"] == i]
    plt.scatter(subset["UMAP_x"], subset["UMAP_y"], label=i)
    
    # Add the text labels
    for j in range(len(subset)):
        plt.text(
            subset["UMAP_x"].iloc[j],
            subset["UMAP_y"].iloc[j] + 0.12,
            subset["word"].iloc[j],
            horizontalalignment="center",
            verticalalignment="center",
            fontsize=6,
        )

# Add legend and show plot
plt.legend()

plt.tight_layout()
plt.xlabel("X")
plt.ylabel("Y")
plt.title("UMAP Projection of Words")
plt.show()

We can see that groups with words related to foods, and animals, and finance are somewhat close to each other. This let's us find the similarity between different words

# Summary

This recipe showed how to find the similarity between two embeddings. Visualizing embeddings can be a good way to represent this similarity. UMAP can be used to represent high-dimensional embeddings in a 2D plane, so we can easily observe it. 