# Similarity Search

There are several benefits to having the embedding of a word, a primary one is that it gives us the ability to compare how close two words are in meaning. One way of simpling doing so is by taking the **dot product**.
The similarity between two embeddings is given by their dot product. 

$$
\text{Similarity} = \vec{v} \cdot \vec{w}
$$

First let embed some words, which we learned in the previous cookbook

In [None]:
from dotenv import find_dotenv, load_dotenv
load_dotenv(find_dotenv())

In [None]:
from langchain_dartmouth.embeddings import DartmouthEmbeddings
from langchain_core.output_parsers import JsonOutputParser

embeddings = DartmouthEmbeddings()
text_1 = "Japan"
text_2 = "Sushi"
text_3 = "Italy"
text_4 = "Pizza"

embed_1 = embeddings.embed_query(text_1)
embed_2 = embeddings.embed_query(text_2)
embed_3 = embeddings.embed_query(text_3)
embed_4 = embeddings.embed_query(text_4)

Now let's do the dot product
$$
v\cdot w = \sum_{i = 1}^N(v_i \cdot w_i)
$$

In [None]:
N = len(embed_1)
similarity_1 = []

for i in range(N):
    v = embed_1[i]
    w = embed_2[i]
    similarity_1.append(v * w)    
    
print(f'Similarity between {text_1} and {text_2} is {sum(similarity_1)}')

Let's repeat this for the other words too

In [None]:
similarity_2 = []
similarity_3 = []
similarity_4 = []
similarity_5 = []
similarity_6 = []

for i in range(N): 
    similarity_2.append(embed_1[i] * embed_3[i])
    similarity_3.append(embed_1[i] * embed_4[i])
    similarity_4.append(embed_2[i] * embed_3[i])
    similarity_5.append(embed_2[i] * embed_4[i])
    similarity_6.append(embed_3[i] * embed_4[i])

print(f'Similarity between {text_1} and {text_3} is {sum(similarity_2)}')
print(f'Similarity between {text_1} and {text_4} is {sum(similarity_3)}')
print(f'Similarity between {text_2} and {text_3} is {sum(similarity_4)}')
print(f'Similarity between {text_2} and {text_4} is {sum(similarity_5)}')
print(f'Similarity between {text_3} and {text_4} is {sum(similarity_6)}')


From this, we observe that **Japan** and *Sushi* share a similarity comparable to that of **Italy** and *Pizza*. Likewise, **Italy** and *Sushi* as well as **Japan** and *Pizza* exhibit similar levels of association. Interestingly, **Japan** and **Italy** also demonstrate a high degree of similarity, likely because both are countries.

<div class="alert alert-info">

**Note:** This is a source of where *bias* comes in for machine learning models. These results do not mean that you can't get good sushi in Italy or good pizza in Japan. It simply means that in the training data for this embedding model, these words generally appeared close to one another. 
</div>



## Visualizing Similarity
A better way to understand an embedding is to visualize it. Let's generate some random words related to different domains, and find their embeddings. In the [building chains](./08-building-chains.ipynb) notebook, the idea of a pipeline was introduced. We use this to generate and parse an output from the llm as a test embedding

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import umap
from langchain_dartmouth.llms import ChatDartmouth


llm = ChatDartmouth(model_name="llama-3-1-8b-instruct", seed=42, temperature=0.0)
parser = JsonOutputParser()

chain = llm | parser

response = chain.invoke(
    "Generate 30 different words that are well-suited to showcase how word embeddings work. "
    "Draw the words from domains like animals, finance, and food. The food one should contain tomato "
    "Return the words in JSON format, using the domain as the key, and the words as values. "
)

In [None]:
words = pd.DataFrame.from_dict(response).melt(var_name="domain", value_name="word")

embeddings = DartmouthEmbeddings(model_name="bge-large-en-v1-5")
words["embedding"] = embeddings.embed_documents(words["word"])


<div class="alert alert-info">

**Note:** It is difficult to visualize a 1024 dimensional vector, as we're not 1024 dimensional humans! One way to get around this is by using a [UMAP](https://umap-learn.readthedocs.io/en/latest/) (Uniform Manifold Approximation and Projection) to represent this large vector as a 2 dimesional one. 
</div>



In [None]:
import umap

embeddings_list = words["embedding"].to_list()
mapper = umap.UMAP().fit(embeddings_list)
umap_embeddings = pd.DataFrame(mapper.transform(embeddings_list), columns=["UMAP_x", "UMAP_y"])

words = pd.concat([words, umap_embeddings], axis=1)

words.sample(3)

In [None]:

# Create a scatter plot using matplotlib
for i in words["domain"].unique():
    subset = words[words["domain"] == i]
    plt.scatter(subset["UMAP_x"], subset["UMAP_y"], label=i)
    
    # Add the text labels
    for j in range(len(subset)):
        plt.text(
            subset["UMAP_x"].iloc[j],
            subset["UMAP_y"].iloc[j] + 0.12,
            subset["word"].iloc[j],
            horizontalalignment="center",
            verticalalignment="center",
            fontsize=6,
        )

# Add legend and show plot
plt.legend()

plt.tight_layout()
plt.xlabel("X")
plt.ylabel("Y")
plt.title("UMAP Projection of Words")
plt.show()

We can see that groups with words related to foods, and animals, and finance are somewhat close to each other. This let's us visualize the similarities we saw before.

# Summary

This recipe showed how to find the similarity between two embeddings. Visualizing embeddings can be a good way to represent this similarity. UMAP can be used to represent high-dimensional embeddings in a 2D plane, so we can easily visualize embeddings, and see their similarities.