# Similarity Search

In [None]:
from dotenv import find_dotenv, load_dotenv
load_dotenv(find_dotenv())

There are several benefits to having the embedding of a word, a primary one is that it gives us the ability to compare how close two words are in meaning. One way of simpling doing so is by taking the **dot product**.

The similarity between two embeddings is given by their dot product. 

$$
\text{Similarity} = \vec{v} \cdot \vec{w}
$$

First let embed some words

In [9]:
from langchain_dartmouth.embeddings import DartmouthEmbeddings
from langchain_core.output_parsers import JsonOutputParser

embeddings = DartmouthEmbeddings()
text_1 = "Japan"
text_2 = "Sushi"
text_3 = "Italy"
text_4 = "Pizza"

embed_1 = embeddings.embed_query(text_1)
embed_2 = embeddings.embed_query(text_2)
embed_3 = embeddings.embed_query(text_3)
embed_4 = embeddings.embed_query(text_4)

Now let's do the dot product
$$
v\cdot w = \sum_{i = 1}^N(v_i \cdot w_i)
$$

In [15]:
N = len(embed_1)
similarity_1_2 = []

for i in range(N):
    v = embed_1[i]
    w = embed_2[i]
    similarity_1_2.append(v * w)    
    
print(f'Similarity between {text_1} and {text_2} is {sum(similarity_1_2)}')

Similarity between Japan and Sushi is 0.6915366834825497


Let's repeat this for the other words too

In [17]:
similarity_1_3 = []
similarity_1_4 = []
similarity_2_3 = []
similarity_2_4 = []
similarity_3_4 = []

for i in range(length_of_vector): 
    similarity_1_3.append(embed_1[i] * embed_3[i])
    similarity_1_4.append(embed_1[i] * embed_4[i])
    similarity_2_3.append(embed_2[i] * embed_3[i])
    similarity_2_4.append(embed_2[i] * embed_4[i])
    similarity_3_4.append(embed_3[i] * embed_4[i])

print(f'Similarity between {text_1} and {text_3} is {sum(similarity_1_3)}')
print(f'Similarity between {text_1} and {text_4} is {sum(similarity_1_4)}')
print(f'Similarity between {text_2} and {text_3} is {sum(similarity_2_3)}')
print(f'Similarity between {text_2} and {text_4} is {sum(similarity_2_4)}')
print(f'Similarity between {text_3} and {text_4} is {sum(similarity_3_4)}')


Similarity between Japan and Italy is 0.7488653578639409
Similarity between Japan and Pizza is 0.5895998995258541
Similarity between Sushi and Italy is 0.6092699421928527
Similarity between Sushi and Pizza is 0.780035362735349
Similarity between Italy and Pizza is 0.7009643386868145


From here we can see that Japan and Sushi are are about as similar as Italy and Pizza. Italy and Sushi and Japan and Pizza are also equally similar. We can see that Japan and Italy are very similar, perhaps because they are both countries. 

<div class="alert alert-info">

**Note:** This is a source of where *bias* comes in for machine learning models. These results do not mean that you can't get good sushi in Italy or good pizza in Japan. It simply means that in the training data for this embedding model, these words generally appeared close to one another. 
</div>



## Visualizing an embedding
A better way to understand an embedding is to visualize it. Let's generate some random words related to different domains, and find their embeddings

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import umap
from langchain_dartmouth.llms import ChatDartmouth


llm = ChatDartmouth(model_name="llama-3-1-8b-instruct", seed=42, temperature=0.0)
parser = JsonOutputParser()

chain = llm | parser

response = chain.invoke(
    "Generate 30 different words that are well-suited to showcase how word embeddings work. "
    "Draw the words from domains like animals, finance, and food. The food one should contain tomato "
    "Return the words in JSON format, using the domain as the key, and the words as values. "
)

In [None]:
words = pd.DataFrame.from_dict(response).melt(var_name="domain", value_name="word")

embeddings = DartmouthEmbeddings(model_name="bge-large-en-v1-5")
words["embedding"] = embeddings.embed_documents(words["word"])


<div class="alert alert-info">

**Note:** It is difficult to visualize a 1024 dimensional vector, as we're not 1024 dimensional humans! One way to get around this is by using a [UMAP](https://umap-learn.readthedocs.io/en/latest/) (Uniform Manifold Approximation and Projection) to represent this large vector as a 2 dimesional one. 
</div>



In [18]:
import umap

embeddings_list = words["embedding"].to_list()
mapper = umap.UMAP().fit(embeddings_list)
umap_embeddings = pd.DataFrame(mapper.transform(embeddings_list), columns=["UMAP_x", "UMAP_y"])

words = pd.concat([words, umap_embeddings], axis=1)

words.sample(3)

NameError: name 'words' is not defined

In [None]:

# Create a scatter plot using matplotlib
for i in words["domain"].unique():
    subset = words[words["domain"] == i]
    plt.scatter(subset["UMAP_x"], subset["UMAP_y"], label=i)
    
    # Add the text labels
    for j in range(len(subset)):
        plt.text(
            subset["UMAP_x"].iloc[j],
            subset["UMAP_y"].iloc[j] + 0.12,
            subset["word"].iloc[j],
            horizontalalignment="center",
            verticalalignment="center",
            fontsize=6,
        )

# Add legend and show plot
plt.legend()

plt.tight_layout()
plt.xlabel("X")
plt.ylabel("Y")
plt.title("UMAP Projection of Words")
plt.show()

We can see that groups with words related to foods, and animals, and finance are somewhat close to each other. This let's us find the similarity between different words

# Summary

This recipe showed how to find the similarity between two embeddings. Visualizing embeddings can be a good way to represent this similarity. UMAP can be used to represent high-dimensional embeddings in a 2D plane, so we can easily observe it. 