# Advanced Embeddings

In [None]:
from dotenv import find_dotenv, load_dotenv
load_dotenv(find_dotenv())

# Similarity

There are several benefits to having the embedding of a word, a primary one is that it gives us the ability to compare how close two words are in meaning. One way of simpling doing so is by taking the **normalized dot product**.

The similarity between two embeddings is given by their dot product. We then normalize the dot product with the size of each of the vectors. That way if one word is really common, it won't skew the similarity. 

$$
\text{Similarity} = \frac{v\cdot w}{|v||w|}
$$

First let embed two words

In [None]:
from langchain_dartmouth.embeddings import DartmouthEmbeddings

embeddings = DartmouthEmbeddings()
text_1 = "Japan"
text_2 = "Pizza"

embed_1 = embeddings.embed_query(text_1)
embed_2 = embeddings.embed_query(text_2)


Next let's do the dot product
$$
v\cdot w = \sum_{i = 1}^N(v_i \times w_i)
$$

In [None]:

multiplied_list = []
for i in range(length_of_vector):
    multiplied_list.append(embed_1[i] * embed_2[i])

dot_product = sum(multiplied_list)
print(f'The dot product of the two vectors is: {dot_product:.2f}')

In [None]:
embed_1_size = []
embed_2_size = []

for j in range(len(embed_2)): 
    embed_1_size.append(embed_1[j] * embed_1[j])
    embed_2_size.append(embed_2[j] * embed_2[j])

sum_embed_1_size = sum(embed_1_size)**0.5
sum_embed_2_size = sum(embed_2_size)**0.5

normalized_size = sum_embed_1_size * sum_embed_2_size

print(f'The normalized size of the two vectors is: {normalized_size:.2f}')

Putting this together, we can obtain the similarity

In [None]:
similarity = dot_product / normalized_size
print(f"Similarity between '{text_1}' and '{text_2}': {similarity:.2f}")

## Visualizing an embedding
A better way to understand an embedding is to visualize it. Let's generate some random words related to different domains, and find their embeddings

In [None]:
llm = ChatDartmouth(model_name="llama-3-1-8b-instruct", seed=42, temperature=0.0)
parser = JsonOutputParser()

chain = llm | parser

response = chain.invoke(
    "Generate 30 different words that are well-suited to showcase how word embeddings work. "
    "Draw the words from domains like animals, finance, and food. The food one should contain tomato "
    "Return the words in JSON format, using the domain as the key, and the words as values. "
)

In [None]:
words = pd.DataFrame.from_dict(response).melt(var_name="domain", value_name="word")

embeddings = DartmouthEmbeddings(model_name="bge-large-en-v1-5")
words["embedding"] = embeddings.embed_documents(words["word"])

It is difficult to visualize a 1024 dimensional vector, as we're not 1024 dimensional humans! One way to get around this is by using a [UMAP](https://umap-learn.readthedocs.io/en/latest/) (Uniform Manifold Approximation and Projection) to represent this large vector as a 2 dimesional one. This can then be plotted as follows. 

In [None]:
# fitting using UMAP
mapper = umap.UMAP().fit(np.array(words["embedding"].to_list()))
umap_embeddings = pd.DataFrame(mapper.transform(np.array(words["embedding"].to_list())), columns=["UMAP_x", "UMAP_y"])

# merge with the words
words = pd.concat([words, umap_embeddings], axis=1)
words.head(1)

In [None]:
import matplotlib.pyplot as plt

# Create a scatter plot using matplotlib
for i in words["domain"].unique():
    subset = words[words["domain"] == i]
    plt.scatter(subset["UMAP_x"], subset["UMAP_y"], label=i)
    
    # Add the text labels
    for j in range(len(subset)):
        plt.text(
            subset["UMAP_x"].iloc[j],
            subset["UMAP_y"].iloc[j] + 0.12,
            subset["word"].iloc[j],
            horizontalalignment="center",
            verticalalignment="center",
            fontsize=6,
        )

# Add legend and show plot
plt.legend()

plt.tight_layout()
plt.xlabel("X")
plt.ylabel("Y")
plt.title("UMAP Projection of Words")
plt.show()

We can see that groups with words related to foods, and animals, and finance are somewhat close to each other. This let's us find the similarity between different words

# Embed 
`embed_documents` lets us take advantage of langchain's  [document loaders](https://python.langchain.com/docs/integrations/document_loaders/) which let us read various formats of data into a document class, which is embeddable.

In [None]:
from langchain.document_loaders import TextLoader
from langchain.schema import Document

file_path = './rag_documents/asteroids.txt'

text_loader = TextLoader(file_path, encoding='utf-8')
loaded_documents = text_loader.load()
document = loaded_documents[0]

words = document.page_content.split()
print('Number of words in the document: ', len(words))