# Embeddings

At it's heart a large language model is predicting what is the most likely string of words to come from an input query. Machine learning models struggle to work with data that is not in a numeric format, so our first task is to represent words in a way that can be understood by computers. However, words are complicated. One way we can simplifiy this is to find words which mean similar things. For example, "stupendous" and "nice" mean similar things (a positive reaction) but at different intensities. 

Imagine we have 2 knobs we can turn. One represents "niceness" and one represents "intensity". Based on that, the "niceness" knob for "nice" and "stupendous" might be very similar, but the "intensity" might be different. Similarly, we can image that the settings for words like "horrible" and "terrible" may be similar. With this intuition, we can represent any word as a specific configuration of different knobs, by twisting and turning to get the perfect match. The examples in this sectino have embeddings with 1024 dimensions (or 1024 different knobs).  

Another way is if we can find a word, such as "Queen", we can calculate how probable is it for another word to be around it. For example, "Elizabeth", "King", "Buckingham" may all be words that are more likely to appear around the word "Queen" compared to something like "bulldozer". As humans, we understand this intuitively. For computers, it may be a lot more difficult. One way to deal with this issue is by looking at "Embeddings", which are a way to representing text as numbers. 

In [None]:
from dotenv import find_dotenv, load_dotenv

load_dotenv(find_dotenv())

# Creating an Embedding

Let's find an embedding for a word of our choosing. We will be looking into static embeddings, which are embeddings which have been already assigned to several words already. Through different algorithms that analyze the probability of a word given it's semantic meaning and the context around it, each word is given a specific set of numbers, otherwise known as a "Vector". 

In [None]:
from langchain_dartmouth.embeddings import DartmouthEmbeddings
from langchain_dartmouth.llms import ChatDartmouth

from langchain_core.output_parsers import JsonOutputParser, ListOutputParser

import numpy as np
import pandas as pd
import umap
import matplotlib.pyplot as plt

In [None]:
embeddings = DartmouthEmbeddings()
tiger = embeddings.embed_query("tiger")
print(tiger)
print("Length of tiger: ", len(tiger))

We see that the word "tiger" is represented by 1024 numbers. This means that the numeric representation of the word "tiger" consists of 1024 dimensions for this particular embedding model. Other models may use fewer or more numbers to represent a word. 

There are several benefits to having the embedding of a word, a primary one is that it gives us the ability to compare how close two words are in meaning. One way of simpling doing so is by taking the dot product. For example: 

In [None]:
lion = embeddings.embed_query("lion")
eggs = embeddings.embed_query("eggs")

print("Similarity between tiger and lion: ", np.dot(tiger, lion).round(2))
print("Similarity between tiger and eggs:", np.dot(tiger, eggs).round(2))

## Embedding a query
A better way to understand an embedding is to visualize it. Let's generate some random words related to different domains, and find their embeddings

In [None]:
llm = ChatDartmouth(model_name="llama-3-1-8b-instruct", seed=42, temperature=0.0)
parser = JsonOutputParser()

chain = llm | parser

response = chain.invoke(
    "Generate 30 different words that are well-suited to showcase how word embeddings work. "
    "Draw the words from domains like animals, finance, and food. The food one should contain tomato "
    "Return the words in JSON format, using the domain as the key, and the words as values. "
)

In [None]:
words = pd.DataFrame.from_dict(response).melt(var_name="domain", value_name="word")

embeddings = DartmouthEmbeddings(model_name="bge-large-en-v1-5")
words["embedding"] = embeddings.embed_documents(words["word"])

print(len(words["embedding"][0]))

It is difficult to visualize a 1024 dimensional vector, as we're not 1024 dimensional humans! One way to get around this is by using a UMAP (Uniform Manifold Approximation and Projection) to represent this large vector as a 2 dimesional one. This can then be plotted as follows. 

In [None]:
mapper = umap.UMAP().fit(np.array(words["embedding"].to_list()))

umap_embeddings = pd.DataFrame(mapper.transform(np.array(words["embedding"].to_list())), columns=["UMAP_x", "UMAP_y"])
# merge with the words
words = pd.concat([words, umap_embeddings], axis=1)
words.head(1)

In [None]:
import seaborn as sns

for i in words["domain"].unique():
    sns.scatterplot(data=words[words["domain"] == i], x="UMAP_x", y="UMAP_y", label=i)
    # add the text labels
    for j in range(len(words)):
        plt.text(
            words["UMAP_x"].iloc[j],
            words["UMAP_y"].iloc[j] + 0.13,
            words["word"].iloc[j],
            horizontalalignment="center",
            verticalalignment="center",
            fontsize=7,
        )

We can see that groups with words related to foods, and animals, and finance are somewhat close to each other. This let's us find the similarity between different words

## Embedding a document

We can also embed an entire document 
# TOCOMPLETE

In [None]:
from langchain_dartmouth.llms import DartmouthLLM

llm = ChatDartmouth(model_name="llama-3-1-8b-instruct", seed=42, temperature=0.0)
response1 = llm.invoke("Generate a 100 word text about dartmouth college and it's history and area")

llm = ChatDartmouth(model_name="llama-3-1-8b-instruct", seed=45, temperature=0.8)
response2 = llm.invoke("Generate a 100 word text about dartmouth college and it's history and area")

llm = ChatDartmouth(model_name="llama-3-1-8b-instruct", seed=10, temperature=0.0)
response3 = llm.invoke("Create 5 words of gibberish")


In [None]:
print(response1.content)
print(response2.content)
print(response3.content)

In [None]:
import numpy as np

def get_embeddings(response, upper_limit=400):
    words = response.split(" ")
    embedding_list = []
    chunks = [words[i:i + 32] for i in range(0, upper_limit, 32)]
    for chunk in chunks:
        embeddings = DartmouthEmbeddings(model_name="bge-large-en-v1-5")
        embedding_list.append(embeddings.embed_documents(chunk))
    return np.concatenate(embedding_list)


In [None]:
import wikipediaapi

def get_wikipedia_page_text(page_title, language="en"):
    # Create a user-agent string
    user_agent = "CoolBot/0.0 (https://example.org; coolbot@example.org)"
    
    # Initialize Wikipedia API with the user agent
    wiki_wiki = wikipediaapi.Wikipedia(
        language=language,
        user_agent=user_agent
    )
    
    # Fetch the page
    page = wiki_wiki.page(page_title)
    
    if page.exists():
        return page.text
    else:
        return "Page not found"

# Example usage
page_title = "Dartmouth College"
dartmouth_text = get_wikipedia_page_text(page_title)
french_text = get_wikipedia_page_text("Claude Cohen-Tannoudji", "fr")
Ivy_league_text = get_wikipedia_page_text("Ivy League")

In [None]:
dartmouth_embedding = get_embeddings(dartmouth_text)

# get embedding of something random 


In [None]:
french_embedding = get_embeddings(french_text)

In [None]:
Ivy_league_embedding = get_embeddings(Ivy_league_text)

In [None]:
# get the centroid of the embeddings
dartmouth_centroid = np.mean(dartmouth_embedding, axis=0)
french_centroid = np.mean(french_embedding, axis=0)

# find the similarity between the centroid and the random word
similarity = np.dot(dartmouth_centroid, french_centroid)
print("Similarity between Dartmouth College and Claude Cohen-Tannoudji: ", similarity.round(2))

In [None]:
# get the centroid of the embeddings
dartmouth_centroid = np.mean(dartmouth_embedding, axis=0)
Ivy_league_embedding = np.mean(french_embedding, axis=0)

similarity = np.dot(dartmouth_centroid, Ivy_league_embedding)/ (np.linalg.norm(dartmouth_centroid) * np.linalg.norm(Ivy_league_embedding))
print("Similarity between Dartmouth College and Ivy League: ", similarity.round(2))


In [None]:
# given these embeddings we can now use them to find the similarity between the two documents
# we can also use them to find the similarity between the two documents and a random word

# get the centroid of the embeddings
dartmouth_centroid = np.mean(dartmouth_embedding, axis=0)
french_centroid = np.mean(french_embedding, axis=0)

# find the similarity between the centroid and the random word
similarity = np.dot(dartmouth_centroid, french_centroid)
print("Similarity between Dartmouth College and Claude Cohen-Tannoudji: ", similarity.round(2))

dartmouth_board_centroid = np.mean(dartmouth_board_embedding, axis=0)

similarity = np.dot(dartmouth_centroid, dartmouth_board_centroid)
print("Similarity between Dartmouth College and Dartmouth Board: ", similarity.round(2))

# Uses