# Embeddings

At its core, a large language model predicts the most likely sequence of words based on an input query. Since machine learning models operate best with numerical data, our first task is to represent words numerically. Words are complex, but we can simplify this by identifying words with similar meanings. For instance, "stupendous" and "nice" convey similar feelings of positivity, though at different intensities. These are aspects of **sentiment analysis**, which is a significant area in natural language processing.

To illustrate, consider two conceptual "sliders": one for "niceness" and another for "intensity". The "niceness" settings for "nice" and "stupendous" might be similar, while their "intensity" settings differ. Similarly, words like "horrible" and "terrible" would have similar settings on both sliders. We can represent any word as a specific configuration of such sliders, adjusting them to find the best match. In practice, these configurations can be represented with embeddings, which typically have many dimensions, such as 1024.

<!-- # How they work 

There are many different ways to create an embedding. The likelihood of words appearing near each other. For example, words like "Elizabeth", "King", and "Buckingham" are more likely to appear around "Queen" than a word like "bulldozer". While humans grasp this context naturally, it’s more challenging for computers. Embeddings help tackle this by representing text as numerical values, capturing such contextual relationships. -->

TO DO

In [None]:
from dotenv import find_dotenv, load_dotenv
load_dotenv(find_dotenv())

# Creating an Embedding

Let's find an embedding for a word of our choosing. We will be looking into static embeddings, which are fixed representations of words as vectors from a pre-trained model. 

In [None]:
from langchain_dartmouth.embeddings import DartmouthEmbeddings
from langchain_dartmouth.llms import ChatDartmouth
from langchain_core.output_parsers import JsonOutputParser, ListOutputParser

import numpy as np
import pandas as pd
import umap
import matplotlib.pyplot as plt

The model for embeddings is different class from the ones that we have used before. We can see how it's used below:

In [None]:
embeddings = DartmouthEmbeddings(model_name="bge-large-en-v1-5")

tiger = embeddings.embed_query("tiger")
print(tiger)
print("Length of tiger: ", len(tiger))

<div class="alert alert-info">

**Note:** We see that the word "tiger" is represented by a list of 1024 numbers. This means that the numeric representation of the word "tiger" consists of 1024 dimensions (or sliders) for this particular embedding model. Other models may use fewer or more numbers to represent text. You can read more about the model we are using [here](https://huggingface.co/BAAI/bge-large-en-v1.5)
</div>



There are several benefits to having the embedding of a word, a primary one is that it gives us the ability to compare how close two words are in meaning. One way of simpling doing so is by taking the **normalized dot product**.

In [None]:
lion = embeddings.embed_query("lion")
eggs = embeddings.embed_query("eggs")

print("Similarity between 'tiger' and 'lion': {:.2f}".format(np.dot(tiger,lion)/np.linalg.norm(tiger)/np.linalg.norm(lion)))
print("Similarity between 'tiger' and 'eggs': {:.2f}".format( np.dot(tiger, eggs)/np.linalg.norm(tiger)/np.linalg.norm(eggs)))

## Visualizing an embedding
A better way to understand an embedding is to visualize it. Let's generate some random words related to different domains, and find their embeddings

In [None]:
llm = ChatDartmouth(model_name="llama-3-1-8b-instruct", seed=42, temperature=0.0)
parser = JsonOutputParser()

chain = llm | parser

response = chain.invoke(
    "Generate 30 different words that are well-suited to showcase how word embeddings work. "
    "Draw the words from domains like animals, finance, and food. The food one should contain tomato "
    "Return the words in JSON format, using the domain as the key, and the words as values. "
)

In [None]:
words = pd.DataFrame.from_dict(response).melt(var_name="domain", value_name="word")

embeddings = DartmouthEmbeddings(model_name="bge-large-en-v1-5")
words["embedding"] = embeddings.embed_documents(words["word"])

It is difficult to visualize a 1024 dimensional vector, as we're not 1024 dimensional humans! One way to get around this is by using a [UMAP](https://umap-learn.readthedocs.io/en/latest/) (Uniform Manifold Approximation and Projection) to represent this large vector as a 2 dimesional one. This can then be plotted as follows. 

In [None]:
# fitting using UMAP
mapper = umap.UMAP().fit(np.array(words["embedding"].to_list()))
umap_embeddings = pd.DataFrame(mapper.transform(np.array(words["embedding"].to_list())), columns=["UMAP_x", "UMAP_y"])

# merge with the words
words = pd.concat([words, umap_embeddings], axis=1)
words.head(1)

In [None]:
import matplotlib.pyplot as plt

# Create a scatter plot using matplotlib
for i in words["domain"].unique():
    subset = words[words["domain"] == i]
    plt.scatter(subset["UMAP_x"], subset["UMAP_y"], label=i)
    
    # Add the text labels
    for j in range(len(subset)):
        plt.text(
            subset["UMAP_x"].iloc[j],
            subset["UMAP_y"].iloc[j] + 0.12,
            subset["word"].iloc[j],
            horizontalalignment="center",
            verticalalignment="center",
            fontsize=6,
        )

# Add legend and show plot
plt.legend()

plt.tight_layout()
plt.xlabel("X")
plt.ylabel("Y")
plt.title("UMAP Projection of Words")
plt.show()

We can see that groups with words related to foods, and animals, and finance are somewhat close to each other. This let's us find the similarity between different words

## Embedding a document

If there are several words for which we want an embedding, we can use the `embed_documents` command instead. 
# TOCOMPLETE

In [None]:
from langchain_dartmouth.llms import DartmouthLLM

llm = ChatDartmouth(model_name="llama-3-1-8b-instruct", seed=42, temperature=0.0)
response1 = llm.invoke("Generate a 100 word text about dartmouth college and it's history and area")

llm = ChatDartmouth(model_name="llama-3-1-8b-instruct", seed=45, temperature=0.8)
response2 = llm.invoke("Generate a 100 word text about dartmouth college and it's history and area")

llm = ChatDartmouth(model_name="llama-3-1-8b-instruct", seed=10, temperature=0.0)
response3 = llm.invoke("Create 5 words of gibberish")


In [None]:
print(response1.content)
print(response2.content)
print(response3.content)

In [None]:
import numpy as np

def get_embeddings(response, upper_limit=400):
    words = response.split(" ")
    embedding_list = []
    chunks = [words[i:i + 32] for i in range(0, upper_limit, 32)]
    for chunk in chunks:
        embeddings = DartmouthEmbeddings(model_name="bge-large-en-v1-5")
        embedding_list.append(embeddings.embed_documents(chunk))
    return np.concatenate(embedding_list)


By using the wikipedia api, we can get some articles from wikipedia and do some semantic analysis on them  

In [None]:
import wikipediaapi

def get_wikipedia_page_text(page_title, language="en"):
    user_agent = "CoolBot/0.0 (https://example.org; coolbot@example.org)"
    
    wiki_wiki = wikipediaapi.Wikipedia(
        language=language,
        user_agent=user_agent
    )
    
    # Fetch the page
    page = wiki_wiki.page(page_title)
    if page.exists():
        return page.text
    else:
        return "Page not found"

dartmouth_text = get_wikipedia_page_text("Dartmouth College")
french_text = get_wikipedia_page_text("Claude Cohen-Tannoudji", "fr")
Ivy_league_text = get_wikipedia_page_text("Ivy League")

In [None]:
dartmouth_embedding = get_embeddings(dartmouth_text)

# get embedding of something random 


In [None]:
french_embedding = get_embeddings(french_text)

In [None]:
Ivy_league_embedding = get_embeddings(Ivy_league_text)

In [None]:
# get the centroid of the embeddings
dartmouth_centroid = np.mean(dartmouth_embedding, axis=0)
french_centroid = np.mean(french_embedding, axis=0)

# find the similarity between the centroid and the random word
similarity = np.dot(dartmouth_centroid, french_centroid)
print("Similarity between Dartmouth College and Claude Cohen-Tannoudji: ", similarity.round(2))

In [None]:
# get the centroid of the embeddings
dartmouth_centroid = np.mean(dartmouth_embedding, axis=0)
Ivy_league_embedding = np.mean(french_embedding, axis=0)

similarity = np.dot(dartmouth_centroid, Ivy_league_embedding)/ (np.linalg.norm(dartmouth_centroid) * np.linalg.norm(Ivy_league_embedding))
print("Similarity between Dartmouth College and Ivy League: ", similarity.round(2))


# Uses