# Embeddings

Before a large language model (LLM) can understand and predict words, it first needs to convert them into numbers through a process called "embedding." This is like representing each word as a collection of sliders—imagine a graphic equalizer for sound—where each slider setting captures some aspect of the word's meaning. 

For example, words like "nice" and "stupendous" might have similar settings on a "positivity" slider but differ on an "intensity" slider. These sliders help the model figure out how words relate to each other. 

A word's embedding involves many of these sliders—possibly thousands—but we don't actually know what each individual slider represents in terms of meaning. The large number of sliders helps the model better understand complex relationships between words, even if we can't clearly label each one. 

[this](https://www.youtube.com/shorts/FJtFZwbvkI4) short video explains this concept in an intuitive way.

# Creating an Embedding

Let's find an embedding for a word of our choosing. We will be looking into static embeddings, which are fixed representations of words as vectors from a pre-trained model. 

In [None]:
from dotenv import find_dotenv, load_dotenv
from langchain_dartmouth.embeddings import DartmouthEmbeddings

load_dotenv(find_dotenv())

The model for embeddings is different class from the ones that we have used before. We can see how it's used below:

In [None]:
embeddings = DartmouthEmbeddings(model_name="bge-large-en-v1-5")

word = embeddings.embed_query("tiger")
print(word)
print("Length of embedding: ", len(word))

<div class="alert alert-info">

**Note:** We see that the word "tiger" is represented by a list of 1024 numbers. This means that the numeric representation of the word "tiger" consists of 1024 dimensions (or sliders) for this particular embedding model. Other models may use fewer or more numbers to represent text. You can read more about the model we are using [here](https://huggingface.co/BAAI/bge-large-en-v1.5)
</div>


If we want to embed more than one word, we can use the `embed_document` function which takes in a list of "tokens" or words as an input

In [None]:
example_text = "Dartmouth College is a private Ivy League research university in Hanover, New Hampshire, United States. Established in 1769 by Eleazar Wheelock, Dartmouth is one of the nine colonial colleges chartered before the"

words = example_text.split()
responses = embeddings.embed_documents(words)

# Printing the first 5 responses
for response in responses[:5]: 
    print(response)


<div class="alert alert-block alert-warning">

**Note:** We can see that I have split the example text into individual words. Embeddings can convert not just individual words but entire sentences into numbers. However, the specific meanings of words are averaged together to capture the sentence's overall meaning. This helps the model understand context but might blur individual word details. 
</div>


An import note about `embed_document` is that it can only accept up to 32 words to be embedded at a time. If our example_text is longer than 32 words, we get a batch size related error: 

In [None]:
# Increasing the number of words to embed to 33
words = words + ["extra"]

try:
    responses = embeddings.embed_documents(words)
except Exception as e:
    print(e)

A way to circumvent this is by breaking up our payload into 32 sized chunks, and then embedding those chunks independently into a list. 

In [None]:
embeddings_list = []

for i in range(0, len(words), 32):
    chunk = words[i:i+32]
    embeded_chunk = embeddings.embed_documents(chunk)
    embeddings_list += embeded_chunk

In [None]:
print('Number of embeddings in the document: ', len(embeddings_list))
print('First 5 embeddings:')
for i in range(5):
    print(f'{words[i]:<10}: {embeddings_list[i]}')

# Summary 

Embeddings are representation of words or phrases, or more accurately, "tokens", as numbers. Using the `embed_query` and `embed_documents` functions, we can get the embeddings of words or phrases. This lets us do many exciting operations to represent how different words are related to each other.

The batch size of the embedding model is 32 tokens. This can be bypassed by feeding in 32 words at a time