# Embeddings

Before a large language model (LLM) can understand and predict words, it first needs to convert them into numbers through a process called "embedding." This is like representing each word as a collection of sliders—imagine a graphic equalizer for sound—where each slider setting captures some aspect of the word's meaning. 

For example, words like "nice" and "stupendous" might have similar settings on a "positivity" slider but differ on an "intensity" slider. These sliders help the model figure out how words relate to each other. 

A word's embedding involves many of these sliders—possibly thousands—but we don't actually know what each individual slider represents in terms of meaning. The large number of sliders helps the model better understand complex relationships between words, even if we can't clearly label each one. 

[This](https://www.youtube.com/shorts/FJtFZwbvkI4) short video explains this concept in an intuitive way.

# Creating an Embedding

Let's find an embedding for a word of our choosing. We will be looking into static embeddings, which are fixed representations of words as vectors from a pre-trained model. 

In [None]:
from dotenv import find_dotenv, load_dotenv
from langchain_dartmouth.embeddings import DartmouthEmbeddings

load_dotenv(find_dotenv())

The model for embeddings is different class from the ones that we have used before. We can see how it's used below:

In [None]:
embeddings = DartmouthEmbeddings()

word = embeddings.embed_query("tiger")
print(word)
print("Length of embedding: ", len(word))

<div class="alert alert-info">

**Note:** We see that the word "tiger" is represented by a list of 1024 numbers. This means that the numeric representation of the word "tiger" consists of 1024 dimensions (or sliders) for this particular embedding model. Other models may use fewer or more numbers to represent text. You can read more about the model we are using [here](https://huggingface.co/BAAI/bge-large-en-v1.5)
</div>


### TextLoaders
The embed_documents function allows embedding multiple tokens as a list. It also supports embedding `Document` class objects in LangChain. Using `TextLoader`, you can import various different types of files as a `Document` class. LangChain offers specialized loaders for different file types. Learn more about these loaders [here](https://python.langchain.com/docs/integrations/document_loaders/)

In [None]:
from langchain.document_loaders import TextLoader

directory_to_file = './rag_documents/asteroids.txt'
text_loader = TextLoader(directory_to_file)
document = text_loader.load()
print(document)

Now the text in the document can be **tokenized** in any way we like. A conceptually easy way to do so is by embedding each word. However, documents generally have a lot of words, and `embed_documents` only accepts a specific number of strings in one go. An example of this error is given below, where the maximum number is 512. 

In [None]:
# getting embedding 33 words
words = document[0].page_content.split(' ')

try:
    responses = embeddings.embed_documents(words)
except Exception as e:
    print(e)

To get by this issue, we can feed in 512 words at a time, and repeat until we are done with all the words that are needed.

In [None]:
embeddings_list = []

for i in range(0, len(words), 512):
    chunk = words[i:i+512]
    embeded_chunk = embeddings.embed_documents(chunk)
    embeddings_list += embeded_chunk

In [None]:
print('Number of embeddings in the document: ', len(embeddings_list))
print('First 5 embeddings:')
for i in range(5):
    print(f'{words[i]:<15}: {embeddings_list[i]}')

# Summary 

Embeddings are representation of strings as numbers. Using the `embed_query` and `embed_documents` functions, we can get the embeddings of words or phrases. This lets us do many exciting operations to represent how different words are related to each other.

With `embed_documents` we can take advantage of LangChain's `Document` class to embed entire files. 

The batch size of the default embedding model is 512 tokens. This can be bypassed by feeding in 512 strings at a time. The batch size is dependent on the model being used.  