# Embeddings

Before a large language model (LLM) can understand and predict words, it first needs to convert them into numbers through a process called "embedding." This is like representing each word as a collection of sliders—imagine a graphic equalizer for sound—where each slider setting captures some aspect of the word's meaning. 

For example, words like "nice" and "stupendous" might have similar settings on a "positivity" slider but differ on an "intensity" slider. These sliders help the model figure out how words relate to each other. 

A word's embedding involves many of these sliders—possibly thousands—but we don't actually know what each individual slider represents in terms of meaning. The large number of sliders helps the model better understand complex relationships between words, even if we can't clearly label each one. 

[this](https://www.youtube.com/shorts/FJtFZwbvkI4) short video explains this concept in an intuitive way.

# Creating an Embedding

Let's find an embedding for a word of our choosing. We will be looking into static embeddings, which are fixed representations of words as vectors from a pre-trained model. 

In [1]:
from dotenv import find_dotenv, load_dotenv
from langchain_dartmouth.embeddings import DartmouthEmbeddings

load_dotenv(find_dotenv())

True

The model for embeddings is different class from the ones that we have used before. We can see how it's used below:

In [2]:
embeddings = DartmouthEmbeddings(model_name="bge-large-en-v1-5")

word = embeddings.embed_query("tiger")
print(word)
print("Length of embedding: ", len(word))

[-0.0098512, 0.021645533, 0.020528702, 0.000515087, -0.03606561, -0.0034401882, -0.019277748, 0.031079523, -0.0088551855, -0.013496476, 0.02126853, 0.0049624527, -0.038428802, 0.013413767, -0.004831469, 0.014840598, -0.019473404, -0.037833326, -0.037558023, -0.00017545157, 0.0143051855, 0.043533113, -0.08317123, -0.023391493, 0.02023354, 0.03892471, 0.008833535, 0.015522147, 0.060258042, 0.009829163, -0.023471642, -0.019778078, 0.04882958, -0.035276916, -0.0027987345, -0.021148305, -0.003931096, -0.0033566123, -0.03245597, -0.0015825863, 0.021915514, -0.06235675, 0.05735998, -0.017271146, -0.018306775, -0.027426377, 0.016969977, -0.045100108, 0.030155057, -0.06374126, 0.008559248, 0.008048475, 0.00875124, -0.023909586, -0.049112137, -0.00404275, -0.019802958, 0.012755762, 0.005984073, 0.016819803, 0.024077311, -0.003208673, 0.039493624, -0.033909094, -0.028675787, -0.016938612, 0.021367928, 0.009030608, 0.01552536, 0.029666582, 0.0012243559, 0.023820473, 0.0014541324, -0.0020722183, -0

<div class="alert alert-info">

**Note:** We see that the word "tiger" is represented by a list of 1024 numbers. This means that the numeric representation of the word "tiger" consists of 1024 dimensions (or sliders) for this particular embedding model. Other models may use fewer or more numbers to represent text. You can read more about the model we are using [here](https://huggingface.co/BAAI/bge-large-en-v1.5)
</div>


If we want to embed more than one word, we can use the `embed_documents` function which takes in a list of "tokens" or words as an input. More importantly, the `embed_documents` let's us embed `Document` class objects in langchain. Using the `TextLoader` from langchain, we can import these documents. Langchain has various different loaders for different file types. You can learn more about them [here](https://python.langchain.com/docs/integrations/document_loaders/)

In [17]:
from langchain.document_loaders import TextLoader

directory_to_file = './rag_documents/asteroids.txt'
text_loader = TextLoader(directory_to_file)
document = text_loader.load()
print(document)

[Document(metadata={'source': './rag_documents/asteroids.txt'}, page_content="**Asteroids: The Mysterious and Ancient Building Blocks of Our Solar System**\n\nAsteroids, also known as minor planets or planetoids, are small, rocky objects that orbit the Sun. They are remnants from the early days of our solar system, and their study has provided valuable insights into the formation and evolution of the cosmos. These mysterious bodies have captivated the imagination of scientists and researchers for centuries, and their exploration continues to uncover new secrets about the universe.\n\n**Composition and Types of Asteroids**\n\nAsteroids are typically small, with diameters ranging from a few meters to hundreds of kilometers. They are composed of rock, metal, and ice, and are thought to be the remnants of the early solar system. There are two main types of asteroids: stony asteroids, which are composed mostly of silicate minerals, and metal asteroids, which are rich in iron and nickel.\n\n

Now the text in the document can be **tokenized** in any way we like. A conceptually easy way to do so is by embedding each word. However, documents generally have a lot of words, and `embed_documents` only accepts up to 32 sepereate strings in one go, so we have chunk our embeddings accordingly. You can see an example of this error below

In [26]:
# getting embedding 33 words
words = document[0].page_content.split(' ')

try:
    responses = embeddings.embed_documents(words[:33])
except Exception as e:
    print(e)

413 Client Error: Payload Too Large for url: https://ai-api.dartmouth.edu/tei/bge-large-en-v1-5/

batch size 33 > maximum allowed batch size 32


In [27]:
embeddings_list = []

for i in range(0, len(words), 32):
    chunk = words[i:i+32]
    embeded_chunk = embeddings.embed_documents(chunk)
    embeddings_list += embeded_chunk

In [28]:
print('Number of embeddings in the document: ', len(embeddings_list))
print('First 5 embeddings:')
for i in range(5):
    print(f'{words[i]:<10}: {embeddings_list[i]}')

Number of embeddings in the document:  589
First 5 embeddings:
**Asteroids:: [-0.0013865127, 0.048595, -0.00616314, 0.03327783, -0.03426471, 0.0018010071, 3.0388974e-06, 0.01772578, 0.040013604, 0.0024980635, 0.0061349086, 0.040155906, 0.02858962, -0.0150268935, -0.04761392, 0.007947801, -0.015128773, -0.043546356, -0.04938369, -0.011828519, -0.018640265, -0.011944743, -0.04364425, 0.020006755, 0.008378321, 0.009664619, 0.036639687, 0.052313082, 0.03623276, 0.03847547, -0.025293412, -0.016672196, 0.012919154, -0.043451056, 0.012029128, -0.038350794, 0.0050261198, -0.017362282, 0.004714033, -0.032903317, 0.041372925, -0.026330214, 0.051650483, 0.0150903715, -0.036029633, -0.002127812, 0.02534947, -0.05835694, -0.034236107, -0.03359526, 0.028596444, 0.03482527, 0.0030820225, -0.040154386, -0.004329293, 0.0094767865, -0.011067421, 0.0012937655, -0.0033151035, -0.036669012, 0.011441893, 0.031958304, -0.0017730083, -0.03853861, -0.01748169, 0.03184121, 0.00093986746, -0.022917993, -0.001791

# Summary 

Embeddings are representation of words or phrases, or more accurately, "tokens", as numbers. Using the `embed_query` and `embed_documents` functions, we can get the embeddings of words or phrases. This lets us do many exciting operations to represent how different words are related to each other.

The batch size of the embedding model is 32 tokens. This can be bypassed by feeding in 32 words at a time