# **Embedding**

Embeddings convert text into numeric vectors (lists of numbers) that capture semantic meaning, enabling AI systems to understand and compare text based on meaning rather than exact word matches.

## **What Are Embeddings?**
An embedding is a vector (array of numbers) that represents the semantic meaning of text. Instead of storing text as strings, we store it as numbers that a machine learning model can understand and compare.
```python
# Text
text = "The quick brown fox jumps over the lazy dog"

# After embedding (example - not real values)
embedding = [0.234, -0.456, 0.789, 0.123, ..., 0.567]  # 1536 dimensions (for OpenAI)
```

## **Text Embedding Models**
The Embeddings class is a class designed for interfacing with text embedding models. There are lots of embedding model providers (OpenAI, Cohere, Hugging Face, etc) - this class is designed to provide a standard interface for all of them.

Embeddings create a vector representation of a piece of text. This is useful because it means we can think about text in the vector space, and do things like semantic search where we look for pieces of text that are most similar in the vector space.

The base Embeddings class in LangChain provides two methods: 
1. For embedding documents - while the latter takes a single text
2. For embedding a query -  takes as input multiple texts

The reason for having these as two separate methods is that some embedding providers have different embedding methods for documents (to be searched over) vs queries (the search query itself).

## **Initialize OpenAIEmbeddings**

In [1]:
from langchain_openai import OpenAIEmbeddings

f = open('keys/.openai_api_key.txt')
OPENAI_API_KEY = f.read()

embeddings_model = OpenAIEmbeddings(api_key=OPENAI_API_KEY, 
                                    model="text-embedding-3-large")

  from .autonotebook import tqdm as notebook_tqdm


## **Embed Single Query - `embed_query`**

In [2]:
# Embed single query

embedded_query = embeddings_model.embed_query("What was the name mentioned in the conversation?")

print("Dimensionality of embedded vector:", len(embedded_query))
print()
print(embedded_query[:15])

Dimensionality of embedded vector: 3072

[0.005890291649848223, 0.041957274079322815, -0.026762796565890312, 0.007915631867945194, -0.030353575944900513, 0.005187171045690775, 0.012921495363116264, -0.011126106604933739, 0.009251118637621403, 0.024321774020791054, 0.0004958325298503041, 0.038313429802656174, -0.008198649622499943, 0.038030412048101425, 0.029681408777832985]


### **Embed list of Texts - `embed_documents`**

In [3]:
# Embed list of texts

embeddings = embeddings_model.embed_documents(
                                [
                                    "Hi there!",
                                    "Oh, hello!",
                                    "What's your name?",
                                    "My friends call me World",
                                    "Hello World!"
                                ]
)

print("Number of Embeddings:", len(embeddings))

print("Dimensionality of Embeddings:", len(embeddings[0]))

Number of Embeddings: 5
Dimensionality of Embeddings: 3072


### **Create Embeddings for Subtitles Data**

In [5]:
# Load all .srt files
from langchain_community.document_loaders import TextLoader
from langchain_community.document_loaders import DirectoryLoader

loader = DirectoryLoader('data/subtitles', glob="*.srt", show_progress=True, loader_cls=TextLoader)

docs = loader.load()

print("Number of Documents:", len(docs))

100%|█████████████████████████████████████████| 10/10 [00:00<00:00, 6523.02it/s]

Number of Documents: 10





In [6]:
print(type(docs))

print(type(docs[0]))

<class 'list'>
<class 'langchain_core.documents.base.Document'>


In [7]:
# To read 0th document, we can use .page_content

print(docs[0].page_content[:100])

1
00:00:01,435 --> 00:00:04,082
This is pretty much
what's happened so far.

2
00:00:04,395 --> 00:0


In [8]:
# # Reading the content of all the .srt files

# [srt_file.page_content for srt_file in docs]

**Important Note: Be careful running the following code. It will encounter some `cost`.**

In [9]:
# # Creating embeddings for all the 23 .srt files

# embedded_docs = embeddings_model.embed_documents([srt_file.page_content for srt_file in docs])

# print("Type of variable:", type(embedded_docs))

# print("Number of embeddings:", len(embedded_docs))

# print("Dimensionality of each embedding:", len(embedded_docs[0]))

## **HuggingFace Embeddings**

In [16]:
# ! pip install sentence-transformers
# ! pip install langchain-huggingface

In [17]:
from langchain_huggingface import HuggingFaceEmbeddings

model_name = "sentence-transformers/all-mpnet-base-v2"

embeddings_model = HuggingFaceEmbeddings(model_name=model_name)

In [18]:
embedded_docs = embeddings_model.embed_documents([
                    "Hi there!",
                    "Oh, hello!",
                    "What's your name?",
                    "My friends call me World",
                    "Hello World!"
                    ])

In [19]:
print("Type of variable:", type(embedded_docs))

print("Number of embeddings:", len(embedded_docs))

print("Dimensionality of each embedding:", len(embedded_docs[0]))

Type of variable: <class 'list'>
Number of embeddings: 5
Dimensionality of each embedding: 768


## **Conclusion**

We just learned how to take documents and embed them into vectors.

These vectors are stored in memory as a Python list. Whenever we restart the program, these Python lists will flush out.

How do we make sure these embeddings persist to some permanent storage?

**Important Note: If generating embeddings has cost associated with it, why to generate it every time? Why not store these embeddings in a database during the first execution and use these embeddings from the database from next time onwards for better cost management.**