# **Embeddings**

## **What's Covered?**
1. Introduction to Embeddings
    - What Are Embeddings?
    - Text Embedding Models
    - Top LangChain Integrations
2. Generating Embeddings via API using GoogleGenAI and OpenAI
    - Installing the libraries
    - Setting up the API Key
    - Instantiating the Embedding Models
    - Embed Single Query using embed_query()
    - Embed list of Documents using embed_documents()
3. Generating Embeddings Locally using HuggingFaceEmbeddings
    - Installing the libraries
    - Instantiating the Embedding Models
    - Embed Single Query using embed_query()
    - Embed list of Documents using embed_documents()
4. End-to-End Embedding Pipeline
    - Step 1: Load the documents
    - Step 2: Apply Chunking
    - Step 3: Convert the Chunks into Embeddings
5. What's next?

## **Introduction to Embeddings**
Embeddings convert text into numeric vectors (lists of numbers) that capture semantic meaning, enabling AI systems to understand and compare text based on meaning rather than exact word matches.

### **What Are Embeddings?**
An embedding is a vector (array of numbers) that represents the semantic meaning of text. Instead of storing text as strings, we store it as numbers that a machine learning model can understand and compare.
```python
# Text
text = "The quick brown fox jumps over the lazy dog"

# After embedding (example - not real values)
embedding = [0.234, -0.456, 0.789, 0.123, ..., 0.567]  # 1536 dimensions (for OpenAI)
```

### **Text Embedding Models**
The Embeddings class is a class designed for interfacing with text embedding models. There are lots of embedding model providers (OpenAI, Cohere, Hugging Face, etc) - this class is designed to provide a standard interface for all of them.

Embeddings create a vector representation of a piece of text. This is useful because it means we can think about text in the vector space, and do things like semantic search where we look for pieces of text that are most similar in the vector space.

### **Top LangChain Integrations**

| Model                          | Package                   |
|--------------------------------|----------------------------|
| OpenAIEmbeddings               | langchain-openai           |
| AzureOpenAIEmbeddings          | langchain-openai           |
| GoogleGenerativeAIEmbeddings   | langchain-google-genai     |
| OllamaEmbeddings               | langchain-ollama           |
| TogetherEmbeddings             | langchain-together         |
| FireworksEmbeddings            | langchain-fireworks        |
| MistralAIEmbeddings            | langchain-mistralai        |
| CohereEmbeddings               | langchain-cohere           |
| NomicEmbeddings                | langchain-nomic            |
| FakeEmbeddings                 | langchain-core             |
| DatabricksEmbeddings           | databricks-langchain       |
| WatsonxEmbeddings              | langchain-ibm              |
| NVIDIAEmbeddings               | langchain-nvidia           |
| AimlapiEmbeddings              | langchain-aimlapi          |


## **Generating Embeddings via API using GoogleGenAI and OpenAI**

### **Installing the libraries**

```python
! pip install --upgrade --quiet langchain-google-genai
! pip install --upgrade --quiet langchain-openai
```

In [14]:
# ! pip install --upgrade --quiet langchain-google-genai
# ! pip install --upgrade --quiet langchain-openai

### **Setting up the API Key**

In [9]:
# Setup API Key

f = open('keys/.gemini.txt')

GOOGLE_API_KEY = f.read()

In [10]:
# Setup API Key

f = open('keys/.openai_api_key.txt')

OPENAI_API_KEY = f.read()

### **Instantiating the Embedding Models**

The base Embeddings class in **LangChain provides two methods**: 
1. **For embedding documents** - while the latter takes a single text
2. **For embedding a query** -  takes as input multiple texts

The reason for having these as two separate methods is that some embedding providers have different embedding methods for documents (to be searched over) vs queries (the search query itself).


In [19]:
# Import Embedding Models
from langchain_google_genai import GoogleGenerativeAIEmbeddings
from langchain_openai import OpenAIEmbeddings

# Pass the standard parameters during initialization
google_embd_model = GoogleGenerativeAIEmbeddings(google_api_key=GOOGLE_API_KEY, 
                                                 model="models/gemini-embedding-001")

openai_embd_model = OpenAIEmbeddings(openai_api_key=OPENAI_API_KEY, 
                                     model="text-embedding-3-small")

### **Embed Single Query using `embed_query()`**

In [20]:
# Embed single query

embedded_query = google_embd_model.embed_query("What was the name mentioned in the conversation?")

print("Dimensionality of embedded vector:", len(embedded_query))
print()
print(embedded_query[:15])

Dimensionality of embedded vector: 3072

[-0.030619625002145767, 0.0032838324550539255, 0.0156903937458992, -0.08700630068778992, -0.008714504539966583, 0.004057515878230333, -0.011523551307618618, 0.006958039943128824, -0.022802716121077538, 0.010272241197526455, -0.006420494522899389, -0.007641482166945934, -0.0015980988973751664, 0.007948610931634903, 0.11037036031484604]


In [21]:
# Embed single query

embedded_query = openai_embd_model.embed_query("What was the name mentioned in the conversation?")

print("Dimensionality of embedded vector:", len(embedded_query))
print()
print(embedded_query[:15])

Dimensionality of embedded vector: 1536

[-0.010680568404495716, -0.01018487848341465, -0.0019450020045042038, 0.023096051067113876, -0.02682921662926674, 0.013724414631724358, -0.03931440785527229, 0.04200971871614456, -0.009875072166323662, -0.07416760176420212, 0.016946399584412575, -0.038973618298769, -0.02641097828745842, -0.043775614351034164, -0.013639218173921108]


### **Embed list of Documents using `embed_documents()`**

In [22]:
docs = [
    "Hi there!",
    "Oh, hello!",
    "What's your name?",
    "My friends call me World",
    "Hello World!"
]

In [24]:
# Embed list of docs

embeddings = google_embd_model.embed_documents(docs)

print("Number of Embeddings:", len(embeddings))

print("Dimensionality of Embeddings:", len(embeddings[0]))

Number of Embeddings: 5
Dimensionality of Embeddings: 3072


In [25]:
# Embed list of docs

embeddings = openai_embd_model.embed_documents(docs)

print("Number of Embeddings:", len(embeddings))

print("Dimensionality of Embeddings:", len(embeddings[0]))

Number of Embeddings: 5
Dimensionality of Embeddings: 1536


## **Generating Embeddings Locally using HuggingFaceEmbeddings**

### **Installing the libraries**
```python
! pip install sentence-transformers
! pip install langchain-huggingface
```

In [16]:
# ! pip install sentence-transformers
# ! pip install langchain-huggingface

### **Instantiating the Embedding Models**

In [27]:
from langchain_huggingface import HuggingFaceEmbeddings

model_name = "sentence-transformers/all-mpnet-base-v2"

hf_embd_model = HuggingFaceEmbeddings(model_name=model_name)

### **Embed Single Query using embed_query()**

In [28]:
embedded_query = hf_embd_model.embed_query("What was the name mentioned in the conversation?")

print("Dimensionality of embedded vector:", len(embedded_query))
print()
print(embedded_query[:15])

Dimensionality of embedded vector: 768

[0.09514585137367249, 9.883820894174278e-05, -0.016573389992117882, 0.04484795406460762, 0.043236978352069855, -0.008534776046872139, 0.08940394222736359, 0.002871787641197443, -0.031132318079471588, 0.052375730127096176, -0.0026622936129570007, -0.00033354779588989913, 0.023539021611213684, -0.09407063573598862, -0.014772295951843262]


### **Embed list of Documents using embed_documents()**

In [29]:
docs = [
    "Hi there!",
    "Oh, hello!",
    "What's your name?",
    "My friends call me World",
    "Hello World!"
]

In [31]:
embeddings = hf_embd_model.embed_documents(docs)

print("Number of Embeddings:", len(embeddings))

print("Dimensionality of Embeddings:", len(embeddings[0]))

Number of Embeddings: 5
Dimensionality of Embeddings: 768


## **End-to-End Embedding Pipeline**

### **Step 1: Load the documents**

In [5]:
# Load all .srt files
from langchain_community.document_loaders import TextLoader
from langchain_community.document_loaders import DirectoryLoader

loader = DirectoryLoader('data/subtitles', glob="*.srt", show_progress=True, loader_cls=TextLoader)

docs = loader.load()

print("Number of Documents:", len(docs))

100%|█████████████████████████████████████████| 10/10 [00:00<00:00, 6523.02it/s]

Number of Documents: 10





In [6]:
print(type(docs))

print(type(docs[0]))

<class 'list'>
<class 'langchain_core.documents.base.Document'>


In [7]:
# To read 0th document, we can use .page_content

print(docs[0].page_content[:100])

1
00:00:01,435 --> 00:00:04,082
This is pretty much
what's happened so far.

2
00:00:04,395 --> 00:0


In [8]:
# # Reading the content of all the .srt files

# [srt_file.page_content for srt_file in docs]

### **Step 2: Apply Chunking**

### **Step 3: Convert the Chunks into Embeddings**

**Important Note: Be careful running the following code. It will encounter some `cost`.**

In [9]:
# # Creating embeddings for all the 23 .srt files

# embedded_docs = embeddings_model.embed_documents([srt_file.page_content for srt_file in docs])

# print("Type of variable:", type(embedded_docs))

# print("Number of embeddings:", len(embedded_docs))

# print("Dimensionality of each embedding:", len(embedded_docs[0]))

## **What's next?**

We just learned how to take documents and embed them into vectors.

These vectors are **stored in memory as a Python list**. Whenever we **restart the program**, these Python lists will flush out.

How do we make sure these embeddings persist to some permanent storage?

**Important Note: If generating embeddings has cost associated with it, why to generate it every time? Why not store these embeddings in a database during the first execution and use these embeddings from the database from next time onwards for better cost management.**