# **Text Embedding Models**

The Embeddings class is a class designed for interfacing with text embedding models. There are lots of embedding model providers (OpenAI, Cohere, Hugging Face, etc) - this class is designed to provide a standard interface for all of them.

Embeddings create a vector representation of a piece of text. This is useful because it means we can think about text in the vector space, and do things like semantic search where we look for pieces of text that are most similar in the vector space.

The base Embeddings class in LangChain provides two methods: one for embedding documents and one for embedding a query. The former takes as input multiple texts, while the latter takes a single text. The reason for having these as two separate methods is that some embedding providers have different embedding methods for documents (to be searched over) vs queries (the search query itself).

In [1]:
f = open('keys/.openai_api_key.txt')

OPENAI_API_KEY = f.read()

In [2]:
from langchain_openai import OpenAIEmbeddings

embeddings_model = OpenAIEmbeddings(openai_api_key=OPENAI_API_KEY)

### **Embed Single Query - `embed_query`**

In [3]:
# Embed single query

embedded_query = embeddings_model.embed_query("What was the name mentioned in the conversation?")

print("Dimensionality of embedded vector:", len(embedded_query))

print(embedded_query[:15])

Dimensionality of embedded vector: 1536
[0.005384807424727807, -0.0005522561790177147, 0.03896066510130955, -0.002939867294003909, -0.008987877434176603, 0.021116891679065407, -0.017197068620393528, -0.0017239962310982204, -0.0029712125847749776, -0.010499054468180394, 0.022383905738602872, 0.009245239112047377, 0.004035306125525385, -0.009291432711317806, -0.010023924428684487]


### **Embed list of Texts - `embed_documents`**

In [4]:
# Embed list of texts

embeddings = embeddings_model.embed_documents(
                                [
                                    "Hi there!",
                                    "Oh, hello!",
                                    "What's your name?",
                                    "My friends call me World",
                                    "Hello World!"
                                ]
)

print("Number of Embeddings:", len(embeddings))

print("Dimensionality of Embeddings:", len(embeddings[0]))

Number of Embeddings: 5
Dimensionality of Embeddings: 1536


### **Create Embeddings for Subtitles Data**

In [5]:
# Load all .srt files
from langchain_community.document_loaders import TextLoader
from langchain_community.document_loaders import DirectoryLoader

loader = DirectoryLoader('data/subtitles_data', glob="*.srt", show_progress=True, loader_cls=TextLoader)

docs = loader.load()

print("Number of Documents:", len(docs))

100%|████████████████████████████████████████████████████████████████████████████████| 23/23 [00:00<00:00, 2356.18it/s]

Number of Documents: 23





In [6]:
print(type(docs))

print(type(docs[0]))

<class 'list'>
<class 'langchain_core.documents.base.Document'>


In [7]:
# To read 0th document, we can use .page_content

print(docs[0].page_content[:100])

1
00:00:01,435 --> 00:00:04,082
This is pretty much
what's happened so far.

2
00:00:04,395 --> 00:0


In [8]:
# # Reading the content of all the .srt files

# [srt_file.page_content for srt_file in docs]

**Important Note: Be careful running the following code. It will encounter some `cost`.**

In [9]:
# # Creating embeddings for all the 23 .srt files

# embedded_docs = embeddings_model.embed_documents([srt_file.page_content for srt_file in docs])

# print("Type of variable:", type(embedded_docs))

# print("Number of embeddings:", len(embedded_docs))

# print("Dimensionality of each embedding:", len(embedded_docs[0]))

## **HuggingFace Embeddings**

In [4]:
from langchain_huggingface import HuggingFaceEmbeddings

model_name = "sentence-transformers/all-mpnet-base-v2"

embeddings_model = HuggingFaceEmbeddings(model_name=model_name)

In [5]:
embedded_docs = embeddings_model.embed_documents([
                    "Hi there!",
                    "Oh, hello!",
                    "What's your name?",
                    "My friends call me World",
                    "Hello World!"
                    ])

In [6]:
print("Type of variable:", type(embedded_docs))

print("Number of embeddings:", len(embedded_docs))

print("Dimensionality of each embedding:", len(embedded_docs[0]))

Type of variable: <class 'list'>
Number of embeddings: 5
Dimensionality of each embedding: 768


## **Conclusion**

We just learned how to take documents and embed them into vectors.

These vectors are stored in memory as a Python list. Whenever we restart the program, these Python lists will flush out.

How do we make sure these embeddings persist to some permanent storage?

**Important Note: If generating embeddings has cost associated with it, why to generate it every time? Why not store these embeddings in a database during the first execution and use these embeddings from the database from next time onwards for better cost management.**