
# Embedding models 
transform text into numerical vectors, allowing machines to represent and understand the semantic meaning of text in a compact format. These vectors, called embeddings, make it possible for retrieval systems to find relevant documents based on meaning rather than just keywords

# Key Concepts
# 1. Embedding as a Vector:

* Embeddings convert text into numerical vector representations, essentially capturing the “essence” of the text. These vectors allow systems to represent, compare, and process text efficiently.
* Each text is represented as a fixed-length vector, which acts as a unique “fingerprint” of its semantic content.

# 2. Similarity Measurement:

Once text is embedded, similar pieces of text will have similar vectors. Simple mathematical operations, such as cosine similarity or Euclidean distance, can then measure similarity between these vectors, allowing systems to compare the meaning of different texts.

# Historical Context
* BERT (2018): Google's BERT introduced transformer models to the embedding space, creating powerful representations of text. BERT embeddings performed well on many NLP tasks, but it wasn’t optimized for sentence embeddings.

* SBERT: SBERT (Sentence-BERT) was developed to adapt BERT for sentence-level embeddings, making it computationally efficient for tasks like sentence similarity by generating embeddings more optimized for comparisons.

# Using Embedding Models with LangChain
LangChain offers a streamlined interface for embedding models with two main methods:

* embed_documents: Embeds multiple texts, typically documents in a corpus.
* embed_query: Embeds a single query to be compared against document embeddings for retrieval tasks.

# Practical Example with LangChain
Here's an example of using LangChain's OpenAIEmbeddings to embed multiple texts:

In [1]:
from langchain_openai import OpenAIEmbeddings

embeddings_model = OpenAIEmbeddings()
embeddings = embeddings_model.embed_documents(
    [
        "Hi there!",
        "Oh, hello!",
        "What's your name?",
        "My friends call me World",
        "Hello World!"
    ]
)

# Check the length of embeddings (5 texts, each with 1536 dimensions)
print(len(embeddings), len(embeddings[0]))  # (5, 1536)


5 1536


# For a single query, you can use:

In [2]:
query_embedding = embeddings_model.embed_query("What is the meaning of life?")

In [3]:
query_embedding

[0.004381184931844473,
 -0.029710490256547928,
 -0.008193866349756718,
 -0.003302882192656398,
 -0.02610172890126705,
 -0.01874825917184353,
 -0.01915609836578369,
 0.011431865394115448,
 -0.02126944810152054,
 -0.0013911344576627016,
 0.0018847129540517926,
 0.016845008358359337,
 0.013990131206810474,
 -0.006611943710595369,
 0.014731657691299915,
 -0.003157666651532054,
 0.040536776185035706,
 -0.007823103107511997,
 0.0035098916850984097,
 -0.01256269309669733,
 0.0031236798968166113,
 0.00378796411678195,
 -0.002312635537236929,
 -0.008410145528614521,
 -0.012476181611418724,
 0.005005303304642439,
 0.018019091337919235,
 -0.03045201674103737,
 0.02340751700103283,
 -0.024371501058340073,
 0.026447774842381477,
 -0.015843946486711502,
 -0.020490845665335655,
 -0.006188655737787485,
 -0.013125017285346985,
 -0.0004901026259176433,
 0.01174701377749443,
 0.0011091999476775527,
 -0.0063709476962685585,
 -0.005094904452562332,
 0.025335485115647316,
 0.02023131214082241,
 0.0008149066

# Similarity Metrics
The embedding space enables us to calculate the similarity between texts, with metrics like:

* Cosine Similarity: Measures the angle between two vectors, commonly used due to its efficiency.
Euclidean Distance: Measures the straight-line distance between vectors.
Dot Product: Projects one vector onto another.

In [4]:
import numpy as np

def cosine_similarity(vec1, vec2):
    dot_product = np.dot(vec1, vec2)
    norm_vec1 = np.linalg.norm(vec1)
    norm_vec2 = np.linalg.norm(vec2)
    return dot_product / (norm_vec1 * norm_vec2)

similarity = cosine_similarity(query_embedding, embeddings[0])
print("Cosine Similarity:", similarity)


Cosine Similarity: 0.7522752852421101


# Benefits of Embeddings
Embedding models allow for semantic search and clustering by enabling retrieval of similar texts based on meaning rather than literal keyword matching. They are foundational in applications such as recommendation systems, chatbots, and document retrieval systems.

# Types

# 1. Text embedding models

The Embeddings class in LangChain provides a simple way to work with text embedding models from different providers (like OpenAI, Cohere, Hugging Face).

# How it Works:

It converts text into vectors, which are numerical representations capturing the meaning of the text.
This helps with tasks like semantic search by finding texts with similar meanings based on their positions in this "vector space."

# Key Methods:

* embed_documents: Embeds multiple texts (e.g., documents) and returns a list of vector lists.
* embed_query: Embeds a single query (e.g., search term) and returns a single vector list.
These methods are separated because providers sometimes handle document and query embeddings differently.

In [5]:
pip install langchain-huggingface

Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 24.2 -> 24.3.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [1]:
# .embed_documents to embed a list of strings, recovering a list of embeddings:
from langchain_huggingface import HuggingFaceEmbeddings

embeddings_model = HuggingFaceEmbeddings(model_name="sentence-transformers/all-mpnet-base-v2")

  from tqdm.autonotebook import tqdm, trange


In [None]:
embeddings = embeddings_model.embed_documents(
    [
        "Hi there!",
        "Oh, hello!",
        "What's your name?",
        "My friends call me World",
        "Hello World!"
    ]
)
len(embeddings), len(embeddings[0])

In [7]:
# Use .embed_query to embed a single piece of text (e.g., for the purpose of comparing to other embedded pieces of texts).
embedded_query = embeddings_model.embed_query("What was the name mentioned in the conversation?")
embedded_query[:5]

[0.09514587372541428,
 9.884173778118566e-05,
 -0.016573386266827583,
 0.0448479987680912,
 0.043236974626779556]

# 2. Caching
Caching embeddings can be done using a CacheBackedEmbeddings.

 using CacheBackedEmbeddings to cache embeddings in LangChain, which avoids recomputing embeddings and speeds up your application:

# Why Cache Embeddings?
Embedding computations can be time-consuming and costly. By caching embeddings, you store them temporarily, so you don’t have to recompute them every time, saving both time and resources.

# How to Use CacheBackedEmbeddings
* 1 CacheBackedEmbeddings is a wrapper around your embedding model (e.g., OpenAI, Hugging Face). It caches the embeddings as they’re computed, using a hashed key for each piece of text.

* 2 To set it up, use CacheBackedEmbeddings.from_bytes_store() with the following key parameters:

    * Punderlying_embedder: The embedding model you’re using (e.g., OpenAIEmbeddings).
    * document_embedding_cache: Where you want to store your cached embeddings (e.g., a local file).
    * namespace: Optional, but recommended to separate caches for different embedding models.

# Example Setup with Local Cache:

In [8]:
pip install langchain-openai faiss-cpu

Collecting faiss-cpu
  Downloading faiss_cpu-1.9.0-cp310-cp310-win_amd64.whl.metadata (4.5 kB)
Downloading faiss_cpu-1.9.0-cp310-cp310-win_amd64.whl (14.9 MB)
   ---------------------------------------- 0.0/14.9 MB ? eta -:--:--
   ---------------------------------------- 0.0/14.9 MB ? eta -:--:--
   ---------------------------------------- 0.0/14.9 MB ? eta -:--:--
   ---------------------------------------- 0.0/14.9 MB ? eta -:--:--
    --------------------------------------- 0.3/14.9 MB ? eta -:--:--
   - -------------------------------------- 0.5/14.9 MB 670.4 kB/s eta 0:00:22
   - -------------------------------------- 0.5/14.9 MB 670.4 kB/s eta 0:00:22
   -- ------------------------------------- 0.8/14.9 MB 745.8 kB/s eta 0:00:19
   -- ------------------------------------- 1.0/14.9 MB 762.0 kB/s eta 0:00:19
   -- ------------------------------------- 1.0/14.9 MB 762.0 kB/s eta 0:00:19
   --- ------------------------------------ 1.3/14.9 MB 745.3 kB/s eta 0:00:19
   --- ----------


[notice] A new release of pip is available: 24.2 -> 24.3.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [9]:
from langchain.embeddings import CacheBackedEmbeddings

# Using with a Vector Store
First, let's see an example that uses the local file system for storing embeddings and uses FAISS vector store for retrieval.

In [10]:
from langchain.storage import LocalFileStore
from langchain_community.document_loaders import TextLoader
from langchain_community.vectorstores import FAISS
from langchain_openai import OpenAIEmbeddings
from langchain_text_splitters import CharacterTextSplitter

underlying_embeddings = OpenAIEmbeddings()

store = LocalFileStore("./cache/")

cached_embedder = CacheBackedEmbeddings.from_bytes_store(
    underlying_embeddings, store, namespace=underlying_embeddings.model
)

In [11]:
# The cache is empty prior to embedding:
list(store.yield_keys())

[]

Load the document, split it into chunks, embed each chunk and load it into the vector store.

In [12]:
from langchain.document_loaders import TextLoader
from langchain.text_splitter import CharacterTextSplitter

# Load the raw document
raw_documents = TextLoader(r"C:\Users\Admin\Desktop\10-20-2024\data\state_of_the_union.txt" , encoding="utf-8").load()

# Set up the text splitter
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
documents = text_splitter.split_documents(raw_documents)


Create the vector store:

In [13]:
%%time
db = FAISS.from_documents(documents, cached_embedder)

CPU times: total: 328 ms
Wall time: 7.86 s


If we try to create the vector store again, it'll be much faster since it does not need to re-compute any embeddings.

In [14]:
%%time
db2 = FAISS.from_documents(documents, cached_embedder)

CPU times: total: 46.9 ms
Wall time: 1.01 s


And here are some of the embeddings that got created:

In [15]:
list(store.yield_keys())[:5]

['text-embedding-ada-00201dbc21f-5e4c-5fb5-8d13-517dbe7a32d4',
 'text-embedding-ada-002059eb9ff-c4c8-5ceb-8bf9-0d3d02a92b44',
 'text-embedding-ada-0020fc1ede2-407a-5e14-8f8f-5642214263f5',
 'text-embedding-ada-0021297d37a-2bc1-5e19-bf13-6c950f075062',
 'text-embedding-ada-00217a6727d-8916-54eb-b196-ec9c9d6ca472']

# Swapping the ByteStore
In order to use a different ByteStore, just use it when creating your CacheBackedEmbeddings. Below, we create an equivalent cached embeddings object, except using the non-persistent InMemoryByteStore instead:

In [16]:
from langchain.embeddings import CacheBackedEmbeddings
from langchain.storage import InMemoryByteStore

store = InMemoryByteStore()

cached_embedder = CacheBackedEmbeddings.from_bytes_store(
    underlying_embeddings, store, namespace=underlying_embeddings.model
)