- Table of content
  - [Interface](#interface)
  - [Measure similarity](#measure-similarity)
  - [Embedding Model](#embedding-model)
  - [Text Embedding Model](#text-embedding-models)
  - [Caching](#caching)

## Interface

1. **`embed_documents`:** It converts multiple texts into vector representation.
2. **`embed_query`:** It converts single texts (query) into vector representation.

In [1]:
from langchain_openai import OpenAIEmbeddings

embeddings_model = OpenAIEmbeddings(model="text-embedding-3-small")
embedding = embeddings_model.embed_documents(
    [
        "Hi there!",
        "Oh, hello!",
        "What's your name?",
        "My friends call me World",
        "Hello World!"
    ]
)

In [4]:
len(embedding), len(embedding[0])

(5, 1536)

In [2]:
query_embedding = embeddings_model.embed_query("What is the meaning of life?")
doc_embedding = embeddings_model.embed_documents([
    "What is the meaning of life?"
])

In [13]:
print(len(query_embedding))
print(len(doc_embedding[0]))

1536
1536


# Measure similarity

In [5]:
import numpy as np


def cosine_similarity(vec1, vec2):
    dot_product = np.dot(vec1, vec2)
    norm_vec1 = np.linalg.norm(vec1)
    norm_vec2 = np.linalg.norm(vec2)
    return dot_product / (norm_vec1 * norm_vec2)


similarity = cosine_similarity(query_embedding, query_embedding)
print("Cosine Similarity:", similarity)

Cosine Similarity: 0.9999999999999998


# Embedding Model


- Embedding model are responsible to create vector representation of a piece of text.

In [2]:
from langchain_openai import OpenAIEmbeddings


embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

In [3]:
embeddings.embed_query("Hello, world")

[-0.005234135314822197,
 -0.032529741525650024,
 0.012238061055541039,
 0.008411875925958157,
 -0.032443758100271225,
 -0.02863190323114395,
 -0.03012225218117237,
 0.042560938745737076,
 -0.022469881922006607,
 -0.018457403406500816,
 0.01693839393556118,
 -0.01015300489962101,
 -0.019130926579236984,
 -0.033246252685785294,
 0.010934005491435528,
 0.015820631757378578,
 -0.05393918231129646,
 0.04485378414392471,
 0.012617813423275948,
 0.0370294488966465,
 0.04760519787669182,
 -0.006702989339828491,
 -0.008433370850980282,
 0.0141869792714715,
 0.04069800302386284,
 0.0003701243258547038,
 -0.0034643455874174833,
 0.03290232643485069,
 -0.002957412041723728,
 -0.04723260924220085,
 0.043449416756629944,
 -0.03447865694761276,
 -0.016365181654691696,
 -0.003539579687640071,
 0.013957695104181767,
 0.021595735102891922,
 0.019632486626505852,
 -0.005793016403913498,
 0.00014083980931900442,
 -0.019675476476550102,
 0.008591003715991974,
 -0.007544893305748701,
 0.019947752356529236,


## Text embedding models

In [4]:
from langchain_openai import OpenAIEmbeddings

embeddings_model = OpenAIEmbeddings(model="text-embedding-3-small")

### `embed_documents`

- Use **`embed_documents`** to embed list of string, to convert into vector.

In [8]:
embeddings = embeddings_model.embed_documents(
    [

        "Hi there!",
        "Oh, hello!",
        "What's your name?",
        "My friends call me World",
        "Hello World!"
    ]
)

len(embeddings), embeddings[0]

(5,
 [-0.019144710153341293,
  -0.03818102926015854,
  -0.03100006841123104,
  -0.004633751232177019,
  -0.035362839698791504,
  -0.004027434624731541,
  0.012959600426256657,
  0.05102546140551567,
  -0.005832836497575045,
  -0.037259697914123535,
  -0.010805312544107437,
  -0.0021763050463050604,
  0.027274098247289658,
  -0.0022660670801997185,
  0.005883644800633192,
  0.03400794044137001,
  -0.01651620678603649,
  -0.010107539594173431,
  -0.0317046158015728,
  0.07647044956684113,
  0.059886496514081955,
  -0.018711142241954803,
  0.002972307614982128,
  0.01899567060172558,
  0.03975271061062813,
  0.04582265391945839,
  0.020838333293795586,
  0.0065339962020516396,
  0.013352520763874054,
  -0.004782790318131447,
  0.029889052733778954,
  -0.022342270240187645,
  0.006984499748796225,
  -0.02399524487555027,
  -0.015039368532598019,
  -0.0035837055183947086,
  -0.007546782493591309,
  0.01815563440322876,
  -0.009673972614109516,
  -0.04251670092344284,
  0.012099239975214005,

### `embed_query`

- Use **embed_query** to embed single text, converting into vector representation.

In [9]:
embedded_query = embeddings_model.embed_query(
    "What was the name mentioned in the conversation?")
embedded_query[:5]

[-0.010684116743505001,
 -0.010173137299716473,
 -0.0019674645736813545,
 0.023056013509631157,
 -0.02686513401567936]

## Caching

### Using with a Vector Store

- First, let's see an example that uses the local file system for storing embeddings and uses FAISS vector store for retrieval.



In [18]:
from langchain.storage import LocalFileStore
from langchain_community.document_loaders import PyPDFLoader
from langchain_community.vectorstores import FAISS
from langchain_openai import OpenAIEmbeddings
from langchain_text_splitters import CharacterTextSplitter
from langchain.embeddings import CacheBackedEmbeddings

underlying_model = OpenAIEmbeddings(model="text-embedding-3-small")
store = LocalFileStore("./cache/")

cached_embedder = CacheBackedEmbeddings.from_bytes_store(
    underlying_model, store, namespace=underlying_model.model
)

In [15]:
list(store.yield_keys())  # cache is empty prior to embedding

[]

- Load the document, split into chunks, embed each chunk and load it into the vector store.

In [26]:
pdf_path = "../sample_files/progit.pdf"


raw_docs = []

pdf_loader = PyPDFLoader(pdf_path)

# Load document data from pdf.
async for doc in pdf_loader.alazy_load():
    current_page_number = doc.metadata["page_label"]

    if current_page_number.isdigit():
        current_page_number = int(current_page_number)

        if current_page_number < 8:
            continue

        if current_page_number >= 492:
            break

        raw_docs.append(doc)
    else:
        continue

text_splitter = CharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200
)
documents = text_splitter.split_documents(
    raw_docs
)

- Create the vector store

In [33]:
%%time
db = FAISS.from_documents(documents, cached_embedder)

CPU times: total: 516 ms
Wall time: 502 ms


- If we try to create the vector store again, it'll be much faster since it does not need to re-compute any embeddings.

In [32]:
%%time
db2 = FAISS.from_documents(documents, cached_embedder)

CPU times: total: 547 ms
Wall time: 5.43 s


- And here are some of the embeddings that got created:



In [34]:
list(store.yield_keys())[:5]

['text-embedding-3-small007d8505-8360-510b-b44f-f6ebdadc10e5',
 'text-embedding-3-small01ba9ea9-e36a-5711-8cb5-bb48544ca8aa',
 'text-embedding-3-small025716d1-e1da-589e-95c0-c307c51776b0',
 'text-embedding-3-small031ffd7a-e23c-5771-8a7a-89b4585a6e1c',
 'text-embedding-3-small0366fcfb-f15d-59e9-9b12-ba92816c69e0']

## Swapping the ByteStore

- In order to use a different ByteStore, just use it when creating your CacheBackedEmbeddings. Below, we create an equivalent cached embeddings object, except using the non-persistent InMemoryByteStore instead:

In [35]:
from langchain.embeddings import CacheBackedEmbeddings
from langchain.storage import InMemoryByteStore

store = InMemoryByteStore()

cached_embedder = CacheBackedEmbeddings.from_bytes_store(
    underlying_embeddings, store, namespace=underlying_embeddings.model
)

NameError: name 'underlying_embeddings' is not defined