# Embedding Model Experiments in LangChain
* Notebook by Adam Lang
* Date: 7/5/2024
* In this notebook we will perform some experiments using various embedding models with LangChain.

## Install Dependencies

In [1]:
!pip install langchain==0.2.0
!pip install langchain-openai==0.1.7
!pip install langchain-community==0.2.0
!pip install langchain-huggingface==0.0.1

Collecting langchain==0.2.0
  Downloading langchain-0.2.0-py3-none-any.whl (973 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m973.7/973.7 kB[0m [31m3.5 MB/s[0m eta [36m0:00:00[0m
Collecting dataclasses-json<0.7,>=0.5.7 (from langchain==0.2.0)
  Downloading dataclasses_json-0.6.7-py3-none-any.whl (28 kB)
Collecting langchain-core<0.3.0,>=0.2.0 (from langchain==0.2.0)
  Downloading langchain_core-0.2.11-py3-none-any.whl (337 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m337.4/337.4 kB[0m [31m22.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting langchain-text-splitters<0.3.0,>=0.2.0 (from langchain==0.2.0)
  Downloading langchain_text_splitters-0.2.2-py3-none-any.whl (25 kB)
Collecting langsmith<0.2.0,>=0.1.17 (from langchain==0.2.0)
  Downloading langsmith-0.1.83-py3-none-any.whl (127 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m127.5/127.5 kB[0m [31m7.9 MB/s[0m eta [36m0:00:00[0m
Collecting marshmallow<4.0.0,>

## OpenAI and HuggingFace API Tokens

In [2]:
from getpass import getpass

OPENAI_KEY = getpass('Enter your OPENAI API Key: ')

Enter your OPENAI API Key: ··········


In [3]:
from getpass import getpass

HUGGINGFACEHUB_API_TOKEN = getpass('Enter your HuggingFace Auth Token: ')

Enter your HuggingFace Auth Token: ··········


## Environment Variables

In [5]:
import os

os.environ['OPENAI_API_KEY'] = OPENAI_KEY
os.environ['HUGGINGFACEHUB_API_TOKEN'] = HUGGINGFACEHUB_API_TOKEN

# Embedding Models
* The LangChain Embeddings class is used for interfacing with text embedding models.
* There are numerous providers available (e.g. OpenAI, Cohere, HuggingFace, etc..) -- the idea is you can interface with just about any embedding provider.
* Embeddings vectorize text in order to compare the semantic similarity of words and sentences in the same vector space.
* LangChain provides a base Embeddings class with 2 methods:
1. Embedding documents
  * input is multiple texts
2. Embedding queries
  * input is single text

## Why are there 2 methods for embedding text?
* Depending on the source of your embeddings, some may use the same method or different when embedding text and documents vs. search queries.

In [6]:
## lets setup some text documents
docs = [
    "cats eat and spleep",
    "dogs eat and bark",
    "cars drive fast",
    "vehicles include trucks and cars"
]

## Technical Notes
* `.embed_documents`
   * input is multiple text
   * returns a list of lists of floats
* `.embed_query`
   * input is single text
   * returns a list of floats

## OpenAI Embedding Models
* LangChain gives access to multiple text embedding models such as:
1. `text-embedding-3-small`
2. `text-embedding-3-large`

In [7]:
from langchain_openai import OpenAIEmbeddings

# load model from openai
openai_embed_model = OpenAIEmbeddings(model='text-embedding-3-small')

In [8]:
## embed documents
embeddings = openai_embed_model.embed_documents(docs)

In [9]:
# length of emebddings
len(embeddings)

4

In [11]:
# first index length
len(embeddings[0])

1536

In [12]:
# print first list of lists
print(embeddings[0])

[-0.004183967570865675, -0.012147431183688781, -0.010854445328023384, 0.002973866164178819, -0.0009208375645234025, 0.022676972526454307, 0.027636730865661166, 0.04760838269361552, -0.008215429208474042, -0.012160692285743417, 0.06954271301591367, -0.01174958974014657, -0.02883025612761037, 0.04774099557680702, 0.01921575131571127, 0.021470187093773036, 0.018194623174440046, 0.011431315840254745, -0.020634718106556996, 0.029307666977448105, 0.03219864869264838, -0.0007244032993875411, 0.024347906775596107, -0.02031644606931031, -0.03599140397401531, 0.04206512096284355, -0.0013211656975315005, 0.021549753706100853, 0.016576735196161926, -0.02463965847137866, 0.014136639332722406, -0.04248948367917247, -0.008699470143677812, -0.07277849269776018, 0.0022958777697203955, 0.03339217209195245, -0.008328150904244873, -0.014786447303238123, -0.03294128530886912, -0.020197093356850875, -0.017160234862436754, 0.004180652528182658, 0.009780273709856183, 0.002882694108492735, -0.00921666429967945

In [13]:
# docs we just embedded
docs

['cats eat and spleep',
 'dogs eat and bark',
 'cars drive fast',
 'vehicles include trucks and cars']

In [14]:
## now we can compare the cosine similarity
from sklearn.metrics.pairwise import cosine_similarity

# compute cosine similarity
sim_matrix = cosine_similarity(embeddings)
sim_matrix

array([[1.        , 0.57917666, 0.18223288, 0.15377351],
       [0.57917666, 1.        , 0.26200263, 0.20038633],
       [0.18223288, 0.26200263, 1.        , 0.39913189],
       [0.15377351, 0.20038633, 0.39913189, 1.        ]])

Summary:
* The matrix gives us a cross comparison of the documents in vector representation.

## HuggingFace Embedding Models - Open Source
* You can use the `langchain-community` or the `langchain-huggingface` package which is new.
* `langchain-huggingface` integrates easily with LangChain which allows us to use just about any HuggingFace embedding model in the LangChain ecosystem.
* `HuggingFaceEmbeddings` uses `sentence-transformers` embeddings "under the hood".
  * This computes embeddings locally which will utilize your local computer memory/RAM/resources to access embedding LLMs from HuggingFace

In [15]:
# import the library
from langchain_huggingface.embeddings import HuggingFaceEmbeddings

# model we will use: https://huggingface.co/mixedbread-ai/mxbai-embed-large-v1
model_name = 'mixedbread-ai/mxbai-embed-large-v1'


# instantiate with huggingface embeddings from langchain
hf_embeddings = HuggingFaceEmbeddings(
    model_name=model_name,
)

  from tqdm.autonotebook import tqdm, trange
Access to the secret `HF_TOKEN` has not been granted on this notebook.
You will not be requested again.
Please restart the session if you want to be prompted again.


modules.json:   0%|          | 0.00/229 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/171 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/113k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]



config.json:   0%|          | 0.00/677 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/670M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.24k [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/711k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/695 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/297 [00:00<?, ?B/s]

In [16]:
# embed your documents
embeddings = hf_embeddings.embed_documents(docs)

In [17]:
# length of embeddings
len(embeddings)

4

In [18]:
# view first indexed embeddings
len(embeddings[0])

1024

In [20]:
# documents we are comparing
docs

['cats eat and spleep',
 'dogs eat and bark',
 'cars drive fast',
 'vehicles include trucks and cars']

In [19]:
# compare cosine similarity
sim_matrix = cosine_similarity(embeddings)
sim_matrix

array([[1.        , 0.64495838, 0.36495506, 0.33347202],
       [0.64495838, 1.        , 0.31728608, 0.33082586],
       [0.36495506, 0.31728608, 1.        , 0.72253332],
       [0.33347202, 0.33082586, 0.72253332, 1.        ]])

Summary:
* The results are similar but not the same as the OpenAI embeddings.

# Demo Project: Building a Small Search Engine

## Load Knowledgebase documents

In [21]:
documents = [
    'Quantum mechanics describes the behavior of very small particles.',
    'Photosynthesis is the process by which green plants make food using sunlight.',
    "Shakespeare's plays are a testament to English literature.",
    'Artificial Intelligence aims to create machines that can think and learn.',
    'The pyramids of Egypt are historical monuments that have stood for thousands of years.',
    'Biology is the study of living organisms and their interactions with the environment.',
    'Music therapy can aid in the mental well-being of individuals.',
    'The Milky Way is just one of billions of galaxies in the universe.',
    'Economic theories help understand the distribution of resources in society.',
    'Yoga is an ancient practice that involves physical postures and meditation.'
]

In [22]:
# length of documents
len(documents)

10

## Get document embeddings
* Using OpenAI embeddings (`text-embedding-3-small`)

In [23]:
# embed the documents
document_embeddings = openai_embed_model.embed_documents(documents)

## Let's now try to find the most similari document for 1 search query

In [24]:
new_query = "What is AI?"
new_query

'What is AI?'

In [25]:
# embed the new_query
query_embedding = openai_embed_model.embed_query(new_query)

In [26]:
# compare cosine similarities: query vs. document embeddings
cosine_similarities = cosine_similarity([query_embedding], document_embeddings)
cosine_similarities

array([[ 0.10207296,  0.09261155, -0.00532004,  0.6313305 ,  0.02341739,
         0.09321042,  0.1076584 ,  0.07002011,  0.05820363,  0.06614016]])

### Now we need to get the MOST similar embedding
* This is because we can see above the index of the embeddings are not in order.

In [28]:
import numpy as np

# extract the most similar embedding
documents[np.argmax(cosine_similarities)]

'Artificial Intelligence aims to create machines that can think and learn.'

## Simplifying this: Create a Search engine function

In [29]:
# create function that does all of the above
def semantic_search_engine(query, embedder_model):
  """This function takes in a query and embedder_model and returns the most similar result"""
  # 1. Embed the query
  query_embedding = embedder_model.embed_query(query)
  # 2. Calcualte cosine similarity of query vs. document embedding
  cos_scores = cosine_similarity([query_embedding], document_embeddings)[0]
  # 3. Get the argmax cosine result
  top_result_id = np.argmax(cos_scores)

  return documents[top_result_id]

## Demonstrate function

In [30]:
# new query
new_query_sentence = "Tell me about AI"

# using the function --> query + embed model we want to use
semantic_search_engine(new_query_sentence, openai_embed_model)


'Artificial Intelligence aims to create machines that can think and learn.'

In [31]:
## another new query
new_query_sentence = "What do you know about the pyramids?"
semantic_search_engine(new_query_sentence, openai_embed_model)

'The pyramids of Egypt are historical monuments that have stood for thousands of years.'

In [32]:
## another query
new_query_sentence = "How do plants survive?"
semantic_search_engine(new_query_sentence, openai_embed_model)

'Photosynthesis is the process by which green plants make food using sunlight.'

In [34]:
## another query
new_query_sentence = "What is Yoga?"
semantic_search_engine(new_query_sentence, openai_embed_model)

'Yoga is an ancient practice that involves physical postures and meditation.'

Summary:
* This is not answering questions such as a RAG system would do but instead peforming semantic search based on the embeddings and cosine similarity vectors.