# Research on Embedding Models

Vector embeddings is a powerful technique for transforming complex data into numerical forms that can be easily processed and analyzed by machine learning algorithms. It basically allows us to take virtually any data type and represent it as vectors.

But it isn't as simple as just turning data into vectors. We want to ensure that we can perform tasks on this transformed data without losing the data's original meaning. For example, if we want to compare two sentences, we don't want just to compare the words they contain but rather whether or not they mean the same thing. 

To preserve the data's meaning, we need to understand how to produce vectors where relationships between the vectors make sense. To do this, we need what's known as an embedding model. We apply a pre-trained machine learning model that will produce a representation of this data that is more compact while preserving what's meaningful about the data.

The goal of embeddings is to capture the semantic meaning or relationships between data points in a way that similar items are close together in the vector space, and dissimilar items are far apart. For example, consider two words "king" and "queen". An embedding might map these words to vectors such that the difference between the "king" and "queen" vectors is similar to the difference between the "man" and "woman" vectors. This reflects the underlying semantic relationships.

Key characteristics of embeddings:
- Dimensionality: The number of elements in the vector. Higher dimensions can capture more complex relationships but are computationally more expensive.
- Similarity: Measured using metrics like cosine similarity or euclidean distance, which help in finding how close or far two vectors are from each other.

In [None]:
pip install python-dotenv==1.0.1 langchain==0.2.11 langchain-community==0.2.10 scikit-learn==1.5.0

In [1]:
import os
from dotenv import load_dotenv
from sklearn.metrics.pairwise import cosine_similarity

In [2]:
load_dotenv()

True

In [3]:
# Calculate cosine similarity between vectors.
# Values close to 1 indicate very similar vectors, while values close to 0 indicate very different vectors.
# Cosine similarity is particularly useful when the magnitude of the vectors is not as important as their direction. 
# It is often used in text analysis and information retrieval where the orientation (semantic meaning) of vectors 
# matters more than their magnitude.
def compare_embeddings(vector_1, vector_2):
    similarity = cosine_similarity([vector_1], [vector_2])
    print(f"Cosine similarity: {similarity[0][0]}")

In [4]:
# Some sentences to test different embedding models
sentences_list = [
    # Include the same words but have different semantics. Similarity should be low.
    ["I want a new watch for my birthday", "I like to watch the TV on weekends"],      
    ["The bank will close at 5 PM", "The river bank is a great place to relax"],
    # Convey the same meaning without sharing words. Similarity should be high.
    ["I have to go to the mechanic", "My car is broken"], 
    ["She went to the store to buy some groceries", "She went shopping for food"],
    # Completely unrelated in terms of their content and context. Similarity should be low.
    ["The stock market experienced a significant decline last week", "She enjoys painting landscapes in her free time"],
    ["The sun rises in the east every morning", "Pizza is a popular food in Italy"]
]

In [5]:
# Function to test an embedding model applying it to a set of sentences and analyzing similarity
def test_embeddings(model):
    for sentences_pair in sentences_list:
        vector_1 = model.embed_query(sentences_pair[0])
        print(f"\"{sentences_pair[0]}\"")
        print(f" Dimensionality: {len(vector_1)}")
        print(f" Sample: {vector_1[:5]}")
        print()
        
        vector_2 = model.embed_query(sentences_pair[1])
        print(f"\"{sentences_pair[1]}\"")
        print(f" Dimensionality: {len(vector_2)}")
        print(f" Sample: {vector_2[:5]}")
        print()

        compare_embeddings(vector_1, vector_2)
        print("---")

## Google AI

In [None]:
pip install langchain-google-genai

In [6]:
from langchain_google_genai import GoogleGenerativeAIEmbeddings

google_embeddings = GoogleGenerativeAIEmbeddings(model="models/embedding-001")

test_embeddings(google_embeddings)

  from .autonotebook import tqdm as notebook_tqdm


"I want a new watch for my birthday"
 Dimensionality: 768
 Sample: [-0.006526446435600519, -0.05270149186253548, -0.06381335854530334, -0.04306812211871147, 0.04528359696269035]

"I like to watch the TV on weekends"
 Dimensionality: 768
 Sample: [-0.021920520812273026, -0.02803944982588291, -0.05818081274628639, 0.006759436335414648, 0.01261313445866108]

Cosine similarity: 0.6767452984415877
---
"The bank will close at 5 PM"
 Dimensionality: 768
 Sample: [0.019774820655584335, 0.0191058199852705, -0.009512552060186863, -0.06061512976884842, 0.03158271312713623]

"The river bank is a great place to relax"
 Dimensionality: 768
 Sample: [0.028460154309868813, -0.003968124743551016, -0.009560493752360344, -0.02726844884455204, 0.044375017285346985]

Cosine similarity: 0.6493495151799284
---
"I have to go to the mechanic"
 Dimensionality: 768
 Sample: [0.019688233733177185, -0.04954982548952103, -0.04260384291410446, -0.020591052249073982, -0.012055622413754463]

"My car is broken"
 Dimens

## Hugging Face

In [None]:
pip install langchain_huggingface==0.0.3 huggingface_hub==0.23.4

In [7]:
from langchain_huggingface.embeddings import HuggingFaceEndpointEmbeddings

hface_embeddings = HuggingFaceEndpointEmbeddings()

test_embeddings(hface_embeddings)

"I want a new watch for my birthday"
 Dimensionality: 768
 Sample: [-0.043364789336919785, 0.04182625934481621, -0.0038558896631002426, 0.0431031733751297, 0.01737719029188156]

"I like to watch the TV on weekends"
 Dimensionality: 768
 Sample: [-0.04785402491688728, 0.0326734222471714, 0.0009588559623807669, -0.0176838506013155, -0.03530512750148773]

Cosine similarity: 0.176958964124023
---
"The bank will close at 5 PM"
 Dimensionality: 768
 Sample: [-0.04836975410580635, 0.0051988218910992146, -0.029005667194724083, 0.00420030765235424, 0.03131011128425598]

"The river bank is a great place to relax"
 Dimensionality: 768
 Sample: [-0.08330544829368591, -0.014301057904958725, -0.029551850631833076, -0.00013575299817603081, -0.01434556394815445]

Cosine similarity: 0.23080635655292628
---
"I have to go to the mechanic"
 Dimensionality: 768
 Sample: [-0.023156633600592613, -0.02565646730363369, -0.012464101426303387, 0.03722125664353371, -0.024508243426680565]

"My car is broken"
 Dime

## Azure OpenAI

In [None]:
pip install langchain-openai==0.1.22

In [15]:
from langchain_openai import AzureOpenAIEmbeddings

openai_embeddings = AzureOpenAIEmbeddings(
    model="ada-002",
    openai_api_version="2024-06-01"
)

test_embeddings(openai_embeddings)

"I want a new watch for my birthday"
 Dimensionality: 1536
 Sample: [-0.030059291049838066, 0.0008972808718681335, 7.123553223209456e-05, -0.03169933333992958, -0.027586987242102623]

"I like to watch the TV on weekends"
 Dimensionality: 1536
 Sample: [-0.008290333673357964, -0.013153238222002983, 0.011332020163536072, -0.024978503584861755, -0.02726767398416996]

Cosine similarity: 0.785458414824596
---
"The bank will close at 5 PM"
 Dimensionality: 1536
 Sample: [-0.030440565198659897, -0.0058855642564594746, 0.014171946793794632, -0.009836236946284771, -0.0347115620970726]

"The river bank is a great place to relax"
 Dimensionality: 1536
 Sample: [0.010515332221984863, 0.00823382381349802, 0.013548845425248146, 0.0017079447861760855, -0.03066653199493885]

Cosine similarity: 0.7972961835284542
---
"I have to go to the mechanic"
 Dimensionality: 1536
 Sample: [-0.013156797736883163, 0.009770764969289303, 0.0030002621933817863, -0.014506213366985321, -0.03778362646698952]

"My car is 

## Results

Embedding model | Sentence 1 | Sentence 2 | Cosine similarity 
:---: | :---: | :---: | :---:
Google | "I want a new watch for my birthday" | "I like to watch the TV on weekends" | 0.67
Google | "The bank will close at 5 PM" | "The river bank is a great place to relax" | 0.65
Google | "I have to go to the mechanic" | "My car is broken" | 0.84
Google | "She went to the store to buy some groceries" | "She went shopping for food" | 0.95
Google | "The stock market experienced a significant decline last week" | "She enjoys painting landscapes in her free time" | 0.52
Google | "The sun rises in the east every morning" | "Pizza is a popular food in Italy" | 0.59
Hugging Face | "I want a new watch for my birthday" | "I like to watch the TV on weekends" | 0.17
Hugging Face | "The bank will close at 5 PM" | "The river bank is a great place to relax" | 0.23
Hugging Face | "I have to go to the mechanic" | "My car is broken" | 0.60
Hugging Face | "She went to the store to buy some groceries" | "She went shopping for food" | 0.88
Hugging Face | "The stock market experienced a significant decline last week" | "She enjoys painting landscapes in her free time" | -0.09
Hugging Face | "The sun rises in the east every morning" | "Pizza is a popular food in Italy" | 0.17
OpenAI | "I want a new watch for my birthday" | "I like to watch the TV on weekends" | 0.78
OpenAI | "The bank will close at 5 PM" | "The river bank is a great place to relax" | 0.79
OpenAI | "I have to go to the mechanic" | "My car is broken" | 0.88
OpenAI | "She went to the store to buy some groceries" | "She went shopping for food" | 0.96
OpenAI | "The stock market experienced a significant decline last week" | "She enjoys painting landscapes in her free time" | 0.72
OpenAI | "The sun rises in the east every morning" | "Pizza is a popular food in Italy" | 0.75

## Conclusions

- The OpenAI embedding model appears to be the most effective at capturing semantic similarity overall, making it a strong candidate for the RAG application.
- The Google embedding model is a solid alternative, with reasonable performance across various sentence pairs.
- The Hugging Face embedding model may not be the best fit for our application, as it struggles with semantic differentiation in key scenarios.