# Embeddings

Embeddings are the first thing to understand when learning about the Retrieval Augmented Generation (RAG) architecture.

There are a few main high level things to know:

* embeddings models are separate AI models
* there are closed source embeddings models and open source embeddings models
* embeddings *conceptually encode* text
* embeddings can be compared to each other to determine how conceptually related pieces of text are to each other

In [1]:
# We'll use HuggingFace embeddings, which are free and open source:
# https://huggingface.co/blog/getting-started-with-embeddings
from langchain.embeddings import HuggingFaceEmbeddings

# There are many embedding providers and models:
# https://integrations.langchain.com/embeddings

In [2]:
# Use a small but good embedding  model
# See the leaderboard on HF for many more: 
# https://huggingface.co/spaces/mteb/leaderboard
embedder = HuggingFaceEmbeddings(
    model_name='sentence-transformers/all-MiniLM-L12-v2'
)
# Note that the first time you run this, it will download the referenced model. Don't panic.

Once we have an embedder, we can encode a piece of text with a simple method call.

In [3]:
a = embedder.embed_query('Hello embedder')

In [4]:
b = embedder.embed_query('Goodbye embedder')

In [5]:
a[:50]

[-0.059417031705379486,
 -0.01565878838300705,
 0.10515066981315613,
 -0.017390284687280655,
 -0.034024570137262344,
 -0.10761065781116486,
 -0.003911233972758055,
 -0.04314994812011719,
 -0.023459119722247124,
 0.04211834818124771,
 0.02837151288986206,
 -0.040017738938331604,
 0.06348534673452377,
 -0.040965303778648376,
 0.03680306673049927,
 0.015729669481515884,
 0.04204173386096954,
 0.10421767830848694,
 -0.032858461141586304,
 -0.04491071403026581,
 -0.05512953922152519,
 -0.040489766746759415,
 -0.012975454330444336,
 -0.08303125947713852,
 -0.012239343486726284,
 -0.029760343953967094,
 -0.0036864359863102436,
 0.02339751459658146,
 0.07509695738554001,
 -0.048225440084934235,
 0.08054011315107346,
 -0.03907801955938339,
 0.0801326259970665,
 0.06917019933462143,
 -0.07532914727926254,
 0.06141740083694458,
 0.03781856968998909,
 -0.05798576772212982,
 -0.08692601323127747,
 0.007979047484695911,
 0.0025746095925569534,
 -0.0007543520187027752,
 -0.061774276196956635,
 -0.026

## Cosine Similarity

To determine how conceptually related two pieces of are, one method is "cosine similarity", which measures how close two vectors are to each other (more or less).

In [6]:
# https://www.sbert.net/docs/usage/semantic_textual_similarity.html#semantic-textual-similarity
from sentence_transformers import util as st_utils

In [7]:
st_utils.cos_sim(a, b)

tensor([[0.7861]])

In [8]:
def compare_strings(a, b):
    embedded_a = embedder.embed_query(a)
    embedded_b = embedder.embed_query(b)
    return st_utils.cos_sim(embedded_a, embedded_b)

In [9]:
compare_strings('car', 'automobile')

tensor([[0.8555]])

In [10]:
compare_strings('car', 'train')

tensor([[0.3786]])

In [11]:
compare_strings('car', 'banana')

tensor([[0.2699]])

In [12]:
compare_strings('car', 'tire')

tensor([[0.3743]])

In [13]:
compare_strings('car', 'engine')

tensor([[0.5234]])

In [14]:
compare_strings('car', 'Ford')

tensor([[0.5644]])

In [15]:
compare_strings(
    'I like to take the train to work',
    'I like to fly in airplanes to get to work'
)

tensor([[0.5462]])

In [16]:
compare_strings(
    'I like to take the train to work',
    'I am currently sleeping'
)

tensor([[0.1823]])

In [17]:
compare_strings(
    'I like to take the train to work',
    'I am at work right now'
)

tensor([[0.3665]])

In [18]:
compare_strings(
    'I like to take the train to work',
    'Chinese food is good'
)

tensor([[0.1978]])

In [19]:
compare_strings(
    'I like to take the train to work',
    'I am a subway guy, myself'
)

tensor([[0.4118]])

In [20]:
compare_strings(
    'I like to take the train to work',
    'I work from home and do not commute'
)

tensor([[0.3576]])

In [21]:
compare_strings(
    'I like to take the train to work',
    'I drive a locomotive to work'
)

tensor([[0.7113]])

In [22]:
compare_strings(
    'I like to take the train to work',
    'I drive a locomotive for fun'
)

tensor([[0.6729]])