# Embeddings and Dense Vector Search: A Quick Primer

If you come from an NLP background, embeddings are something you might be intimately familiar with - otherwise, you might find the topic a bit...dense. (this attempt at a joke will make more sense later)

In all seriousness, embeddings are a powerful piece of the NLP puzzle, so let's dive in!

> NOTE: While this notebook is language/NLP-centric, embeddings have uses beyond just text!

#### Why Do We Even Need Embeddings?

In order to fully understand what Embeddings are, we first need to understand why we have them!

Machine Learning algorithms, ranging from the very big to the very small, all have one thing in common:

They need numeric inputs.

So we need a process by which to translate the domain we live in, dominated by images, audio, language, and more, into the domain of the machine: Numbers.

Another thing we want to be able to do is capture "semantic information" about words/phrases so that we can use algorithmic approaches to determine if words are closely related or not!

So, we need to come up with a process that does these two things well:

- Convert non-numeric data into numeric-data
- Capture potential semantic relationships between individual pieces of data

#### How Do Embeddings Capture Semantic Relationships?

In a simplified sense, embeddings map a word or phrase into n-dimensional space with a dense continuous vector, where each dimension in the vector represents some "latent feature" of the data.

This is best represented in a classic example:

![image](https://i.imgur.com/K5eQtmH.png)

As can be seen in the extremely simplified example: The X_1 axis represents age, and the X_2 axis represents hair.

The relationship of "puppy -> dog" reflects the same relationship as "baby -> adult", but dogs are (typically) hairier than humans. However, adults typically have more hair than babies - so they are shifted slightly closer to dogs on the X_2 axis!

Now, this is a simplified and contrived example - but it is *essentially* the mechanism by which embeddings capture semantic information.

In reality, the dimensions don't sincerely represent hard-concepts like "age" or "hair", but it's useful as a way to think about how the semantic relationships are captured.

Alright, with some history behind us - let's examine how these might help us choose relevant context.

Let's begin with a simple example - simply looking at how close two embedding vectors are for a given phrase.

When we use the term "close" in this notebook - we're referring to a distance measure called "cosine similarity".

We discussed above that if two embeddings are close - they are semantically similar, cosine similarity gives us a quick way to measure how similar two vectors are!

Closeness is measured from 1 to -1, with 1 being extremely close and -1 being extremely close to opposite in meaning.

Let's implement it with Numpy below.

In [1]:
import numpy as np
from numpy.linalg import norm

def cosine_similarity(vec_1, vec_2):
  return np.dot(vec_1, vec_2) / (norm(vec_1) * norm(vec_2))

Now let's use the `all-MiniLM-L6-v2` embedding model from [sentence-transformers](https://www.sbert.net/) to embed two sentences. This is a lightweight model that runs locally - no API key needed!

In [2]:
from sentence_transformers import SentenceTransformer

# Load a pretrained Sentence Transformer model (runs locally)
model = SentenceTransformer("all-MiniLM-L6-v2")

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

Loading weights:   0%|          | 0/103 [00:00<?, ?it/s]

BertModel LOAD REPORT from: sentence-transformers/all-MiniLM-L6-v2
Key                     | Status     |  | 
------------------------+------------+--+-
embeddings.position_ids | UNEXPECTED |  | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.


tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Let's define our two sentences:

In [3]:
puppy_sentence = "i love puppies"
dog_sentence = "i love dogs"

Now we can convert those into embedding vectors using our local model!

In [4]:
puppy_vector = model.encode(puppy_sentence)
dog_vector = model.encode(dog_sentence)

Now we can determine how closely they are related using our distance measure!

In [5]:
cosine_similarity(puppy_vector, dog_vector)

np.float32(0.7885634)

Remember, with cosine similarity, close to 1. means they're very close!

Let's see what happens if we use a different set of sentences.

In [6]:
puppy_sentence = "I love puppies!"
cat_sentence = "I dislike cats!"

puppy_vector = model.encode(puppy_sentence)
cat_vector = model.encode(cat_sentence)

cosine_similarity(puppy_vector, cat_vector)

np.float32(0.50499433)

As you can see - these vectors are further apart - as expected!

### Embedding Vector Calculations

One of the ways that Embedding Vectors can be leveraged, and a fun "proof" that they work the way we expected can be explored via "Vector Calculations"

That is to say: If we take the vector for "King", and subtract the vector for "man", and add the vector for "woman" - we should have a vector that is similar to "Queen".

Let's try this out below!

In [7]:
king_vector = model.encode("King")
man_vector = model.encode("man")
woman_vector = model.encode("woman")

vector_calculation_result = king_vector - man_vector + woman_vector

queen_vector = model.encode("Queen")

cosine_similarity(vector_calculation_result, queen_vector)

np.float32(0.5794788)

As you can see - the resulting vector is indeed quite close to the "Queen" vector!

> NOTE: The loss is explained by the vectors not *literally* encoding information along axes as simple as "man" or "woman".

### Conclusion

As you can see - embeddings can help us convert text into a machine understandable format, which we can leverage for a number of purposes.