# Beginner Friendly Guide to Vector Similarity Search and Facebook AI Similarity Search(FAISS)

## 1. Embeddings: How Machines Understand Meaning

> Computers don't understand **zero** and **one** but they understant **0** and **1**.

When we type *“king”* into a computer, it doesn’t see royalty but instead it sees a string of characters such as `['k', 'i', 'n', 'g']`. To a machine that’s meaningless because t doesn’t know that *“king”* and *“queen”* are related or that *“king”* and *“banana”* are not.

So if machines can’t understand meaning directly then
**how do modern AI models like ChatGPT or search engines make sense of language?**<br>
The answer is that they turn words and sentences into **numbers** that *capture meaning* and these hese numerical representations are called **embeddings**.


### How Meanings Live in Space

Imagine we have a huge map but instead of cities we place **words** on it such as:

* Words that mean similar things are close together.
* Words that mean different things are far apart.

On this map we notice that

> *“king” and “queen” live close together.*
> *“apple” and “banana” live close together.*
> *But “king” and “banana”? They live far apart.*

That’s the intuition behind **embeddings**.
They are coordinates of words, phrases or sentences in a high-dimensional **semantic space**.

Every point (vector) represents the meaning of the text.

### From Words to Vectors

In the early days, we used something called **one-hot encoding**:
Each word was a long vector of 0s with a single 1 marking its position.

For example:

| Word   | One-hot vector (simplified) |
| ------ | --------------------------- |
| king   | [1, 0, 0, 0]                |
| queen  | [0, 1, 0, 0]                |
| apple  | [0, 0, 1, 0]                |
| banana | [0, 0, 0, 1]                |

This approach had a big problem:
There are millions of words and all words are equally distant. There’s no sense of similarity between *“king”* and *“queen”*.

### Enter Embeddings: Learning Meaning from Context

Modern models (like Word2Vec, GloVe, and Transformer-based encoders) learn *dense* representations of words automatically by observing **context**.

> The word *“bank”* in “river bank” vs. “bank account” appears in different neighborhoods.
> The model learns those subtle differences.

So instead of arbitrary one-hot vectors, we get something like:

| Word   | Embedding (simplified)      |
| ------ | --------------------------- |
| king   | [0.61, 0.43, 0.72, 0.13, …] |
| queen  | [0.59, 0.44, 0.70, 0.15, …] |
| banana | [0.10, 0.87, 0.03, 0.91, …] |

These numbers represent coordinates in a space where **distance = meaning difference**.

### Embeddings in Action

In [2]:
from sentence_transformers import SentenceTransformer, util

# Load a pre-trained embedding model
model = SentenceTransformer('all-MiniLM-L6-v2')

# Create embeddings
sentences = ["king", "queen", "banana"]
embeddings = model.encode(sentences, convert_to_tensor=True)

# Compute similarity between words
sim_king_queen = util.cos_sim(embeddings[0], embeddings[1])
sim_king_banana = util.cos_sim(embeddings[0], embeddings[2])

print(f"Similarity(king, queen): {sim_king_queen.item():.3f}")
print(f"Similarity(king, banana): {sim_king_banana.item():.3f}")

Similarity(king, queen): 0.681
Similarity(king, banana): 0.395


These numbers tells us that “king” and “queen” are close in meaning whereas “king” and “banana” are not.

### Why Embeddings Matter

Embeddings are the unsung backbone of modern NLP. They are useful in the
* **Semantic Search:** finding meaning-based matches, not just keyword matches.
* **Recommendation Systems:** suggesting content similar in theme.
* **Chatbots & RAG:** retrieving relevant documents before answering.


## 2. Vector Similarity Search for Nearby Meaning

### Vector Based Methods for Similarity Search

Earlier we learned that embeddings place words and sentences into a vector space where distance represents meaning. Now we want to know **how do we find the closest vectors to a query?** or it can be also written as **how do we search for similar text?**

#### TF-IDF
Imagine you have three documents:
1. **Doc A:** The dog saw the cat.
2. **Doc B:** The cat sat on the mat.
3. **Doc C:** A fast brown fox jumps over the lazy dog.
Now if you want to find a document about a **dog** how would you do it?

If you are dumb like me then your first instinct might be to just count the words. This is called a "Bag of Words" approach.
- Doc A: `{"the": 2, "dog": 1, "saw": 1, "cat": 1}`
- Doc C: `{"a": 1, "fast": 1, "brown": 1, "fox": 1, "jumps": 1, "over": 1, "the": 1, "lazy": 1, "dog": 1}`

A query for "the dog" would give Doc A a score of 3 (2 for "the", 1 for "dog") and Doc C a score of 2 (1 for "the", 1 for "dog"). But the problem here is that word **the** is completely useless and it’s somehow dominating our scores. This is where IDF comes into picture. <br>

TF-IDF transforms documents into vectors based on term frequency (TF) and inverse document frequency (IDF). In layman terms we can write it as like words that appear often in one document but rarely in others get higher weight (they are more meaningful).<br>
A word that appears many times in one article is probably important to that article. If we are reading an article about Python and the word Python appears 30 times it's a safe bet the article is about the programming language or can be of snake also.<br>

$TF(word, document) = (Count of the word in the document) / (Total words in the document)$

Hence in our above example "The dog saw the cat," the TF for "dog" is 1/5 = 0.2. The TF for "the" is 2/5 = 0.4.

But as we notice "the" is still winning by far. We've only solved half the problem. Now for the secret sauce.

#### Inverse Document Frequency (IDF): How Special is this Word?

Inverse Document Frequency  asks a simple question that how common is this word across all our documents? For example, Rare words are special. They are strong signifiers of a topic. The word "quantum" or "bioinformatics" is a fantastic keyword. Common words are noise. The word "it," "and," or "is" appears in almost every document. It tells us nothing. IDF gives a high score to rare words and a low score to common words.<br>
It's calculated like this:<br>
$IDF(word, all_documents) = log( (Total number of documents) / (Number of documents containing the word) )$

Now looking at our example:<br>
Total documents = 3
- IDF("dog"): Appears in 2 documents `(A, C)` -> `log(3 / 2)` = 0.17
- IDF("cat"): Appears in 2 documents `(A, B)` -> `log(3 / 2)` = 0.17
- IDF("fox"): Appears in 1 document (C) -> `log(3 / 1)` = 0.47
- IDF(the): Appears in 3 documents (A, B, C). -> log(3 / 3) = log(1) = **0**

Now finally log(1) becomes 0 completely silencing the word "the" but "fox" which is unique to Doc C gets the highest score.

Now we just multiply them together to get the final TF-IDF score for every word in every document.<br>
$TF-IDF = TF * IDF$

Number obtained from this score is high if and only if the word is common in this document (High TF) and the word is rare across all other documents (High IDF).

#### Implementing TF-IDF

In [3]:
a = "the dog saw the cat".split()
b = "the cat saw the dog".split()
c = "A fast brown fox jumps over the lazy dog.".split()

In [4]:
import numpy as np
def tfidf(word):
    tf = []
    count_n = 0
    for sentence in [a, b, c]:
        # calculate TF
        t_count = len([x for x in sentence if word in sentence])
        tf.append(t_count/len(sentence))
        # count number of docs for IDF
        count_n += 1 if word in sentence else 0
    idf = np.log10(len([a, b, c]) / count_n)
    return [round(_tf*idf, 2) for _tf in tf]

In [None]:
tfidf_a, tfidf_b, tfidf_c = tfidf('dog')
print(f"TF-IDF a: {tfidf_a}\nTF-IDF b: {tfidf_b}\nTF-IDF c: {tfidf_c}")

TF-IDF a: 0.18
TF-IDF b: 0.18
TF-IDF c: 0.0


In [9]:
tfidf_a, tfidf_b, tfidf_c = tfidf('fox')
print(f"TF-IDF a: {tfidf_a}\nTF-IDF b: {tfidf_b}\nTF-IDF c: {tfidf_c}")# changed word from 'dog' to 'forest'

TF-IDF a: 0.0
TF-IDF b: 0.0
TF-IDF c: 0.48
