# Beginner Friendly Guide to Vector Similarity Search and Facebook AI Similarity Search(FAISS)

## 1. Embeddings: How Machines Understand Meaning

> Computers don't understand **zero** and **one** but they understant **0** and **1**.

When we type *“king”* into a computer, it doesn’t see royalty but instead it sees a string of characters such as `['k', 'i', 'n', 'g']`. To a machine that’s meaningless because t doesn’t know that *“king”* and *“queen”* are related or that *“king”* and *“banana”* are not.

So if machines can’t understand meaning directly then
**how do modern AI models like ChatGPT or search engines make sense of language?**<br>
The answer is that they turn words and sentences into **numbers** that *capture meaning* and these hese numerical representations are called **embeddings**.


### How Meanings Live in Space

Imagine we have a huge map but instead of cities we place **words** on it such as:

* Words that mean similar things are close together.
* Words that mean different things are far apart.

On this map we notice that

> *“king” and “queen” live close together.*
> *“apple” and “banana” live close together.*
> *But “king” and “banana”? They live far apart.*

That’s the intuition behind **embeddings**.
They are coordinates of words, phrases or sentences in a high-dimensional **semantic space**.

Every point (vector) represents the meaning of the text.

### From Words to Vectors

In the early days, we used something called **one-hot encoding**:
Each word was a long vector of 0s with a single 1 marking its position.

For example:

| Word   | One-hot vector (simplified) |
| ------ | --------------------------- |
| king   | [1, 0, 0, 0]                |
| queen  | [0, 1, 0, 0]                |
| apple  | [0, 0, 1, 0]                |
| banana | [0, 0, 0, 1]                |

This approach had a big problem:
There are millions of words and all words are equally distant. There’s no sense of similarity between *“king”* and *“queen”*.

### Enter Embeddings: Learning Meaning from Context

Modern models (like Word2Vec, GloVe, and Transformer-based encoders) learn *dense* representations of words automatically by observing **context**.

> The word *“bank”* in “river bank” vs. “bank account” appears in different neighborhoods.
> The model learns those subtle differences.

So instead of arbitrary one-hot vectors, we get something like:

| Word   | Embedding (simplified)      |
| ------ | --------------------------- |
| king   | [0.61, 0.43, 0.72, 0.13, …] |
| queen  | [0.59, 0.44, 0.70, 0.15, …] |
| banana | [0.10, 0.87, 0.03, 0.91, …] |

These numbers represent coordinates in a space where **distance = meaning difference**.

### Embeddings in Action

In [1]:
from sentence_transformers import SentenceTransformer, util

# Load a pre-trained embedding model
model = SentenceTransformer('all-MiniLM-L6-v2')

# Create embeddings
sentences = ["king", "queen", "banana"]
embeddings = model.encode(sentences, convert_to_tensor=True)

# Compute similarity between words
sim_king_queen = util.cos_sim(embeddings[0], embeddings[1])
sim_king_banana = util.cos_sim(embeddings[0], embeddings[2])

print(f"Similarity(king, queen): {sim_king_queen.item():.3f}")
print(f"Similarity(king, banana): {sim_king_banana.item():.3f}")

Similarity(king, queen): 0.681
Similarity(king, banana): 0.395


These numbers tells us that “king” and “queen” are close in meaning whereas “king” and “banana” are not.

### Why Embeddings Matter

Embeddings are the unsung backbone of modern NLP. They are useful in the
* **Semantic Search:** finding meaning-based matches, not just keyword matches.
* **Recommendation Systems:** suggesting content similar in theme.
* **Chatbots & RAG:** retrieving relevant documents before answering.


## 2. Vector Similarity Search for Nearby Meaning

### Vector Based Methods for Similarity Search

Earlier we learned that embeddings place words and sentences into a vector space where distance represents meaning. Now we want to know **how do we find the closest vectors to a query?** or it can be also written as **how do we search for similar text?**

#### TF-IDF
Imagine you have three documents:
1. **Doc A:** The dog saw the cat.
2. **Doc B:** The cat sat on the mat.
3. **Doc C:** A fast brown fox jumps over the lazy dog.
Now if you want to find a document about a **dog** how would you do it?

If you are dumb like me then your first instinct might be to just count the words. This is called a "Bag of Words" approach.
- Doc A: `{"the": 2, "dog": 1, "saw": 1, "cat": 1}`
- Doc C: `{"a": 1, "fast": 1, "brown": 1, "fox": 1, "jumps": 1, "over": 1, "the": 1, "lazy": 1, "dog": 1}`

A query for "the dog" would give Doc A a score of 3 (2 for "the", 1 for "dog") and Doc C a score of 2 (1 for "the", 1 for "dog"). But the problem here is that word **the** is completely useless and it’s somehow dominating our scores. This is where IDF comes into picture. <br>

TF-IDF transforms documents into vectors based on term frequency (TF) and inverse document frequency (IDF). In layman terms we can write it as like words that appear often in one document but rarely in others get higher weight (they are more meaningful).<br>
A word that appears many times in one article is probably important to that article. If we are reading an article about Python and the word Python appears 30 times it's a safe bet the article is about the programming language or can be of snake also.<br>

$TF(word, document) = (Count of the word in the document) / (Total words in the document)$

Hence in our above example "The dog saw the cat," the TF for "dog" is 1/5 = 0.2. The TF for "the" is 2/5 = 0.4.

But as we notice "the" is still winning by far. We've only solved half the problem. Now for the secret sauce.

#### Inverse Document Frequency (IDF): How Special is this Word?

Inverse Document Frequency  asks a simple question that how common is this word across all our documents? For example, Rare words are special. They are strong signifiers of a topic. The word "quantum" or "bioinformatics" is a fantastic keyword. Common words are noise. The word "it," "and," or "is" appears in almost every document. It tells us nothing. IDF gives a high score to rare words and a low score to common words.<br>
It's calculated like this:<br>
$IDF(word, all_documents) = log( (Total number of documents) / (Number of documents containing the word) )$

Now looking at our example:<br>
Total documents = 3
- IDF("dog"): Appears in 2 documents `(A, C)` -> `log(3 / 2)` = 0.17
- IDF("cat"): Appears in 2 documents `(A, B)` -> `log(3 / 2)` = 0.17
- IDF("fox"): Appears in 1 document (C) -> `log(3 / 1)` = 0.47
- IDF(the): Appears in 3 documents (A, B, C). -> log(3 / 3) = log(1) = **0**

Now finally log(1) becomes 0 completely silencing the word "the" but "fox" which is unique to Doc C gets the highest score.

Now we just multiply them together to get the final TF-IDF score for every word in every document.<br>
$TF-IDF = TF * IDF$

Number obtained from this score is high if and only if the word is common in this document (High TF) and the word is rare across all other documents (High IDF).

#### Implementing TF-IDF

In [2]:
a = "the dog saw the cat".split()
b = "the cat saw the dog".split()
c = "A fast brown fox jumps over the lazy dog.".split()

In [3]:
import numpy as np
def tfidf(word):
    tf = []
    count_n = 0
    for sentence in [a, b, c]:
        # calculate TF
        t_count = len([x for x in sentence if word in sentence])
        tf.append(t_count/len(sentence))
        # count number of docs for IDF
        count_n += 1 if word in sentence else 0
    idf = np.log10(len([a, b, c]) / count_n)
    return [round(_tf*idf, 2) for _tf in tf]

In [4]:
tfidf_a, tfidf_b, tfidf_c = tfidf('dog')
print(f"TF-IDF a: {tfidf_a}\nTF-IDF b: {tfidf_b}\nTF-IDF c: {tfidf_c}")

TF-IDF a: 0.18
TF-IDF b: 0.18
TF-IDF c: 0.0


In [5]:
tfidf_a, tfidf_b, tfidf_c = tfidf('fox')
print(f"TF-IDF a: {tfidf_a}\nTF-IDF b: {tfidf_b}\nTF-IDF c: {tfidf_c}")# changed word from 'dog' to 'forest'

TF-IDF a: 0.0
TF-IDF b: 0.0
TF-IDF c: 0.48


## 3. Facebook AI Similarity Search: How FAISS Finds a Needle in a Trillion Vector Haystack

In modern world everything is a vector. Our texts, photos or songs we hear we've gotten really good at turning them all into long lists of numbers called **embeddings.**<br>
These vectors are powerful because they capture the meaning and context of the data. For example from BERT paper "King" - "Man" + "Woman" = "Queen" but it applies to everything.<br>
- **Pictures:** Vectors for "golden retriever in a park" are "close" to vectors for "dog on the grass."
- **Text:** Vectors for "How much is a flight to Tokyo?" are "close" to "Price of tickets to Japan."

But this creates a new type of problem which we are calling **billion-vector problem**.<br>
For example you are uploading a photo and we want to find he 10 most similar images then what is the most obvious nightmare choice?

- Take the user's new image vector.
- Compare it one by one to all the one billion vectors in our database (using a metric like Euclidean Distance or Cosine Similarity).
- Sort the billion results by distance.
- Return the top 10. 

This is called a brute-force or exhaustive search and it is completely accurate but it comes with a problem that it is also impossibly slow. You'd be waiting for minutes not milliseconds and if you are running a business then your users will leave then and servers will melt.<br>

So what is solution of this? Here comes the **FAISS**


### 3.1 What is FAISS?

FAISS (Facebook AI Similarity Search) is an open-source library from Meta AI. It's not a database but ultra fast toolbox written in C++ (with a perfect Python wrapper) for one and only job i.e. finding the **approximate nearest neighbors** in a massive set of vectors. The most important word here is **approximate**. FAISS is built on the concept that *what if we give up a tiny bit of perfect accuracy to gain an enormous amount of speed?*<br>
For 99% of applications we don't need the absolute mathematically perfect 10 closest vectors but just need 10 really good matches. FAISS gives us those matches and it does it so fast that it feels like magic.

- #### How FAISS Performs SMART Indexing: `IndexFlatL2`

This is the dumb and slow method we talked about. FAISS has it and it's called IndexFlatL2 (Flat L2/Euclidean distance). It compares our query vector to every single other vector.
- Speed: Very Slow.
- Accuracy: 100% Perfect
- Use Case: Only for small datasets (e.g., under 100,000 vectors) or for benchmarking other indexes.


In [8]:
import faiss
import requests
import numpy as np
import pandas as pd
from io import StringIO

In [10]:
# dataset initialization
res = requests.get('https://raw.githubusercontent.com/brmson/dataset-sts/master/data/sts/sick2014/SICK_train.txt')
text = res.text
text[:100]# show beginning of the text

'pair_ID\tsentence_A\tsentence_B\trelatedness_score\tentailment_judgment\n1\tA group of kids is playing in '

In [11]:
data = pd.read_csv(StringIO(text), sep='\t')# load data into dataframe
data.head()# show first 5 rows

Unnamed: 0,pair_ID,sentence_A,sentence_B,relatedness_score,entailment_judgment
0,1,A group of kids is playing in a yard and an ol...,A group of boys in a yard is playing and a man...,4.5,NEUTRAL
1,2,A group of children is playing in the house an...,A group of kids is playing in a yard and an ol...,3.2,NEUTRAL
2,3,The young boys are playing outdoors and the ma...,The kids are playing outdoors near a man with ...,4.7,ENTAILMENT
3,5,The kids are playing outdoors near a man with ...,A group of kids is playing in a yard and an ol...,3.4,NEUTRAL
4,9,The young boys are playing outdoors and the ma...,A group of kids is playing in a yard and an ol...,3.7,NEUTRAL


In [12]:
sentences = data['sentence_A'].tolist()# extract sentences
sentences[:5]# show first 5 sentences

['A group of kids is playing in a yard and an old man is standing in the background',
 'A group of children is playing in the house and there is no man standing in the background',
 'The young boys are playing outdoors and the man is smiling nearby',
 'The kids are playing outdoors near a man with a smile',
 'The young boys are playing outdoors and the man is smiling nearby']

In [13]:
sentence_b = data['sentence_B'].tolist()# extract sentence B
sentences.extend(sentence_b)# add to sentences list
len(set(sentences))# unique sentences count

4802

In [14]:
urls = [
    'https://raw.githubusercontent.com/brmson/dataset-sts/master/data/sts/semeval-sts/2012/MSRpar.train.tsv',
    'https://raw.githubusercontent.com/brmson/dataset-sts/master/data/sts/semeval-sts/2012/MSRpar.test.tsv',
    'https://raw.githubusercontent.com/brmson/dataset-sts/master/data/sts/semeval-sts/2012/OnWN.test.tsv',
    'https://raw.githubusercontent.com/brmson/dataset-sts/master/data/sts/semeval-sts/2013/OnWN.test.tsv',
    'https://raw.githubusercontent.com/brmson/dataset-sts/master/data/sts/semeval-sts/2014/OnWN.test.tsv',
    'https://raw.githubusercontent.com/brmson/dataset-sts/master/data/sts/semeval-sts/2014/images.test.tsv',
    'https://raw.githubusercontent.com/brmson/dataset-sts/master/data/sts/semeval-sts/2015/images.test.tsv'
]# additional datasets

In [16]:
for url in urls:
    res = requests.get(url)
    # extract to dataframe
    data = pd.read_csv(StringIO(res.text), sep='\t',
                       header=None, on_bad_lines='skip')
    # add to columns 1 and 2 to sentences list
    sentences.extend(data[1].tolist())
    sentences.extend(data[2].tolist())

In [17]:
len(set(sentences))

14505

In [18]:
# saving our text files as backup
sentences = [
    sentence.replace('\n', '') for sentence in list(set(sentences)) if type(sentence) is str
]

In [19]:
with open('sentences.txt', 'w') as fp:
    fp.write('\n'.join(sentences))

In [20]:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('bert-base-nli-mean-tokens')

sentence_embeddings = model.encode(sentences, convert_to_numpy=True, show_progress_bar=True)
sentence_embeddings.shape

modules.json:   0%|          | 0.00/229 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/122 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/625 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/399 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

added_tokens.json:   0%|          | 0.00/2.00 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Batches:   0%|          | 0/454 [00:00<?, ?it/s]

(14504, 768)

In [22]:
sentence_embeddings.shape[0]# number of sentences

14504

In [25]:
d = sentence_embeddings.shape[1]
d# dimension of embeddings

768

#### Initialising `IndexFlatL2`

Now we are initialising flat L2 distance index `IndexFlatL2` with vector dimension. Here we defined the dimension above which is 768.

In [69]:
# Reset the index to remove duplicates
index = faiss.IndexFlatL2(d)
index.add(sentence_embeddings)  # Add vectors only once
print(f"Index now has {index.ntotal} vectors")  # Should show 14504

Index now has 14504 vectors


Usually indexes are required to train on our data before being used but `IndexFlatL2` is a simple operation and only requires that we calculate distances between vectors when we introduce our query vector `xq` during search. We can check this using `is_trained` attribute.

In [70]:
index.is_trained

True

Now we will add new vectors.

In [71]:
index.add(sentence_embeddings)# adding new vectors

In [72]:
index.ntotal

29008

Now we have to give our search query and number of nearest neighbours(`k`) around which we want to search.

In [77]:
k = 4# nearest neighbours to search
xq = model.encode("The fox jumped over the dog.")  # Returns shape (768,) search query

In [78]:
%%time
# Reshape query to 2D array with shape (1, 768)
xq= np.asarray(xq, dtype=np.float32).reshape(1, -1)

D, I = index.search(xq, k)  # actual search: returns (distances, indices)
print(I)  # indexes of nearest neighbours

[[ 8121 22625   305 14809]]
CPU times: user 3.92 ms, sys: 18.3 ms, total: 22.2 ms
Wall time: 28.6 ms


In [79]:
# Safely display nearest neighbour sentences.
# If I is not defined (previous search not run) perform the search.
if 'I' not in globals():
	# ensure we have a query vector and k
	k = globals().get('k', 4)
	q = globals().get('xq2', globals().get('xq', None))
	if q is None:
		raise RuntimeError("No query vector found. Define 'xq' or 'xq2' and run the search.")
	q = np.asarray(q, dtype=np.float32).reshape(1, -1)
	D, I = index.search(q, k)

# Build safe result list (handle out-of-range indices returned by faiss)
results = []
for idx in I[0]:
	if idx < 0 or idx >= len(sentences):
		results.append(f'{idx}: [index out of range]')
	else:
		results.append(f'{idx}: {sentences[idx]}')

results

['8121: Two dogs running together and one has a duck toy in its mouth.',
 '22625: [index out of range]',
 '305: A greyhound jumps over a chain.',
 '14809: [index out of range]']

In [80]:
sentences[305]

'A greyhound jumps over a chain.'

We can see some good matches with little bit of similar contexts as our search query. Now we will extract the numerical vectors from FAISS.

In [81]:
vecs = np.zeros((k, d))# initialize array to hold vectors
for i, val in enumerate(I[0].tolist()):# iterate over indices
    vecs[i, :] = index.reconstruct(val)# extract vectors

In [82]:
vecs.shape

(4, 768)

In [84]:
vecs[0][:10]### first 10 dimensions of first vector

array([-0.44295901,  0.0148984 ,  0.13510251,  0.17391756,  0.13312227,
        0.38461047, -0.59615934, -0.74605459, -0.56400001, -0.40464091])

### Scaling Vector Search: Adding Partitioning to the Index

Imagine we are running a semantic search engine with 100 million documents. A user types in a query and our system needs to find the most relevant documents. Using the flat index approach we learned earlier aboe (`IndexFlatL2`) our system would need to:
1. Convert the query to a 768-dimensional vector.
2. Calculate the distance between this query vector and all 100 million document vectors.
3. Sort these distances to find the closest matches.

That's 100 million distance calculations for every single search query. Even with optimized hardware this becomes painfully slow as our dataset grows. The core challenge here is that how do we search through millions (or billions) of vectors without comparing against every single one?
Now here comes the clever mathematical concept called **Voronoi cells**.

#### What Are Voronoi Cells?

Imagine you're a city planner deciding where people should shop. You have three grocery stores in your city:
- Store A at coordinates (2, 3).
- Store B at coordinates (8, 7).
- Store C at coordinates (5, 1).

Only rule we have here is that every resident shops at whichever store is closest to their home.
If we colour code the city map by which store each location is closest to we would create three regions. These regions are Voronoi cells.

```bash
      Store B (8,7)
            ●
           /|\
          / | \
         /  |  \
        /   |   \
       /    |    \
   Store A  |  Store C
     (2,3)  |   (5,1)
       ●    |      ●
        \   |     /
         \  |    /
          \ |   /
           \|  /
            \/
```

Each cell here share these properties:
- One center point (the store location).
- A boundary where we are equidistant from two stores.
- Everything inside is closer to that store than any other.

Now we will convert our this analogy to the vector search:
- Instead of grocery stores we have centroid vectors (representative points in our vector space).
- Instead of resident homes we have document embeddings.
- Instead of asking which store? we ask which cell does this vector belong to?

In high dimensional space (like the 768 dimensions in BERT embeddings) Voronoi cells work the same way. Each cell contains all vectors that are closer to its centroid than to any other centroid.


#### How FAISS Uses Voronoi Partitioning

In FAISS we take our query vector `xq` identify which cell it belongs to and then use our `IndexFlatL2` to search between the query vector `xq` and all indexed vectors belonging to that cell. We can also include vectors from other nearby cells too.

We initialize our new partitioned index by first adding our previous `IndexFlatL2` operation as a quantization step and feeding this into the new `IndexIVFFlat`: