# FAISS Intro + Similarity search algorithm

## Objective
Learn about the configuration parameters for the various algorithms:

* Flat
* LSH
* IVF
* PQ
* IVF PQ
* HNSW PQ
* HNSW Scalar

### Installation
https://github.com/facebookresearch/faiss/wiki/Installing-Faiss

https://github.com/matsui528/faiss_tips

#### Note
* Small dataset is in use to make it easy to understand/follow the working of algos
* The time measurement is NOT applicable as the datasets is too small for achieving high efficiency via many of the algorithms.


In [38]:
# !pip install faiss-cpu
# !pip install faiss-gpu

## 1. Setup environment

In [39]:
from dotenv import load_dotenv
import os

import warnings

warnings.filterwarnings("ignore")

# Load the file that contains the API keys
# CHANGE THIS TO YOUR ENV FILE LOCATION
load_dotenv('C:\\Users\\raj\\.jupyter\\.env')

True

## 2. Generate the embeddings

#### Note
Cohere in use but may be switched with any embedding model

In [40]:
# from langchain_community.embeddings import CohereEmbeddings
from langchain_cohere import CohereEmbeddings
import numpy as np

# Create the embeddings model - Dimension = 384
# Cohere model in use
# model_name = "embed-english-light-v3.0"

# Create the embeddings model - Dimension = 1024
model_name = "embed-english-v3.0"

embeddings_model = CohereEmbeddings(model=model_name)

corpus = [
  "A man is eating food.", "A man is eating a piece of bread.",
  "The chef is preparing a delicious meal in the kitchen.", "A chef is tossing vegetables in a sizzling pan.",
  "A man is riding a horse.", "A man is riding a white horse on an enclosed ground.",
  "A woman is playing violin.", "A musician is tuning his guitar before the concert.",
  "The girl is carrying a baby.", "The baby is giggling while playing with her toys.",
  "The family is having a picnic under the shady oak tree.", "A group of friends is hiking up the mountain trail.",
  "The mechanic is repairing a broken-down car in the garage.", "The old man is feeding breadcrumbs to the ducks at the pond.",
  "The artist is sketching a beautiful landscape at sunset.", "A man is painting a colorful mural on the city wall.",
  "A team of scientists is conducting experiments in the laboratory.", "A group of students is studying together in the library.",
  "The birds are chirping happily in the morning sun.", "The dog is chasing its tail around the backyard.",
  "A group of children are playing soccer in the park.", "A monkey is playing drums.",
  "A boy is flying a kite in the open field.", "Two men pushed carts through the woods.",
  "A woman is walking her dog along the beach.", "A young girl is reading a book under a shady tree.",
  "The dancer is gracefully performing on stage.", "The farmer is harvesting ripe tomatoes from the vine."
]


# A list of embeddings
corpus_embeddings = embeddings_model.embed_documents(corpus)

# convert list of embeddings to numpy
corpus_embeddings_numpy = np.array(corpus_embeddings).astype(np.float32)


### Generate embeddings

#### Note
The quality of embedding & results will depend on the dimension !!

In [41]:
embedding_dimension = len(corpus_embeddings[0])

print("Embedding dimension = ", embedding_dimension)


Embedding dimension =  1024


##  3. IndexFlatL2

* Some indexes require training. FlatL2 doesn't require any training as it is brute force

In [42]:
import faiss


# Create index 
index_flatl2 = faiss.IndexFlatL2(embedding_dimension)

# Is trained
print("Is trained : ",index_flatl2.is_trained)



# Add embeddings
index_flatl2.add(corpus_embeddings_numpy)

print("Size of index : ",index_flatl2.ntotal)


Is trained :  True
Size of index :  28


#### Query

#### index.search

Takes as parameter an ndarray with embedding

Returns the Distance:ndarray, Indexes:ndarray


In [43]:
test_docs = [
    'I am a foodie',
    'My siter loves to play string instruments',
    'Musical instruments'
]

embed_query = embeddings_model.embed_documents([test_docs[0]])


In [44]:
%%time

k=3

distances, indexes = index_flatl2.search(np.array(embed_query), k)

print("Distances : ", distances)
print("Indexes : ", indexes)

print("------")
for i, corpus_index in enumerate(indexes[0]):
    print(corpus[corpus_index],"  (",  distances[0][i],")")

Distances :  [[0.8526868 1.0069261 1.16745  ]]
Indexes :  [[0 2 1]]
------
A man is eating food.   ( 0.8526868 )
The chef is preparing a delicious meal in the kitchen.   ( 1.0069261 )
A man is eating a piece of bread.   ( 1.16745 )
CPU times: total: 0 ns
Wall time: 11.3 ms


## 4. Index LSH

In [45]:
## Index LSH. Higher bits is better but results in lowering of QPS, potential increase in latency
## nbits is generally expressed as multiple of d (embedding dimension)

## For testing use low numbers such as 8, 16, 32, 64 and check out the impact on results
## Low values translates into BAD recall
nbits = 16

# initialize the index using our vectors dimensionality and nbits
index_lsh = faiss.IndexLSH(embedding_dimension, nbits)

# then add the data
index_lsh.add(corpus_embeddings_numpy)

In [46]:
%%time

# Test docs
test_docs = [
    'I am a foodie',
    'My siter loves to play string instruments',
    'Musical instruments'
]

# Create query vector
embed_query = embeddings_model.embed_documents([test_docs[0]])

k=3

distances, indexes = index_lsh.search(np.array(embed_query), k)

print("Distances : ", distances)
print("Indexes : ", indexes)

print("------")
for i, corpus_index in enumerate(indexes[0]):
    print(corpus[corpus_index],"  (",  distances[0][i],")")

Distances :  [[2. 3. 3.]]
Indexes :  [[0 1 2]]
------
A man is eating food.   ( 2.0 )
A man is eating a piece of bread.   ( 3.0 )
The chef is preparing a delicious meal in the kitchen.   ( 3.0 )
CPU times: total: 2.11 s
Wall time: 604 ms


## 5. IVF + PQ

https://faiss.ai/cpp_api/struct/structfaiss_1_1IndexIVFPQ.html

##### Note:
Dataset is too small to demonstrate the true value of PQ with IVF.

In [47]:
# Change to this number will change the performance & recall
nlist_ivfpq = 4

# Number of centroids in the compressed vectors
number_centroids = 8

# Number of bits in each centroid
number_bits_per_centroid = 2 

quantizer = faiss.IndexFlatL2(embedding_dimension)  # we keep the same L2 distance flat index
index_ivfpq = faiss.IndexIVFPQ(quantizer, embedding_dimension, nlist_ivfpq, number_centroids, number_bits_per_centroid) 

print("Is trained: ", index_ivfpq.is_trained)

Is trained:  False


### Train index

In [48]:
%%time

# Now add the embeddings to the index
index_ivfpq.train(corpus_embeddings_numpy)
index_ivfpq.add(corpus_embeddings_numpy)

print("Is trained: ", index_ivfpq.is_trained)
print("ntotal : ", index_ivfpq.ntotal)


Is trained:  True
ntotal :  28
CPU times: total: 172 ms
Wall time: 34.4 ms


### Run test

In [49]:
%%time

# Test docs
test_docs = [
    'I am a foodie',
    'My siter loves to play string instruments',
    'Musical instruments'
]

# Create query vector
embed_query = embeddings_model.embed_documents([test_docs[0]])

k=3

distances, indexes = index_ivfpq.search(np.array(embed_query), k)

print("Distances : ", distances)
print("Indexes : ", indexes)
print("------")
for i, corpus_index in enumerate(indexes[0]):
    print(corpus[corpus_index],"  (",  distances[0][i],")")

Distances :  [[0.79832053 0.8417079  0.8417079 ]]
Indexes :  [[ 6  9 16]]
------
A woman is playing violin.   ( 0.79832053 )
The baby is giggling while playing with her toys.   ( 0.8417079 )
A team of scientists is conducting experiments in the laboratory.   ( 0.8417079 )
CPU times: total: 2.69 s
Wall time: 546 ms


## 6. Product Quantization (PQ)

https://faiss.ai/cpp_api/struct/structfaiss_1_1IndexPQ.html#

IndexPQ(int d, size_t M, size_t nbits, MetricType metric = METRIC_L2)

- d – dimensionality of the input vectors
- M – number of subquantizers
- nbits – number of bit per subvector index

In [50]:
# Number of sub spaces
m = 8

# Number of bits per subquantizer. Total number of centroids = 2**nbits
# If you set this number very high, you will get an error : "Number of training points (28) should be at least as large as number of clusters (...)"
# For the example set with 28 vectors (max nbits = 4)
nbits = 4

# embedding dimension must be divisble by number of vector spaces
assert embedding_dimension % m == 0

# Create the index
index_pq = faiss.IndexPQ(embedding_dimension, m, nbits)

index_pq.is_trained

False

### Train index

In [51]:
%%time

# Train and add embeddings
index_pq.train(corpus_embeddings_numpy)
index_pq.add(corpus_embeddings_numpy)

print("is_trained: ", index_pq.is_trained)
print("ntotal : ", index_pq.ntotal)

is_trained:  True
ntotal :  28
CPU times: total: 172 ms
Wall time: 22 ms


### Test query

In [52]:
%%time

# Test docs
test_docs = [
    'I am a foodie',
    'My siter loves to play string instruments',
    'Musical instruments'
]

# Create query vector
embed_query = embeddings_model.embed_documents([test_docs[0]])

k=3

distances, indexes = index_pq.search(np.array(embed_query), k)

print("Distances : ", distances)
print("Indexes : ", indexes)

print("------")
for i, corpus_index in enumerate(indexes[0]):
    print(corpus[corpus_index],"  (",  distances[0][i],")")

Distances :  [[0.86941785 0.92359406 0.92359406]]
Indexes :  [[0 2 3]]
------
A man is eating food.   ( 0.86941785 )
The chef is preparing a delicious meal in the kitchen.   ( 0.92359406 )
A chef is tossing vegetables in a sizzling pan.   ( 0.92359406 )
CPU times: total: 2.3 s
Wall time: 545 ms


## 7.IndexIVFFlat 

https://faiss.ai/cpp_api/struct/structfaiss_1_1IndexIVFFlat.html

#### Note

Negative index in the result set



In [53]:
# how many cells
nlist = 4

# Quantizer
quantizer = faiss.IndexFlatL2(embedding_dimension)

# Index creation
ivfflat_index = faiss.IndexIVFFlat(quantizer, embedding_dimension, nlist)

# index
print("Is training : ", ivfflat_index.is_trained)

ivfflat_index.train(corpus_embeddings_numpy)
ivfflat_index.add(corpus_embeddings_numpy)

print("Is training : ", ivfflat_index.is_trained)
print("ntotal : ", ivfflat_index.ntotal)

Is training :  False
Is training :  True
ntotal :  28


In [54]:
%%time

k=3

nprobes = 3
ivfflat_index.nprobe = nprobes 

distances, indexes = ivfflat_index.search(np.array(embed_query), k)

print("Distances : ", distances)
print("Indexes : ", indexes)

print("------")
for i, corpus_index in enumerate(indexes[0]):
    print(corpus[corpus_index],"  (",  distances[0][i],")")

Distances :  [[0.8513175 1.0079684 1.1668954]]
Indexes :  [[0 2 1]]
------
A man is eating food.   ( 0.8513175 )
The chef is preparing a delicious meal in the kitchen.   ( 1.0079684 )
A man is eating a piece of bread.   ( 1.1668954 )
CPU times: total: 15.6 ms
Wall time: 1.01 ms


## 8. HNSW with Flat Quantizer

https://faiss.ai/cpp_api/struct/structfaiss_1_1IndexHNSWFlat.html

#### IndexHNSWFlat.hnsw

https://faiss.ai/cpp_api/file/HNSW_8h.html#_CPPv4N5faiss4HNSWE

https://github.com/facebookresearch/faiss/blob/main/benchs/bench_hnsw.py

#### Note:
* Sample is provided to aid familiarization with HNSW
* With the small dataset you would not see much gain by way of adjusting the efConstruction/efSearch


In [64]:
# Bidirectional links from each node in graph
m = 3

index_hnsw_flat = faiss.IndexHNSWFlat(embedding_dimension, m)

# layer_depth_construction
efConstruction = 1

# layer_depth_in_search 
efSearch = 1

index_hnsw_flat.hnsw.efConstruction = efConstruction 
index_hnsw_flat.hnsw.efSearch = efSearch # 


print("Is trained: ", index_hnsw_flat.is_trained)

index_hnsw_flat.train(corpus_embeddings_numpy)
index_hnsw_flat.add(corpus_embeddings_numpy)

Is trained:  True


In [66]:
%%time

k=3

distances, indexes = index_hnsw_flat.search(np.array(embed_query), k)

print("Distances : ", distances)
print("Indexes : ", indexes)

print("------")
for i, corpus_index in enumerate(indexes[0]):
    print(corpus[corpus_index],"  (",  distances[0][i],")")

Distances :  [[0.8513175 1.0079685 1.1668953]]
Indexes :  [[0 2 1]]
------
A man is eating food.   ( 0.8513175 )
The chef is preparing a delicious meal in the kitchen.   ( 1.0079685 )
A man is eating a piece of bread.   ( 1.1668953 )
CPU times: total: 0 ns
Wall time: 6.5 ms


## 9. Write & Read index

#### Note
Change the location of temp folder

In [None]:
faiss_index_cache = "c:\\temp\\faiss"

faiss.write_index(index_flatl2, faiss_index_cache)

In [None]:
index_flatl2 = faiss.read_index(faiss_index_cache)