## Faiss - Facebook AI similarity search
Eficient, fast similarity search in Vector db for dense vectors (partly not fitting in RAM) on huge amounts of data. https://engineering.fb.com/2017/03/29/data-infrastructure/faiss-a-library-for-efficient-similarity-search/

https://github.com/facebookresearch/faiss/wiki/Faiss-indexes

1. Generate embedding vectors for your data vectors and query vector.

2. Choose index. Index is a datastructure for efficient vector storage and fast similarty search. By choosing an index like IndexFlatL2 you define the similarity function (Eucliedien distance L2), search method (exact search for L2), seach mode (brute force computing distance between the query vector and all vectors and returning k nearest neighbors) and the optimisation of the vector storage (4xd which means 4xd bytes total memory required for one vector).

3. Add embedding vectors to index
4. Do search with your query vector on index to get nearest neighbours vectors

In [1]:
!pip install sentence-transformers -q

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/171.5 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m61.4/171.5 kB[0m [31m1.7 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m171.5/171.5 kB[0m [31m2.8 MB/s[0m eta [36m0:00:00[0m
[?25h

In [17]:
!pip install faiss-cpu

Collecting faiss-cpu
  Downloading faiss_cpu-1.8.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (27.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m27.0/27.0 MB[0m [31m24.0 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: faiss-cpu
Successfully installed faiss-cpu-1.8.0


## 1. Generate embeddings

In [5]:
import pandas as pd
from sentence_transformers import SentenceTransformer


In [12]:
# we use quotes even tho not high-dimensional
# embeddings need to be type dtype=float32
embedding_model = SentenceTransformer("all-MiniLM-L6-v2")
df = pd.read_csv('sample_data/quotes.csv')
embeddings = embedding_model.encode(df['quote'])
query_vector = embedding_model.encode('What is Love')

In [20]:
dimensions = embeddings[0].shape[0] # 384 dimensions has a vector

## 2.Choose index

In [18]:
import faiss

In [21]:
index = faiss.IndexFlatL2(dimensions)

## 3. Add embedding vectors to index

In [22]:
index.add(embeddings)

## 4. Query index for k nearest neighbour

In [36]:
import numpy as np
k = 2
query = np.array([query_vector])
distance, indices = index.search(query, k=k)


In [37]:
distance

array([[0.71335363, 0.8702464 ]], dtype=float32)

In [38]:
indices

array([[78, 98]])

In [41]:
for i in indices:
  print(df['quote'][i])

78    Love is not only something you feel, it is som...
98    Love is an irresistible desire to be irresisti...
Name: quote, dtype: object


In [42]:
query_vector = embedding_model.encode('What is Green')
query = np.array([query_vector])
distance, indices = index.search(query, k=k)

for i in indices:
  print(df['quote'][i])

74    Liberty, when it begins to take root, is a pla...
94    Freedom is nothing else but a chance to be bet...
Name: quote, dtype: object
