## Exploring the Faiss Library

[Faiss](https://faiss.ai/) or Facebook AI Similarity Search was built out of FB Engineering and is a library used to build an index of dense vectors and search, at scale.

The following is a sentence embedding exercise exploring the library courtesy of the folks at [Pinecone](https://www.pinecone.io/product/), a commercial provider of fully mananged vector databases.

[Sentence embedding](https://en.wikipedia.org/wiki/Sentence_embedding) is a way to tokenize or numerically represent sentences, as vectors. Its applications in natural language include knowledge databases that could be queried against through the use of vector indexing for search.  [LangChain](https://github.com/hwchase17/langchain) is one good example of such an application - it was developed in Q4 2022 just last year as an open source project.

Similarity Search itself is a complex topic however there is a rich set of writings online - here is a [link to a good starter series](https://towardsdatascience.com/similarity-search-knn-inverted-file-index-7cab80cc0e79) for those curious.

In the exercise, the embeddings are built using the [BERT Library](https://pypi.org/project/sentence-transformers/)

In [2]:
import requests
from io import StringIO
import pandas as pd


In [3]:
urls = [
    'https://raw.githubusercontent.com/brmson/dataset-sts/master/data/sts/sick2014/SICK_train.txt',
    'https://raw.githubusercontent.com/brmson/dataset-sts/master/data/sts/semeval-sts/2012/MSRpar.train.tsv',
    'https://raw.githubusercontent.com/brmson/dataset-sts/master/data/sts/semeval-sts/2012/MSRpar.test.tsv',
    'https://raw.githubusercontent.com/brmson/dataset-sts/master/data/sts/semeval-sts/2012/OnWN.test.tsv',
    'https://raw.githubusercontent.com/brmson/dataset-sts/master/data/sts/semeval-sts/2013/OnWN.test.tsv',
    'https://raw.githubusercontent.com/brmson/dataset-sts/master/data/sts/semeval-sts/2014/OnWN.test.tsv',
    'https://raw.githubusercontent.com/brmson/dataset-sts/master/data/sts/semeval-sts/2014/images.test.tsv',
    'https://raw.githubusercontent.com/brmson/dataset-sts/master/data/sts/semeval-sts/2015/images.test.tsv'
]

sentences =[]

In [4]:
## Ingest data
for index, url in enumerate(urls):
  res = requests.get(url)
  if index == 0:
    data = pd.read_csv(StringIO(res.text), sep='\t')
    sentences = data['sentence_A'].tolist()
    sentences.extend(data['sentence_B'].tolist())
  else:
    data = pd.read_csv(StringIO(res.text), sep='\t', header=None, on_bad_lines='skip')
    sentences.extend(data[1].tolist())
    sentences.extend(data[2].tolist())

sentences = [word for word in list(set(sentences)) if type(word) is str]
len(set(sentences))


14504

In [5]:
## Build Dense Vectors using the sentence_transformers Bert Library
!pip install sentence-transformers


Collecting sentence-transformers
  Downloading sentence-transformers-2.2.2.tar.gz (85 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m86.0/86.0 kB[0m [31m1.6 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting transformers<5.0.0,>=4.6.0 (from sentence-transformers)
  Downloading transformers-4.31.0-py3-none-any.whl (7.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.4/7.4 MB[0m [31m51.9 MB/s[0m eta [36m0:00:00[0m
Collecting sentencepiece (from sentence-transformers)
  Downloading sentencepiece-0.1.99-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m56.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting huggingface-hub>=0.4.0 (from sentence-transformers)
  Downloading huggingface_hub-0.16.4-py3-none-any.whl (268 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m268.8/268.8 kB

In [7]:
from sentence_transformers import SentenceTransformer

In [8]:
# Initialize sentence transformer model
model = SentenceTransformer('bert-base-nli-mean-tokens')
# Create sentence embeddings
sentence_embeddings = model.encode(sentences)
sentence_embeddings.shape

Downloading (…)821d1/.gitattributes:   0%|          | 0.00/391 [00:00<?, ?B/s]

Downloading (…)_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading (…)8d01e821d1/README.md:   0%|          | 0.00/3.95k [00:00<?, ?B/s]

Downloading (…)d1/added_tokens.json:   0%|          | 0.00/2.00 [00:00<?, ?B/s]

Downloading (…)01e821d1/config.json:   0%|          | 0.00/625 [00:00<?, ?B/s]

Downloading (…)ce_transformers.json:   0%|          | 0.00/122 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/438M [00:00<?, ?B/s]

Downloading (…)nce_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading (…)821d1/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/399 [00:00<?, ?B/s]

Downloading (…)8d01e821d1/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)1e821d1/modules.json:   0%|          | 0.00/229 [00:00<?, ?B/s]

(14504, 768)

In [17]:
# Simple example measures the L2 (or Euclidean) distance between all given points between our query vector, and the vectors loaded into the index.
# Install Faiss
!apt install libomp-dev
!pip install faiss-cpu --no-cache

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
libomp-dev is already the newest version (1:14.0-55~exp2).
0 upgraded, 0 newly installed, 0 to remove and 15 not upgraded.


In [21]:
import faiss

In [22]:
#Initialize the L2 index with vectors dimension
index = faiss.IndexFlatL2(sentence_embeddings.shape[1])

In [23]:
#Load the embeddings and query
index.add(sentence_embeddings)

In [28]:
%time

CPU times: user 4 µs, sys: 0 ns, total: 4 µs
Wall time: 8.34 µs


In [29]:
#Try a test query while setting a parameter called nearest neighbors k to 5
k=5
xq = model.encode(["Someone runs with a football"])
%time
d, i = index.search(xq, k) #implement search
print(i)

CPU times: user 2 µs, sys: 0 ns, total: 2 µs
Wall time: 5.72 µs
[[ 9295 12864  7969  4554  7222]]


In [42]:
sentences[9295]

'Two groups of people are playing football'

In [45]:
sentences[12864]

'A football player kicks the ball.'

In [46]:
sentences[7969]

'A group of people playing football is running in the field'

In [47]:
sentences[4554]

'A group of football players is running in the field'

In [48]:
sentences[7222]

'A person playing football is running past an official carrying a football'