In [1]:
%pip install faiss-gpu chromadb==0.3.21

Note: you may need to restart the kernel to use updated packages.


In [2]:
import pandas as pd
from datasets import load_dataset

In [3]:
import pandas as pd

pdf = pd.read_csv(f"data/labelled_newscatcher_dataset.csv", sep=";")
pdf['id'] = pdf.index
display(pdf)

Unnamed: 0,topic,link,domain,published_date,title,lang,id
0,SCIENCE,https://www.eurekalert.org/pub_releases/2020-0...,eurekalert.org,2020-08-06 13:59:45,A closer look at water-splitting's solar fuel ...,en,0
1,SCIENCE,https://www.pulse.ng/news/world/an-irresistibl...,pulse.ng,2020-08-12 15:14:19,"An irresistible scent makes locusts swarm, stu...",en,1
2,SCIENCE,https://www.express.co.uk/news/science/1322607...,express.co.uk,2020-08-13 21:01:00,Artificial intelligence warning: AI will know ...,en,2
3,SCIENCE,https://www.ndtv.com/world-news/glaciers-could...,ndtv.com,2020-08-03 22:18:26,Glaciers Could Have Sculpted Mars Valleys: Study,en,3
4,SCIENCE,https://www.thesun.ie/tech/5742187/perseid-met...,thesun.ie,2020-08-12 19:54:36,Perseid meteor shower 2020: What time and how ...,en,4
...,...,...,...,...,...,...,...
108769,NATION,https://www.vanguardngr.com/2020/08/pdp-govern...,vanguardngr.com,2020-08-08 02:40:00,PDP governors’ forum urges security agencies t...,en,108769
108770,BUSINESS,https://www.patentlyapple.com/patently-apple/2...,patentlyapple.com,2020-08-08 01:27:12,"In Q2-20, Apple Dominated the Premium Smartpho...",en,108770
108771,HEALTH,https://www.belfastlive.co.uk/news/health/coro...,belfastlive.co.uk,2020-08-12 17:01:00,Coronavirus Northern Ireland: Full breakdown s...,en,108771
108772,ENTERTAINMENT,https://www.thenews.com.pk/latest/696364-paul-...,thenews.com.pk,2020-08-05 04:59:00,Paul McCartney details post-Beatles distress a...,en,108772


### Vector Library: FAISS

Vector libraries are often sufficient for small, static data. Since it's not a full-fledged database solution, it doesn't have the CRUD (Create, Read, Update, Delete) support. Once the index has been built, if there are more vectors that need to be added/removed/edited, the index has to be rebuilt from scratch. 

That said, vector libraries are easy, lightweight, and fast to use. Examples of vector libraries are [FAISS](https://faiss.ai/), [ScaNN](https://github.com/google-research/google-research/tree/master/scann), [ANNOY](https://github.com/spotify/annoy), and [HNSM](https://arxiv.org/abs/1603.09320).

FAISS has several ways for similarity search: L2 (Euclidean distance), cosine similarity. You can read more about their implementation on their [GitHub](https://github.com/facebookresearch/faiss/wiki/Getting-started#searching) page or [blog post](https://engineering.fb.com/2017/03/29/data-infrastructure/faiss-a-library-for-efficient-similarity-search/). They also published their own [best practice guide here](https://github.com/facebookresearch/faiss/wiki/Guidelines-to-choose-an-index).

If you'd like to read up more on the comparisons between vector libraries and databases, [here is a good blog post](https://weaviate.io/blog/vector-library-vs-vector-database#feature-comparison---library-versus-database).


**The overall workflow of FAISS is captured in the diagram below.**

<img width="100%" src="https://miro.medium.com/v2/resize:fit:1400/0*ouf0eyQskPeGWIGm">

Source: [How to use FAISS to build your first similarity search by Asna Shafiq](https://medium.com/loopio-tech/how-to-use-faiss-to-build-your-first-similarity-search-bf0f708aa772).


In [4]:
# O sentence_transformers é um framework Python desenvolvido para facilitar a geração 
# de embeddings de sentenças de alta qualidade, especialmente para tarefas de 
# NLP (Processamento de Linguagem Natural).

from sentence_transformers import InputExample

pdf_subset = pdf.head(1000)

def example_create_fn(doc1: pd.Series) -> InputExample:
    """
        Helper function that outputs a sentence_transformer guid, label and text
    """
    return InputExample(texts=[doc1])

In [5]:
faiss_train_examples = pdf_subset.apply(lambda x: example_create_fn(x['title']), axis=1).tolist()
print(faiss_train_examples)

[<sentence_transformers.readers.InputExample.InputExample object at 0x7f88f1dd2c50>, <sentence_transformers.readers.InputExample.InputExample object at 0x7f88f27414e0>, <sentence_transformers.readers.InputExample.InputExample object at 0x7f88f27415a0>, <sentence_transformers.readers.InputExample.InputExample object at 0x7f88f2743d00>, <sentence_transformers.readers.InputExample.InputExample object at 0x7f88f2741570>, <sentence_transformers.readers.InputExample.InputExample object at 0x7f88f2743a30>, <sentence_transformers.readers.InputExample.InputExample object at 0x7f88f2743970>, <sentence_transformers.readers.InputExample.InputExample object at 0x7f88f2743af0>, <sentence_transformers.readers.InputExample.InputExample object at 0x7f88f2743ac0>, <sentence_transformers.readers.InputExample.InputExample object at 0x7f88f1dd2cb0>, <sentence_transformers.readers.InputExample.InputExample object at 0x7f88f1dd2d10>, <sentence_transformers.readers.InputExample.InputExample object at 0x7f88f1