In [1]:
%pip install faiss-gpu chromadb==0.3.21
cache_dir='cache_dir'

Note: you may need to restart the kernel to use updated packages.


In [2]:
import pandas as pd
from datasets import load_dataset
import torch

In [3]:
device = torch.device('cuda') if torch.cuda.is_available() else 'cpu'
num_gpus = torch.cuda.device_count()

if num_gpus > 0:
    print(f"GPUs available: {num_gpus}")
    for i in range(num_gpus):
        gpu_name = torch.cuda.get_device_name(i)
        print(f"GPU {i}: {gpu_name}")
else:
    print("There is no GPU available. Using CPU")

GPUs available: 1
GPU 0: NVIDIA RTX A4000


In [4]:
pdf = pd.read_csv(f"data/labelled_newscatcher_dataset.csv", sep=";")
pdf['id'] = pdf.index
display(pdf)

Unnamed: 0,topic,link,domain,published_date,title,lang,id
0,SCIENCE,https://www.eurekalert.org/pub_releases/2020-0...,eurekalert.org,2020-08-06 13:59:45,A closer look at water-splitting's solar fuel ...,en,0
1,SCIENCE,https://www.pulse.ng/news/world/an-irresistibl...,pulse.ng,2020-08-12 15:14:19,"An irresistible scent makes locusts swarm, stu...",en,1
2,SCIENCE,https://www.express.co.uk/news/science/1322607...,express.co.uk,2020-08-13 21:01:00,Artificial intelligence warning: AI will know ...,en,2
3,SCIENCE,https://www.ndtv.com/world-news/glaciers-could...,ndtv.com,2020-08-03 22:18:26,Glaciers Could Have Sculpted Mars Valleys: Study,en,3
4,SCIENCE,https://www.thesun.ie/tech/5742187/perseid-met...,thesun.ie,2020-08-12 19:54:36,Perseid meteor shower 2020: What time and how ...,en,4
...,...,...,...,...,...,...,...
108769,NATION,https://www.vanguardngr.com/2020/08/pdp-govern...,vanguardngr.com,2020-08-08 02:40:00,PDP governors’ forum urges security agencies t...,en,108769
108770,BUSINESS,https://www.patentlyapple.com/patently-apple/2...,patentlyapple.com,2020-08-08 01:27:12,"In Q2-20, Apple Dominated the Premium Smartpho...",en,108770
108771,HEALTH,https://www.belfastlive.co.uk/news/health/coro...,belfastlive.co.uk,2020-08-12 17:01:00,Coronavirus Northern Ireland: Full breakdown s...,en,108771
108772,ENTERTAINMENT,https://www.thenews.com.pk/latest/696364-paul-...,thenews.com.pk,2020-08-05 04:59:00,Paul McCartney details post-Beatles distress a...,en,108772


### Vector Library: FAISS

Vector libraries are often sufficient for small, static data. Since it's not a full-fledged database solution, it doesn't have the CRUD (Create, Read, Update, Delete) support. Once the index has been built, if there are more vectors that need to be added/removed/edited, the index has to be rebuilt from scratch. 

That said, vector libraries are easy, lightweight, and fast to use. Examples of vector libraries are [FAISS](https://faiss.ai/), [ScaNN](https://github.com/google-research/google-research/tree/master/scann), [ANNOY](https://github.com/spotify/annoy), and [HNSM](https://arxiv.org/abs/1603.09320).

FAISS has several ways for similarity search: L2 (Euclidean distance), cosine similarity. You can read more about their implementation on their [GitHub](https://github.com/facebookresearch/faiss/wiki/Getting-started#searching) page or [blog post](https://engineering.fb.com/2017/03/29/data-infrastructure/faiss-a-library-for-efficient-similarity-search/). They also published their own [best practice guide here](https://github.com/facebookresearch/faiss/wiki/Guidelines-to-choose-an-index).

If you'd like to read up more on the comparisons between vector libraries and databases, [here is a good blog post](https://weaviate.io/blog/vector-library-vs-vector-database#feature-comparison---library-versus-database).


**The overall workflow of FAISS is captured in the diagram below.**

<img width="100%" src="https://miro.medium.com/v2/resize:fit:1400/0*ouf0eyQskPeGWIGm">

Source: [How to use FAISS to build your first similarity search by Asna Shafiq](https://medium.com/loopio-tech/how-to-use-faiss-to-build-your-first-similarity-search-bf0f708aa772).


In [5]:
# O sentence_transformers é um framework Python desenvolvido para facilitar a geração 
# de embeddings de sentenças de alta qualidade, especialmente para tarefas de 
# NLP (Processamento de Linguagem Natural).

from sentence_transformers import InputExample

pdf_subset = pdf.head(10000)

def example_create_fn(doc1: pd.Series) -> InputExample:
    """
        Helper function that outputs a sentence_transformer guid, label and text
    """
    return InputExample(texts=[doc1])

In [6]:
faiss_train_examples = pdf_subset.apply(lambda x: example_create_fn(x['title']), axis=1).tolist()
print(faiss_train_examples)

[<sentence_transformers.readers.InputExample.InputExample object at 0x7f1ab7dbf310>, <sentence_transformers.readers.InputExample.InputExample object at 0x7f1ab7dbf760>, <sentence_transformers.readers.InputExample.InputExample object at 0x7f1ab7dbf7c0>, <sentence_transformers.readers.InputExample.InputExample object at 0x7f1ab7dbf820>, <sentence_transformers.readers.InputExample.InputExample object at 0x7f1ab7dbf880>, <sentence_transformers.readers.InputExample.InputExample object at 0x7f1ab7dbf8e0>, <sentence_transformers.readers.InputExample.InputExample object at 0x7f1ab7dbf940>, <sentence_transformers.readers.InputExample.InputExample object at 0x7f1ab7dbf9a0>, <sentence_transformers.readers.InputExample.InputExample object at 0x7f1ab7dbfa00>, <sentence_transformers.readers.InputExample.InputExample object at 0x7f1ab7dbfa60>, <sentence_transformers.readers.InputExample.InputExample object at 0x7f1ab7dbfac0>, <sentence_transformers.readers.InputExample.InputExample object at 0x7f1ab7

In [7]:
from sentence_transformers import SentenceTransformer

model = SentenceTransformer(
    'all-MiniLM-L6-v2',
    cache_folder=cache_dir
)

faiss_title_embedding = model.encode(pdf_subset.title.values.tolist())
print(faiss_title_embedding)

[[-0.11270548  0.04076545  0.02181419 ... -0.01874594 -0.03136871
   0.06824829]
 [-0.0218716  -0.03349996  0.07321802 ...  0.0336232  -0.00563894
  -0.00630979]
 [ 0.0160838   0.00279444 -0.01504422 ... -0.00706241  0.00905898
  -0.02835049]
 ...
 [-0.02578137  0.05926263  0.031239   ... -0.01920929  0.0108579
   0.10288586]
 [-0.0632398  -0.03073819  0.09070415 ... -0.11329682 -0.01014124
   0.09564894]
 [ 0.10612507  0.02338426 -0.00669216 ... -0.02930272 -0.02617294
   0.0091322 ]]


In [8]:
len(faiss_title_embedding), len(faiss_title_embedding[0])

(10000, 384)

### Step 3: Saving embedding vectors to FAISS index
Below, we create the FAISS index object based on our embedding vectors, normalize vectors, and add these vectors to the FAISS index. 


In [9]:
import numpy as np
import faiss

pdf_to_index = pdf_subset.set_index(['id'], drop=False)
id_index = np.array(pdf_to_index.id.values).flatten().astype('int')
print(id_index)

[   0    1    2 ... 9997 9998 9999]


In [10]:
content_encoded_normalized = faiss_title_embedding.copy()
faiss.normalize_L2(content_encoded_normalized)

In [11]:
# Index1DMap translates search results to IDs: https://faiss.ai/cpp_api/file/IndexIDMap_8h.html#_CPPv4I0EN5faiss18IndexIDMapTemplateE
# The IndexFlatIP below builds index

index_content = faiss.IndexIDMap(faiss.IndexFlatIP(len(faiss_title_embedding[0])))
index_content.add_with_ids(content_encoded_normalized, id_index)

### Step 4: Search for relevant documents
We define a search function below to first vectorize our query text, and then search for the vectors with the closest distance. 

In [12]:
def search_content(query, pdf_to_index, k=5):
    query_vector = model.encode([query])
    faiss.normalize_L2(query_vector)

    top_k = index_content.search(query_vector, k)
    ids = top_k[1][0].tolist()
    similarities = top_k[0][0].tolist()
    results = pdf_to_index.loc[ids]
    results['similarities'] = similarities

    return results

display(search_content('bad', pdf_to_index))

Unnamed: 0_level_0,topic,link,domain,published_date,title,lang,id,similarities
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
4289,TECHNOLOGY,https://www.thurrott.com/mobile/239182/expired,thurrott.com,2020-08-17 15:12:39,Expired,en,4289,0.356567
4354,TECHNOLOGY,https://www.futuregamereleases.com/2020/08/spi...,futuregamereleases.com,2020-08-04 13:19:35,Spider-Man Marvel’s Avengers PlayStation Exclu...,en,4354,0.294088
3510,HEALTH,https://www.westernadvocate.com.au/story/68592...,westernadvocate.com.au,2020-08-04 23:00:00,Sweet cause for concern,en,3510,0.272047
9622,HEALTH,https://timesofindia.indiatimes.com/life-style...,timesofindia.indiatimes.com,2020-08-10 03:30:00,7 dos and don'ts of eating papaya,en,9622,0.259334
6697,TECHNOLOGY,https://www.kotaku.com.au/2020/08/how-much-do-...,kotaku.com.au,2020-08-11 03:42:00,How Much Do You Think The Xbox Series X Will C...,en,6697,0.25719


### Vector Database: Chroma

Chroma is an open-source embedding database. The company just raised its [seed funding in April 2023](https://www.trychroma.com/blog/seed) and is quickly becoming popular to support LLM-based applications. 

In [13]:
import chromadb
from chromadb.config import Settings

chroma_db_directory = 'db'


chroma_client = chromadb.Client(
    Settings(
        chroma_db_impl="duckdb+parquet",
        persist_directory=chroma_db_directory,  # this is an optional argument. If you don't supply this, the data will be ephemeral
    )
)

Using embedded DuckDB with persistence: data will be stored in: db


#### Chroma Concept: Collection

Chroma `collection` is akin to an index that stores one set of your documents. 

According to the [docs](https://docs.trychroma.com/getting-started): 
> Collections are where you will store your embeddings, documents, and additional metadata

The nice thing about ChromaDB is that if you don't supply a model to vectorize text into embeddings, it will automatically load a default embedding function, i.e. `SentenceTransformerEmbeddingFunction`. It can handle tokenization, embedding, and indexing automatically for you. If you would like to change the embedding model, read [here on how to do that](https://docs.trychroma.com/embeddings). TLDR: you can add an optional `model_name` argument. 

You can read [the documentation here](https://docs.trychroma.com/usage-guide#using-collections) on rules for collection names.


In [14]:
collection_name = "my_news"

# If you have created the collection before, you need to delete the collection first
if len(chroma_client.list_collections()) > 0 and collection_name in [chroma_client.list_collections()[0].name]:
    chroma_client.delete_collection(name=collection_name)

print(f"Creating collection: '{collection_name}'")
collection = chroma_client.create_collection(name=collection_name)

No embedding_function provided, using default embedding function: SentenceTransformerEmbeddingFunction
No embedding_function provided, using default embedding function: SentenceTransformerEmbeddingFunction
No embedding_function provided, using default embedding function: SentenceTransformerEmbeddingFunction


Creating collection: 'my_news'


In [15]:
display(pdf_subset)

Unnamed: 0,topic,link,domain,published_date,title,lang,id
0,SCIENCE,https://www.eurekalert.org/pub_releases/2020-0...,eurekalert.org,2020-08-06 13:59:45,A closer look at water-splitting's solar fuel ...,en,0
1,SCIENCE,https://www.pulse.ng/news/world/an-irresistibl...,pulse.ng,2020-08-12 15:14:19,"An irresistible scent makes locusts swarm, stu...",en,1
2,SCIENCE,https://www.express.co.uk/news/science/1322607...,express.co.uk,2020-08-13 21:01:00,Artificial intelligence warning: AI will know ...,en,2
3,SCIENCE,https://www.ndtv.com/world-news/glaciers-could...,ndtv.com,2020-08-03 22:18:26,Glaciers Could Have Sculpted Mars Valleys: Study,en,3
4,SCIENCE,https://www.thesun.ie/tech/5742187/perseid-met...,thesun.ie,2020-08-12 19:54:36,Perseid meteor shower 2020: What time and how ...,en,4
...,...,...,...,...,...,...,...
9995,TECHNOLOGY,https://www.cnet.com/news/best-google-assistan...,cnet.com,2020-08-11 14:35:15,The best Google Assistant and Nest devices of ...,en,9995
9996,TECHNOLOGY,https://www.somagnews.com/scientists-discover-...,somagnews.com,2020-08-16 11:20:00,Scientists Discover a Liquid that Acts Like a ...,en,9996
9997,TECHNOLOGY,https://www.forbes.com/sites/jaymcgregor/2020/...,forbes.com,2020-08-10 12:00:00,Samsung Note 20 Ultra Gains Gaming Advantage O...,en,9997
9998,TECHNOLOGY,https://au.finance.yahoo.com/news/spacex-attem...,au.finance.yahoo.com,2020-08-17 12:32:00,SpaceX will attempt to break a rocket reusabil...,en,9998


In [16]:
qnt = 100
collection.add(
    documents=pdf_subset['title'][:qnt].tolist(),
    metadatas=[{'topics': topic} for topic in pdf_subset['topic'][:qnt].tolist()],
    ids=[f"id{x}" for x in range(qnt)]
)

#### Step 2: Query for 10 relevant documents on "space"

We will return 10 most relevant documents. You can think of `10` as 10 nearest neighbors. You can also change the number of results returned as well. 

In [17]:
import json

In [18]:
results = collection.query(
    query_texts=['Galaxy space'], n_results=10
)

print(json.dumps(results, indent=4))

{
    "ids": [
        [
            "id30",
            "id72",
            "id23",
            "id7",
            "id76",
            "id75",
            "id26",
            "id13",
            "id40",
            "id10"
        ]
    ],
    "embeddings": null,
    "documents": [
        [
            "NASA drops \"insensitive\" nicknames for cosmic objects",
            "Beck teams up with NASA and AI for 'Hyperspace' visual album experience",
            "Hubble Uses Moon As \u201cMirror\u201d to Study Earth\u2019s Atmosphere \u2013 Proxy in Search of Potentially Habitable Planets Around Other Stars",
            "Orbital space tourism set for rebirth in 2021",
            "Australia's small yet crucial part in the mission to find life on Mars",
            "Alien base on Mercury: ET hunters claim to find huge UFO",
            "\u2018It came alive:\u2019 NASA astronauts describe experiencing splashdown in SpaceX Dragon",
            "Martian Night Sky Pulses in Ultraviolet Light",

### Prompt engineering for question answering 

Now that we have identified documents about space from the news dataset, we can pass these documents as additional context for a language model to generate a response based on them! 

We first need to pick a `text-generation` model. Below, we use a Hugging Face model. You can also use OpenAI as well, but you will need to get an Open AI token and [pay based on the number of tokens](https://openai.com/pricing). 


In [19]:
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline

model_id = "gpt2"
tokenizer = AutoTokenizer.from_pretrained(model_id, cache_dir=cache_dir)
lm_model = AutoModelForCausalLM.from_pretrained(model_id, cache_dir=cache_dir)

pipe = pipeline(
    "text-generation",
    model=lm_model,
    tokenizer=tokenizer,
    max_new_tokens=512,
    device=device
)

Here's where prompt engineering, which is developing prompts, comes in. We pass in the context in our `prompt_template` but there are numerous ways to write a prompt. Some prompts may generate better results than the others and it requires some experimentation to figure out how best to talk to the model. Each language model behaves differently to prompts. 

Our prompt template below is inspired from a [2023 paper on program-aided language model](https://arxiv.org/pdf/2211.10435.pdf). The authors have provided their sample prompt template [here](https://github.com/reasoning-machines/pal/blob/main/pal/prompt/date_understanding_prompt.py).

The following links also provide some helpful guidance on prompt engineering: 
- [Prompt engineering with OpenAI](https://help.openai.com/en/articles/6654000-best-practices-for-prompt-engineering-with-openai-api)
- [GitHub repo that compiles best practices to interact with ChatGPT](https://github.com/f/awesome-chatgpt-prompts)


In [20]:
question = "What's the latest news on space development?"
context = " ".join([f"#{str(i)}" for i in results["documents"][0]])
prompt_template = f"Relevant context: {context}\n\n The user's question: {question} \n\n"

lm_response = pipe(prompt_template)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


In [21]:
print(lm_response[0]["generated_text"])

Relevant context: #NASA drops "insensitive" nicknames for cosmic objects #Beck teams up with NASA and AI for 'Hyperspace' visual album experience #Hubble Uses Moon As “Mirror” to Study Earth’s Atmosphere – Proxy in Search of Potentially Habitable Planets Around Other Stars #Orbital space tourism set for rebirth in 2021 #Australia's small yet crucial part in the mission to find life on Mars #Alien base on Mercury: ET hunters claim to find huge UFO #‘It came alive:’ NASA astronauts describe experiencing splashdown in SpaceX Dragon #Martian Night Sky Pulses in Ultraviolet Light #SpaceX's Starship spacecraft saw 150 meters high #Astronomers Detect Electromagnetic Signal Caused by Unequal Neutron-Star Collision

 The user's question: What's the latest news on space development? 


This entry was posted on Monday, February 19th, 2013 at 06:33 AM and is filed under Space. There were no more comments within the listed time. It has been ten days since the post was posted.
