# | NLP | LLM | Vector | Embeddings DB Search |

## Natural Language Processing (NLP), Large Language Models (LLM), and the Power of Vector Embeddings and Databases

![Learning](https://t3.ftcdn.net/jpg/06/14/01/52/360_F_614015247_EWZHvC6AAOsaIOepakhyJvMqUu5tpLfY.jpg)

# <b>1 <span style='color:#78D118'>|</span> Overview</b>

### Embeddings, Vector Databases, and Advanced Search

Converting text into embedding vectors is the first step to any text processing pipeline. As the amount of text gets larger, there is often a need to save these embedding vectors into a dedicated vector index or library, so that developers won't have to recompute the embeddings and the retrieval process is faster. We can then search for documents based on our intended query and pass these relevant documents into a language model (LM) as additional context. We also refer to this context as supplying the LM with "state" or "memory". The LM then generates a response based on the additional context it receives! 

In this notebook, we will implement the full workflow of text vectorization, vector search, and question answering workflow. While we use [FAISS](https://faiss.ai/) (vector library) and [ChromaDB](https://docs.trychroma.com/) (vector database), and a Hugging Face model, know that you can easily swap these tools out for your preferred tools or models!

<img src="https://storage.googleapis.com/kagglesdsdata/datasets/4232024/7296128/vector.png?X-Goog-Algorithm=GOOG4-RSA-SHA256&X-Goog-Credential=databundle-worker-v2%40kaggle-161607.iam.gserviceaccount.com%2F20231228%2Fauto%2Fstorage%2Fgoog4_request&X-Goog-Date=20231228T094528Z&X-Goog-Expires=345600&X-Goog-SignedHeaders=host&X-Goog-Signature=2358155e44397b47c5cd178ae645e0ad4a656e3f30b773ce9ce29dc596a7c59319465e6fa9f2c8019257a0be627bbff2e3452aa2cb9b8e117813fdf62b82d6864f2b6a31f00e4bdab75eafff8b634426b6d0781331179283408fa357eaf43ce17dfcafea30461d305fb647abdfbe4309ca0921b5bd816451e084e12df550ebad38f020e39b1ebc2df4821c02cc861d5b44a6eb581bfada85c074e6913d64554777cab3c035f2e946146fc743dadbeee62019fa5649219573616c341566ffc3a122b04070cb72152f4bc4810c4be4bd45f28442a7d1e1b8e20c279c511b506274a31208f7aa8b61b91af0dbaab42b3aa6c52dc5024dd98d1c2538d08d47f9d6ce" width=1000 target="_blank" > 

### Learning Objectives
1. Implement the workflow of reading text, converting text to embeddings, saving them to FAISS and ChromaDB 
2. Query for similar documents using FAISS and ChromaDB 
3. Apply a Hugging Face language model for question answering.

## Setup

### Libraries

In [1]:
pip install faiss-cpu==1.7.4 chromadb==0.3.21

Collecting faiss-cpu==1.7.4
  Downloading faiss_cpu-1.7.4-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (17.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m17.6/17.6 MB[0m [31m59.8 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hCollecting chromadb==0.3.21
  Downloading chromadb-0.3.21-py3-none-any.whl (46 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m46.4/46.4 kB[0m [31m5.1 MB/s[0m eta [36m0:00:00[0m
Collecting hnswlib>=0.7 (from chromadb==0.3.21)
  Downloading hnswlib-0.8.0.tar.gz (36 kB)
  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h  Preparing metadata (pyproject.toml) ... [?25ldone
[?25hCollecting clickhouse-connect>=0.5.7 (from chromadb==0.3.21)
  Obtaining dependency information for clickhouse-connect>=0.5.7 from https://files.pythonhosted.org/packages/4c/7c/a86a5fd861f584623465b8960713a5cfbf7ff1ad6f79caba55608d0d38d1/clickhouse_connect-0.6.23

In [9]:
cache_dir = "./cache"

In [3]:
pd.set_option('display.max_column', None)
pd.set_option('display.max_rows', None)
pd.set_option('display.max_seq_items', None)
pd.set_option('display.max_colwidth', 500)
pd.set_option('expand_frame_repr', True)

# <b>2 <span style='color:#78D118'>|</span> Reading data</b>

In this section, we are going to use the data on <a href="https://newscatcherapi.com/" target="_blank">news topics collected by the NewsCatcher team</a>, who collect and index news articles and release them to the open-source community. The dataset can be downloaded from <a href="https://www.kaggle.com/kotartemiy/topic-labeled-news-dataset" target="_blank">Kaggle</a>.


In [5]:
import pandas as pd

pdf = pd.read_csv(f"/kaggle/input/topic-labeled-news-dataset/labelled_newscatcher_dataset.csv", sep=";")
pdf["id"] = pdf.index
display(pdf.head(10))


Unnamed: 0,topic,link,domain,published_date,title,lang,id
0,SCIENCE,https://www.eurekalert.org/pub_releases/2020-08/dbnl-acl080620.php,eurekalert.org,2020-08-06 13:59:45,A closer look at water-splitting's solar fuel potential,en,0
1,SCIENCE,https://www.pulse.ng/news/world/an-irresistible-scent-makes-locusts-swarm-study-finds/jy784jw,pulse.ng,2020-08-12 15:14:19,"An irresistible scent makes locusts swarm, study finds",en,1
2,SCIENCE,https://www.express.co.uk/news/science/1322607/artificial-intelligence-warning-machine-learning-algorithm-social-media-data,express.co.uk,2020-08-13 21:01:00,Artificial intelligence warning: AI will know us better than we know ourselves,en,2
3,SCIENCE,https://www.ndtv.com/world-news/glaciers-could-have-sculpted-mars-valleys-study-2273648,ndtv.com,2020-08-03 22:18:26,Glaciers Could Have Sculpted Mars Valleys: Study,en,3
4,SCIENCE,https://www.thesun.ie/tech/5742187/perseid-meteor-shower-tonight-time-uk-see/,thesun.ie,2020-08-12 19:54:36,Perseid meteor shower 2020: What time and how to see the huge bright FIREBALLS over UK again tonight,en,4
5,SCIENCE,https://interestingengineering.com/nasa-releases-in-depth-map-of-beirut-explosion-damage,interestingengineering.com,2020-08-08 11:05:45,NASA Releases In-Depth Map of Beirut Explosion Damage,en,5
6,SCIENCE,https://www.thequint.com/tech-and-auto/spacex-nasa-demo-2-rocket-launch-set-for-saturday-how-to-watch,thequint.com,2020-05-28 09:09:46,"SpaceX, NASA Demo-2 Rocket Launch Set for Saturday: How to Watch",en,6
7,SCIENCE,https://www.thespacereview.com/article/4003/1,thespacereview.com,2020-08-10 22:48:23,Orbital space tourism set for rebirth in 2021,en,7
8,SCIENCE,https://www.businessinsider.com/greenland-melting-ice-sheet-past-tipping-point-2020-8,businessinsider.com,2020-08-16 00:28:54,Greenland's melting ice sheet has 'passed the point of no return',en,8
9,SCIENCE,https://www.thehindubusinessline.com/news/science/nasa-invites-engineering-students-to-help-harvest-water-on-mars-moon/article32352915.ece,thehindubusinessline.com,2020-08-14 07:43:25,"NASA invites engineering students to help harvest water on Mars, Moon",en,9


# <b>3 <span style='color:#78D118'>|</span> Vector Library: FAISS</b>

Vector libraries are often sufficient for small, static data. Since it's not a full-fledged database solution, it doesn't have the CRUD (Create, Read, Update, Delete) support. Once the index has been built, if there are more vectors that need to be added/removed/edited, the index has to be rebuilt from scratch. 

That said, vector libraries are easy, lightweight, and fast to use. Examples of vector libraries are [FAISS](https://faiss.ai/), [ScaNN](https://github.com/google-research/google-research/tree/master/scann), [ANNOY](https://github.com/spotify/annoy), and [HNSM](https://arxiv.org/abs/1603.09320).

FAISS has several ways for similarity search: L2 (Euclidean distance), cosine similarity. You can read more about their implementation on their [GitHub](https://github.com/facebookresearch/faiss/wiki/Getting-started#searching) page or [blog post](https://engineering.fb.com/2017/03/29/data-infrastructure/faiss-a-library-for-efficient-similarity-search/). They also published their own [best practice guide here](https://github.com/facebookresearch/faiss/wiki/Guidelines-to-choose-an-index).

If you'd like to read up more on the comparisons between vector libraries and databases, [here is a good blog post](https://weaviate.io/blog/vector-library-vs-vector-database#feature-comparison---library-versus-database).

#### Feature Comparison - Library versus Database
The table below summarizes the differences between vector libraries and databases. This is by no means an exhaustive list of features, and not every library or database has the same features.

<table><thead><tr><th><strong>Feature</strong></th><th><strong>Vector Library</strong></th><th><strong>Vector Database</strong> (Weaviate as an example)</th></tr></thead><tbody><tr><td>Filtering (in combination with Vector Search)</td><td>No</td><td>Yes</td></tr><tr><td>Updatability (CRUD)</td><td>No (some do, e.g. hnswlib)</td><td>Yes</td></tr><tr><td>Incremental importing, concurrent reading while importing</td><td>No (some do, e.g. hnswlib)</td><td>Yes</td></tr><tr><td>Stores objects and vectors</td><td>No</td><td>Yes</td></tr><tr><td>Speed</td><td>Typically faster than full-blown database</td><td>Typically slower than pure library</td></tr><tr><td>Performance optimized for</td><td>In-memory similarity search</td><td>End2end-callstack, including: vector search, object retrieval from persistent storage, optional inverted index filtering, network requests, etc.</td></tr><tr><td>Durability, crash recovery</td><td>No</td><td>Yes</td></tr><tr><td>Persistence</td><td>Only at explicit snapshot</td><td>Immediate (after each insert, update or delete)</td></tr><tr><td>Sharding (i.e. distribute dataset among more than one machine, automatically merge/aggregate results at query time across machines)</td><td>No (there are some 3rd-party projects that wrap sharding around existing libraries)</td><td>Yes</td></tr><tr><td>Replication</td><td>No</td><td>Yes (from v1.17)</td></tr><tr><td>Automated backups (i.e. to and from cloud storage)</td><td>No</td><td>Yes</td></tr><tr><td>Deployment ecosystem (Docker, K8s, Helm, SaaS)</td><td>No (you have to build it yourself)</td><td>Yes</td></tr><tr><td>SDKs / language clients</td><td>No (although most have python bindings)</td><td>Yes (Python, Java, JS, Golang)</td></tr><tr><td>Execution</td><td>Embedded (can be turned into standalone service if you build a simple wrapper app around it)</td><td>Standalone service</td></tr><tr><td>Communication with app</td><td>Foreign Function Interface (FFI) / Language Bindings</td><td>Network calls (HTTP, gRPC, etc.)</td></tr><tr><td>Multi-tenancy</td><td>No</td><td>Yes</td></tr><tr><td>Arbitrary aggregations of stored objects</td><td>No</td><td>Yes</td></tr><tr><td>Module ecosystem (e.g. vectorization, QnA)</td><td>No (have to build it yourself)</td><td>Yes</td></tr><tr><td>Hybrid BM25+dense vector search</td><td>No</td><td>Yes (from v1.17)</td></tr></tbody></table>

The overall workflow of FAISS is captured in the diagram below. 

<img src="https://miro.medium.com/v2/resize:fit:1400/0*ouf0eyQskPeGWIGm" width=1000>

Source: [How to use FAISS to build your first similarity search by Asna Shafiq](https://medium.com/loopio-tech/how-to-use-faiss-to-build-your-first-similarity-search-bf0f708aa772).


In [6]:
from sentence_transformers import InputExample

pdf_subset = pdf.head(1000)

def example_create_fn(doc1: pd.Series) -> InputExample:
    """
    Helper function that outputs a sentence_transformer guid, label, and text
    """
    return InputExample(texts=[doc1])

faiss_train_examples = pdf_subset.apply(
    lambda x: example_create_fn(x["title"]), axis=1
).tolist()



In [12]:
faiss_train_examples[0]

<sentence_transformers.readers.InputExample.InputExample at 0x79bca3b54a30>

## Step 1: Vectorize text into embedding vectors</b>

We will be using `Sentence-Transformers` [library](https://www.sbert.net/) to load a language model to vectorize our text into embeddings. The library hosts some of the most popular transformers on [Hugging Face Model Hub](https://huggingface.co/sentence-transformers).

Here, we are using the `model = SentenceTransformer("all-MiniLM-L6-v2")` to generate embeddings.


In [10]:
from sentence_transformers import SentenceTransformer

model = SentenceTransformer(
    "all-MiniLM-L6-v2", 
    cache_folder=cache_dir
)  # Use a pre-cached model
faiss_title_embedding = model.encode(pdf_subset.title.values.tolist())
len(faiss_title_embedding), len(faiss_title_embedding[0])

.gitattributes:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

data_config.json:   0%|          | 0.00/39.3k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

train_script.py:   0%|          | 0.00/13.2k [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

(1000, 384)

## Step 2: Saving embedding vectors to FAISS index</b>

Below, we create the FAISS index object based on our embedding vectors, normalize vectors, and add these vectors to the FAISS index. 


In [15]:
# Import necessary libraries
import numpy as np
import faiss

# Set 'id' column as the index for the DataFrame 'pdf_subset'
pdf_to_index = pdf_subset.set_index(["id"], drop=False)

# Extract the 'id' values from the DataFrame and convert to a flattened numpy array
id_index = np.array(pdf_to_index.id.values).flatten().astype("int")

# Create a normalized copy of the title embeddings using the L2 normalization
content_encoded_normalized = faiss_title_embedding.copy()
faiss.normalize_L2(content_encoded_normalized)

# Initialize an IndexIDMap, which translates search results to IDs
# Reference: https://faiss.ai/cpp_api/file/IndexIDMap_8h.html#_CPPv4I0EN5faiss18IndexIDMapTemplateE
# The IndexFlatIP below builds an index with Inner Product (IP) metric

# Explanation: We are creating a FAISS index that is capable of performing similarity searches.
# The 'content_encoded_normalized' array represents the normalized title embeddings of our data.
# Each row in this array corresponds to the embedding of a document, and 'id_index' holds the IDs of these documents.

# The normalization step is crucial for ensuring that vector magnitudes do not affect similarity calculations.

# The index is constructed using the Inner Product (IP) metric, which measures the similarity between vectors.

# The IndexIDMap is used to associate each document's embedding with its corresponding 'id' for easy retrieval.

index_content = faiss.IndexIDMap(faiss.IndexFlatIP(len(faiss_title_embedding[0])))

# Add the normalized embeddings to the index, associating each embedding with its corresponding 'id'
index_content.add_with_ids(content_encoded_normalized, id_index)


## Step 3: Search for relevant documents</b>

We define a search function below to first vectorize our query text, and then search for the vectors with the closest distance. 


In [16]:
def search_content(query, pdf_to_index, k=3):
    query_vector = model.encode([query])
    faiss.normalize_L2(query_vector)

    # We set k to limit the number of vectors we want to return
    top_k = index_content.search(query_vector, k)
    ids = top_k[1][0].tolist()
    similarities = top_k[0][0].tolist()
    results = pdf_to_index.loc[ids]
    results["similarities"] = similarities
    return results

Now you can query for similar content, Notice that you did not have to configure any database networks beforehand nor pass in any credentials. FAISS works locally with your code.


In [17]:
display(search_content("animal", pdf_to_index))

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Unnamed: 0_level_0,topic,link,domain,published_date,title,lang,id,similarities
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
176,TECHNOLOGY,https://www.pushsquare.com/news/2020/08/random_you_can_pick_up_and_pet_cats_in_assassins_creed_valhalla,pushsquare.com,2020-08-03 16:30:00,Random: You Can Pick Up and Pet Cats in Assassin's Creed Valhalla,en,176,0.391902
975,HEALTH,https://www.news-medical.net/news/20200813/Researchers-explore-the-social-behavior-of-animals-towards-emerging-infectious-diseases.aspx,news-medical.net,2020-08-13 05:18:00,Researchers explore social behavior of animals toward emerging infectious diseases,en,975,0.376784
99,TECHNOLOGY,https://www.gematsu.com/2020/08/ghostwire-tokyo-confirms-dog-petting,gematsu.com,2020-08-07 16:43:13,Ghostwire: Tokyo confirms dog petting,en,99,0.344058


Up until now, we haven't done the last step of conducting Q/A with a language model yet. We are going to demonstrate this with Chroma, a vector database.

# <b>4 <span style='color:#78D118'>|</span> Vector Database: Chroma</b>

Chroma is an open-source embedding database. The company just raised its [seed funding in April 2023](https://www.trychroma.com/blog/seed) and is quickly becoming popular to support LLM-based applications. 


In [19]:
import chromadb
from chromadb.config import Settings

chroma_client = chromadb.Client(
    Settings(
        chroma_db_impl="duckdb+parquet",
        persist_directory=cache_dir,  # this is an optional argument. If you don't supply this, the data will be ephemeral
    )
)

### Chroma Concept: Collection

Chroma `collection` is akin to an index that stores one set of your documents. 

According to the [docs](https://docs.trychroma.com/getting-started): 
> Collections are where you will store your embeddings, documents, and additional metadata

The nice thing about ChromaDB is that if you don't supply a model to vectorize text into embeddings, it will automatically load a default embedding function, i.e. `SentenceTransformerEmbeddingFunction`. It can handle tokenization, embedding, and indexing automatically for you. If you would like to change the embedding model, read [here on how to do that](https://docs.trychroma.com/embeddings). TLDR: you can add an optional `model_name` argument. 

You can read [the documentation here](https://docs.trychroma.com/usage-guide#using-collections) on rules for collection names.


In [20]:
collection_name = "my_news"

# If you have created the collection before, you need to delete the collection first
if len(chroma_client.list_collections()) > 0 and collection_name in [chroma_client.list_collections()[0].name]:
    chroma_client.delete_collection(name=collection_name)

print(f"Creating collection: '{collection_name}'")
collection = chroma_client.create_collection(name=collection_name)

Creating collection: 'my_news'


.gitattributes:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

data_config.json:   0%|          | 0.00/39.3k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

train_script.py:   0%|          | 0.00/13.2k [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

## Step 1: Add data to collection

Since we are re-using the same data, we can skip the step of reading data. As mentioned in the text above, Chroma can take care of text vectorization for us, so we can directly add text to the collection and Chroma will convert the text into embeddings behind the scene. 


In [22]:
display(pdf_subset.head(10))

Unnamed: 0,topic,link,domain,published_date,title,lang,id
0,SCIENCE,https://www.eurekalert.org/pub_releases/2020-08/dbnl-acl080620.php,eurekalert.org,2020-08-06 13:59:45,A closer look at water-splitting's solar fuel potential,en,0
1,SCIENCE,https://www.pulse.ng/news/world/an-irresistible-scent-makes-locusts-swarm-study-finds/jy784jw,pulse.ng,2020-08-12 15:14:19,"An irresistible scent makes locusts swarm, study finds",en,1
2,SCIENCE,https://www.express.co.uk/news/science/1322607/artificial-intelligence-warning-machine-learning-algorithm-social-media-data,express.co.uk,2020-08-13 21:01:00,Artificial intelligence warning: AI will know us better than we know ourselves,en,2
3,SCIENCE,https://www.ndtv.com/world-news/glaciers-could-have-sculpted-mars-valleys-study-2273648,ndtv.com,2020-08-03 22:18:26,Glaciers Could Have Sculpted Mars Valleys: Study,en,3
4,SCIENCE,https://www.thesun.ie/tech/5742187/perseid-meteor-shower-tonight-time-uk-see/,thesun.ie,2020-08-12 19:54:36,Perseid meteor shower 2020: What time and how to see the huge bright FIREBALLS over UK again tonight,en,4
5,SCIENCE,https://interestingengineering.com/nasa-releases-in-depth-map-of-beirut-explosion-damage,interestingengineering.com,2020-08-08 11:05:45,NASA Releases In-Depth Map of Beirut Explosion Damage,en,5
6,SCIENCE,https://www.thequint.com/tech-and-auto/spacex-nasa-demo-2-rocket-launch-set-for-saturday-how-to-watch,thequint.com,2020-05-28 09:09:46,"SpaceX, NASA Demo-2 Rocket Launch Set for Saturday: How to Watch",en,6
7,SCIENCE,https://www.thespacereview.com/article/4003/1,thespacereview.com,2020-08-10 22:48:23,Orbital space tourism set for rebirth in 2021,en,7
8,SCIENCE,https://www.businessinsider.com/greenland-melting-ice-sheet-past-tipping-point-2020-8,businessinsider.com,2020-08-16 00:28:54,Greenland's melting ice sheet has 'passed the point of no return',en,8
9,SCIENCE,https://www.thehindubusinessline.com/news/science/nasa-invites-engineering-students-to-help-harvest-water-on-mars-moon/article32352915.ece,thehindubusinessline.com,2020-08-14 07:43:25,"NASA invites engineering students to help harvest water on Mars, Moon",en,9


Each document must have a unique `id` associated with it and it is up to you to check that there are no duplicate ids. 

Adding data to collection will take some time to run, especially when there is a lot of data. In the cell below, we intentionally write only a subset of data to the collection to speed things up. 


In [23]:
# Add documents to a collection
collection.add(
    # Extract the first 100 titles and convert to a list
    documents=pdf_subset["title"][:100].tolist(),
    
    # Associate each document with its corresponding topic as metadata
    metadatas=[{"topic": topic} for topic in pdf_subset["topic"][:100].tolist()],
    
    # Generate unique IDs for each document in the collection
    ids=[f"id{x}" for x in range(100)],
)


Batches:   0%|          | 0/4 [00:00<?, ?it/s]

## Step 2: Query for 10 relevant documents on "space"

We will return 10 most relevant documents. You can think of `10` as 10 nearest neighbors. You can also change the number of results returned as well. 


In [24]:
import json

results = collection.query(query_texts=["space"], n_results=10)

print(json.dumps(results, indent=4))

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

{
    "ids": [
        [
            "id72",
            "id7",
            "id30",
            "id26",
            "id23",
            "id76",
            "id69",
            "id40",
            "id47",
            "id75"
        ]
    ],
    "embeddings": null,
    "documents": [
        [
            "Beck teams up with NASA and AI for 'Hyperspace' visual album experience",
            "Orbital space tourism set for rebirth in 2021",
            "NASA drops \"insensitive\" nicknames for cosmic objects",
            "\u2018It came alive:\u2019 NASA astronauts describe experiencing splashdown in SpaceX Dragon",
            "Hubble Uses Moon As \u201cMirror\u201d to Study Earth\u2019s Atmosphere \u2013 Proxy in Search of Potentially Habitable Planets Around Other Stars",
            "Australia's small yet crucial part in the mission to find life on Mars",
            "NASA Astronauts in SpaceX Capsule Splashdown in Gulf Of Mexico",
            "SpaceX's Starship spacecraft saw 150 mete

## Bonus: Add filter statement

In addition to conducting relevancy search, we can also add filter statements. Refer to the [documentation](https://docs.trychroma.com/usage-guide#using-where-filters) for more information.


In [26]:
results = collection.query(query_texts=["space"], where={"topic": "SCIENCE"}, n_results=10)

print(json.dumps(results, indent=4))

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

{
    "ids": [
        [
            "id7",
            "id30",
            "id26",
            "id23",
            "id76",
            "id69",
            "id40",
            "id47",
            "id75",
            "id52"
        ]
    ],
    "embeddings": null,
    "documents": [
        [
            "Orbital space tourism set for rebirth in 2021",
            "NASA drops \"insensitive\" nicknames for cosmic objects",
            "\u2018It came alive:\u2019 NASA astronauts describe experiencing splashdown in SpaceX Dragon",
            "Hubble Uses Moon As \u201cMirror\u201d to Study Earth\u2019s Atmosphere \u2013 Proxy in Search of Potentially Habitable Planets Around Other Stars",
            "Australia's small yet crucial part in the mission to find life on Mars",
            "NASA Astronauts in SpaceX Capsule Splashdown in Gulf Of Mexico",
            "SpaceX's Starship spacecraft saw 150 meters high",
            "NASA\u2019s InSight lander shows what\u2019s beneath Mars\u2019 

## Bonus: Update data in a collection

Unlike a vector library, vector databases support changes to the data so we can update or delete data. 

Indeed, we can update or delete data in a Chroma collection. 


In [27]:
collection.delete(ids=["id0"])

[UUID('7f32105a-861b-4006-83c8-7acbee29303b')]

The record with `ids=0` is no longer present.

In [28]:
collection.get(
    ids=["id0"],
)

{'ids': [], 'embeddings': None, 'documents': [], 'metadatas': []}

We can also update a specific data point.

In [29]:
collection.get(
    ids=["id2"],
)

{'ids': ['id2'],
 'embeddings': None,
 'metadatas': [{'topic': 'SCIENCE'}]}

In [30]:
collection.update(
    ids=["id2"],
    metadatas=[{"topic": "TECHNOLOGY"}],
)

Updated specific data point.

In [31]:
collection.get(
    ids=["id2"],
)

{'ids': ['id2'],
 'embeddings': None,
 'metadatas': [{'topic': 'TECHNOLOGY'}]}

# <b>5 <span style='color:#78D118'>|</span> Prompt engineering for question answering</b>

Now that we have identified documents about space from the news dataset, we can pass these documents as additional context for a language model to generate a response based on them.

We first need to pick a `text-generation` model. Below, we use a Hugging Face model. You can also use OpenAI as well, but you will need to get an Open AI token and [pay based on the number of tokens](https://openai.com/pricing). 


In [33]:
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline

model_id = "gpt2"
tokenizer = AutoTokenizer.from_pretrained(model_id, cache_dir=cache_dir)
lm_model = AutoModelForCausalLM.from_pretrained(model_id, cache_dir=cache_dir)

pipe = pipeline(
    "text-generation",
    model=lm_model,
    tokenizer=tokenizer,
    max_new_tokens=512,
    device_map="auto",
)

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

Here's where prompt engineering, which is developing prompts, comes in. We pass in the context in our `prompt_template` but there are numerous ways to write a prompt. Some prompts may generate better results than the others and it requires some experimentation to figure out how best to talk to the model. Each language model behaves differently to prompts. 

Our prompt template below is inspired from a [2023 paper on program-aided language model](https://arxiv.org/pdf/2211.10435.pdf). The authors have provided their sample prompt template [here](https://github.com/reasoning-machines/pal/blob/main/pal/prompt/date_understanding_prompt.py).

The following links also provide some helpful guidance on prompt engineering: 
- [Prompt engineering with OpenAI](https://help.openai.com/en/articles/6654000-best-practices-for-prompt-engineering-with-openai-api)
- [GitHub repo that compiles best practices to interact with ChatGPT](https://github.com/f/awesome-chatgpt-prompts)


In [51]:
question = "What's the latest news on space development?"
context = " ".join([f"#{str(i)}" for i in results["documents"][0]])

prompt_template = f"Relevant context: {context}\n\n The user's question: {question}"
print("Results:")
print("\n")
display(results)

print("\n")
print("Results of documents 0:")
print("\n")
display(results["documents"][0])

print("\n")
print("Context:")
print("\n")

display(context)

print("\n")
print("Question:")
print("\n")
display(question)

print("\n")
print("Prompt Template:")
print("\n")
display(prompt_template)

Results:




{'ids': [['id7',
   'id30',
   'id26',
   'id23',
   'id76',
   'id69',
   'id40',
   'id47',
   'id75',
   'id52']],
 'embeddings': None,
 'documents': [['Orbital space tourism set for rebirth in 2021',
   'NASA drops "insensitive" nicknames for cosmic objects',
   '‘It came alive:’ NASA astronauts describe experiencing splashdown in SpaceX Dragon',
   'Hubble Uses Moon As “Mirror” to Study Earth’s Atmosphere – Proxy in Search of Potentially Habitable Planets Around Other Stars',
   "Australia's small yet crucial part in the mission to find life on Mars",
   'NASA Astronauts in SpaceX Capsule Splashdown in Gulf Of Mexico',
   "SpaceX's Starship spacecraft saw 150 meters high",
   'NASA’s InSight lander shows what’s beneath Mars’ surface',
   'Alien base on Mercury: ET hunters claim to find huge UFO',
   'SpaceX Crew-1 mission with NASA, first fully operational crewed mission to space to launch in October']],
 'metadatas': [[{'topic': 'SCIENCE'},
   {'topic': 'SCIENCE'},
   {'topic': '



Results of documents 0:




['Orbital space tourism set for rebirth in 2021',
 'NASA drops "insensitive" nicknames for cosmic objects',
 '‘It came alive:’ NASA astronauts describe experiencing splashdown in SpaceX Dragon',
 'Hubble Uses Moon As “Mirror” to Study Earth’s Atmosphere – Proxy in Search of Potentially Habitable Planets Around Other Stars',
 "Australia's small yet crucial part in the mission to find life on Mars",
 'NASA Astronauts in SpaceX Capsule Splashdown in Gulf Of Mexico',
 "SpaceX's Starship spacecraft saw 150 meters high",
 'NASA’s InSight lander shows what’s beneath Mars’ surface',
 'Alien base on Mercury: ET hunters claim to find huge UFO',
 'SpaceX Crew-1 mission with NASA, first fully operational crewed mission to space to launch in October']



Context:




'#Orbital space tourism set for rebirth in 2021 #NASA drops "insensitive" nicknames for cosmic objects #‘It came alive:’ NASA astronauts describe experiencing splashdown in SpaceX Dragon #Hubble Uses Moon As “Mirror” to Study Earth’s Atmosphere – Proxy in Search of Potentially Habitable Planets Around Other Stars #Australia\'s small yet crucial part in the mission to find life on Mars #NASA Astronauts in SpaceX Capsule Splashdown in Gulf Of Mexico #SpaceX\'s Starship spacecraft saw 150 meters high #NASA’s InSight lander shows what’s beneath Mars’ surface #Alien base on Mercury: ET hunters claim to find huge UFO #SpaceX Crew-1 mission with NASA, first fully operational crewed mission to space to launch in October'



Question:




"What's the latest news on space development?"



Prompt Template:




'Relevant context: #Orbital space tourism set for rebirth in 2021 #NASA drops "insensitive" nicknames for cosmic objects #‘It came alive:’ NASA astronauts describe experiencing splashdown in SpaceX Dragon #Hubble Uses Moon As “Mirror” to Study Earth’s Atmosphere – Proxy in Search of Potentially Habitable Planets Around Other Stars #Australia\'s small yet crucial part in the mission to find life on Mars #NASA Astronauts in SpaceX Capsule Splashdown in Gulf Of Mexico #SpaceX\'s Starship spacecraft saw 150 meters high #NASA’s InSight lander shows what’s beneath Mars’ surface #Alien base on Mercury: ET hunters claim to find huge UFO #SpaceX Crew-1 mission with NASA, first fully operational crewed mission to space to launch in October\n\n The user\'s question: What\'s the latest news on space development?'

In [53]:
lm_response = pipe(prompt_template)
print(lm_response[0]["generated_text"])

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Relevant context: #Orbital space tourism set for rebirth in 2021 #NASA drops "insensitive" nicknames for cosmic objects #‘It came alive:’ NASA astronauts describe experiencing splashdown in SpaceX Dragon #Hubble Uses Moon As “Mirror” to Study Earth’s Atmosphere – Proxy in Search of Potentially Habitable Planets Around Other Stars #Australia's small yet crucial part in the mission to find life on Mars #NASA Astronauts in SpaceX Capsule Splashdown in Gulf Of Mexico #SpaceX's Starship spacecraft saw 150 meters high #NASA’s InSight lander shows what’s beneath Mars’ surface #Alien base on Mercury: ET hunters claim to find huge UFO #SpaceX Crew-1 mission with NASA, first fully operational crewed mission to space to launch in October

 The user's question: What's the latest news on space development? As such, we've listed a timeline so you may make it part of your experience with the space shuttle mission. We take your question seriously. If you don't consider it a threat to the future of the

Just completed the implementation of your first text vectorization, search, and question answering workflow (that requires prompt engineering)
