<a href="https://colab.research.google.com/github/essteer/data-science/blob/main/src/nlp/chromadb_RAG.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# chromadb retriever and RAG

This notebook demonstrates an implementation of a Retrieval Augmented Generation (RAG) model with chromadb.

Docs for chromadb: https://docs.trychroma.com/.

chromadb has four core API commands, as illustrated in the code snippet below taken from https://www.trychroma.com/:

1) Create the client  
2) Create a collection  
3) Add docs to the collection  
4) Query the collection

```python
# python can also run in-memory with no server running: chromadb.PersistentClient()

import chromadb
client = chromadb.HttpClient()
collection = client.create_collection("sample_collection")

# Add docs to the collection. Can also update and delete. Row-based API coming soon!
collection.add(
    documents=["This is document1", "This is document2"], # we embed for you, or bring your own
    metadatas=[{"source": "notion"}, {"source": "google-docs"}], # filter on arbitrary metadata!
    ids=["doc1", "doc2"], # must be unique for each doc
)

results = collection.query(
    query_texts=["This is a query document"],
    n_results=2,
    # where={"metadata_field": "is_equal_to_this"}, # optional filter
    # where_document={"$contains":"search_string"}  # optional filter
)
```

In [None]:
# Import Hugging Face token from Colab secrets
from google.colab import userdata
userdata.get("HF_TOKEN")

In [None]:
!pip install chromadb pypdf sentence_transformers

In [None]:
import chromadb
from chromadb.utils import embedding_functions
from pypdf import PdfReader
import unicodedata
from tqdm import tqdm

## Prepare document path access

In [None]:
from google.colab import drive
drive.mount("/content/gdrive")

Mounted at /content/gdrive


In [None]:
# Customise the path after "My Drive/" to match the location of this file
%cd /content/gdrive/My Drive/Colab Notebooks/chromadb

/content/gdrive/My Drive/Colab Notebooks/chromadb


In [None]:
# The output of this cell should match the "%cd" command cell above
!pwd

/content/gdrive/My Drive/Colab Notebooks/chromadb


In [None]:
# Set the filepath of the PDF document
pdf_path = "./Scott JC - The Art of Not Being Governed - An Anarchist History of Upland Southeast Asia.pdf"
reader = PdfReader(pdf_path)

In [None]:
# Choose a shorthand name for the article / book / paper
short_name = "sjc_art"

### Read document with PyPDF

In [None]:
doc_list = []
metadata_list = []
id_list = []

Docs for unicodedata library:
https://docs.python.org/3/library/unicodedata.html.

In [None]:
for pageno, page in tqdm(enumerate(reader.pages)):
    # This is required to normalize the text to unicode data
    new_str = unicodedata.normalize("NFKD", page.extract_text())
    doc_list.append(new_str)
    metadata_list.append({"reference": f"python_{pageno + 1}"})
    id_list.append(str(pageno))

465it [00:14, 31.12it/s]


## chromadb Retriever

The content of this section replicates that from the `chromadb_retriever.ipynb` notebook.

### Select model

In [None]:
sentence_transformer_ef = embedding_functions.SentenceTransformerEmbeddingFunction(
    model_name="paraphrase-albert-small-v2")

modules.json:   0%|          | 0.00/229 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/122 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/4.06k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/827 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/46.7M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/465 [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/760k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.31M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/245 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

### chromadb: initialise client

To use chromadb, we start by initialising a "client". The simplest way to do this is to run `chroma_client = chromadb.Client()` - this version runs in-memory.

We can also run `chromadb.PersistentClient(path="path/to/data")` to permit saving and loading to disk, so that data persists between sessions - we will use this version here.

There is also a `chroma_client = chromadb.HttpClient(host="localhost", port=8000)` method for backend.

In [None]:
client = chromadb.PersistentClient(path="./docs_cache/")

### chromadb: create collection

Once a client has been initialised, we can create a collection.

As the docs explain: "Collections are where you'll store your embeddings, documents, and any additional metadata."

In [None]:
collection = client.get_or_create_collection(name="pdf_books", embedding_function=sentence_transformer_ef)

### chromadb: add documents to collection

In [None]:
print(f"Adding {len(doc_list)} to the collection.")
collection.add(documents=doc_list,
               metadatas=metadata_list,
               ids=id_list
)

print(f"There are {collection.count()} documents in the collection.")

Adding 278 to the collection.
There are 278 documents in the collection.


### chromadb: query collection

In [None]:
query = "How does the Cossack experience relate to nomadic states?"

In [None]:
# Adjust num_results (number of results) as desired
num_results = 3
fetched_results = collection.query(query_texts=[query], n_results=num_results)

The data structure of the returned result is a dictionary, with lists.

We fetch the distance and the documents.

Then we select the matches for the first query (since we only have 1 query, this list is of length 1).

Then we iterate through the 3 results.

In [None]:
fetched_results

{'ids': [['280', '378', '163']],
 'distances': [[125.52056884765625, 150.05520629882812, 162.97003173828125]],
 'metadatas': [[{'reference': 'sjc_art_281'},
   {'reference': 'sjc_art_379'},
   {'reference': 'sjc_art_164'}]],
 'embeddings': None,
 'documents': [['\x18\x180 etHNoge Nesis\nout of thin air, so far as origins are concerned, is particularly instructive for \nunderstanding ethnogenesis in Southeast Asia. The people who became the \nCossacks were runaway serfs and fugitives from all over European Russia. \nMost of them fled in the sixteenth century to the Don River steppelands “to escape or avoid the social and political ills of Muscovite Russia.”50 They had nothing in common but servitude and flight. On the vast Russian hinterland, \nthey were geographically fragmented into as many as twenty-two Cossack \n“hosts” all the way from Siberia and the Amur River to the Don River basin and the Azov Sea.\n They became a “people” at the frontier for reasons having largely to \ndo with

In [None]:
for dist, doc in zip(fetched_results['distances'][0], fetched_results['documents'][0]):
    print(f"DISTANCE: {dist}\n{doc}")
    print('=' * 25)

DISTANCE: 125.52056884765625
0 etHNoge Nesis
out of thin air, so far as origins are concerned, is particularly instructive for 
understanding ethnogenesis in Southeast Asia. The people who became the 
Cossacks were runaway serfs and fugitives from all over European Russia. 
Most of them fled in the sixteenth century to the Don River steppelands “to escape or avoid the social and political ills of Muscovite Russia.”50 They had nothing in common but servitude and flight. On the vast Russian hinterland, 
they were geographically fragmented into as many as twenty-two Cossack 
“hosts” all the way from Siberia and the Amur River to the Don River basin and the Azov Sea.
 They became a “people” at the frontier for reasons having largely to 
do with their new ecological setting and subsistence routines. Depending on their location, they settled among the T atars, Circassians (whose dress they 
adopted), and Kalmyks, whose horseback habits and settlement patterns they 
copied. The abundant lan

## RAG using OpenAI

In [None]:
!pip install openai

In [None]:
import openai
import os

os.environ["OPENAI_API_KEY"] = userdata.get("OPENAI_API_KEY")

In [None]:
from openai import OpenAI
client = OpenAI()

In [None]:
context = " ".join(fetched_results['documents'][0])
print(context)

0 etHNoge Nesis
out of thin air, so far as origins are concerned, is particularly instructive for 
understanding ethnogenesis in Southeast Asia. The people who became the 
Cossacks were runaway serfs and fugitives from all over European Russia. 
Most of them fled in the sixteenth century to the Don River steppelands “to escape or avoid the social and political ills of Muscovite Russia.”50 They had nothing in common but servitude and flight. On the vast Russian hinterland, 
they were geographically fragmented into as many as twenty-two Cossack 
“hosts” all the way from Siberia and the Amur River to the Don River basin and the Azov Sea.
 They became a “people” at the frontier for reasons having largely to 
do with their new ecological setting and subsistence routines. Depending on their location, they settled among the T atars, Circassians (whose dress they 
adopted), and Kalmyks, whose horseback habits and settlement patterns they 
copied. The abundant land available for both pasture 

In [None]:
context = " ".join(fetched_results["documents"][0])

question = "How was Cossack society organised?"

# Make a request to the OpenAI API
response = client.completions.create(
  model="gpt-3.5-turbo-instruct",  # Use the appropriate engine
  prompt=context + "\n\n" + question,
  max_tokens=150  # Adjust as needed
)


In [None]:
response

In [None]:
response.choices[0].text