# chromadb Retriever

This notebook demonstrates how to query document contents using chromadb's sentence transformer embedding functions.

Docs for chromadb: https://docs.trychroma.com/.

chromadb has four core API commands, as illustrated in the code snippet below taken from https://www.trychroma.com/:

1) Create the client  
2) Create a collection  
3) Add docs to the collection  
4) Query the collection

```python
# python can also run in-memory with no server running: chromadb.PersistentClient()

import chromadb
client = chromadb.HttpClient()
collection = client.create_collection("sample_collection")

# Add docs to the collection. Can also update and delete. Row-based API coming soon!
collection.add(
    documents=["This is document1", "This is document2"], # we embed for you, or bring your own
    metadatas=[{"source": "notion"}, {"source": "google-docs"}], # filter on arbitrary metadata!
    ids=["doc1", "doc2"], # must be unique for each doc
)

results = collection.query(
    query_texts=["This is a query document"],
    n_results=2,
    # where={"metadata_field": "is_equal_to_this"}, # optional filter
    # where_document={"$contains":"search_string"}  # optional filter
)
```

In [None]:
# Import Hugging Face token from Colab secrets
from google.colab import userdata
userdata.get("HF_TOKEN")

In [3]:
# This will take a while...
!pip install -Uqq chromadb pypdf sentence_transformers

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m525.5/525.5 kB[0m [31m2.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m232.6/232.6 kB[0m [31m4.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m156.5/156.5 kB[0m [31m5.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.4/2.4 MB[0m [31m3.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m92.1/92.1 kB[0m [31m3.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m60.5/60.5 kB[0m [31m3.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m41.3/41.3 kB[0m [31m2.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.4/5.4 MB[0m [31m1.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━

In [4]:
import chromadb
from chromadb.utils import embedding_functions
from pypdf import PdfReader
import unicodedata
from tqdm import tqdm

## Prepare document name and path

In [2]:
from google.colab import drive
drive.mount("/content/gdrive")

Mounted at /content/gdrive


In [7]:
# Customise the path after "My Drive/" to match the location of this file
%cd /content/gdrive/My Drive/Colab Notebooks/chromadb

/content/gdrive/My Drive/Colab Notebooks/chromadb


In [8]:
# The output of this cell should match the "%cd" command cell above
!pwd

/content/gdrive/My Drive/Colab Notebooks/chromadb


In [9]:
# Set the filepath of the PDF document
pdf_path = "./Scott JC - The Art of Not Being Governed - An Anarchist History of Upland Southeast Asia.pdf"
reader = PdfReader(pdf_path)

In [10]:
# Choose a shorthand name for the article / book / paper
short_name = "sjc_art"

## Read document with PyPDF

In [14]:
doc_list = []
metadata_list = []
id_list = []

Docs for unicodedata library:
https://docs.python.org/3/library/unicodedata.html.

In [15]:
# Read document with PyPDF
for page_num, page in tqdm(enumerate(reader.pages)):
    # This is required to normalize the text to unicode data
    new_str = unicodedata.normalize("NFKD", page.extract_text())
    doc_list.append(new_str)
    metadata_list.append({"reference": f"{short_name}_{page_num + 1:03}"})
    id_list.append(str(page_num))

465it [00:14, 33.10it/s]


## chromadb

### Select model

In [11]:
sentence_transformer_ef = embedding_functions.SentenceTransformerEmbeddingFunction(
    model_name="paraphrase-albert-small-v2"
)

modules.json:   0%|          | 0.00/229 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/122 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/4.06k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/827 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/46.7M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/465 [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/760k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.31M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/245 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

### chromadb: initialise client

To use chromadb, we start by initialising a "client". The simplest way to do this is to run `chroma_client = chromadb.Client()` - this version runs in-memory.

We can also run `chromadb.PersistentClient(path="path/to/data")` to permit saving and loading to disk, so that data persists between sessions - we will use this version here.

There is also a `chroma_client = chromadb.HttpClient(host="localhost", port=8000)` method for backend.

In [12]:
client = chromadb.PersistentClient(path="./docs_cache/")

### chromadb: create collection

Once a client has been initialised, we can create a collection.

As the docs explain: "Collections are where you'll store your embeddings, documents, and any additional metadata."

In [13]:
collection = client.get_or_create_collection(name="pdf_books", embedding_function=sentence_transformer_ef)

### chromadb: add documents to collection

In [16]:
print(f"Adding {len(doc_list)} to the collection.")

collection.add(
    documents=doc_list,
    metadatas=metadata_list,
    ids=id_list
)

print(f"There are {collection.count()} documents in the collection.")

Adding 465 to the collection.
There are 465 documents in the collection.


### chromadb: query collection

In [17]:
query = "How do nomadic societies avoid state control?"

In [18]:
# Adjust num_results (number of results) as desired
num_results = 3
fetched_results = collection.query(query_texts=[query], n_results=num_results)

The data structure of the returned result is a dictionary, with lists.

We fetch the distance and the documents.

Then we select the matches for the first query (since we only have 1 query, this list is of length 1).

Then we iterate through the 3 results.

In [19]:
fetched_results

{'ids': [['160', '205', '153']],
 'distances': [[131.33511352539062, 142.31483459472656, 146.12371826171875]],
 'metadatas': [[{'reference': 'sjc_art_161'},
   {'reference': 'sjc_art_206'},
   {'reference': 'sjc_art_154'}]],
 'embeddings': None,
 'documents': [['1\x180 KeePiNg tHe state at a dista NCe\nagainst state expansion that the populations seeking to evade incorporation \nhave been driven. Having, over time, adapted to a hilly environment and, as we shall see, developed a social structure and subsistence routines to avoid \nincorporation, they are now seen by their lowland neighbors as impoverished, \nbackward, tribal populations that lacked the talent for civilization. But, as \nWiens explains, “There is no doubt that the early predecessors of the present day ‘hill-tribes’ occupied lowland plains as well. . . . It was not until much later that there developed a strict differentiation of the Miao and Yao as hill-\ndwellers. This development was not so much a matter of preference

In [20]:
for dist, doc in zip(fetched_results["distances"][0], fetched_results["documents"][0]):
    print(f"DISTANCE: {dist}\n{doc}")
    print("=" * 25)

DISTANCE: 131.33511352539062
10 KeePiNg tHe state at a dista NCe
against state expansion that the populations seeking to evade incorporation 
have been driven. Having, over time, adapted to a hilly environment and, as we shall see, developed a social structure and subsistence routines to avoid 
incorporation, they are now seen by their lowland neighbors as impoverished, 
backward, tribal populations that lacked the talent for civilization. But, as 
Wiens explains, “There is no doubt that the early predecessors of the present day ‘hill-tribes’ occupied lowland plains as well. . . . It was not until much later that there developed a strict differentiation of the Miao and Yao as hill-
dwellers. This development was not so much a matter of preference as of ne-
cessity for those tribesmen wishing to escape domination or annihilation.”28
 An
y attempt to craft a historically deep and accurate narrative of mi -
gration for any particular people is fraught with difficulty, in part because 
th