## Start ChromaDB Container

* Clone the chromadb sandbox
  ```sh
  git clone git@github.com:chroma-core/chroma.git
  ```

* Edit the `docker-compose.yml` file and add `ALLOW_RESET=TRUE` under `environment`. Set port from `8000` to `8002` to not conflict with `airbyte`:
  ```yaml
      ...
      command: uvicorn chromadb.app:app --reload --workers 1 --host 0.0.0.0 --port 8002 --log-config log_config.yml
      environment:
        - IS_PERSISTENT=TRUE
        - ALLOW_RESET=TRUE
      ports:
        - 8002:8002
      ...
  ```

* Run `docker-compose up -d --build` to start ChromaDB
* To shut it down, run `docker-compose down`

## Load all files under ../data into ChromaDB

In [1]:
from tqdm import tqdm
from langchain.embeddings.sentence_transformer import SentenceTransformerEmbeddings
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores import Chroma
from langchain.document_loaders import DirectoryLoader

In [2]:
# create the chroma client
import chromadb
import uuid
from chromadb.config import Settings

client = chromadb.HttpClient(port=8002, settings=Settings(allow_reset=True))

# create the open-source embedding function
embedding_function = SentenceTransformerEmbeddings(model_name="all-MiniLM-L6-v2")

In [24]:
client.list_collections()
# To delete a collection:
#client.delete_collection("public_records")

[Collection(name=pypdf), Collection(name=test), Collection(name=default)]

In [23]:
collection_name = "default"
try:
    collection = client.get_collection(collection_name)
    print(f"Retrieved collection {collection_name}")
except:
    collection = client.create_collection(collection_name)
    print(f"Created collection {collection_name}")

Retrieved collection default


In [22]:
# List the number of files in the collection
fname_set = set()

# Get the documents
offset = 0
while True:
    result = collection.get(limit=1000, offset=offset, include=["metadatas"])
    result_size = len(result["metadatas"])
    if result_size == 0:
        break
    offset += result_size
    for metadata in result["metadatas"]:
        if "file_name" not in metadata:
            print(metadata)
        fname_set.add(metadata["file_name"])

fname_set

{'2021_stm_final_warrant_to_print_10.14.21_date_insterted.pdf',
 'FY2024_Brown_Book.pdf'}

In [13]:
# load the document and split it into chunks
loader = DirectoryLoader('/home/andrei/build/data', glob='*.pdf', show_progress=True, use_multithreading=True)
docs = loader.load()

# split it into chunks
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=100)
split_docs = text_splitter.split_documents(docs)

100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 18/18 [02:08<00:00,  7.11s/it]
Created a chunk of size 1079, which is longer than the specified 1000
Created a chunk of size 1205, which is longer than the specified 1000
Created a chunk of size 2448, which is longer than the specified 1000
Created a chunk of size 1251, which is longer than the specified 1000
Created a chunk of size 1908, which is longer than the specified 1000
Created a chunk of size 1818, which is longer than the specified 1000
Created a chunk of size 1446, which is longer than the specified 1000
Created a chunk of size 1259, which is longer than the specified 1000
Created a chunk of size 1208, which is longer than the specified 1000
Created a chunk of size 1087, which is longer than the specified 1000
Created a chunk of size 1243, which is longer than the specified 1000
Created a chunk of size 1249, which is l

In [15]:
# client.reset()  # resetting the database deletes all collections

collection_name = "default"
try:
    collection = client.get_collection(collection_name)
    print(f"Retrieved collection {public_records}")
except:
    collection = client.create_collection(collection_name)
    print(f"Created collection {collection_name}")

for doc in tqdm(split_docs):
    collection.add(
        ids=[str(uuid.uuid1())], 
        documents=doc.page_content,
        metadatas=doc.metadata
    )

Created collection public_records


100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3012/3012 [02:26<00:00, 20.60it/s]


In [5]:
# tell LangChain to use our client and collection name
db4 = Chroma(client=client, collection_name=collection_name, embedding_function=embedding_function)
query = "What is the Police Dept budget this year?"
docs = db4.similarity_search(query)
for doc in docs:
    print(doc.metadata)
    #print(doc.page_content)

{'file_name': 'FY2023_Brown_Book.pdf', 'type': 'Brown Book', 'uuid': 'd0c08154-525c-4d4d-abce-1b4542aefc9c', 'year': '2022'}
{'file_name': 'FY2024_Brown_Book.pdf', 'type': 'Brown Book', 'uuid': 'a2cc48e6-1ef3-44a4-9621-e2ef1bda78a1', 'year': '2023'}
{'file_name': 'Art_2_report_Appropriation_committee_posted_3.21.23.pdf', 'type': 'Appropriation Committee Report', 'uuid': 'b2f39a07-1f8b-40f7-b939-3fac6a19ae5b', 'year': '2023'}
{'file_name': 'Capital_expenditures_committee_report_to_2022_atm_stms_2022-1_-2_final.pdf', 'type': 'Capital Expenditures Committee Report', 'uuid': '1e130103-7358-42bb-8845-8e62b5182907', 'year': '2022'}


In [17]:
# get an existing collection
collection = client.get_collection(collection_name)
collection.count()

3012

## Q&A With Chroma

In [18]:
from langchain.chains import RetrievalQA
from langchain.llms import OpenAI

qa = RetrievalQA.from_chain_type(llm=OpenAI(), chain_type="stuff", retriever=db4.as_retriever())

In [21]:
qa.run("What is the trend in the Police Dept budget in FY2023 versus 2024?")

' The FY2023 Police Dept budget is $8,265,377, which is a 1.49% increase from the FY2022 budget. The FY2024 recommended Police Dept budget is $9,042,530 which is a 9.40% increase from the FY2023 budget.'

In [22]:
from langchain.chains import RetrievalQAWithSourcesChain
from langchain.llms import OpenAI

qa2 = RetrievalQAWithSourcesChain.from_chain_type(llm=OpenAI(), chain_type="stuff", retriever=db4.as_retriever())

In [24]:
qa2(
    {"question": "What is the number of employees in the Police Dept? Give a full description"},
    return_only_outputs=True,
)

{'answer': ' The Police Department currently has 73 FTEs including 8 command officers, 34 patrol sergeants and officers, 8 detectives, 10 civilian dispatchers, school crossing guards, and administration staff.\n',
 'sources': '/home/andrei/build/data/Acreport2022atm-20220321-v2_1.pdf, /home/andrei/build/data/FY2023_Brown_Book.pdf, /home/andrei/build/data/FY2024_Brown_Book.pdf'}