## Chroma Intro

Chroma is the open-source vector database. Chroma makes it easy to build LLM apps by making knowledge, facts, and skills pluggable for LLMs.



By default, Chroma uses Sentence Transformers to create embeddings. Sentence Transformers is a library for creating sentence and document embeddings that can be used for a wide variety of tasks. It is based on the Transformers library from Hugging Face. This embedding function runs locally on your machine, and may require you download the model files (this will happen automatically).



In [1]:
pip install chromadb

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting chromadb
  Downloading chromadb-0.3.22-py3-none-any.whl (69 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m69.2/69.2 kB[0m [31m2.0 MB/s[0m eta [36m0:00:00[0m
Collecting requests>=2.28 (from chromadb)
  Downloading requests-2.30.0-py3-none-any.whl (62 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m62.5/62.5 kB[0m [31m4.0 MB/s[0m eta [36m0:00:00[0m
Collecting hnswlib>=0.7 (from chromadb)
  Downloading hnswlib-0.7.0.tar.gz (33 kB)
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting clickhouse-connect>=0.5.7 (from chromadb)
  Downloading clickhouse_connect-0.5.24-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (922 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m922.6/922.6 

In [2]:
import chromadb

client = chromadb.Client()

# Collections are where you'll store your embeddings, documents, and any additional metadata. You can create a collection with a name:
collection = client.create_collection("test")



Downloading (…)e9125/.gitattributes:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

Downloading (…)_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading (…)7e55de9125/README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

Downloading (…)55de9125/config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

Downloading (…)ce_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

Downloading (…)125/data_config.json:   0%|          | 0.00/39.3k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

Downloading (…)nce_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading (…)e9125/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

Downloading (…)9125/train_script.py:   0%|          | 0.00/13.2k [00:00<?, ?B/s]

Downloading (…)7e55de9125/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)5de9125/modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

## Quick Test Data

In [3]:
# Chroma will store your text, and handle tokenization, embedding, and indexing automatically.
collection.add(
    embeddings=[
        [1.1, 2.3, 3.2],
        [4.5, 6.9, 4.4],
        [1.1, 2.3, 3.2],
        [4.5, 6.9, 4.4],
        [1.1, 2.3, 3.2],
        [4.5, 6.9, 4.4],
        [1.1, 2.3, 3.2],
        [4.5, 6.9, 4.4],
    ],
    metadatas=[
        {"uri": "img1.png", "style": "style1"},
        {"uri": "img2.png", "style": "style2"},
        {"uri": "img3.png", "style": "style1"},
        {"uri": "img4.png", "style": "style1"},
        {"uri": "img5.png", "style": "style1"},
        {"uri": "img6.png", "style": "style1"},
        {"uri": "img7.png", "style": "style1"},
        {"uri": "img8.png", "style": "style1"},
    ],
    documents=["doc1", "doc2", "doc3", "doc4", "doc5", "doc6", "doc7", "doc8"],
    ids=["id1", "id2", "id3", "id4", "id5", "id6", "id7", "id8"],
)

# Query the data
query_result = collection.query(
        query_embeddings=[[1.1, 2.3, 3.2], [5.1, 4.3, 2.2]],
        n_results=2,
    )

print(query_result)


{'ids': [['id1', 'id5'], ['id2', 'id4']], 'embeddings': None, 'documents': [['doc1', 'doc5'], ['doc2', 'doc4']], 'metadatas': [[{'uri': 'img1.png', 'style': 'style1'}, {'uri': 'img5.png', 'style': 'style1'}], [{'uri': 'img2.png', 'style': 'style2'}, {'uri': 'img4.png', 'style': 'style1'}]], 'distances': [[0.0, 0.0], [11.960000038146973, 11.960000038146973]]}


## LangChain + Chroma

Question Answering with Langchain and ChromeDB

In [5]:
pip install langchain

Installing collected packages: mypy-extensions, multidict, marshmallow, frozenlist, async-timeout, yarl, typing-inspect, openapi-schema-pydantic, marshmallow-enum, aiosignal, dataclasses-json, aiohttp, langchain
Successfully installed aiohttp-3.8.4 aiosignal-1.3.1 async-timeout-4.0.2 dataclasses-json-0.5.7 frozenlist-1.3.3 langchain-0.0.167 marshmallow-3.19.0 marshmallow-enum-1.5.1 multidict-6.0.4 mypy-extensions-1.0.0 openapi-schema-pydantic-1.2.4 typing-inspect-0.8.0 yarl-1.9.2


In [8]:
# import langchain and vector db libs

from langchain.vectorstores import Chroma
from langchain.embeddings import OpenAIEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter

from langchain.llms import OpenAI
from langchain.chains import VectorDBQA
from langchain.document_loaders import TextLoader

# in case we need huggingface :)
import os
os.environ['HUGGINGFACEHUB_API_TOKEN'] = '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'

In [None]:
# Upload file
from google.colab import files
files.upload()

In [11]:
# load your text corpora
loader = TextLoader('amendments.txt')
documents = loader.load()

In [69]:
# split texts before fed into LLM
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
texts = text_splitter.split_documents(documents)

In [70]:
# When choosing chunk_size 100, almost all amendments are in separate chunks
texts[:30]

[Document(page_content='Amendment I\nCongress shall make no law respecting an establishment of religion, or prohibiting the free exercise thereof; or abridging the freedom of speech, or of the press; or the right of the people peaceably to assemble, and to petition the Government for a redress of grievances.\n\nAmendment II\nA well regulated Militia, being necessary to the security of a free State, the right of the people to keep and bear Arms, shall not be infringed.\n\nAmendment III\nNo Soldier shall, in time of peace be quartered in any house, without the consent of the Owner, nor in time of war, but in a manner to be prescribed by law.\n\nAmendment IV\nThe right of the people to be secure in their persons, houses, papers, and effects, against unreasonable searches and seizures, shall not be violated, and no Warrants shall issue, but upon probable cause, supported by Oath or affirmation, and particularly describing the place to be searched, and the persons or things to be seized.', 

## How to dislike OpenAI

In [34]:
# to use openai embeddings, we need openai and tiktoken (for byte pair encoding used in tokenization)
# so do:
# pip install openai tiktoken
# before you proceed
embeddings = OpenAIEmbeddings(openai_api_key = "???")

# make sure you are PAYing OPENAI to use their API even if you are not building a commercial product :) otherwise it will throw error coz of your plan

In [71]:
# intialize chroma
vectordb = Chroma.from_documents(texts)



Quick note here:

When we initialize chrome, we can also set embeddings that we chose. In this imaginary example, you can go ahead and pay for openai api to get access to their embeddings. Code will be like below:
```
# intialize chroma
embeddings = OpenAIEmbeddings(openai_api_key = "???")
vectordb = Chroma.from_documents(texts, embeddings )

```
Since I did not set the embeddings in to code above, Chroma uses its default opens source SentenceTransformer model which is pretty damn good anyways

Thus we don't have to manually import it from the Huggingface anymore. If you wanna use any other embeddings from Huggingface, I m not sure how to set that up tho :)
```
!pip install -qU huggingface_hub
import os

os.environ['HUGGINGFACEHUB_API_TOKEN'] = '???'

from sentence_transformers import SentenceTransformer, util
```

## Create Chains

In [86]:
# if not using 
from langchain import PromptTemplate, HuggingFaceHub, LLMChain

# initialize HF LLM
flan_t5 = HuggingFaceHub(
    repo_id="google/flan-t5-xxl",
    model_kwargs={"temperature":0.7, "max_length": 512}
)


In [87]:
# using Huggingface t5
qa = VectorDBQA.from_chain_type(llm = flan_t5, chain_type = "stuff", vectorstore=vectordb)

Another quick note:
If u wanna use OpenAI embeddings, do get their LLM in the params:
```
qa = VectorDBQA.from_chain_type(llm = OpenAI(), chain_type = "stuff", vectorstore=vectordb)
```


In [None]:
query = "What is first amendment?"

# flan-small is working but as expected output is not great. For more reasonable outcomes, use t5-xl model.
# Flan-XXL is working but max_length of 512 makes it harder to pass few shot template difficult and without this answers aren't that good.

qa.run(query)