## **ChromaDB Collection Operations**

The goal of this notebook is to demonstrate how to use `ChromaDB`, a vector database for managing and querying embeddings (numeric representations of text or other data).

Here, it will be explored how to work with Chroma collections, including:
- Creating, inspecting, and deleting collections;
- Adding, querying, updating, and deleting documents;
- Filtering by metadata and document contents.

In [1]:
# !pip install chromadb

In [2]:
# Create a DB
import chromadb
chroma_client = chromadb.Client()

This initializes a Chroma client that allows you to interact with an in-memory instance of ChromaDB. The `Client` is used to manage collections and perform operations like adding data or querying.

In [3]:
# Create a colletion
collection = chroma_client.create_collection(name="my_collection")

In ChromaDB, a collection is a logical grouping of related data within the database. It serves as a container for storing and managing embeddings and their associated metadata, allowing you to organize and query data efficiently.

In [4]:
# Add some text documents to the collection
collection.add(
    documents=[
        "This is a document about pineapple. This is a document about oranges",
        " Welcome to Natural Language Processing  It is one of the most exciting research areas as of today  We will see how Python can be used to work with text files. "
    ],
    ids=["id1", "id2"]
)

- **Documents**: Text data to be stored in the collection.
- **IDs**: Unique identifiers for the documents.

In [5]:
# Query the collection
results = collection.query(
    query_texts=["This is a query document about hawaii"], # Chroma will embed this for you
    n_results=2 # how many results to return
)
print(results)

{'ids': [['id1', 'id2']], 'embeddings': None, 'documents': [['This is a document about pineapple. This is a document about oranges', ' Welcome to Natural Language Processing  It is one of the most exciting research areas as of today  We will see how Python can be used to work with text files. ']], 'uris': None, 'data': None, 'metadatas': [[None, None]], 'distances': [[1.082511067390442, 1.848189115524292]], 'included': [<IncludeEnum.distances: 'distances'>, <IncludeEnum.documents: 'documents'>, <IncludeEnum.metadatas: 'metadatas'>]}


This performs a similarity search using the query text `"This is a query document about hawaii"`. ChromaDB embeds the query text and compares it with the document embeddings in the collection.

- `n_results=2`: Requests the top 2 most similar documents.

In [7]:
from chromadb.config import DEFAULT_TENANT, DEFAULT_DATABASE, Settings

In [8]:
DEFAULT_TENANT, DEFAULT_DATABASE

('default_tenant', 'default_database')

In [None]:
client = chromadb.PersistentClient(
    path="../VectorDB",
    settings=Settings(
                is_persistent = True,
                persist_directory = "../VectorDB",
                allow_reset = True,
        anonymized_telemetry=False),
    tenant=DEFAULT_TENANT,
    database=DEFAULT_DATABASE,
)

# path - parameter must be a local path on the machine where Chroma is running. If the path does not exist, it will be created. 
# The path can be relative or absolute. If the path is not specified, the default is ./chroma in the current working directory.
# settings - Chroma settings object / behavior.
# tenant - the tenant to use. Default is default_tenant.
# database - the database to use. Default is default_database.

The `PersistentClient` saves ChromaDB data to disk, allowing you to maintain state across sessions. The `path` parameter specifies the directory where the data will be stored.

### **Creating, inspecting, and deleting Collections**

Chroma uses collection names in the url, so there are a few restrictions on naming them:

**Collection Naming Rules**
- Names must be between 3 and 63 characters.
- Names must start and end with a lowercase letter or a digit.
- Allowed characteres include dots(.), dashes (-), and underscores (_), but no consecutive dots.
- Names must not resemble an IP address.

**Creating a Collection**:
A collection is initialized with a name and an optional embedding function. If you supply an embedding function, you must supply it every time you get the collection.

In [10]:
from chromadb.utils import embedding_functions


In [13]:
model_name = "all-MiniLM-L6-v2"
emb_fun = embedding_functions.SentenceTransformerEmbeddingFunction(model_name=model_name)

In [20]:
emb_fun(["foo"])

[array([ 1.62334405e-02, -7.66238105e-03,  1.86064355e-02,  3.19690108e-02,
        -3.10037173e-02,  8.77797510e-03,  1.59455374e-01, -9.52165574e-03,
         2.02003960e-02, -4.54581492e-02,  1.39858145e-02, -1.76749974e-02,
        -3.61696668e-02, -2.19433811e-02,  2.13876776e-02,  6.45927489e-02,
        -3.65953557e-02, -1.21336570e-02, -4.36662212e-02, -3.51500399e-02,
        -3.26299071e-02,  7.83412531e-02, -2.10416876e-02,  3.37276980e-02,
        -2.41579358e-02, -1.07671050e-02, -4.28647995e-02,  1.35396337e-02,
         5.03973290e-02, -9.19567645e-02,  3.54946516e-02,  1.80297494e-01,
         1.57636590e-02, -4.94916439e-02, -3.97645682e-03,  3.21058877e-04,
         2.18496323e-02,  3.53683755e-02,  4.18541990e-02,  4.89937104e-02,
        -2.66513415e-02, -5.65088317e-02, -3.27685401e-02, -2.07234658e-02,
        -1.12308590e-02,  2.79816091e-02, -1.05389841e-02,  3.03177647e-02,
         1.76971294e-02,  3.63379484e-03, -8.70853849e-03, -4.94684279e-02,
        -2.9

In [None]:
len(emb_fun(["foo"])[0]) # Returns a 384-dimensional vector

384

In [None]:
# Create collection
collection = client.create_collection(name="my_collection2", embedding_function=emb_fun)

In [22]:
# Get collection
collection = client.get_collection(name="my_collection2", embedding_function=emb_fun)

In [23]:
# Delete collection
client.delete_collection(name="my_collection2")

In [24]:
# If not sure if a collection exist or needs to be created
collection = client.get_or_create_collection(name="some_collection")

In [25]:
collection

Collection(name=some_collection)

In [26]:
# Rename a collection
collection.modify(name="new_name") # Rename the collection

In [27]:
collection

Collection(name=new_name)

In [28]:
# Add documents
collection.add(
    documents=["som2", "doc2", "doc3"],
    metadatas=[{"chapter": "3", "verse": "16"}, {"chapter": "3", "verse": "5"}, {"chapter": "29", "verse": "11"}],
    ids=["id1", "id2", "id3"]
)

If Chroma is passed a list of documents, it will automatically tokenize and embed them with the collection's embedding function (the default will be used if none was supplied at collection creation). Chroma will also store the documents themselves. If the documents are too large to embed using the chosen embedding function, an exception will be raised.

Each document must have a unique associated id. Trying to .add the same ID twice will result in only the initial value being stored. An optional list of metadata dictionaries can be supplied for each document, to store additional information and enable filtering.

Alternatively, you can supply a list of document-associated embeddings directly, and Chroma will store the associated documents without embedding them itself.

In [29]:
collection.add(
    documents=["doc1", "doc2", "doc3"],
#     embeddings=[[1.1, 2.3, 3.2], [4.5, 6.9, 4.4], [1.1, 2.3, 3.2]],
    metadatas=[{"chapter": "3", "verse": "16"}, {"chapter": "3", "verse": "5"}, {"chapter": "29", "verse": "11"}],
    ids=["id11", "id21", "id31"]
)

In [None]:
# batch adding
from sklearn.datasets import fetch_20newsgroups

newsgroups_train = fetch_20newsgroups(subset='train')

In [31]:
doc_lists=[]
metda_list=[]
for i in newsgroups_train['data'][:100]:
    doc_lists.append(i)
    metda_list.append({'len_of_doc':len(i)})

In [32]:
collection.add(
    documents=doc_lists,
#     embeddings=[[1.1, 2.3, 3.2], [4.5, 6.9, 4.4], [1.1, 2.3, 3.2]],
    metadatas=metda_list,
    ids=['id_{}'.format(i) for i in range(100)]
)

In [33]:
# select 10 records
collection.peek(2) # returns a list of the first 10 items in the collection

{'ids': ['id1', 'id2'],
 'embeddings': array([[-7.27678509e-03, -3.58610414e-02, -5.85470051e-02,
          4.30398919e-02,  2.56231036e-02, -2.53969394e-02,
          3.37159522e-02,  3.43396189e-03, -2.63475925e-02,
         -4.61397991e-02,  8.30679685e-02, -6.42239768e-03,
         -3.55973691e-02, -3.96921635e-02,  4.66601700e-02,
          2.61252634e-02,  8.53319243e-02, -1.68907885e-02,
         -1.70462276e-03, -1.81527901e-02,  6.35295585e-02,
         -3.71037200e-02,  5.74298874e-02,  2.37502102e-02,
         -4.89793792e-02,  3.40397581e-02,  4.02286369e-03,
          6.53342456e-02,  6.25966191e-02, -1.67348072e-01,
          7.33717456e-02,  9.65449214e-02, -2.70988718e-02,
          6.61829999e-03, -8.19215104e-02, -2.82600895e-02,
          3.99508812e-02, -9.23850313e-02, -9.49947312e-02,
         -1.11495787e-02,  3.19334329e-03,  3.37751233e-03,
         -1.56492591e-02, -4.09619361e-02,  3.89017239e-02,
          1.65557954e-02,  4.70884219e-02, -7.38915289e-03,
  

In [None]:
collection.query(
    query_embeddings=[[11.1, 12.1, 13.1],[1.1, 2.3, 3.2], ...],
    include=["documents"]
)

### **Using Where filters**

Chroma supports filtering queries by metadata and document contents. The where filter is used to filter by metadata, and the where_document filter is used to filter by document contents.

#### **Filtering by metadata**

In order to filter on metadata, you must supply a where filter dictionary to the query. The dictionary must have the following structure:

{ "metadata_field": { : } }

Metadata filters allow you to filter the results by metadata fields using operators such as `$eq`, `$ne`, `$gt`, `$lt`, `$gte`, `$lte`, and `$contains`. Logical operators like `$and` and `$or` can be used to combine multiple filters.

#### **Filtering for a search_string**

{ "$contains": "search_string" }

#### **Using logical operators**

You can also use the logical operators or to combine multiple filters.

An $and operator will return results that match all of the filters in the list.

{ "$and": [ { "metadata_field": { : } }, { "metadata_field": { : } } ] }

{ "$or": [ { "metadata_field": { : } }, { "metadata_field": { : } } ] }

In [None]:
# Update
collection.update(
    ids=["id1", "id2", "id3", ...],
    embeddings=[[1.1, 2.3, 3.2], [4.5, 6.9, 4.4], [1.1, 2.3, 3.2], ...],
    metadatas=[{"chapter": "3", "verse": "16"}, {"chapter": "3", "verse": "5"}, {"chapter": "29", "verse": "11"}, ...],
    documents=["doc1", "doc2", "doc3", ...],
)

In [None]:
# Find and Update if not found add
collection.upsert(
    ids=["id1", "id2", "id3", ...],
    embeddings=[[1.1, 2.3, 3.2], [4.5, 6.9, 4.4], [1.1, 2.3, 3.2], ...],
    metadatas=[{"chapter": "3", "verse": "16"}, {"chapter": "3", "verse": "5"}, {"chapter": "29", "verse": "11"}, ...],
    documents=["doc1", "doc2", "doc3", ...],
)

In [None]:
# Deleting data from a collection
collection.delete(
    ids=["id1", "id2", "id3",...],
	where={"chapter": "20"}
)

ChromaDB supports deleting items from a collection by id using .delete. The embeddings, documents, and metadata associated with each item will be deleted. 

⚠️ This is a destructive operation, and cannot be undone.