# **Chroma Handson**

## **What's covered?**
1. Vector Stores
    - Why Vector Stores?
    - Vector Store Key Attributes
    - Popular VectorDBs
2. Introduction to Chroma
    - Features
    - Installing Chroma
    - aaa
3. abc

## **Vector Stores**

### **Why Vector Stores?**
One of the most common ways to store and search over unstructured data is to embed it and store the resulting embedding vectors, and then at query time to embed the unstructured query and retrieve the embedding vectors that are 'most similar' to the embedded query. A vector store takes care of storing embedded data and performing vector search for you.

<img src="images/langchain_rag.jpg" style="width: 70%; height: auto;">

### **Vector Store Key Attributes**
- Can store large d-dimensional vectors
- Can directly index an embedded vector to its associated string text document.
- Can be "Queried", allowing for a cosine similarity search between new vector and the stored vectors.
- Can easily add, update, or delete new vectors.

### **Popular VectorDBs**
There are many great vector store options, here are a few that are free, open-source, and run entirely on your local machine. Review all integrations for many great hosted offerings.
- Chroma
- FAISS (Facebook AI Similarity Search)
- Milvus
- PGVector

There are other vectordb options as well like:
- Pinecone
- Qdrant
- Astra DB
- Azure Cosmos DB

[Click here](https://docs.langchain.com/oss/python/integrations/vectorstores) to checkout all LangChain vector store integration options.

## **Introduction to Chroma**

<img src="images/chromadb_1.png" style="width: 70%; height: auto;">


### **Features**
1. Store document embedding and their metadata
2. **Has everything we need for retrieval**
    - Similarity Search
    - Full-text Search (regex supported)
    - Sparse Vector Search (BM25)
    - Metadata filtering
    - Multi-modal retrieval
3. **Free and Open source**
4. **Integrations** 
    - Works with HuggingFace, OpenAI, Google, LangChain and more.

**[Click here](https://www.trychroma.com/) to visit the official website.**

### **Installing Chroma**

```python
!pip install chromadb
!pip install langchain-chroma
```

**[Click here](https://docs.trychroma.com/guides) to read the complete chromadb guide.**

In [1]:
# ! pip install chromadb

### **Initiating a Persistent Client**

You can configure Chroma to save and load the database from your local machine, using the PersistentClient.

Data will be persisted automatically and loaded on start (if it exists).

**Syntax**
```python
client = chromadb.PersistentClient(path=PATH)
```

The path is where Chroma will store its database files on disk, and load them on start. If you don't provide a path, the default is .chroma

In [2]:
import chromadb

client = chromadb.PersistentClient(path="vector_store")

In [3]:
# returns a nanosecond heartbeat. Useful for making sure the client remains connected.

client.heartbeat()

1765289453677826000

In [4]:
client.list_collections()

[]

In [5]:
# # Empties and completely resets the database. ⚠️ This is destructive and not reversible.

# client.reset() 

# # Note that to reset make following changes to your configurations
# # set `allow_reset` to `True` in your Settings() or 
# # include `ALLOW_RESET=TRUE` in your environment variables

### **Create a Collection**

Collections are the fundamental unit of storage and querying in Chroma. 

Chroma lets you manage collections of embeddings, using the collection primitive.

Chroma collections are created with a **name**. Collection names are used in the url, so there are a few restrictions on them:
- The length of the name must be between 3 and 512 characters.
- The name must start and end with a lowercase letter or a digit, and it can contain dots, dashes, and underscores in between.
- The name must not contain two consecutive dots.
- The name must not be a valid IP address.
- Note that collection names must be unique inside a Chroma database. If you try to create a collection with a name of an existing one, you will see an exception.

**Parameters**
- **name**
- **embedding_function**: By default, Chroma uses the Sentence Transformers all-MiniLM-L6-v2 model to create embeddings.
- **metadata={"hnsw:space": "cosine"}**: By default L2 distance


In [7]:
from datetime import datetime

collection = client.create_collection(
    name="my_first_collection", 
    embedding_function=None, 
    metadata={
        "description": "my first Chroma collection",
        "created": str(datetime.now()),
        "hnsw:space": "cosine"
    }
)

# Alternatively you can also use client.get_or_create_collection()

In [8]:
# returns the number of items in the collection

collection.count()

0

In [9]:
# returns a list of the first 10 items in the collection

collection.peek()

{'ids': [],
 'embeddings': array([], dtype=float64),
 'documents': [],
 'uris': None,
 'included': ['metadatas', 'documents', 'embeddings'],
 'data': None,
 'metadatas': []}

In [10]:
client.list_collections()

[Collection(name=my_first_collection)]

In [12]:
# Rename the collection

collection.modify(
    name="my_first_collection_modified", 
    metadata={"description": "new description"}
)

In [13]:
client.list_collections()

[Collection(name=my_first_collection_modified)]

In [11]:
# client.delete_collection(name="collection_name")

### **Get An Already Existing Collection**

In [15]:
import chromadb

client = chromadb.PersistentClient(path="vector_store")

collection = client.get_collection(name="my_first_collection_modified")

# Alternatively you can also use client.get_or_create_collection()

### **Adding data to Collection**

Add data to a Chroma collection with the .add method. It takes a list of unique string ids, and a list of documents. Chroma will embed these documents for you using the collection's embedding function. It will also store the documents themselves. You can optionally provide a metadata dictionary for each document you add.

In [16]:
import uuid

# add embeddings and documents
data = [
    "Apples - High in fiber, support digestion, and promote heart health.",
    "Bananas - Rich in potassium, help regulate blood pressure and muscle function.",
    "Oranges - Packed with vitamin C, boost immunity, and promote skin health.",
    "Blueberries - High in antioxidants, improve brain function and reduce inflammation.",
    "Strawberries - Support heart health and contain anti-aging antioxidants.",
    "Watermelon - Hydrating fruit with lycopene, good for heart and skin health.",
    "Pineapple - Contains bromelain, aids digestion, and reduces inflammation.",
    "Avocado - Loaded with healthy fats, supports brain and heart health.",
    "Papaya - Rich in enzymes for digestion and boosts skin health.",
    "Pomegranate - Full of antioxidants, improves blood circulation, and heart health.",
    "Carrots - High in beta-carotene, improve eye health and skin glow.",
    "Spinach - Rich in iron, good for blood health and energy levels.",
    "Broccoli - Contains sulforaphane, which has anti-cancer properties.",
    "Tomatoes - Packed with lycopene, supports heart health and skin protection.",
    "Bell Peppers - High in vitamin C, boosts immunity, and reduces inflammation.",
    "Cucumber - Hydrating vegetable, aids in digestion, and supports skin health.",
    "Garlic - Has antibacterial properties, supports heart health and immunity.",
    "Ginger - Anti-inflammatory, helps with digestion and nausea relief.",
    "Beets - Improve blood flow, support endurance, and detox the liver.",
    "Sweet Potatoes - Rich in fiber and vitamin A, supports vision and digestion."
    ]

collection.add(
    documents = data,
    ids=[ str(uuid.uuid4()) for i in range(len(data)) ]
    # metadatas=[{"key_1": "abc_1", "key_2": "abc_2"}, {"key_1": "xyz_1", "key_2": "xyz_2"}]
)

In [17]:
collection.count()

20

In [19]:
# # By default peek() shows 10 documents embeddings

# collection.peek(2)

## **Search Embeddings**

In [13]:
results = collection.query(
    query_texts=["digestion, fiber rich"], 
    n_results=3
)

results

{'ids': [['3514895a-60f3-4766-9a13-a8f9b2edb563',
   '9d466f7d-f301-4629-b973-61fb7c5b2fa4',
   '96bea589-7651-4510-aad6-3a6ddba303ab']],
 'embeddings': None,
 'documents': [['Sweet Potatoes - Rich in fiber and vitamin A, supports vision and digestion.',
   'Apples - High in fiber, support digestion, and promote heart health.',
   'Spinach - Rich in iron, good for blood health and energy levels.']],
 'uris': None,
 'included': ['metadatas', 'documents', 'distances'],
 'data': None,
 'metadatas': [[None, None, None]],
 'distances': [[0.926364541053772, 0.9831753373146057, 1.1355160474777222]]}

In [14]:
results = collection.query(
    query_texts=["vitamin rich"], 
    n_results=3
)

results["documents"]

[['Spinach - Rich in iron, good for blood health and energy levels.',
  'Sweet Potatoes - Rich in fiber and vitamin A, supports vision and digestion.',
  'Oranges - Packed with vitamin C, boost immunity, and promote skin health.']]