# **Introduction to Vector Databases - ChromaDB**

Chroma is the open-source AI application database.

[Click here](https://www.trychroma.com/) to visit the official website.

<img src="images/chromadb_1.png">

### **Features**
1. **Has everything we need for retrieval**
    - Store document embedding and their metadata
    - Search Embeddings
    - Full-tect Search
    - Metadata filtering
    - Multi-modal retrieval
2. **Free and Open source**
3. **Integrations** 
    - Works with HuggingFace, OpenAI, Google, LangChain and more.
4. **Simple to Get Started**
    - ```pip install chromadb```
  
### **Syntax**
```python
import chromadb

# Initiating a Persistent Chroma Client
client = chromadb.PersistentClient(path="/path/to/save/to")

# Create a new collection or get if already exist
collection = client.get_or_create_collection(name="my_collection", embedding_function=emb_fn, metadata={"hnsw:space": "cosine"})

# add embeddings and documents
collection.add(
    documents=[
        "This is a document about pineapple",
        "This is a document about oranges"
    ], 
    metadatas=[{"key_1": "value_1", "key_2": "value_2"}, {"key_1": "value_1", "key_2": "value_2"}],
    ids=["id1", "id2"]
)

# get back similar embeddings
# Note that Chroma will embed query_texts for you and return n_results
results = collection.query(
    query_texts=["This is a query document about hawaii"], 
    n_results=2
)

# switch `add` to `upsert` to avoid adding the same documents every time
collection.upsert(
    documents=[
        "This is a document about pineapple",
        "This is a document about oranges"
    ],
    ids=["id1", "id2"]
)

```

[Click here](https://docs.trychroma.com/guides) to read the complete chromadb guide.

In [1]:
# ! pip install chromadb

## **Initiating a Persistent Client**

In [2]:
import chromadb

client = chromadb.PersistentClient(path="vector_store")



In [3]:
# returns a nanosecond heartbeat. Useful for making sure the client remains connected.

client.heartbeat()

1741185893226106000

In [4]:
# # Empties and completely resets the database. ⚠️ This is destructive and not reversible.

# client.reset() 

# # Note that to reset make following changes to your configurations
# # set `allow_reset` to `True` in your Settings() or 
# # include `ALLOW_RESET=TRUE` in your environment variables

## **Create a Collection**

**Parameters**
- **name**
- **embedding_function**: By default, Chroma uses the Sentence Transformers all-MiniLM-L6-v2 model to create embeddings.
- **metadata={"hnsw:space": "cosine"}**: By default L2 distance

In [5]:
collection = client.create_collection(name="my_first_collection")

# Alternatively you can also use client.get_or_create_collection()

In [6]:
# returns the number of items in the collection

collection.count()

0

In [7]:
# returns a list of the first 10 items in the collection

collection.peek()

{'ids': [],
 'embeddings': [],
 'metadatas': [],
 'documents': [],
 'uris': None,
 'data': None,
 'included': ['embeddings', 'metadatas', 'documents']}

In [8]:
# Rename the collection

collection.modify(name="name_modified")

## **Get An Already Existing Collection**

In [9]:
collection = client.get_collection(name="name_modified")

# Alternatively you can also use client.get_or_create_collection()

## **Adding data to Collection**

In [11]:
# add embeddings and documents

data = [
    "Apples - High in fiber, support digestion, and promote heart health.",
    "Bananas - Rich in potassium, help regulate blood pressure and muscle function.",
    "Oranges - Packed with vitamin C, boost immunity, and promote skin health.",
    "Blueberries - High in antioxidants, improve brain function and reduce inflammation.",
    "Strawberries - Support heart health and contain anti-aging antioxidants.",
    "Watermelon - Hydrating fruit with lycopene, good for heart and skin health.",
    "Pineapple - Contains bromelain, aids digestion, and reduces inflammation.",
    "Avocado - Loaded with healthy fats, supports brain and heart health.",
    "Papaya - Rich in enzymes for digestion and boosts skin health.",
    "Pomegranate - Full of antioxidants, improves blood circulation, and heart health.",
    "Carrots - High in beta-carotene, improve eye health and skin glow.",
    "Spinach - Rich in iron, good for blood health and energy levels.",
    "Broccoli - Contains sulforaphane, which has anti-cancer properties.",
    "Tomatoes - Packed with lycopene, supports heart health and skin protection.",
    "Bell Peppers - High in vitamin C, boosts immunity, and reduces inflammation.",
    "Cucumber - Hydrating vegetable, aids in digestion, and supports skin health.",
    "Garlic - Has antibacterial properties, supports heart health and immunity.",
    "Ginger - Anti-inflammatory, helps with digestion and nausea relief.",
    "Beets - Improve blood flow, support endurance, and detox the liver.",
    "Sweet Potatoes - Rich in fiber and vitamin A, supports vision and digestion."
    ]

collection.add(
    documents = data,
    ids=[ str(uuid.uuid4()) for i in range(len(data)) ]
    # metadatas=[{"key_1": "abc_1", "key_2": "abc_2"}, {"key_1": "xyz_1", "key_2": "xyz_2"}]
)

In [12]:
collection.count()

20

In [13]:
# # You can peek to check the top 10 documents embeddings

# collection.peek()

## **Search Embeddings**

In [14]:
# results = collection.query(
#     query_texts=["digestion, fiber rich"], 
#     n_results=3
# )

# results

In [15]:
# results = collection.query(
#     query_texts=["vitamin rich"], 
#     n_results=3
# )

# results["documents"]