# **Introduction to Vector Databases - ChromaDB**

Chroma is the open-source AI application database.

[Click here](https://www.trychroma.com/) to visit the official website.

<img src="images/chromadb_1.png">

### **Features**
1. **Has everything we need for retrieval**
    - Store document embedding and their metadata
    - Search Embeddings
    - Full-tect Search
    - Metadata filtering
    - Multi-modal retrieval
2. **Free and Open source**
3. **Integrations** 
    - Works with HuggingFace, OpenAI, Google, LangChain and more.
4. **Simple to Get Started**
    - ```pip install chromadb```
  
### **Syntax**
```python
import chromadb

# Initiating a Persistent Chroma Client
client = chromadb.PersistentClient(path="/path/to/save/to")

# Create a new collection or get if already exist
collection = client.get_or_create_collection(name="my_collection", embedding_function=emb_fn, metadata={"hnsw:space": "cosine"})

# add embeddings and documents
collection.add(
    documents=[
        "This is a document about pineapple",
        "This is a document about oranges"
    ], 
    metadatas=[{"key_1": "value_1", "key_2": "value_2"}, {"key_1": "value_1", "key_2": "value_2"}],
    ids=["id1", "id2"]
)

# get back similar embeddings
# Note that Chroma will embed query_texts for you and return n_results
results = collection.query(
    query_texts=["This is a query document about hawaii"], 
    n_results=2
)

# switch `add` to `upsert` to avoid adding the same documents every time
collection.upsert(
    documents=[
        "This is a document about pineapple",
        "This is a document about oranges"
    ],
    ids=["id1", "id2"]
)

```

[Click here](https://docs.trychroma.com/guides) to read the complete chromadb guide.

In [15]:
# ! pip install chromadb

## **Initiating a Persistent Client**

In [2]:
import chromadb

client = chromadb.PersistentClient(path="vector_store")



In [3]:
# returns a nanosecond heartbeat. Useful for making sure the client remains connected.

client.heartbeat()

1722593038787657000

In [6]:
# # Empties and completely resets the database. ⚠️ This is destructive and not reversible.

# client.reset() 

## **Create a Collection**

**Parameters**
- **name**
- **embedding_function**: By default, Chroma uses the Sentence Transformers all-MiniLM-L6-v2 model to create embeddings.
- **metadata={"hnsw:space": "cosine"}**: By default L2 distance

In [7]:
collection = client.create_collection(name="my_first_collection")

# Alternatively you can also use client.get_or_create_collection()

In [10]:
# returns the number of items in the collection

collection.count()

0

In [11]:
# returns a list of the first 10 items in the collection

collection.peek()

{'ids': [],
 'embeddings': [],
 'metadatas': [],
 'documents': [],
 'uris': None,
 'data': None,
 'included': ['embeddings', 'metadatas', 'documents']}

In [12]:
# Rename the collection

collection.modify(name="new_name")

## **Get An Already Existing Collection**

In [13]:
collection = client.get_collection(name="new_name")

## **Adding data to Collection**

In [16]:
# add embeddings and documents
collection.add(
    documents=[
        "This is a document about pineapple",
        "This is a document about oranges"
    ], 
    metadatas=[{"key_1": "abc_1", "key_2": "abc_2"}, {"key_1": "xyz_1", "key_2": "xyz_2"}],
    ids=["id1", "id2"]
)

/Users/kanavbansal/.cache/chroma/onnx_models/all-MiniLM-L6-v2/onnx.tar.gz: 100%|


In [19]:
# # You can peek to check the top 10 documents embeddings

# collection.peek()

## **Search Embeddings**

In [20]:
results = collection.query(
    query_texts=["apple"], 
    n_results=1
)

In [21]:
results

{'ids': [['id1']],
 'distances': [[1.6024470811180438]],
 'metadatas': [[{'key_1': 'abc_1', 'key_2': 'abc_2'}]],
 'embeddings': None,
 'documents': [['This is a document about pineapple']],
 'uris': None,
 'data': None,
 'included': ['metadatas', 'documents', 'distances']}