# **Introduction to Vector Databases**

## **What's covered?**
1. Vector Stores
    - Why Vector Stores?
    - Vector Store Key Attributes
    - Popular VectorDBs
2. Introduction to Chroma
    - Features
    - Installing Chroma
    - Instantiate LangChain Chroma
    - Adding Documents
3. COMING SOON

## **Vector Stores**

### **Why Vector Stores?**
One of the most common ways to store and search over unstructured data is to embed it and store the resulting embedding vectors, and then at query time to embed the unstructured query and retrieve the embedding vectors that are 'most similar' to the embedded query. A vector store takes care of storing embedded data and performing vector search for you.

<img src="images/langchain_rag.jpg" style="width: 70%; height: auto;">

### **Vector Store Key Attributes**
- Can store large d-dimensional vectors
- Can directly index an embedded vector to its associated string text document.
- Can be "Queried", allowing for a cosine similarity search between new vector and the stored vectors.
- Can easily add, update, or delete new vectors.

### **Popular VectorDBs**
There are many great vector store options, here are a few that are free, open-source, and run entirely on your local machine. Review all integrations for many great hosted offerings.
- Chroma
- FAISS (Facebook AI Similarity Search)
- Milvus
- PGVector
- Qdrant

There are other vectordb options as well like:
- Pinecone
- Astra DB
- Azure Cosmos DB

[Click here](https://docs.langchain.com/oss/python/integrations/vectorstores) to checkout all LangChain vector store integration options.

## **Introduction to Chroma**

<img src="images/chromadb_1.png" style="width: 70%; height: auto;">


### **Features**
1. Store document embedding and their metadata
2. **Has everything we need for retrieval**
    - Similarity Search
    - Full-text Search (regex supported)
    - Sparse Vector Search (BM25)
    - Metadata filtering
    - Multi-modal retrieval
3. **Free and Open source**
4. **Integrations** 
    - Works with HuggingFace, OpenAI, Google, LangChain and more.

**[Click here](https://www.trychroma.com/) to visit the official website.**

### **Installing Chroma**

```python
! pip install chromadb
! pip install langchain-chroma
```

**[Click here](https://docs.trychroma.com/guides) to read the complete chromadb guide.**

In [1]:
# ! pip install chromadb
# ! pip install langchain-chroma

### **Instantiate LangChain Chroma**

In [2]:
# Initialize a ChromaDB Connection
from langchain_chroma import Chroma
from langchain_openai import OpenAIEmbeddings

f = open("keys/.openai_api_key.txt")
OPENAI_API_KEY = f.read()
openai_embd_model = OpenAIEmbeddings(
    openai_api_key=OPENAI_API_KEY, 
    model="text-embedding-3-small"
)

# Initialize the database connection
# If database exist, it will connect with the collection_name and persist_directory
# Otherwise a new collection will be created
vectordb = Chroma(collection_name="vector_database", 
            embedding_function=openai_embd_model, 
            persist_directory="./chroma_db_")

  from .autonotebook import tqdm as notebook_tqdm


In [3]:
vectordb.get()

{'ids': [],
 'embeddings': None,
 'documents': [],
 'uris': None,
 'included': ['metadatas', 'documents'],
 'data': None,
 'metadatas': []}

### **Adding Documents**

In [4]:
# creating the documents
from langchain_core.documents import Document

docs = [
    Document(page_content="Apples - High in fiber, support digestion, and promote heart health."),
    Document(page_content="Bananas - Rich in potassium, help regulate blood pressure and muscle function."),
    Document(page_content="Oranges - Packed with vitamin C, boost immunity, and promote skin health."),
    Document(page_content="Blueberries - High in antioxidants, improve brain function and reduce inflammation."),
    Document(page_content="Strawberries - Support heart health and contain anti-aging antioxidants."),
    Document(page_content="Watermelon - Hydrating fruit with lycopene, good for heart and skin health."),
    Document(page_content="Pineapple - Contains bromelain, aids digestion, and reduces inflammation."),
    Document(page_content="Avocado - Loaded with healthy fats, supports brain and heart health."),
    Document(page_content="Papaya - Rich in enzymes for digestion and boosts skin health."),
    Document(page_content="Pomegranate - Full of antioxidants, improves blood circulation, and heart health."),
    Document(page_content="Carrots - High in beta-carotene, improve eye health and skin glow."),
    Document(page_content="Spinach - Rich in iron, good for blood health and energy levels."),
    Document(page_content="Broccoli - Contains sulforaphane, which has anti-cancer properties."),
    Document(page_content="Tomatoes - Packed with lycopene, supports heart health and skin protection."),
    Document(page_content="Bell Peppers - High in vitamin C, boosts immunity, and reduces inflammation."),
    Document(page_content="Cucumber - Hydrating vegetable, aids in digestion, and supports skin health."),
    Document(page_content="Garlic - Has antibacterial properties, supports heart health and immunity."),
    Document(page_content="Ginger - Anti-inflammatory, helps with digestion and nausea relief."),
    Document(page_content="Beets - Improve blood flow, support endurance, and detox the liver."),
    Document(page_content="Sweet Potatoes - Rich in fiber and vitamin A, supports vision and digestion.")
]

In [5]:
import uuid

vectordb.add_documents(
    documents=docs, 
    ids=[ str(uuid.uuid4()) for i in range(len(docs)) ]
)

['58a4def2-129b-4a50-bf4e-4a534bb7b51a',
 'd3265d79-f8c1-4505-8885-626a658330b5',
 '03f489a9-b7e5-4b7b-aa1d-4470b5a98090',
 '93bfdcf2-7990-4459-b0a2-7d486c109680',
 'e36a06d9-9785-49e1-860d-a5e0f9907f8e',
 '95be451a-e552-492a-9d0e-45986c709c85',
 'c08c11f6-432e-4baf-8952-b208d4c878f7',
 '62f0e01d-d72e-4045-9eeb-ef9f5cd3030f',
 '9655e6a0-9113-4b0d-8a84-076abc8ecf38',
 'e40173e6-e529-4730-95c4-66c22262a157',
 '4f00fd4c-068e-4ec4-8341-1bce7795ce6c',
 '0723a583-6a96-493c-893f-5cbcb6cd9332',
 '03a735bb-9b95-4570-889b-cc747f712876',
 '7a7c5e95-fc59-41f9-a839-08a2b0a41af6',
 'da40361a-a39e-4dab-84d9-fab9a72d300b',
 '43988b8d-9d46-4530-b059-fcb3d0bcd40d',
 'f17c07ce-94be-4153-a20a-b4da66160393',
 'c68734ae-529a-4993-91be-7e3532c226a4',
 '9025e46f-f023-48bf-826b-cbe0d2cbb764',
 '3267d975-e76c-4530-b233-9a051013b9f0']

### **Verifying the number of embeddings in vectordb**

In [6]:
print(vectordb.get().keys())

dict_keys(['ids', 'embeddings', 'documents', 'uris', 'included', 'data', 'metadatas'])


In [7]:
print(len(vectordb.get()["ids"]))

20


### **Verifying the Documents stored in VectorDB**

In [9]:
print(vectordb.get().keys())

dict_keys(['ids', 'embeddings', 'documents', 'uris', 'included', 'data', 'metadatas'])


In [8]:
vectordb.get()["documents"]

['Apples - High in fiber, support digestion, and promote heart health.',
 'Bananas - Rich in potassium, help regulate blood pressure and muscle function.',
 'Oranges - Packed with vitamin C, boost immunity, and promote skin health.',
 'Blueberries - High in antioxidants, improve brain function and reduce inflammation.',
 'Strawberries - Support heart health and contain anti-aging antioxidants.',
 'Watermelon - Hydrating fruit with lycopene, good for heart and skin health.',
 'Pineapple - Contains bromelain, aids digestion, and reduces inflammation.',
 'Avocado - Loaded with healthy fats, supports brain and heart health.',
 'Papaya - Rich in enzymes for digestion and boosts skin health.',
 'Pomegranate - Full of antioxidants, improves blood circulation, and heart health.',
 'Carrots - High in beta-carotene, improve eye health and skin glow.',
 'Spinach - Rich in iron, good for blood health and energy levels.',
 'Broccoli - Contains sulforaphane, which has anti-cancer properties.',
 'Tom

### **Verifying the Document Embeddings stored in VectorDB**

In [10]:
print(vectordb.get().keys())

dict_keys(['ids', 'embeddings', 'documents', 'uris', 'included', 'data', 'metadatas'])


In [11]:
print(vectordb.get()["embeddings"])

None


**`db.get()["embeddings"]` returns `None` because Chroma doesn't store embeddings in metadata by default - use `db._collection.peek()` or similarity search.**

In [12]:
vectordb._collection.peek(1)

{'ids': ['58a4def2-129b-4a50-bf4e-4a534bb7b51a'],
 'embeddings': array([[-0.01265854, -0.01608257, -0.00302629, ..., -0.00368919,
         -0.01505652,  0.00234898]], shape=(1, 1536)),
 'documents': ['Apples - High in fiber, support digestion, and promote heart health.'],
 'uris': None,
 'included': ['metadatas', 'documents', 'embeddings'],
 'data': None,
 'metadatas': [None]}

### **Updating and Deleting the Documents**

In [None]:
# vectordb.update_documents(
#     ids=["1"], 
#     documents=[updated_document]
# )

In [None]:
# vectordb.delete(
#     ids=["3"]
# )

In [31]:
results = vectordb.similarity_search(
    query="digestion, fiber rich", 
    k=3
)

for doc in results:
    print(f"* {doc.page_content}")

* Apples - High in fiber, support digestion, and promote heart health.
* Sweet Potatoes - Rich in fiber and vitamin A, supports vision and digestion.
* Papaya - Rich in enzymes for digestion and boosts skin health.


In [33]:
results = vectordb.similarity_search_with_score(
    query="digestion, fiber rich", 
    k=10
)

for doc, score in results:
    print(f"* [SIMILARITY SCORE={score:3f}] {doc.page_content}")

* [SIMILARITY SCORE=0.850605] Apples - High in fiber, support digestion, and promote heart health.
* [SIMILARITY SCORE=0.947660] Sweet Potatoes - Rich in fiber and vitamin A, supports vision and digestion.
* [SIMILARITY SCORE=1.141120] Papaya - Rich in enzymes for digestion and boosts skin health.
* [SIMILARITY SCORE=1.213186] Pineapple - Contains bromelain, aids digestion, and reduces inflammation.
* [SIMILARITY SCORE=1.267470] Cucumber - Hydrating vegetable, aids in digestion, and supports skin health.
* [SIMILARITY SCORE=1.297430] Spinach - Rich in iron, good for blood health and energy levels.
* [SIMILARITY SCORE=1.304829] Ginger - Anti-inflammatory, helps with digestion and nausea relief.
* [SIMILARITY SCORE=1.309168] Beets - Improve blood flow, support endurance, and detox the liver.
* [SIMILARITY SCORE=1.351108] Avocado - Loaded with healthy fats, supports brain and heart health.
* [SIMILARITY SCORE=1.355935] Carrots - High in beta-carotene, improve eye health and skin glow.


```
 |
 |  Search with filter:
 |      ```python
 |      results = vector_store.similarity_search(
 |          query="thud", k=1, filter={"baz": "bar"}
 |      )
 |      for doc in results:
 |          print(f"* {doc.page_content} [{doc.metadata}]")
 |      ```
 |      ```python
 |      *foo[{"baz": "bar"}]
 |      ```
 |
 |  Search with score:
 |      ```python
 |      results = vector_store.similarity_search_with_score(query="qux", k=1)
 |      for doc, score in results:
 |          print(f"* [SIM={score:3f}] {doc.page_content} [{doc.metadata}]")
 |      ```
 |      ```python
 |      * [SIM=0.000000] qux [{'bar': 'baz', 'baz': 'bar'}]
 |      ```
```