# Vector Stores: Utilizing Chroma for Persistent Vector Storage

This Jupyter Notebook explores **Chroma**, an AI-native open-source vector database designed for developer productivity. Chroma offers a convenient way to store, manage, and query vector embeddings, making it an excellent choice for building RAG applications and semantic search systems.

We will demonstrate how to:
* Build an in-memory Chroma vector database.
* Query the database for similarity.
* Persist the Chroma database to disk for later use.
* Perform similarity searches with relevance scores.
* Integrate Chroma as a LangChain `Retriever`.

## What You Will Learn:

1.  **Chroma Basics**: What Chroma is and its role in AI applications.
2.  **Building a Vector DB**: Loading, splitting, and embedding documents into Chroma using Ollama embeddings.
3.  **In-Memory vs. Persistent Chroma**: Differentiating between temporary and disk-saved databases.
4.  **Querying Chroma**: Performing similarity searches to retrieve relevant documents.
5.  **Similarity Search with Scores**: Understanding how to get a relevance score along with retrieved documents.
6.  **Chroma as a LangChain Retriever**: Integrating Chroma into the standard LangChain `Retriever` interface for streamlined RAG pipelines.

## Key Concepts:

* **Chroma**: An open-source, in-memory by default (but can be persistent) vector database that stores embeddings, documents, and metadata.
* **Vector Database**: A specialized database optimized for storing and querying high-dimensional vectors (embeddings), enabling fast similarity searches.
* **`persist_directory`**: A parameter in Chroma that specifies a local path to save the database files, ensuring data persistence across sessions.
* **`embedding_function`**: The callable (e.g., an `OllamaEmbeddings` instance) that Chroma uses internally to convert text into vector embeddings.
* **Similarity Search**: Finding documents in the vector database that are semantically similar to a given query.
* **Similarity Score**: A numerical value indicating the similarity between the query and a retrieved document. For cosine similarity (often default for embeddings), higher values indicate more similarity (e.g., 0 to 1). For L2 distance (Euclidean), lower values mean higher similarity. Chroma's default for `similarity_search_with_score` is cosine distance, where a lower score indicates higher similarity, ranging from 0 (identical) to 2 (opposite).
* **Retriever**: A LangChain component that retrieves relevant documents based on a query, abstracting the underlying vector store logic.

## Prerequisites:
Before running the code in this notebook, ensure you have:
1.  **Ollama Installed and 'llama2' model pulled**: If you haven't already, run `ollama pull llama2` in your terminal, as `OllamaEmbeddings()` will use this model by default.
2.  **Required Libraries Installed**:
    ```bash
    pip install langchain-chroma langchain-community langchain-text-splitters
    ```

#### Chroma
Chroma is a AI-native open-source vector database focused on developer productivity and happiness. Chroma is licensed under Apache 2.0.

https://python.langchain.com/v0.2/docs/integrations/vectorstores/

In [2]:
# Import necessary classes for building a vector database with Chroma
from langchain_chroma import Chroma
from langchain_community.document_loaders import TextLoader
from langchain_community.embeddings import OllamaEmbeddings
from langchain_text_splitters import RecursiveCharacterTextSplitter
import os # Import os for directory checks

# --- 1. Load Documents ---
# Initialize TextLoader to load content from 'speech.txt'
loader = TextLoader("speech.txt")
data = loader.load()

# Display the raw loaded Document objects (each page/file is a Document)
print("--- Loaded Documents ---")
print(data)

# --- 2. Split Documents ---
# Initialize RecursiveCharacterTextSplitter to break down documents into smaller chunks.
# chunk_size: maximum characters allowed in each chunk.
# chunk_overlap: number of characters that overlap between consecutive chunks (set to 0 here).
text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=0)
splits = text_splitter.split_documents(data)

# Display the split document chunks.
print("\n--- Split Document Chunks ---")
print(splits)

# --- 3. Initialize Embeddings ---
# Initialize OllamaEmbeddings. Ensure Ollama is running and a default model (like 'llama2')
# is pulled via `ollama pull llama2`.
embedding = OllamaEmbeddings(model='llama3')

# --- 4. Build an In-Memory Chroma Vector Database ---
# Create an in-memory Chroma vector database from the split documents and the embeddings.
# This database will exist only during the current session.
vectordb = Chroma.from_documents(documents=splits, embedding=embedding)

# Display the Chroma database object.
print("\n--- In-Memory Chroma DB Created ---")
print(vectordb)

Failed to send telemetry event ClientStartEvent: capture() takes 1 positional argument but 3 were given
Failed to send telemetry event ClientCreateCollectionEvent: capture() takes 1 positional argument but 3 were given


--- Loaded Documents ---
[Document(metadata={'source': 'speech.txt'}, page_content='The world must be made safe for democracy. Its peace must be planted upon the tested foundations of political liberty. We have no selfish ends to serve. We desire no conquest, no dominion. We seek no indemnities for ourselves, no material compensation for the sacrifices we shall freely make. We are but one of the champions of the rights of mankind. We shall be satisfied when those rights have been made as secure as the faith and the freedom of nations can make them.\n\nJust because we fight without rancor and without selfish object, seeking nothing for ourselves but what we shall wish to share with all free peoples, we shall, I feel confident, conduct our operations as belligerents without passion and ourselves observe with proud punctilio the principles of right and of fair play we profess to be fighting for.\n\n…\n\nIt will be all the easier for us to conduct ourselves as belligerents in a high spirit

### Querying the In-Memory Chroma Database

Once the documents are embedded and stored in Chroma, we can perform similarity searches to retrieve relevant documents based on a query.

In [3]:
# Define a query string to search for similar content
query = "What does the speaker believe is the main reason the United States should enter the war?"

# Perform a similarity search on the in-memory Chroma database.
# This embeds the query and finds the most similar document chunks.
docs = vectordb.similarity_search(query)

# Print the page content of the first retrieved document.
print("--- Result from In-Memory DB Similarity Search ---")
print(docs[0].page_content)

Failed to send telemetry event CollectionQueryEvent: capture() takes 1 positional argument but 3 were given


--- Result from In-Memory DB Similarity Search ---
The world must be made safe for democracy. Its peace must be planted upon the tested foundations of political liberty. We have no selfish ends to serve. We desire no conquest, no dominion. We seek no indemnities for ourselves, no material compensation for the sacrifices we shall freely make. We are but one of the champions of the rights of mankind. We shall be satisfied when those rights have been made as secure as the faith and the freedom of nations can make them.


### Saving and Loading Chroma to Disk (Persistence)

A key feature of Chroma is its ability to persist the vector database to disk. This means you can save your indexed embeddings and associated documents, then load them back later without having to re-embed all your data. This is essential for building production-ready RAG applications.

In [15]:
## Saving to the disk
# Define the directory where the Chroma database will be persisted.
persist_directory = "./chroma_db_tmp"

# --- 1. Save to Disk ---
# Create a Chroma vector database, this time specifying a `persist_directory`.
# Chroma will automatically save its data to this location.
vectordb=Chroma.from_documents(documents=splits,embedding=embedding,persist_directory=persist_directory)

print(f"\n--- Chroma DB Saved to Disk at '{persist_directory}' ---")
print(vectordb)

# --- 2. Load from Disk ---
# Load the Chroma database back from the specified `persist_directory`.
# You must provide the same `embedding_function` that was used to create it.
db2 = Chroma(persist_directory=persist_directory, embedding_function=embedding)

# Verify the loaded database by performing a similarity search.
docs_loaded = db2.similarity_search(query)

# Print the page content of the first retrieved document from the loaded database.
print("\n--- Result from Loaded DB Similarity Search ---")
print(docs_loaded[0].page_content)

Failed to send telemetry event ClientStartEvent: capture() takes 1 positional argument but 3 were given
Failed to send telemetry event ClientCreateCollectionEvent: capture() takes 1 positional argument but 3 were given
Failed to send telemetry event ClientStartEvent: capture() takes 1 positional argument but 3 were given
Failed to send telemetry event ClientCreateCollectionEvent: capture() takes 1 positional argument but 3 were given
Failed to send telemetry event CollectionQueryEvent: capture() takes 1 positional argument but 3 were given



--- Chroma DB Saved to Disk at './chroma_db_tmp' ---
<langchain_chroma.vectorstores.Chroma object at 0x10c6d98d0>

--- Result from Loaded DB Similarity Search ---
The world must be made safe for democracy. Its peace must be planted upon the tested foundations of political liberty. We have no selfish ends to serve. We desire no conquest, no dominion. We seek no indemnities for ourselves, no material compensation for the sacrifices we shall freely make. We are but one of the champions of the rights of mankind. We shall be satisfied when those rights have been made as secure as the faith and the freedom of nations can make them.


### Similarity Search with Score

Chroma allows you to retrieve documents along with a similarity score, which indicates how closely related the retrieved document is to your query. For Chroma's default cosine distance, **a lower score indicates higher similarity (closer to 0 for identical)**.

In [16]:
# Perform a similarity search and return the documents along with their scores.
# The `vectordb` variable here refers to the potentially in-memory or the last-created persistent one.
# For consistent results, it's best to use `vectordb_persistent` if you re-run cells.
docs_with_score = vectordb.similarity_search_with_score(query)

# Print the retrieved documents and their scores.
# Each item in the list is a tuple: (Document object, similarity score).
print("--- Results from similarity_search_with_score ---")
print(docs_with_score)

--- Results from similarity_search_with_score ---
[(Document(id='f37895a9-8695-4224-8975-9c0bff2e3724', metadata={'source': 'speech.txt'}, page_content='The world must be made safe for democracy. Its peace must be planted upon the tested foundations of political liberty. We have no selfish ends to serve. We desire no conquest, no dominion. We seek no indemnities for ourselves, no material compensation for the sacrifices we shall freely make. We are but one of the champions of the rights of mankind. We shall be satisfied when those rights have been made as secure as the faith and the freedom of nations can make them.'), 20637.166015625), (Document(id='cce077dc-17f7-47ab-ad76-2078d64cb13d', metadata={'source': 'speech.txt'}, page_content='The world must be made safe for democracy. Its peace must be planted upon the tested foundations of political liberty. We have no selfish ends to serve. We desire no conquest, no dominion. We seek no indemnities for ourselves, no material compensation f

### Retriever Option

LangChain's `Retriever` interface provides a standardized way to fetch documents. You can convert any LangChain vector store, including Chroma, into a `Retriever` for easy integration into more complex LangChain chains and agents.

In [17]:
# Convert the Chroma vector database into a LangChain Retriever.
# This makes it compatible with LangChain's standard retrieval methods.
retriever = vectordb.as_retriever()

# Use the `invoke` method of the retriever with the query.
# This returns a list of Document objects.
docs_from_retriever = retriever.invoke(query)

# Print the page content of the first document retrieved by the retriever.
print("--- Result from Retriever.invoke ---")
print(docs_from_retriever[0].page_content)

--- Result from Retriever.invoke ---
The world must be made safe for democracy. Its peace must be planted upon the tested foundations of political liberty. We have no selfish ends to serve. We desire no conquest, no dominion. We seek no indemnities for ourselves, no material compensation for the sacrifices we shall freely make. We are but one of the champions of the rights of mankind. We shall be satisfied when those rights have been made as secure as the faith and the freedom of nations can make them.
