# Practical Guide to Using ChromaDB for RAG and Semantic Search

This notebook demonstrates how to create and query ChromaDB collections with both SentenceTransformers and OpenAI embeddings.
It includes examples for: adding documents, using embedding functions, performing similarity queries, and using OpenAI for generation.



## Setup
Install the minimal dependencies used in this notebook. Run the following cells in order to prepare the environment.

In [None]:
#pip install sentence-transformers openai
#pip insatll chromadb 

## Initialize Chroma Client
The next cells show how to create a Chroma client, both in-memory and persistent. Use `PersistentClient` to keep a local copy of the database on disk.

In [None]:
import chromadb



In [None]:
# Option 1: Using default settings (in-memory)
client = chromadb.Client()

In [None]:
# Option 2: Using persistent local storage (Recommended for real applications)
client = chromadb.PersistentClient(path="./chroma_db")

In [None]:
# Create or get a collection
collection = client.get_or_create_collection(name="my_docs")

## Add Documents to a Collection
The cell below shows a small sample dataset and how to add documents, metadata, and ids to a Chroma collection.
Use meaningful metadata keys to enable filtered searches (the `where` parameter in queries).
Tip: For larger datasets, load documents from files or a database rather than embedding them directly in the notebook.

In [16]:
documents = [
    "COVID-19 vaccines have been shown to reduce severe illness and hospitalization rates.",
    "Diabetes management requires regular monitoring of blood glucose levels and lifestyle adjustments.",
    "Hypertension is a leading risk factor for heart disease and stroke, requiring timely intervention.",
    "MRI scans provide detailed images of soft tissues, helping in the diagnosis of neurological disorders.",
    "Antibiotic resistance is a growing concern in the treatment of bacterial infections.",
    "Telemedicine allows patients to consult doctors remotely, improving access to healthcare services.",
    "Cancer immunotherapy leverages the body's immune system to target and destroy malignant cells.",
    "Mental health disorders, such as depression and anxiety, affect millions and require comprehensive care.",
    "Wearable devices can track heart rate, sleep patterns, and activity levels for personalized health insights.",
    "Genetic testing can help identify individuals at risk for hereditary diseases and guide preventive strategies."
]

# Metadata for filtering/searching
metadatas = [
    {"topic": "vaccines", "source": "WHO"},
    {"topic": "diabetes", "source": "CDC"},
    {"topic": "cardiology", "source": "MayoClinic"},
    {"topic": "diagnostics", "source": "JohnsHopkins"},
    {"topic": "infectious_disease", "source": "CDC"},
    {"topic": "telemedicine", "source": "NIH"},
    {"topic": "oncology", "source": "NCI"},
    {"topic": "mental_health", "source": "WHO"},
    {"topic": "wearables", "source": "Stanford"},
    {"topic": "genetics", "source": "GenomicsInstitute"}
]

# Unique IDs for each document
ids = [f"doc{i+1}" for i in range(len(documents))]

# Example: Adding to a Chroma collection
collection.add(
    documents=documents,
    metadatas=metadatas,
    ids=ids
)


In [17]:
query =   "How does cancer immunotherapy work?"

results = collection.query(
    query_texts=[query],
    n_results=2 #top k results
)

print(results)

{'ids': [['doc7', 'doc1']], 'embeddings': None, 'documents': [["Cancer immunotherapy leverages the body's immune system to target and destroy malignant cells.", 'COVID-19 vaccines have been shown to reduce severe illness and hospitalization rates.']], 'uris': None, 'included': ['metadatas', 'documents', 'distances'], 'data': None, 'metadatas': [[{'source': 'NCI', 'topic': 'oncology'}, {'source': 'WHO', 'topic': 'vaccines'}]], 'distances': [[0.29751405119895935, 1.3732755184173584]]}


In [19]:
query = "How can wearable devices track heart health?"

results = collection.query(
    query_texts=[query],
    n_results=10,
    where={"topic": "cardiology"}  # filter by metadata
)

print(results)

{'ids': [['doc3']], 'embeddings': None, 'documents': [['Hypertension is a leading risk factor for heart disease and stroke, requiring timely intervention.']], 'uris': None, 'included': ['metadatas', 'documents', 'distances'], 'data': None, 'metadatas': [[{'topic': 'cardiology', 'source': 'MayoClinic'}]], 'distances': [[1.4739487171173096]]}


## Embeddings with SentenceTransformers
This section demonstrates using a SentenceTransformer embedding function with Chroma.
We use `all-MiniLM-L6-v2` for a balance of speed and accuracy. If you need better quality, swap in a larger model.

In [20]:
# Import Chroma and embedding function
import chromadb
from chromadb.utils import embedding_functions

# ✅ Define the embedding model (Sentence Transformer)
# all-MiniLM-L6-v2 is lightweight, fast, and good for semantic medical text search
sentence_transformer_ef = embedding_functions.SentenceTransformerEmbeddingFunction(
    model_name="all-MiniLM-L6-v2"
)

# ✅ Initialize a persistent Chroma client (data stored on disk)
client = chromadb.PersistentClient(path="./medical_chroma_db")

# ✅ Create (or get) a collection with embedding support
collection = client.get_or_create_collection(
    name="medical_documents",
    embedding_function=sentence_transformer_ef
)

# ✅ Medical domain documents
documents = [
    "COVID-19 vaccines have been shown to reduce severe illness and hospitalization rates.",
    "Diabetes management requires regular monitoring of blood glucose levels and lifestyle adjustments.",
    "Hypertension is a leading risk factor for heart disease and stroke, requiring timely intervention.",
    "MRI scans provide detailed images of soft tissues, helping in the diagnosis of neurological disorders.",
    "Antibiotic resistance is a growing concern in the treatment of bacterial infections.",
    "Telemedicine allows patients to consult doctors remotely, improving access to healthcare services.",
    "Cancer immunotherapy leverages the body's immune system to target and destroy malignant cells.",
    "Mental health disorders, such as depression and anxiety, affect millions and require comprehensive care.",
    "Wearable devices can track heart rate, sleep patterns, and activity levels for personalized health insights.",
    "Genetic testing can help identify individuals at risk for hereditary diseases and guide preventive strategies."
]

# ✅ Corresponding metadata for filtering/searching
metadatas = [
    {"topic": "vaccines", "source": "WHO"},
    {"topic": "diabetes", "source": "CDC"},
    {"topic": "cardiology", "source": "MayoClinic"},
    {"topic": "diagnostics", "source": "JohnsHopkins"},
    {"topic": "infectious_disease", "source": "CDC"},
    {"topic": "telemedicine", "source": "NIH"},
    {"topic": "oncology", "source": "NCI"},
    {"topic": "mental_health", "source": "WHO"},
    {"topic": "wearables", "source": "Stanford"},
    {"topic": "genetics", "source": "GenomicsInstitute"}
]

# ✅ Unique IDs for each document
ids = [f"doc{i+1}" for i in range(len(documents))]

# ✅ Add data to the Chroma collection
collection.add(
    documents=documents,
    metadatas=metadatas,
    ids=ids
)

print("✅ Medical documents successfully added to Chroma!")


✅ Medical documents successfully added to Chroma!


In [22]:
query = "How can wearable devices help with heart health?"

results = collection.query(
    query_texts=[query],
    n_results=2
)

print("🔍 Query Results:")
print(results)

🔍 Query Results:
{'ids': [['doc9', 'doc3']], 'embeddings': None, 'documents': [['Wearable devices can track heart rate, sleep patterns, and activity levels for personalized health insights.', 'Hypertension is a leading risk factor for heart disease and stroke, requiring timely intervention.']], 'uris': None, 'included': ['metadatas', 'documents', 'distances'], 'data': None, 'metadatas': [[{'source': 'Stanford', 'topic': 'wearables'}, {'source': 'MayoClinic', 'topic': 'cardiology'}]], 'distances': [[0.30843448638916016, 0.640539288520813]]}


In [23]:
from sentence_transformers import SentenceTransformer

# Use the same model used for collection embeddings
model = SentenceTransformer("all-MiniLM-L6-v2")

query = "How can wearable devices help with heart health?"
# encode returns a numpy array for a list of inputs; convert to list of lists
query_embedding = model.encode([query]).tolist()

results = collection.query(
    query_embeddings=query_embedding,  # pass embeddings directly
    n_results=2
)

print("🔍 Query Results (manual embedding):")
print(results)


🔍 Query Results (manual embedding):
{'ids': [['doc9', 'doc3']], 'embeddings': None, 'documents': [['Wearable devices can track heart rate, sleep patterns, and activity levels for personalized health insights.', 'Hypertension is a leading risk factor for heart disease and stroke, requiring timely intervention.']], 'uris': None, 'included': ['metadatas', 'documents', 'distances'], 'data': None, 'metadatas': [[{'source': 'Stanford', 'topic': 'wearables'}, {'source': 'MayoClinic', 'topic': 'cardiology'}]], 'distances': [[0.30843454599380493, 0.6405394077301025]]}


## Using OpenAI embeddings (manual)
The following examples use OpenAI's embeddings API. Before running them, set your API key in the environment (for example in a `.env` file) and never commit secrets to version control.
Example: export OPEN_AI_KEY="your_key_here" or add to a `.env` and use python-dotenv.
If you publish this notebook, replace any real keys with placeholders and add instructions for obtaining keys.

In [25]:
# Save medical documents to Chroma using OpenAI embeddings manually
import chromadb
import os
from dotenv import load_dotenv
import openai

# ✅ Load environment variables
load_dotenv()
openai_api_key = os.getenv("OPEN_AI_KEY")

# ✅ Set up OpenAI API key
openai.api_key = openai_api_key

# ✅ Custom function to generate embeddings using OpenAI API
def generate_embedding(text: str, model: str = "text-embedding-3-small") -> list:
    """
    Generate a vector embedding for a given text using OpenAI embeddings API.
    """
    response = openai.embeddings.create(
        input=text,
        model=model
    )
    embedding_vector = response.data[0].embedding
    return embedding_vector

# ✅ Initialize a persistent Chroma client
client = chromadb.PersistentClient(path="./medical_chroma_openai_db")

# ✅ Create or get collection (without Chroma embedding function)
collection = client.get_or_create_collection(
    name="medical_documents_custom"
)

# ✅ Medical domain documents
documents = [
    "COVID-19 vaccines have been shown to reduce severe illness and hospitalization rates.",
    "Diabetes management requires regular monitoring of blood glucose levels and lifestyle adjustments.",
    "Hypertension is a leading risk factor for heart disease and stroke, requiring timely intervention.",
    "MRI scans provide detailed images of soft tissues, helping in the diagnosis of neurological disorders.",
    "Antibiotic resistance is a growing concern in the treatment of bacterial infections.",
    "Telemedicine allows patients to consult doctors remotely, improving access to healthcare services.",
    "Cancer immunotherapy leverages the body's immune system to target and destroy malignant cells.",
    "Mental health disorders, such as depression and anxiety, affect millions and require comprehensive care.",
    "Wearable devices can track heart rate, sleep patterns, and activity levels for personalized health insights.",
    "Genetic testing can help identify individuals at risk for hereditary diseases and guide preventive strategies."
]

# ✅ Corresponding metadata
metadatas = [
    {"topic": "vaccines", "source": "WHO"},
    {"topic": "diabetes", "source": "CDC"},
    {"topic": "cardiology", "source": "MayoClinic"},
    {"topic": "diagnostics", "source": "JohnsHopkins"},
    {"topic": "infectious_disease", "source": "CDC"},
    {"topic": "telemedicine", "source": "NIH"},
    {"topic": "oncology", "source": "NCI"},
    {"topic": "mental_health", "source": "WHO"},
    {"topic": "wearables", "source": "Stanford"},
    {"topic": "genetics", "source": "GenomicsInstitute"}
]

# ✅ Unique IDs
ids = [f"doc{i+1}" for i in range(len(documents))]

# ✅ Generate embeddings manually
embeddings = [generate_embedding(doc) for doc in documents]

# ✅ Add data to Chroma collection with precomputed embeddings
collection.add(
    documents=documents,
    metadatas=metadatas,
    ids=ids,
    embeddings=embeddings
)

print("✅ Medical documents successfully added to Chroma (custom embeddings)!")

# ✅ Example query embedding
query = "How can wearable devices help with heart health?"
query_embedding = generate_embedding(query)

# ✅ Perform similarity search using the query embedding
results = collection.query(
    query_embeddings=[query_embedding],
    n_results=2
)

print("🔍 Query Results:")
print(results)


✅ Medical documents successfully added to Chroma (custom embeddings)!
🔍 Query Results:
{'ids': [['doc9', 'doc3']], 'embeddings': None, 'documents': [['Wearable devices can track heart rate, sleep patterns, and activity levels for personalized health insights.', 'Hypertension is a leading risk factor for heart disease and stroke, requiring timely intervention.']], 'uris': None, 'included': ['metadatas', 'documents', 'distances'], 'data': None, 'metadatas': [[{'topic': 'wearables', 'source': 'Stanford'}, {'topic': 'cardiology', 'source': 'MayoClinic'}]], 'distances': [[0.5157197117805481, 1.3174545764923096]]}


## Using ChromaDB with LLM as part of RAG system

In [27]:
from openai import OpenAI
client_llm = OpenAI(api_key=openai_api_key)

query = "How can wearable devices help with heart health?"

query_embedding = generate_embedding(query)

results = collection.query(query_embeddings=[query_embedding], n_results=3)
context = "\n".join(results["documents"][0])

prompt = f"""
Answer the following question using the context below.

Context:
{context}

Question: {query}
"""

response = client_llm.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": prompt}]
)

print(response.choices[0].message.content)

Wearable devices can help with heart health by tracking key metrics such as heart rate and activity levels, which provide personalized health insights. This data allows individuals to monitor their cardiovascular health closely, enabling timely interventions for issues like hypertension, a major risk factor for heart disease and stroke. Additionally, these devices can help encourage a more active lifestyle and better sleep patterns, contributing to overall heart health management.
