# Building a Vector Database with ChromaDB

This notebook implements a vector database using ChromaDB to store and efficiently retrieve the embeddings we generated in the previous step. ChromaDB is a lightweight, embedded vector database that works well for RAG applications and doesn't require any external services.

In [1]:
# Import required libraries
import os
import json
import numpy as np
import pandas as pd
from pathlib import Path
from tqdm.notebook import tqdm

# Import ChromaDB
import chromadb
from chromadb.utils import embedding_functions

# For visualization and testing
import matplotlib.pyplot as plt
from sentence_transformers import SentenceTransformer

## Define Paths

First, let's define the paths for input embeddings and the vector database.

In [2]:
# Define directories
EMBEDDINGS_DIR = "../data/embeddings"  # Directory with stored embeddings
CHROMA_DIR = "../data/chroma_db"  # Directory to store ChromaDB files

# Paths to the embedding files
EMBEDDINGS_JSON = os.path.join(EMBEDDINGS_DIR, "chunks_with_embeddings.json")
EMBEDDINGS_PKL = os.path.join(EMBEDDINGS_DIR, "chunks_with_embeddings.pkl")
EMBEDDINGS_NPY = os.path.join(EMBEDDINGS_DIR, "embeddings.npy")
METADATA_JSON = os.path.join(EMBEDDINGS_DIR, "metadata.json")

# Create ChromaDB directory if it doesn't exist
os.makedirs(CHROMA_DIR, exist_ok=True)

## Load Embeddings and Metadata

Now we'll load the embeddings and metadata we generated in the previous notebook.

In [3]:
def load_embeddings_and_metadata():
    """
    Load the pre-generated embeddings and metadata from files.
    
    Returns:
        tuple: (embeddings_array, metadata_list, documents_list)
    """
    try:
        # Load the embeddings NumPy array
        embeddings_array = np.load(EMBEDDINGS_NPY)
        print(f"Loaded embeddings array with shape: {embeddings_array.shape}")
        
        # Load the metadata JSON
        with open(METADATA_JSON, "r", encoding="utf-8") as f:
            metadata_list = json.load(f)
        print(f"Loaded metadata for {len(metadata_list)} chunks")
        
        # Extract documents (text content) and clean metadata for ChromaDB
        documents_list = [item["content"] for item in metadata_list]
        
        # ChromaDB metadata must be simple types (string, int, float, bool)
        clean_metadata = []
        for item in metadata_list:
            # Create a clean metadata dict with only simple types
            clean_item = {
                "chunk_id": str(item["chunk_id"]),
                "source": item.get("filename", ""),
                "title": item.get("title", ""),
                "category": item.get("category", ""),
                "section": item.get("section", "")
            }
            clean_metadata.append(clean_item)
            
        return embeddings_array, clean_metadata, documents_list
    
    except FileNotFoundError as e:
        print(f"Error: File not found. {e}")
        print("Please run the embedding generation notebook first.")
        return None, None, None
    except Exception as e:
        print(f"Error loading embeddings: {e}")
        return None, None, None

# Load the embeddings and metadata
embeddings_array, metadata_list, documents_list = load_embeddings_and_metadata()

# Show a sample of the data
if embeddings_array is not None:
    print("\nSample metadata item:")
    print(json.dumps(metadata_list[0], indent=2))
    
    print("\nSample document content (truncated):")
    print(documents_list[0][:200] + "..." if len(documents_list[0]) > 200 else documents_list[0])

Loaded embeddings array with shape: (389, 384)
Loaded metadata for 389 chunks

Sample metadata item:
{
  "chunk_id": "0",
  "source": "education_masters-programs_ms-in-applied-data-science_course-progressions.md",
  "title": "Course Progressions \u2013 DSI",
  "category": "education",
  "section": ""
}

Sample document content (truncated):
## MS in Applied Data Science facet-arrow-down


## Initialize ChromaDB

Now let's initialize ChromaDB and create a collection for our embeddings.

In [4]:
def initialize_chroma_db():
    """
    Initialize ChromaDB client and create a collection.
    
    Returns:
        tuple: (chroma_client, chroma_collection)
    """
    try:
        # Create a persistent client
        client = chromadb.PersistentClient(path=CHROMA_DIR)
        print(f"Initialized ChromaDB client with persistent storage at {CHROMA_DIR}")
        
        # Check if our collection already exists and recreate it
        collection_name = "ms_applied_data_science"
        try:
            # Try to get existing collection
            client.get_collection(collection_name)
            # If it exists, delete it to start fresh
            client.delete_collection(collection_name)
            print(f"Deleted existing collection '{collection_name}' to start fresh")
        except Exception:
            # Collection doesn't exist yet
            pass
            
        # Create a new collection with our custom embeddings
        collection = client.create_collection(
            name=collection_name,
            metadata={"description": "University of Chicago MS in Applied Data Science program content"}
        )
        print(f"Created collection '{collection_name}'")
        
        return client, collection
    
    except Exception as e:
        print(f"Error initializing ChromaDB: {e}")
        return None, None

# Initialize ChromaDB
chroma_client, chroma_collection = initialize_chroma_db()

Initialized ChromaDB client with persistent storage at ../data/chroma_db
Created collection 'ms_applied_data_science'


## Add Embeddings to ChromaDB

Now let's add our pre-computed embeddings to the ChromaDB collection.

In [5]:
def add_embeddings_to_chroma(collection, embeddings, metadata, documents):
    """
    Add pre-computed embeddings to ChromaDB collection.
    
    Args:
        collection: ChromaDB collection
        embeddings: NumPy array of embeddings
        metadata: List of metadata dictionaries
        documents: List of text documents (chunk content)
        
    Returns:
        bool: Success status
    """
    if collection is None or embeddings is None:
        return False
    
    try:
        # Create IDs for each document
        ids = [f"chunk_{i}" for i in range(len(documents))]
        
        # Add in batches to avoid memory issues with large datasets
        batch_size = 100
        total_batches = (len(documents) + batch_size - 1) // batch_size
        
        for i in tqdm(range(0, len(documents), batch_size), desc="Adding to ChromaDB", total=total_batches):
            # Get the current batch
            end_idx = min(i + batch_size, len(documents))
            batch_ids = ids[i:end_idx]
            batch_embeddings = embeddings[i:end_idx].tolist()
            batch_documents = documents[i:end_idx]
            batch_metadata = metadata[i:end_idx]
            
            # Add the batch to the collection
            collection.add(
                ids=batch_ids,
                embeddings=batch_embeddings,
                documents=batch_documents,
                metadatas=batch_metadata
            )
        
        print(f"Successfully added {len(documents)} documents with embeddings to ChromaDB")
        return True
    
    except Exception as e:
        print(f"Error adding embeddings to ChromaDB: {e}")
        return False

# Add embeddings to ChromaDB
success = add_embeddings_to_chroma(
    chroma_collection, 
    embeddings_array, 
    metadata_list, 
    documents_list
)

Adding to ChromaDB:   0%|          | 0/4 [00:00<?, ?it/s]

Successfully added 389 documents with embeddings to ChromaDB


## Query the Vector Database

Now let's create functions to query our vector database.

In [6]:
def initialize_embedding_model(model_name="all-MiniLM-L6-v2"):
    """
    Initialize the Sentence Transformer model for query embedding.
    
    Args:
        model_name (str): Name of the model to use
        
    Returns:
        SentenceTransformer: Loaded model
    """
    try:
        model = SentenceTransformer(model_name)
        print(f"Loaded Sentence Transformer model: {model_name}")
        return model
    except Exception as e:
        print(f"Error loading model: {e}")
        return None

# Initialize the embedding model for queries
embedding_model = initialize_embedding_model()

Loaded Sentence Transformer model: all-MiniLM-L6-v2


In [7]:
def query_vector_database(collection, query_text, embedding_model, top_k=5, filter_dict=None):
    """
    Query the vector database for similar documents.
    
    Args:
        collection: ChromaDB collection
        query_text (str): The query text
        embedding_model: Model to create query embedding
        top_k (int): Number of results to return
        filter_dict (dict): Optional metadata filters
        
    Returns:
        dict: Query results
    """
    if collection is None or embedding_model is None:
        return None
    
    try:
        # Generate embedding for the query
        query_embedding = embedding_model.encode(query_text).tolist()
        
        # Query the collection
        results = collection.query(
            query_embeddings=query_embedding,
            n_results=top_k,
            where=filter_dict  # Optional filtering by metadata
        )
        
        return results
    
    except Exception as e:
        print(f"Error querying database: {e}")
        return None

def display_query_results(results, query):
    """
    Display the query results in a readable format.
    
    Args:
        results (dict): Results from ChromaDB query
        query (str): The original query
    """
    if results is None or len(results["ids"]) == 0:
        print("No results found.")
        return
    
    print(f"Query: {query}")
    print("-" * 80)
    
    for i in range(len(results["ids"][0])):
        doc_id = results["ids"][0][i]
        document = results["documents"][0][i]
        metadata = results["metadatas"][0][i]
        distance = results["distances"][0][i] if "distances" in results else None
        
        print(f"Result #{i+1} - ID: {doc_id}")
        if distance is not None:
            print(f"Relevance: {1 - distance:.4f}")  # Convert distance to similarity score
        
        # Display metadata
        source = metadata.get("source", "Unknown")
        title = metadata.get("title", "Unknown")
        category = metadata.get("category", "")
        print(f"Source: {source}")
        print(f"Title: {title}")
        if category:
            print(f"Category: {category}")
        
        # Display content (truncated if long)
        content_preview = document[:300] + "..." if len(document) > 300 else document
        print(f"Content:\n{content_preview}")
        print("-" * 80)

## Test Queries

Let's test our vector database with some sample queries.

In [9]:
# List of sample queries to test
test_queries = [
    "What courses are required for the MS in Applied Data Science?",
    #"Who are the faculty members in the program?",
    #"How long does it take to complete the degree?",
    #"What are the prerequisites for the program?",
    #"Tell me about the capstone project requirements"
]

# Test each query
for query in test_queries:
    print(f"\nTesting query: {query}")
    results = query_vector_database(chroma_collection, query, embedding_model, top_k=3)
    display_query_results(results, query)
    print("\n" + "=" * 100 + "\n")


Testing query: What courses are required for the MS in Applied Data Science?
Query: What courses are required for the MS in Applied Data Science?
--------------------------------------------------------------------------------
Result #1 - ID: chunk_276
Relevance: 0.3835
Source: education_masters-programs_ms-in-applied-data-science_online-program.md
Title: Online Program – DSI
Category: education
Content:
You have the flexibility to pursue the [Master’s in Applied Data Science](https://datascience.uchicago.edu/education/masters-programs/ms-in-applied-data-science/) degree on a part- or full-time schedule. Part-time students enroll in two courses each quarter and take their courses in the evenings or ...
--------------------------------------------------------------------------------
Result #2 - ID: chunk_283
Relevance: 0.3594
Source: education_masters-programs_ms-in-applied-data-science_online-program.md
Title: Online Program – DSI
Category: education
Content:
### Core Courses (6)

You

## Filtering by Metadata

One advantage of ChromaDB is the ability to filter results by metadata. Let's try some filtered queries.

In [10]:
# Get unique values for some metadata fields
if chroma_collection is not None:
    # Get all metadata
    all_metadata = chroma_collection.get(include=["metadatas"])["metadatas"]
    
    # Extract unique categories and sources
    categories = set()
    sources = set()
    
    for meta in all_metadata:
        if "category" in meta and meta["category"]:
            categories.add(meta["category"])
        if "source" in meta and meta["source"]:
            sources.add(meta["source"])
    
    print("Available categories:")
    for category in sorted(categories):
        print(f"- {category}")
        
    print("\nAvailable sources (sample):")
    for source in sorted(list(sources)[:5]):  # Show just a few sources
        print(f"- {source}")
    if len(sources) > 5:
        print(f"... and {len(sources) - 5} more sources")

Available categories:
- education

Available sources (sample):
- education_masters-programs_ms-in-applied-data-science_course-progressions.md
- education_masters-programs_ms-in-applied-data-science_instructors-staff.md
- education_masters-programs_ms-in-applied-data-science_online-program.md
- education_masters-programs_ms-in-applied-data-science_tuition-fees-aid.md


In [11]:
# Try some filtered queries
filtered_queries = [
    ("What are the core courses?", {"category": "education"}),
    ("Who are the instructors?", {"category": "education"})
]

for query, filter_dict in filtered_queries:
    print(f"\nFiltered query: '{query}' with filter: {filter_dict}")
    results = query_vector_database(
        chroma_collection, 
        query, 
        embedding_model, 
        top_k=3, 
        filter_dict=filter_dict
    )
    display_query_results(results, query)
    print("\n" + "=" * 100 + "\n")


Filtered query: 'What are the core courses?' with filter: {'category': 'education'}
Query: What are the core courses?
--------------------------------------------------------------------------------
Result #1 - ID: chunk_283
Relevance: 0.3576
Source: education_masters-programs_ms-in-applied-data-science_online-program.md
Title: Online Program – DSI
Category: education
Content:
### Core Courses (6)

You will complete 6 core courses toward your [Master’s in Applied Data Science](https://datascience.uchicago.edu/education/masters-programs/ms-in-applied-data-science/instructors-staff/) degree. Core courses allow you to build your theoretical data science knowledge and practic...
--------------------------------------------------------------------------------
Result #2 - ID: chunk_275
Relevance: 0.1000
Source: education_masters-programs_ms-in-applied-data-science_online-program.md
Title: Online Program – DSI
Category: education
Content:
Read More +
Show Less -

## Curriculum

You will earn

## Create a Simple Retrieval Function

Finally, let's create a simple retrieval function that can be used in a RAG system.

In [13]:
def retrieve_context(query, top_k=5, filter_dict=None):
    """
    Retrieve relevant context for a given query.
    This function can be used as part of a RAG system.
    
    Args:
        query (str): User query
        top_k (int): Number of results to retrieve
        filter_dict (dict): Optional metadata filters
        
    Returns:
        list: List of context strings with source information
    """
    # Query the vector database
    results = query_vector_database(
        chroma_collection,
        query,
        embedding_model,
        top_k=top_k,
        filter_dict=filter_dict
    )
    
    if results is None or len(results["ids"]) == 0:
        return ["No relevant information found."]
    
    # Format the retrieved context
    context_list = []
    for i in range(len(results["ids"][0])):
        document = results["documents"][0][i]
        metadata = results["metadatas"][0][i]
        
        # Add source information
        source = metadata.get("source", "Unknown source")
        title = metadata.get("title", "")
        source_info = f"[Source: {source}" + (f", {title}" if title else "") + "]"
        
        # Add the formatted context
        context_list.append(f"{document}\n{source_info}")
    
    return context_list

# Example of how to use the retrieve_context function
#query = "What are the course requirements for the MS in Applied Data Science program?"
query = "What are the core courses?"
retrieved_context = retrieve_context(query, top_k=3)

print(f"Query: {query}")
print("\nRetrieved context:")
for i, context in enumerate(retrieved_context):
    print(f"\nContext {i+1}:")
    print("-" * 80)
    print(context)
    print("-" * 80)

Query: What are the core courses?

Retrieved context:

Context 1:
--------------------------------------------------------------------------------
### Core Courses (6)

You will complete 6 core courses toward your [Master’s in Applied Data Science](https://datascience.uchicago.edu/education/masters-programs/ms-in-applied-data-science/instructors-staff/) degree. Core courses allow you to build your theoretical data science knowledge and practice applying this theory to examine real-world business problems.

### Elective Courses (4)
[Source: education_masters-programs_ms-in-applied-data-science_online-program.md, Online Program – DSI]
--------------------------------------------------------------------------------

Context 2:
--------------------------------------------------------------------------------
Read More +
Show Less -

## Curriculum

You will earn UChicago’s Master’s in Applied Data Science by successfully completing 12 courses (6 core, 4 elective, 2 Capstone) and our tailored