# Building a Vector Database with ChromaDB (v3)

This notebook creates a vector database using ChromaDB for the header-based chunks and embeddings generated by `generate_embeddings_v3.ipynb`. It loads the embeddings from the specified directory and creates a ChromaDB collection for efficient semantic search and retrieval.

In [1]:
# Import required libraries
import os
import json
import numpy as np
import pandas as pd
from pathlib import Path
from tqdm.notebook import tqdm # For progress bars in Jupyter
import logging 
import time # For timing the main execution
from datetime import datetime

In [2]:
# Import ChromaDB
import chromadb

In [3]:
# For query embedding and testing
from sentence_transformers import SentenceTransformer
import torch # For checking CUDA availability for SentenceTransformer

In [4]:
# Configuration

# Directory with the latest stored embeddings
EMBEDDINGS_BASE_DIR = "../data/embeddings"
EMBEDDINGS_SUB_DIR = "header_chunks_all_MiniLM_L6_v2_20250512_231425"  # Update this to the directory name you want to use
EMBEDDINGS_DIR = os.path.join(EMBEDDINGS_BASE_DIR, EMBEDDINGS_SUB_DIR)

# File containing chunks with their pre-computed embeddings and metadata
CHUNKS_WITH_EMBEDDINGS_FILE = os.path.join(EMBEDDINGS_DIR, "chunks_with_embeddings.json") 

# Directory to store ChromaDB persistent files
CHROMA_DB_DIR = "../data/chroma_db/header_chunks"  

# Output directory for log files
LOGS_DIR = "../logs" 

# Name for the ChromaDB collection
COLLECTION_NAME = "uchicago_ms_applied_ds_header_chunks"

# Sentence Transformer model name (must be the same as used for generating document embeddings)
QUERY_EMBEDDING_MODEL_NAME = "all-MiniLM-L6-v2" # all-MiniLM-L6-v2

# Batch size for adding documents to ChromaDB
CHROMA_ADD_BATCH_SIZE = 100 # Adjust based on your system's memory

In [5]:
# Create output directories if they don't exist
Path(CHROMA_DB_DIR).mkdir(parents=True, exist_ok=True)
Path(LOGS_DIR).mkdir(parents=True, exist_ok=True)

In [6]:
# Setup logging
log_file = Path(LOGS_DIR) / f"vector_database_v3_{datetime.now().strftime('%Y%m%d_%H%M%S')}.log"
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(levelname)s - %(module)s - %(funcName)s - %(message)s',
    handlers=[
        logging.FileHandler(log_file, encoding='utf-8'),
        logging.StreamHandler()
    ]
)
logger = logging.getLogger(__name__)

In [7]:
"""
This file contains the fixed version of the vector_database_v3.py script.
The key fix is in the load_data_for_chroma function to properly handle None values in metadata.
"""

def load_data_for_chroma(chunks_with_embeddings_filepath_str):
    """
    Load chunks, their pre-computed embeddings, and metadata from the JSON file.
    Prepares data in the format required by ChromaDB.
    
    FIXED: This version properly handles None values in metadata
    """
    chunks_file = Path(chunks_with_embeddings_filepath_str)
    if not chunks_file.exists():
        logger.error(f"Data file not found: {chunks_file}. Please run the embedding generation script first.")
        return None, None, None, None

    try:
        with open(chunks_file, "r", encoding="utf-8") as f:
            loaded_data = json.load(f)
        
        logger.info(f"Loaded {len(loaded_data)} items from {chunks_file}")
        
        ids_list = []
        embeddings_list = []
        documents_list = []
        metadatas_list = []
        
        for i, item in enumerate(tqdm(loaded_data, desc="Preparing data for ChromaDB")):
            # Using page_content as the primary field, with fallback to content or text
            doc_content = item.get('page_content', item.get('content', item.get('text', '')))
            embedding_vector = item.get('embedding')
            
            # Initialize metadata_dict - start with an empty dict
            metadata_dict = {}
            
            # If metadata exists, copy it first
            if 'metadata' in item and isinstance(item['metadata'], dict):
                for k, v in item['metadata'].items():
                    if v is None:
                        # Convert None to string "None"
                        metadata_dict[k] = "None"
                    elif isinstance(v, (str, int, float, bool)):
                        metadata_dict[k] = v
                    else:
                        # Convert any other types to string
                        metadata_dict[k] = str(v)
            
            # Add other fields from the item to metadata, except embedding and content
            for key, value in item.items():
                if key not in ['embedding', 'page_content', 'content', 'text', 'metadata']:
                    if value is None:
                        metadata_dict[key] = "None"  # Convert None to string
                    elif isinstance(value, (str, int, float, bool)):
                        metadata_dict[key] = value
                    else:
                        metadata_dict[key] = str(value)  # Convert other types to string
            
            if not doc_content or embedding_vector is None:
                logger.warning(f"Skipping item at index {i} due to missing content or embedding.")
                continue
            
            # Add chunk_id as the document ID if available
            doc_id = item.get('chunk_id', item.get('id', f"doc_{i}"))
            
            ids_list.append(doc_id)
            embeddings_list.append(embedding_vector)
            documents_list.append(doc_content)
            metadatas_list.append(metadata_dict)
        
        if not ids_list:
            logger.error("No valid data to load for ChromaDB.")
            return None, None, None, None

        logger.info(f"Prepared {len(ids_list)} items for ChromaDB.")
        logger.info(f"Sample ID: {ids_list[0]}")
        logger.info(f"Sample Document (start): {documents_list[0][:100]}...")
        logger.info(f"Sample Metadata: {metadatas_list[0]}")
        
        return ids_list, embeddings_list, documents_list, metadatas_list
    except json.JSONDecodeError as e:
        logger.error(f"Error parsing JSON file {chunks_file}: {e}")
        return None, None, None, None
    except Exception as e:
        logger.error(f"Unexpected error loading data for ChromaDB: {e}", exc_info=True)
        return None, None, None, None


In [8]:
# ChromaDB client and collection initialization
def initialize_chroma_client_and_collection(db_path_str, collection_name_str):
    """
    Initialize ChromaDB client and create/get a collection.
    
    Args:
        db_path_str (str): Path to the ChromaDB persistent directory
        collection_name_str (str): Name of the collection to create or get
        
    Returns:
        tuple: (client, collection) instances or (None, None) if error
    """
    try:
        logger.info(f"Initializing ChromaDB client with persistence path: {db_path_str}")
        client = chromadb.PersistentClient(path=db_path_str)
        
        # Create a new collection or get existing one
        try:
            # Try to get existing collection
            collection = client.get_collection(name=collection_name_str)
            logger.info(f"Loaded existing collection '{collection_name_str}' with {collection.count()} documents")
        except Exception:
            # Create new collection if it doesn't exist
            logger.info(f"Creating new collection '{collection_name_str}'")
            collection = client.create_collection(
                name=collection_name_str,
                metadata={"description": "UChicago MS in Applied Data Science documents"}
            )
            logger.info(f"Created new collection '{collection_name_str}'")
        
        return client, collection
    except Exception as e:
        logger.error(f"Error initializing ChromaDB client or collection: {e}", exc_info=True)
        return None, None

In [9]:
# Add data to ChromaDB collection
def add_data_to_chroma_collection(collection_instance, ids_list, embeddings_list, documents_list, metadatas_list, batch_size=CHROMA_ADD_BATCH_SIZE):
    """
    Add data to the ChromaDB collection in batches.
    
    Args:
        collection_instance: ChromaDB collection instance
        ids_list: List of document IDs
        embeddings_list: List of embedding vectors
        documents_list: List of document texts
        metadatas_list: List of document metadata dictionaries
        batch_size: Number of documents to add in each batch
        
    Returns:
        bool: True if successful, False otherwise
    """
    try:
        n_items = len(ids_list)
        logger.info(f"Adding {n_items} documents to collection in batches of {batch_size}")
        
        for i in tqdm(range(0, n_items, batch_size), desc="Adding batches to ChromaDB"):
            batch_end_idx = min(i + batch_size, n_items)
            
            current_batch_ids = ids_list[i:batch_end_idx]
            current_batch_embeddings = embeddings_list[i:batch_end_idx]
            current_batch_documents = documents_list[i:batch_end_idx]
            current_batch_metadatas = metadatas_list[i:batch_end_idx]
            
            collection_instance.add(
                ids=current_batch_ids,
                embeddings=current_batch_embeddings,
                documents=current_batch_documents,
                metadatas=current_batch_metadatas
            )
            
        logger.info(f"Successfully added all {n_items} documents to ChromaDB collection")
        return True
    except Exception as e:
        logger.error(f"Error adding data to ChromaDB: {e}", exc_info=True)
        return False

In [10]:
# Initialize query embedding model
def initialize_query_embedding_model(model_name=QUERY_EMBEDDING_MODEL_NAME):
    """
    Initialize the Sentence Transformer model for encoding query texts.
    
    Args:
        model_name (str): Name of the pre-trained model to use
        
    Returns:
        SentenceTransformer: Initialized model or None if error
    """
    try:
        logger.info(f"Loading query embedding model: {model_name}")
        model = SentenceTransformer(model_name)
        
        # Check if we can use GPU
        device = 'cuda' if torch.cuda.is_available() else 'cpu'
        model.to(device)
        logger.info(f"Query embedding model loaded and using device: {device}")
        
        return model
    except Exception as e:
        logger.error(f"Error loading query embedding model '{model_name}': {e}", exc_info=True)
        return None

In [11]:
# Query function for the ChromaDB collection
def query_chroma_collection(collection_instance, query_text, embedding_model, top_n_results=5, metadata_filter_dict=None):
    """
    Query the ChromaDB collection using the given query text.
    
    Args:
        collection_instance: ChromaDB collection instance
        query_text (str): The query text to search for
        embedding_model: SentenceTransformer model for encoding the query
        top_n_results (int): Number of results to return
        metadata_filter_dict (dict, optional): Dictionary for filtering results by metadata fields
        
    Returns:
        dict: Query results from ChromaDB
    """
    try:
        # Generate embedding for the query text
        query_embedding = embedding_model.encode(query_text)
        
        # Query the collection
        results = collection_instance.query(
            query_embeddings=[query_embedding],
            n_results=top_n_results,
            where=metadata_filter_dict,  # Optional metadata filter
            include=["documents", "metadatas", "distances"]
        )
        
        return results
    except Exception as e:
        logger.error(f"Error querying ChromaDB: {e}", exc_info=True)
        return None

In [12]:
# Display query results
def display_query_search_results(results, query_text):
    """
    Display the results of a ChromaDB query in a readable format.
    
    Args:
        results (dict): Results from a ChromaDB query
        query_text (str): The original query text
    """
    if not results or not results.get('documents'):
        logger.info(f"No results found for query: '{query_text}'")
        return
    
    docs = results.get('documents', [[]])[0]
    metadatas = results.get('metadatas', [[]])[0]
    distances = results.get('distances', [[]])[0]
    
    logger.info(f"\n--- Results for query: '{query_text}' ---")
    
    for i, (doc, meta, dist) in enumerate(zip(docs, metadatas, distances)):
        logger.info(f"\nResult #{i+1} (Distance: {dist:.4f}):")
        
        # Display metadata
        if meta:
            title = meta.get('document_title', meta.get('title', 'Unknown'))
            section = meta.get('section', '')
            subsection = meta.get('subsection', '')
            logger.info(f"Document: {title}")
            if section:
                logger.info(f"Section: {section}")
            if subsection:
                logger.info(f"Subsection: {subsection}")
        
        # Display document snippet
        doc_snippet = doc[:300] + "..." if len(doc) > 300 else doc
        logger.info(f"Content: {doc_snippet}")

In [13]:
# Function to build a context for retrieval-augmented generation
def build_retrieval_context(query_text, top_k_docs, collection_inst, query_embed_model_inst):
    """
    Build a retrieval context by querying the ChromaDB collection.
    This context can be used for RAG applications.
    
    Args:
        query_text (str): The query text
        top_k_docs (int): Number of documents to retrieve
        collection_inst: ChromaDB collection instance
        query_embed_model_inst: SentenceTransformer model instance
        
    Returns:
        list: List of retrieved context strings
    """
    results = query_chroma_collection(
        collection_inst, 
        query_text, 
        query_embed_model_inst, 
        top_n_results=top_k_docs
    )
    
    if not results or not results.get('documents'):
        logger.warning(f"No documents retrieved for query: '{query_text}'")
        return []
    
    # Extract documents and their metadata
    retrieved_contexts = []
    docs = results.get('documents', [[]])[0]
    metadatas = results.get('metadatas', [[]])[0]
    
    for i, (doc, meta) in enumerate(zip(docs, metadatas)):
        # Create a formatted context string with metadata and content
        title = meta.get('document_title', meta.get('title', 'Unknown'))
        section = meta.get('section', '')
        subsection = meta.get('subsection', '')
        
        context_header = f"Source {i+1}: {title}"
        if section:
            context_header += f" | Section: {section}"
        if subsection:
            context_header += f" | Subsection: {subsection}"
            
        context_str = f"{context_header}\n\n{doc}\n"
        retrieved_contexts.append(context_str)
    
    return retrieved_contexts

In [14]:
# Main execution
db_creation_start_time = time.time()

logger.info("Starting vector database creation process")

# 1. Load the data
ids, embeddings, documents, metadatas = load_data_for_chroma(CHUNKS_WITH_EMBEDDINGS_FILE)

if ids:
    # 2. Initialize ChromaDB
    chroma_client, chroma_collection_instance = initialize_chroma_client_and_collection(
        CHROMA_DB_DIR, COLLECTION_NAME
    )
    
    if chroma_collection_instance:
        # 3. Add data to ChromaDB
        success = add_data_to_chroma_collection(
            chroma_collection_instance, ids, embeddings, documents, metadatas
        )
        
        if success:
            logger.info(f"ChromaDB collection '{COLLECTION_NAME}' created successfully with {len(ids)} documents")
            
            # 4. Initialize the embedding model for queries
            query_model = initialize_query_embedding_model(QUERY_EMBEDDING_MODEL_NAME)
            
            if query_model:
                # 5. Test some queries
                sample_test_queries = [
                    "What are the core courses for the MS in Applied Data Science?",
                    "Tell me about the faculty specializing in machine learning.",
                    "What are the admission requirements?",
                    "How is the capstone project structured?"
                ]
                for test_q in sample_test_queries:
                    results = query_chroma_collection(chroma_collection_instance, test_q, query_model, top_n_results=3)
                    display_query_search_results(results, test_q)
                
                # Test with metadata filter (example)
                if metadatas: # Check if metadatas list is not empty
                    sample_category = metadatas[0].get('category')
                    if sample_category:
                        logger.info(f"\n--- Testing Query with Metadata Filter (category: {sample_category}) ---")
                        filtered_query = "capstone project"
                        filter_criteria = {"category": sample_category} 
                        # Example of more complex filter: {"$and": [{"category": "education"}, {"title": {"$contains": "Online"}}]}
                        
                        filtered_results = query_chroma_collection(
                            chroma_collection_instance, filtered_query, query_model, 
                            top_n_results=2, metadata_filter_dict=filter_criteria
                        )
                        display_query_search_results(filtered_results, f"{filtered_query} (filtered by category: {sample_category})")
                    else:
                        logger.info("Skipping metadata filter test as no sample category found in the first chunk.")
                
                # 6. Test the context retrieval function
                logger.info("\n--- Testing Context Retrieval Function ---")
                retrieved_contexts = build_retrieval_context(
                    "What are the core courses?", 
                    top_k_docs=2, 
                    collection_inst=chroma_collection_instance, # Pass instances
                    query_embed_model_inst=query_model           # Pass instances
                )
                for i, ctx in enumerate(retrieved_contexts):
                    logger.info(f"\nRetrieved Context #{i+1}:\n{ctx}")
            else:
                logger.error("Query embedding model could not be initialized. Query testing skipped.")
        else:
            logger.error("Failed to add data to ChromaDB collection. Process halted.")
    else:
        logger.error("ChromaDB collection could not be initialized. Process halted.")
else:
    logger.error(f"Failed to load data from {CHUNKS_WITH_EMBEDDINGS_FILE}. Process halted.")

db_creation_end_time = time.time()
elapsed_processing_time = db_creation_end_time - db_creation_start_time
logger.info(f"--- Vector Database Process Completed in {elapsed_processing_time:.2f} seconds ---")

2025-05-12 23:21:58,385 - INFO - 914924386 - <module> - Starting vector database creation process
2025-05-12 23:21:58,436 - INFO - 2989996848 - load_data_for_chroma - Loaded 203 items from ..\data\embeddings\header_chunks_all_MiniLM_L6_v2_20250512_231425\chunks_with_embeddings.json


Preparing data for ChromaDB:   0%|          | 0/203 [00:00<?, ?it/s]

2025-05-12 23:21:58,448 - INFO - 2989996848 - load_data_for_chroma - Prepared 203 items for ChromaDB.
2025-05-12 23:21:58,449 - INFO - 2989996848 - load_data_for_chroma - Sample ID: chunk_0
2025-05-12 23:21:58,450 - INFO - 2989996848 - load_data_for_chroma - Sample Document (start): ## Your Career Success

Take the next step to advance your career with UChicago’s MS in Applied Data...
2025-05-12 23:21:58,452 - INFO - 2989996848 - load_data_for_chroma - Sample Metadata: {'title': 'In-Person Program – DSI', 'original_url': 'https://datascience.uchicago.edu/education/masters-programs/in-person-program', 'category': 'education', 'date': '2025-05-04', 'source_file': 'C:\\Users\\alen.pavlovic\\Documents\\GitLab\\gen-ai-midterm-project\\data\\markdown_clean_final\\education_masters-programs_in-person-program.md', 'filename': 'education_masters-programs_in-person-program.md', 'section_level': 'main', 'section': 'Your Career Success', 'subsection': 'None', 'header_level': 2, 'header_text': 'You

Adding batches to ChromaDB:   0%|          | 0/3 [00:00<?, ?it/s]

2025-05-12 23:21:59,079 - INFO - 1570821046 - add_data_to_chroma_collection - Successfully added all 203 documents to ChromaDB collection
2025-05-12 23:21:59,081 - INFO - 914924386 - <module> - ChromaDB collection 'uchicago_ms_applied_ds_header_chunks' created successfully with 203 documents
2025-05-12 23:21:59,084 - INFO - 2479983000 - initialize_query_embedding_model - Loading query embedding model: all-MiniLM-L6-v2
2025-05-12 23:21:59,086 - INFO - SentenceTransformer - __init__ - Use pytorch device_name: cpu
2025-05-12 23:21:59,087 - INFO - SentenceTransformer - __init__ - Load pretrained SentenceTransformer: all-MiniLM-L6-v2
2025-05-12 23:22:02,151 - INFO - 2479983000 - initialize_query_embedding_model - Query embedding model loaded and using device: cpu


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

2025-05-12 23:22:02,320 - INFO - 4042634767 - display_query_search_results - 
--- Results for query: 'What are the core courses for the MS in Applied Data Science?' ---
2025-05-12 23:22:02,322 - INFO - 4042634767 - display_query_search_results - 
Result #1 (Distance: 0.4343):
2025-05-12 23:22:02,323 - INFO - 4042634767 - display_query_search_results - Document: In-Person Program – DSI
2025-05-12 23:22:02,325 - INFO - 4042634767 - display_query_search_results - Section: By and For Data Science Innovators
2025-05-12 23:22:02,327 - INFO - 4042634767 - display_query_search_results - Subsection: Core Courses (6)
2025-05-12 23:22:02,328 - INFO - 4042634767 - display_query_search_results - Content: ### Core Courses (6)

You will complete six core courses toward your [Master’s in Applied Data Science](https://datascience.uchicago.edu/education/masters-programs/ms-in-applied-data-science/) degree. Core courses allow you to build your theoretical data science knowledge and practice applying this

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

2025-05-12 23:22:02,390 - INFO - 4042634767 - display_query_search_results - 
--- Results for query: 'Tell me about the faculty specializing in machine learning.' ---
2025-05-12 23:22:02,391 - INFO - 4042634767 - display_query_search_results - 
Result #1 (Distance: 0.7109):
2025-05-12 23:22:02,392 - INFO - 4042634767 - display_query_search_results - Document: Igor Yakushin, PhD
2025-05-12 23:22:02,393 - INFO - 4042634767 - display_query_search_results - Content: People / MS Instructors

# Igor Yakushin, PhD

Instructor; Applied Scientist

PhD in Theoretical Physics, MS in Computer Science. Currently – Applied Scientist at Amazon. Previous jobs: Computational Scientist at the Argonne National Laboratory, Scientist at LIGO project of California Institute of T...
2025-05-12 23:22:02,394 - INFO - 4042634767 - display_query_search_results - 
Result #2 (Distance: 0.7915):
2025-05-12 23:22:02,395 - INFO - 4042634767 - display_query_search_results - Document: Faculty, Instructors, Staff – DSI


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

2025-05-12 23:22:02,431 - INFO - 4042634767 - display_query_search_results - 
--- Results for query: 'What are the admission requirements?' ---
2025-05-12 23:22:02,432 - INFO - 4042634767 - display_query_search_results - 
Result #1 (Distance: 0.7672):
2025-05-12 23:22:02,433 - INFO - 4042634767 - display_query_search_results - Document: How to Apply – DSI
2025-05-12 23:22:02,433 - INFO - 4042634767 - display_query_search_results - Section: Master’s in Applied Data Science Application Requirements
2025-05-12 23:22:02,434 - INFO - 4042634767 - display_query_search_results - Subsection: English Language Requirement
2025-05-12 23:22:02,434 - INFO - 4042634767 - display_query_search_results - Content: ### English Language Requirement

Applicants to the Master’s in Applied Data Science program who do not meet the English Language Proficiency criteria must submit proof of English language proficiency.

Minimum scores for the Master’s in Applied Data Science program: TOEFL, 102 (no subscore re

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

2025-05-12 23:22:02,463 - INFO - 4042634767 - display_query_search_results - 
--- Results for query: 'How is the capstone project structured?' ---
2025-05-12 23:22:02,463 - INFO - 4042634767 - display_query_search_results - 
Result #1 (Distance: 0.6060):
2025-05-12 23:22:02,464 - INFO - 4042634767 - display_query_search_results - Document: In-Person Program – DSI
2025-05-12 23:22:02,465 - INFO - 4042634767 - display_query_search_results - Section: By and For Data Science Innovators
2025-05-12 23:22:02,466 - INFO - 4042634767 - display_query_search_results - Subsection: Capstone (2)
2025-05-12 23:22:02,466 - INFO - 4042634767 - display_query_search_results - Content: ### Capstone (2)

The required [Capstone Project](https://datascience.uchicago.edu/capstone-projects/) is completed over two quarters and covers research design, implementation, and writing. Full-time students start their Capstone Project in their third quarter. Part-time students generally begin th...
2025-05-12 23:22:02,4

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

2025-05-12 23:22:02,542 - INFO - 4042634767 - display_query_search_results - 
--- Results for query: 'capstone project (filtered by category: education)' ---
2025-05-12 23:22:02,543 - INFO - 4042634767 - display_query_search_results - 
Result #1 (Distance: 0.7351):
2025-05-12 23:22:02,544 - INFO - 4042634767 - display_query_search_results - Document: In-Person Program – DSI
2025-05-12 23:22:02,545 - INFO - 4042634767 - display_query_search_results - Section: By and For Data Science Innovators
2025-05-12 23:22:02,546 - INFO - 4042634767 - display_query_search_results - Subsection: Capstone (2)
2025-05-12 23:22:02,546 - INFO - 4042634767 - display_query_search_results - Content: ### Capstone (2)

The required [Capstone Project](https://datascience.uchicago.edu/capstone-projects/) is completed over two quarters and covers research design, implementation, and writing. Full-time students start their Capstone Project in their third quarter. Part-time students generally begin th...
2025-05-12

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

2025-05-12 23:22:02,576 - INFO - 914924386 - <module> - 
Retrieved Context #1:
Source 1: Online Program – DSI | Section: Curriculum | Subsection: Core Courses (6)

### Core Courses (6)

You will complete 6 core courses toward your [Master’s in Applied Data Science](https://datascience.uchicago.edu/education/masters-programs/ms-in-applied-data-science/instructors-staff/) degree. Core courses allow you to build your theoretical data science knowledge and practice applying this theory to examine real-world business problems.

2025-05-12 23:22:02,577 - INFO - 914924386 - <module> - 
Retrieved Context #2:
Source 2: In-Person Program – DSI | Section: By and For Data Science Innovators | Subsection: Core Courses (6)

### Core Courses (6)

You will complete six core courses toward your [Master’s in Applied Data Science](https://datascience.uchicago.edu/education/masters-programs/ms-in-applied-data-science/) degree. Core courses allow you to build your theoretical data science knowledge and pra