# Evaluating Chunking Strategies for Code RAG

You've built a RAG system over your codebase. How do you know if your chunking strategy is working well? And how would you measure if a different approach would work better?

This notebook demonstrates a systematic approach to evaluating chunking strategies using MLflow's genai evaluation framework. We compare three approaches:

1. **Naive**: Fixed-size character chunks with overlap
2. **Language-aware**: LangChain's `RecursiveCharacterTextSplitter.from_language()` with language-specific separators  
3. **AST-based**: Tree-sitter parsing via the `astchunk` library with metadata headers showing file path and class/function hierarchy

The methodology matters more than the specific results. Once you have a repeatable evaluation setup, you can apply it to any RAG decision: chunking strategy, embedding model, k value, reranking approach, and so on.

## 1. Setup & Configuration

In [None]:
%pip install -U -qqq .
%restart_python

In [None]:
import os
import json
import re
from pathlib import Path
from dataclasses import dataclass, asdict
from typing import Optional, Literal

import pandas as pd
import numpy as np
import lancedb
import mlflow
from mlflow.entities import Document, SpanType, Feedback
from mlflow.genai.scorers import scorer  # Custom scorer decorator

# DeepEval scorers (RAG-specific metrics)
from mlflow.genai.scorers.deepeval import AnswerRelevancy, Faithfulness

# MLflow built-in scorers
from mlflow.genai.scorers import RetrievalSufficiency, Correctness

from langchain_text_splitters import (
    Language,
    MarkdownHeaderTextSplitter,
    RecursiveCharacterTextSplitter,
)
from astchunk import ASTChunkBuilder

# Configuration
# NOTE: Must resolve() FIRST, then get parents - Path(".").parent returns "." not ".."
CODEBASE_ROOT = Path(".").resolve().parent.parent  # /caspers-kitchens/
EMBEDDING_MODEL = "databricks-gte-large-en"
LLM_MODEL = "databricks-gemini-3-flash"

JUDGE_MODEL = "databricks:/databricks-claude-opus-4-5"  # Format for scorers
JUDGE_MODEL_NAME = "databricks-claude-opus-4-5"  # For direct API calls
K_CHUNKS = 10

print(f"Codebase root: {CODEBASE_ROOT}")
print(f"MLflow version: {mlflow.__version__}")
print(f"Judge model: {JUDGE_MODEL}")

We use the Databricks SDK's `WorkspaceClient` for authentication, which provides an OpenAI-compatible client for calling Foundation Model APIs. This handles token refresh automatically and works both in Databricks notebooks (automatic auth) and locally (via config profile or environment variables).

In [None]:
# Set up Databricks authentication and MLflow tracking
from databricks.sdk import WorkspaceClient

# =============================================================================
# CONFIGURATION - Auto-detects environment
# =============================================================================
ON_DATABRICKS = "DATABRICKS_RUNTIME_VERSION" in os.environ
DATABRICKS_PROFILE = None  # None = auto-detect (notebook auth on DBX, env vars locally)
USE_DATABRICKS_MLFLOW = ON_DATABRICKS  # Log to Databricks workspace when running there
# =============================================================================

# Initialize Databricks client
if DATABRICKS_PROFILE:
    os.environ["DATABRICKS_CONFIG_PROFILE"] = DATABRICKS_PROFILE
    w = WorkspaceClient(profile=DATABRICKS_PROFILE)
else:
    w = WorkspaceClient()

# Set DATABRICKS_HOST for MLflow to resolve "databricks:/model-name" endpoints
os.environ["DATABRICKS_HOST"] = w.config.host

# Create a personal access token for MLflow scorers (2 hour lifetime)
# MLflow's internal HTTP client for "databricks:/model-name" endpoints requires
# DATABRICKS_TOKEN in the environment, even when using OAuth profiles.
token_response = w.tokens.create(
    comment="MLflow chunking evaluation",
    lifetime_seconds=7200  # 2 hours
)
os.environ["DATABRICKS_TOKEN"] = token_response.token_value
print(f"Created Databricks token (expires in 2 hours)")
print(f"Databricks host: {w.config.host}")

# Disable OpenAI autolog to prevent traces from custom scorer calls
mlflow.openai.autolog(disable=True)

# Configure MLflow tracking location
if USE_DATABRICKS_MLFLOW:
    mlflow.set_tracking_uri("databricks")
    mlflow.set_experiment("/Shared/chunking-strategy-comparison")
    print("MLflow tracking: Databricks workspace")
else:
    # Local MLflow server (run `mlflow server` or `mlflow ui` first)
    mlflow.set_tracking_uri("http://localhost:5000")
    mlflow.set_experiment("chunking-strategy-comparison")
    print("MLflow tracking: http://localhost:5000")

# Get OpenAI-compatible client for model serving
client = w.serving_endpoints.get_open_ai_client()

# Rate limit MLflow evaluation to avoid 429 errors
os.environ["MLFLOW_GENAI_EVAL_MAX_WORKERS"] = "2"
os.environ["MLFLOW_GENAI_EVAL_MAX_SCORER_WORKERS"] = "2"
print("MLflow rate limiting configured")

## 2. Load Codebase Files

We index Python, Markdown, YAML, and Jupyter notebook files while excluding generated artifacts, dependencies, and documentation that would add noise without helping answer code questions.

In [None]:
# File patterns to include
INCLUDE_EXTENSIONS = {".py", ".md", ".yaml", ".yml", ".ipynb"}

# Directories/files to exclude (glob patterns)
EXCLUDE_PATTERNS = [
    "docs/vendor/*",
    ".dbx-runs/*",
    ".beads/*",
    ".git/*",
    "__pycache__/*",
    "demos/*",
    ".databricks/*",
    ".venv/*",
    "*.egg-info/*",
    ".pytest_cache/*",
    "*/.terraform/*",
    "node_modules/*",
    "dist/*",
    "build/*",
    "dbx_execution/*",
    "dbx_ai_docs/*",
    ".claude/*",
    "CLAUDE.md",
    "*/CLAUDE.md",
]

import fnmatch

def should_exclude(file_path: Path, root: Path) -> bool:
    """Check if a file should be excluded based on glob patterns."""
    rel_path = str(file_path.relative_to(root))
    for pattern in EXCLUDE_PATTERNS:
        if fnmatch.fnmatch(rel_path, pattern):
            return True
        # Also check each path component for directory patterns
        parts = rel_path.split('/')
        for i, part in enumerate(parts):
            partial = '/'.join(parts[:i+1])
            if fnmatch.fnmatch(partial, pattern.rstrip('/*')):
                return True
    return False

def discover_files(root: Path) -> list[Path]:
    """Discover all files to process."""
    files = []
    for ext in INCLUDE_EXTENSIONS:
        for file_path in root.rglob(f"*{ext}"):
            if not should_exclude(file_path, root):
                files.append(file_path)
    return sorted(files)

# Discover files
files = discover_files(CODEBASE_ROOT)
print(f"Found {len(files)} files to process")

# Show breakdown by extension
ext_counts = {}
for f in files:
    ext = f.suffix.lower()
    ext_counts[ext] = ext_counts.get(ext, 0) + 1
print("\nFiles by extension:")
for ext, count in sorted(ext_counts.items()):
    print(f"  {ext}: {count}")

Jupyter notebooks pose a challenge for code chunking since their JSON structure doesn't parse well. We convert them to Python format with cell markers (`# %% [code]`), which lets all three chunking strategies process notebook code the same way they handle regular Python files.

In [None]:
# Convert notebooks to .py format for better chunking
import nbformat

def notebook_to_python(content: str) -> str:
    """Convert .ipynb to .py format with cell markers.
    
    - Code cells: # %% [code] followed by code
    - Markdown cells: # %% [markdown] with each line commented
    
    This allows AST chunking to parse notebook code cells properly.
    """
    try:
        nb = nbformat.reads(content, as_version=4)
    except Exception as e:
        print(f"Failed to parse notebook: {e}")
        return content  # Return raw content if parsing fails
    
    lines = []
    for cell in nb.cells:
        source = cell.source.strip()
        if not source:
            continue
            
        if cell.cell_type == 'markdown':
            lines.append('# %% [markdown]')
            for line in source.split('\n'):
                lines.append(f'# {line}')
        else:
            lines.append('# %% [code]')
            lines.append(source)
        lines.append('')  # Blank line between cells
    
    return '\n'.join(lines)

# Load file contents, converting notebooks to .py format
file_contents = {}
for file_path in files:
    try:
        content = file_path.read_text(encoding="utf-8")
        if content.strip():
            # Convert notebooks to Python format
            if file_path.suffix.lower() == '.ipynb':
                content = notebook_to_python(content)
            file_contents[file_path] = content
    except Exception as e:
        print(f"Warning: Could not read {file_path}: {e}")

print(f"Loaded {len(file_contents)} files with content")
print(f"  (Notebooks converted to .py format with cell markers)")

## 3. Implement Chunking Strategies

We compare three chunking strategies that represent increasing levels of code awareness:

1. **Naive** splits text at fixed character intervals, ignoring code structure entirely
2. **Language-aware** uses heuristic separators (class/function boundaries) but enforces strict size limits
3. **AST-based** parses code into a syntax tree and chunks at semantic boundaries, with metadata headers

All strategies use the same target chunk size (1000 chars) and overlap (200 chars) for fair comparison.

### 3.1 Naive Chunking (Baseline)

The naive approach treats code as plain text, splitting at fixed character intervals with overlap. This is the simplest strategy but often cuts through function names, string literals, or logical blocks, losing semantic context.

In [None]:
chunk_size = 1000
overlap = 200

@dataclass
class Chunk:
    """A single chunk with content and metadata."""
    content: str
    file_path: str
    strategy: str
    chunk_index: int
    file_type: str
    char_start: int = 0
    char_end: int = 0

def chunk_naive(content: str, file_path: str, chunk_size: int = chunk_size, overlap: int = overlap) -> list[Chunk]:
    """Fixed-size character chunks with no language awareness.
    
    Uses same chunk_size and overlap as middleground for fair comparison.
    """
    chunks = []
    file_type = Path(file_path).suffix.lower().lstrip(".")
    
    for i in range(0, len(content), chunk_size - overlap):
        chunk_text = content[i:i + chunk_size]
        if chunk_text.strip():
            chunks.append(Chunk(
                content=chunk_text,
                file_path=file_path,
                strategy="naive",
                chunk_index=len(chunks),
                file_type=file_type,
                char_start=i,
                char_end=min(i + chunk_size, len(content))
            ))
    return chunks

# Test naive chunking
test_file = list(file_contents.keys())[0]
test_chunks = chunk_naive(file_contents[test_file], str(test_file.relative_to(CODEBASE_ROOT)))
print(f"Naive chunking test: {len(test_chunks)} chunks from {test_file.name}")

### 3.2 Middleground Chunking (LangChain Language-Aware)

LangChain's `RecursiveCharacterTextSplitter.from_language()` uses language-specific separators (like `\nclass ` and `\ndef ` for Python) to prefer splitting at logical boundaries. Unlike naive chunking, it tries to keep functions and classes intact. However, it still enforces strict size limits, so large functions get split mid-body.

In [None]:
# Language-aware splitters using LangChain's from_language()

chunk_size = 1000
overlap = 200

python_splitter = RecursiveCharacterTextSplitter.from_language(
    language=Language.PYTHON,
    chunk_size=chunk_size,
    chunk_overlap=overlap,
)

markdown_splitter = RecursiveCharacterTextSplitter.from_language(
    language=Language.MARKDOWN,
    chunk_size=chunk_size,
    chunk_overlap=overlap,
)

# Generic splitter for YAML and other formats
generic_splitter = RecursiveCharacterTextSplitter(
    chunk_size=chunk_size,
    chunk_overlap=overlap,
    separators=["\n\n", "\n", " ", ""],
)

def chunk_middleground(content: str, file_path: str) -> list[Chunk]:
    """Language-aware chunking using LangChain's from_language() splitters."""
    chunks = []
    suffix = Path(file_path).suffix.lower()
    
    # Notebooks are pre-converted to .py format, treat as Python
    if suffix in {".py", ".ipynb"}:
        split_texts = python_splitter.split_text(content)
        file_type = "python" if suffix == ".py" else "notebook"
    elif suffix == ".md":
        split_texts = markdown_splitter.split_text(content)
        file_type = "markdown"
    elif suffix in {".yaml", ".yml"}:
        split_texts = generic_splitter.split_text(content)
        file_type = "yaml"
    else:
        split_texts = generic_splitter.split_text(content)
        file_type = "unknown"
    
    for i, text in enumerate(split_texts):
        if text.strip():
            chunks.append(Chunk(
                content=text,
                file_path=file_path,
                strategy="middleground",
                chunk_index=i,
                file_type=file_type,
            ))
    return chunks

### 3.3 AST-Based Chunking (ASTChunk)

AST chunking uses tree-sitter to parse code into a syntax tree, then chunks at semantic boundaries (complete functions, classes, or statement blocks). Unlike the LangChain approach which uses regex-based heuristics, AST parsing understands actual code structure.

The key difference is **chunk expansion**: each chunk gets a metadata header prepended showing the file path and class/function hierarchy. This context becomes part of the embedding, helping retrieval match queries like "UCState.clear_all()" to the right chunk even if those exact terms don't appear in the chunk body.

```python
'''
utils/calculator.py
class Calculator:
'''
    def add(self, a: int, b: int) -> int:
        return a + b
```

AST chunking may also exceed the target chunk size when necessary to keep a complete function together, trading uniform chunk sizes for semantic coherence.

In [None]:
# Language mapping for ASTChunk
# NOTE: .ipynb files are converted to .py format in cell-6, so treat them as Python
AST_SUPPORTED = {".py": "python", ".ipynb": "python", ".ts": "typescript", ".tsx": "typescript", ".js": "javascript"}

# AST chunking configuration
AST_CHUNK_SIZE = 1000
AST_OVERLAP = 1           # Number of AST nodes to overlap between chunks
AST_EXPANSION = True      # Add filepath + class/function path headers to chunks

def chunk_ast(content: str, file_path: str, max_chunk_size: int = AST_CHUNK_SIZE) -> list[Chunk]:
    """AST-aware chunking using tree-sitter.
    
    Uses chunk_overlap for bidirectional AST node overlap between chunks.
    Uses chunk_expansion=True to prepend metadata headers with filepath and 
    ancestor path (class/function hierarchy) to each chunk for better retrieval.
    
    NOTE: Notebooks are pre-converted to .py format, so they get full AST treatment.
    """
    suffix = Path(file_path).suffix.lower()
    language = AST_SUPPORTED.get(suffix)
    
    if language is None:
        # Fall back to middleground for unsupported languages
        chunks = chunk_middleground(content, file_path)
        for c in chunks:
            c.strategy = "ast_fallback"
        return chunks
    
    try:
        builder = ASTChunkBuilder(
            max_chunk_size=max_chunk_size,
            language=language,
            metadata_template="default",
        )
        
        # Pass overlap, expansion, and filepath metadata to chunkify()
        raw_chunks = builder.chunkify(
            content,
            chunk_overlap=AST_OVERLAP,
            chunk_expansion=AST_EXPANSION,
            repo_level_metadata={"filepath": file_path}
        )
        
        chunks = []
        for i, c in enumerate(raw_chunks):
            chunk_content = c["content"] if isinstance(c, dict) else c.content
            if chunk_content.strip():
                chunks.append(Chunk(
                    content=chunk_content,
                    file_path=file_path,
                    strategy="ast",
                    chunk_index=i,
                    file_type="python" if language == "python" else "code",
                ))
        return chunks
    except Exception as e:
        print(f"AST parsing failed for {file_path}: {e}")
        # Fall back to middleground (not naive) for consistency
        chunks = chunk_middleground(content, file_path)
        for c in chunks:
            c.strategy = "ast_fallback"
        return chunks

### Generate All Chunks

In [None]:
def generate_all_chunks(file_contents: dict, chunker_fn, strategy_name: str) -> list[Chunk]:
    """Generate chunks for all files using a given chunking function."""
    all_chunks = []
    for file_path, content in file_contents.items():
        rel_path = str(file_path.relative_to(CODEBASE_ROOT))
        chunks = chunker_fn(content, rel_path)
        all_chunks.extend(chunks)
    return all_chunks

# Generate chunks for all three strategies
print("Generating naive chunks...")
naive_chunks = generate_all_chunks(file_contents, chunk_naive, "naive")
print(f"  {len(naive_chunks)} chunks")

print("Generating middleground chunks...")
middleground_chunks = generate_all_chunks(file_contents, chunk_middleground, "middleground")
print(f"  {len(middleground_chunks)} chunks")

print("Generating AST chunks...")
ast_chunks = generate_all_chunks(file_contents, chunk_ast, "ast")
print(f"  {len(ast_chunks)} chunks")

In [None]:
# Compare chunk statistics
def chunk_stats(chunks: list[Chunk], name: str):
    sizes = [len(c.content) for c in chunks]
    print(f"\n{name}:")
    print(f"  Total chunks: {len(chunks)}")
    print(f"  Avg size: {np.mean(sizes):.0f} chars")
    print(f"  Min/Max size: {min(sizes)}/{max(sizes)} chars")
    print(f"  Total chars: {sum(sizes):,}")

chunk_stats(naive_chunks, "Naive")
chunk_stats(middleground_chunks, "Middleground")
chunk_stats(ast_chunks, "AST-based")

## 4. Create Vector Indexes

To compare chunking strategies fairly, we embed each strategy's chunks using the same embedding model (`databricks-gte-large-en`) and store them in separate vector indexes. We delete and recreate indexes on each run to ensure no stale data affects results.

In [None]:
def get_embeddings(texts: list[str], batch_size: int = 20) -> list[list[float]]:
    """Get embeddings using databricks-gte-large-en."""
    all_embeddings = []
    
    for i in range(0, len(texts), batch_size):
        batch = texts[i:i + batch_size]
        response = w.serving_endpoints.query(
            name=EMBEDDING_MODEL,
            input=batch
        )
        batch_embeddings = [item.embedding for item in response.data]
        all_embeddings.extend(batch_embeddings)
    
    return all_embeddings

# Test embedding to verify connection
test_emb = get_embeddings(["Hello world"])
print(f"✓ Embedding test passed, dimension: {len(test_emb[0])}")

We use LanceDB as a lightweight local vector store. It requires no server setup and stores data in the local directory, making the notebook self-contained and easy to run anywhere.

In [None]:
import shutil

# LanceDB path: /tmp on Databricks (Workspace FS doesn't support atomic rename), local dir otherwise
LANCEDB_PATH = "/tmp/lancedb" if ON_DATABRICKS else "./lancedb"

def create_lancedb_index(chunks: list[Chunk], strategy_name: str) -> lancedb.table.Table:
    """Create LanceDB index for a chunking strategy.
    
    Always overwrites existing tables to ensure fresh data.
    """
    print(f"Creating index for {strategy_name}...")
    
    # Get embeddings
    texts = [c.content for c in chunks]
    print(f"  Embedding {len(texts)} chunks...")
    embeddings = get_embeddings(texts)
    
    # Prepare data for LanceDB
    data = [
        {
            "content": c.content,
            "file_path": c.file_path,
            "strategy": c.strategy,
            "chunk_index": c.chunk_index,
            "file_type": c.file_type,
            "vector": emb
        }
        for c, emb in zip(chunks, embeddings)
    ]
    
    # Create table (mode="overwrite" ensures fresh data)
    db = lancedb.connect(LANCEDB_PATH)
    table = db.create_table(f"chunks_{strategy_name}", data, mode="overwrite")
    print(f"  Created table with {len(data)} rows")
    
    return table

# Delete existing LanceDB to ensure fresh indexes
if Path(LANCEDB_PATH).exists():
    shutil.rmtree(LANCEDB_PATH)
    print(f"Deleted existing LanceDB at {LANCEDB_PATH}")

In [None]:
# Create indexes for all three strategies
naive_index = create_lancedb_index(naive_chunks, "naive")
middleground_index = create_lancedb_index(middleground_chunks, "middleground")
ast_index = create_lancedb_index(ast_chunks, "ast")

## 5. Define RAG Pipeline

The RAG pipeline has two steps: (1) embed the query and retrieve the top-k most similar chunks from the vector index, (2) pass those chunks as context to an LLM to generate an answer. We use MLflow tracing to capture the full pipeline execution for debugging.

In [None]:
def make_rag_pipeline(index: lancedb.table.Table):
    """Create a RAG pipeline function bound to a specific index."""
    
    @mlflow.trace(span_type=SpanType.RETRIEVER)
    def retrieve(query: str) -> list[Document]:
        """Retrieve relevant chunks and return as MLflow Document objects."""
        query_emb = get_embeddings([query])[0]
        results = index.search(query_emb).limit(K_CHUNKS).to_list()
        return [
            Document(
                page_content=r["content"],
                metadata={"doc_uri": r["file_path"], "chunk_id": r["chunk_index"]}
            )
            for r in results
        ]
    
    @mlflow.trace(name="generate", span_type=SpanType.LLM)
    def generate(context: str, question: str) -> str:
        """Generate answer from retrieved context."""
        response = client.chat.completions.create(
            model=LLM_MODEL,
            messages=[{
                "role": "user",
                "content": f"Based on this code context, answer concisely:\n\nContext:\n{context}\n\nQuestion: {question}\n\nAnswer:"
            }],
        )
        content = response.choices[0].message.content
        # Handle Gemini's structured response format
        if isinstance(content, list):
            return "".join(block.get("text", "") for block in content if isinstance(block, dict))
        return content
    
    @mlflow.trace(span_type=SpanType.CHAIN)
    def predict(question: str) -> str:
        """Full RAG pipeline: retrieve then generate."""
        docs = retrieve(question)
        context = "\n\n---\n\n".join([
            f"File: {d.metadata['doc_uri']}\n{d.page_content}" 
            for d in docs
        ])
        return generate(context, question)
    
    return predict

print("✓ RAG pipeline factory defined")

In [None]:
# Test RAG pipeline
test_predict = make_rag_pipeline(middleground_index)
test_answer = test_predict("What is the purpose of the complaint agent?")
print("Test answer:")
print(test_answer)

## 6. Evaluation

To measure which chunking strategy works best, we need two things: a set of questions with expected answers, and scorers that assess answer quality. MLflow's genai evaluation framework handles the orchestration, running each question through the RAG pipeline and applying scorers to the results.

### 6.1 Evaluation Dataset

We created 24 questions across different categories (conceptual, code examples, architecture, how-to, troubleshooting) and difficulty levels. The code example questions are the most demanding since they require retrieving specific implementation details.

In [None]:
# Evaluation questions about Casper's Kitchens codebase
# Categories: Business User, Developer, Demo Presenter
# Complexity: Easy (factual), Medium (architecture/how-to), Hard (code examples)

eval_data = [
    # === BUSINESS USER - CONCEPTUAL (4 questions) ===
    
    {"inputs": {"question": "What is Casper's Kitchens and what problem does it solve?"},
     "expectations": {"expected_response": "Casper's Kitchens is a demonstration platform simulating a ghost kitchen food delivery service. It showcases end-to-end Databricks capabilities including data pipelines, AI agents, streaming, and apps."}},
    
    {"inputs": {"question": "What AI agents are included in the project and what business decisions does each one help with?"},
     "expectations": {"expected_response": "Two AI agents: (1) Refund Recommender Agent - recommends whether to approve refund requests, (2) Complaint Agent - triages customer complaints and decides whether to suggest credit or escalate to human review."}},
    
    {"inputs": {"question": "Walk me through the end-to-end flow of how a customer complaint gets processed."},
     "expectations": {"expected_response": "A scheduled job generates synthetic complaints from delivered orders. These complaints are processed by a DSPy ReAct agent that retrieves order context using Unity Catalog functions, then makes a triage decision to either suggest a credit amount or escalate to human review."}},
    
    {"inputs": {"question": "What types of data does the project generate and process?"},
     "expectations": {"expected_response": "The project generates simulated order events (order_created, delivery tracking), refund requests, and customer complaints. Data flows through bronze/silver/gold medallion layers with Spark transformations."}},
    
    # === CODE EXAMPLES - HARD (6 questions) ===
    
    {"inputs": {"question": "Show the deletion order used by UCState.clear_all() and explain why this ordering matters."},
     "expectations": {"expected_response": "deletion_order = ['experiments', 'jobs', 'pipelines', 'endpoints', 'apps', 'warehouses', 'databasecatalogs', 'catalogs', 'databaseinstances']. This order matters because resources may depend on each other - you must delete dependent resources before their dependencies."}},
    
    {"inputs": {"question": "How does the resolve_with_parents function calculate depth for topological sorting of task dependencies?"},
     "expectations": {"expected_response": "Uses a recursive depth() function with memoization: d = 0 if not parents else 1 + max(depth(p) for p in parents). Tasks are sorted by (depth(n), n)."}},
    
    {"inputs": {"question": "Show the JSON recovery logic in parse_agent_response that handles trailing junk after valid JSON."},
     "expectations": {"expected_response": "If direct json.loads fails and '}' is in the string, tries: obj = json.loads(s[: s.rfind('}') + 1]) to trim to the last closing brace."}},
    
    {"inputs": {"question": "What Spark transformations are applied in silver_order_items to normalize order events into item-level rows?"},
     "expectations": {"expected_response": "filter for order_created → from_json to parse body → explode items array → withColumn extended_price = price * qty → to_date for order_day partitioning"}},
    
    {"inputs": {"question": "How does CaspersDataSource.read() differentiate between initial historical catchup and subsequent streaming runs?"},
     "expectations": {"expected_response": "Checks is_initial flag from offset. If True, outputs all data from day 0 to START_DAY + current time without speed multiplier. Otherwise, uses elapsed_real_seconds * speed_multiplier."}},
    
    {"inputs": {"question": "Show the task dependencies for Complaint_Agent_Stream in the databricks.yml complaints target."},
     "expectations": {"expected_response": "depends_on: [Complaint_Agent, Complaint_Generator_Stream]. Runs after both the agent is deployed and the complaint generator stream is running."}},
    
    # === ARCHITECTURE - MEDIUM (5 questions) ===
    
    {"inputs": {"question": "What is the overall structure of the Casper's Kitchens data pipeline?"},
     "expectations": {"expected_response": "The pipeline includes Raw_Data ingestion, Spark_Declarative_Pipeline for Lakeflow processing, Refund_Recommender_Agent, streaming refund processing, Complaint_Agent, complaint streaming, Lakebase integration, Reverse ETL, and Databricks App deployment."}},
    
    {"inputs": {"question": "What is the purpose of the UCState module and how does it track resources?"},
     "expectations": {"expected_response": "UCState manages Databricks resources by storing their state in a Unity Catalog table. It tracks experiments, jobs, pipelines, endpoints, apps, warehouses, and catalogs."}},
    
    {"inputs": {"question": "How does the DLT pipeline transform raw events into gold-level aggregates?"},
     "expectations": {"expected_response": "Raw events flow through bronze (raw storage) → silver (filtered, parsed, exploded to item level with extended_price) → gold (aggregated per-order metrics like revenue, total_qty, total_items)."}},
    
    {"inputs": {"question": "What is the role of the Databricks App in this project?"},
     "expectations": {"expected_response": "The Databricks App (refund-manager) provides a web interface for managing refund requests and viewing complaint data."}},
    
    {"inputs": {"question": "How does the streaming simulation work with time multipliers?"},
     "expectations": {"expected_response": "CaspersDataSource uses a speed_multiplier to accelerate simulation time. Initial run outputs historical data, subsequent runs calculate elapsed_sim_seconds = elapsed_real_seconds * speed_multiplier."}},
    
    # === HOW-TO / DEPLOYMENT - MEDIUM (4 questions) ===
    
    {"inputs": {"question": "How do I deploy the Casper's Kitchens job using the Databricks CLI?"},
     "expectations": {"expected_response": "Use 'databricks bundle deploy' to deploy the bundle, then 'databricks bundle run caspers' to run the main job."}},
    
    {"inputs": {"question": "How do I validate the Databricks bundle configuration?"},
     "expectations": {"expected_response": "Run 'databricks bundle validate' from the project root directory to check the databricks.yml configuration."}},
    
    {"inputs": {"question": "What are the different deployment targets available and when would I use each one?"},
     "expectations": {"expected_response": "Targets include: default (full pipeline with all components), complaints (complaint handling flow only), and free (Databricks Free Edition compatible). Use complaints target for focused demos, free target for free-tier workspaces."}},
    
    {"inputs": {"question": "I want to add a new AI agent to the pipeline. What files would I need to create and how do I wire it into databricks.yml?"},
     "expectations": {"expected_response": "Create a notebook in stages/ with your agent logic, then add a new task in databricks.yml under the tasks section with notebook_task pointing to your notebook and appropriate depends_on for task ordering."}},
    
    # === TROUBLESHOOTING - MEDIUM/HARD (2 questions) ===
    
    {"inputs": {"question": "My job failed with a task dependency error saying 'Complaint_Agent_Stream' couldn't start. What tasks does it depend on and what should I check?"},
     "expectations": {"expected_response": "Complaint_Agent_Stream depends on Complaint_Agent and Complaint_Generator_Stream. Check that both predecessor tasks completed successfully."}},
    
    {"inputs": {"question": "UCState.clear_all() failed partway through cleanup. Looking at the deletion order, what resources might be left behind?"},
     "expectations": {"expected_response": "Deletion order is experiments, jobs, pipelines, endpoints, apps, warehouses, databasecatalogs, catalogs, databaseinstances. Resources after the failure point in this order would remain."}},
    
    # === FACTUAL - EASY (3 questions) ===
    
    {"inputs": {"question": "What framework is used for the complaint agent and what pattern does it implement?"},
     "expectations": {"expected_response": "DSPy framework is used with a ReAct (Reasoning and Acting) pattern."}},
    
    {"inputs": {"question": "What is the default complaint rate parameter for the complaint generator?"},
     "expectations": {"expected_response": "0.15 (15% of orders generate complaints)"}},
    
    {"inputs": {"question": "Can I run this project on a free-tier Databricks workspace?"},
     "expectations": {"expected_response": "Yes, use the 'free' target when deploying the bundle."}},
]

print(f"Evaluation dataset: {len(eval_data)} questions")

### 6.2 Scorers

We use three scorers to assess different aspects of RAG quality:

- **RetrievalSufficiency** (built-in): Do the retrieved chunks contain enough information to answer the question? This is the key metric for comparing chunking strategies since it measures retrieval quality independent of generation.

- **Correctness** (built-in): Does the answer match the expected response? This is strict and penalizes correct answers that use different phrasing.

- **sufficient_answer** (custom): A lenient scorer we wrote that asks "did the response answer the core question correctly?" without requiring exact phrasing. This gives more actionable signal than strict correctness.

In [None]:
@scorer
def sufficient_answer(
    *,
    inputs: dict,
    outputs: str,
    expectations: dict,
) -> Feedback:
    """Lenient correctness: Is the answer sufficient and grounded?
    
    Unlike built-in Correctness which requires exact phrasing match,
    this scorer asks: "Did the response answer the core question correctly?"
    """
    question = inputs.get("question", "")
    expected = expectations.get("expected_response", "")

    prompt = f"""Evaluate if the response sufficiently answers the question.
Question: {question}
Expected key points: {expected}
Actual response: {outputs}

Criteria: 
1. Does it address the core question?
2. Are the key facts correct?
3. Is it grounded in the expected information?

Respond with ONLY "yes" or "no" on the first line, then your rationale."""

    response = client.chat.completions.create(
        model=JUDGE_MODEL_NAME,
        messages=[{"role": "user", "content": prompt}],
        max_tokens=200,
    )

    result_text = response.choices[0].message.content.strip()
    lines = result_text.split('\n', 1)

    return Feedback(
        value=lines[0].strip().lower(),
        rationale=lines[1].strip() if len(lines) > 1 else "",
    )

print("✓ Custom sufficient_answer scorer defined")

In [None]:
def evaluate_strategy(name: str, index: lancedb.table.Table) -> dict:
    """Run MLflow genai evaluation with RAG-focused scorers."""
    print(f"\nEvaluating {name}...")
    
    predict_fn = make_rag_pipeline(index)
    
    with mlflow.start_run(run_name=f"eval_{name}"):
        mlflow.log_param("strategy", name)
        mlflow.log_param("num_chunks", len(index.to_pandas()))
        mlflow.log_param("judge_model", JUDGE_MODEL)
        mlflow.log_param("k_chunks", K_CHUNKS)
        
        # Scorers:
        # - RetrievalSufficiency: Do retrieved chunks contain enough info? (KEY METRIC)
        # - Correctness: Is the answer factually correct vs ground truth? (strict)
        # - sufficient_answer: Lenient correctness - core question answered? (custom)
        results = mlflow.genai.evaluate(
            data=eval_data,
            predict_fn=predict_fn,
            scorers=[
                RetrievalSufficiency(model=JUDGE_MODEL),
                Correctness(model=JUDGE_MODEL),
                AnswerRelevancy(model=JUDGE_MODEL, threshold=0.9),
                Faithfulness(model=JUDGE_MODEL, threshold=0.9),
                sufficient_answer,  # Custom lenient scorer
            ],
        )
        
        print(f"  ✓ Metrics: {results.metrics}")
        
        return {
            "strategy": name,
            "metrics": results.metrics,
            "table": results.tables.get("eval_results"),
        }

print("✓ Evaluation function defined")

In [None]:
# Run evaluations for all three strategies
naive_results = evaluate_strategy("naive", naive_index)
middleground_results = evaluate_strategy("middleground", middleground_index)
ast_results = evaluate_strategy("ast", ast_index)

print("\n✓ All evaluations complete")

## 8. Results Analysis

In [None]:
# Build comparison DataFrame
metrics_to_show = [
    "retrieval_sufficiency/mean",  # MLflow - KEY METRIC
    "correctness/mean",            # MLflow (strict)
    "sufficient_answer/mean",      # Custom (lenient)
    "AnswerRelevancy/mean",        # DeepEval
    "Faithfulness/mean",           # DeepEval
]

comparison_data = []
for result in [naive_results, middleground_results, ast_results]:
    row = {"strategy": result["strategy"]}
    for m in metrics_to_show:
        key = m.split("/")[0]
        row[key] = result["metrics"].get(m, "N/A")
    comparison_data.append(row)

comparison_df = pd.DataFrame(comparison_data)
print(comparison_df.to_string(index=False))

In [None]:
# Summary table
print("RESULTS SUMMARY")
print("=" * 70)
print(f"\n{'Strategy':<15} {'Chunks':<8} {'Retrieval':<12} {'Correct':<10} {'Sufficient':<10}")
print(f"{'':15} {'':8} {'Sufficiency':<12} {'(strict)':<10} {'(lenient)':<10}")
print("-" * 70)

for result, chunks in [(naive_results, naive_chunks), 
                       (middleground_results, middleground_chunks), 
                       (ast_results, ast_chunks)]:
    m = result["metrics"]
    print(f"{result['strategy']:<15} {len(chunks):<8} "
          f"{m.get('retrieval_sufficiency/mean', 0):<12.1%} "
          f"{m.get('correctness/mean', 0):<10.1%} "
          f"{m.get('sufficient_answer/mean', 0):<10.1%}")

print("\n" + "=" * 70)
print("AST-based chunking shows +13 pts retrieval sufficiency over naive.")

## 9. Conclusion

### What the Results Show

**AST chunking won on the metric that matters most.** The gap in Retrieval Sufficiency between AST and naive shows that semantic boundaries and structural metadata make a real difference for code retrieval.

**Language-aware didn't outperform naive.** We expected heuristic separators to help, but strict size limits still split large functions awkwardly. Without metadata headers, chunks don't carry the structural context that helps retrieval.

**Define the metrics that matter to you.** Look at the gap between Correctness and Sufficient Answer. Many responses were semantically correct but failed strict matching. The built-in Correctness scorer was too pedantic for our use case. The custom scorer gave us actionable signal about what we actually cared about: did the response answer the question?

### Caveats and Open Questions

- **Is it just chunk size?** AST chunks are larger on average. Maybe bigger chunks do better regardless of strategy. A follow-up experiment with naive chunking at 1500 characters could isolate this.
- **Would code-specific embeddings help the other strategies?** We used `databricks-gte-large-en`. Code-specific models exist, but we were constrained to what integrates with production. Better embeddings might narrow the gap between strategies.
- **Small evaluation set**: 24 questions may not capture all query patterns.

### Takeaways

**MLflow GenAI evaluation lets you measure what matters.** Without systematic evaluation, we would have guessed that language-aware chunking was good enough. The numbers showed otherwise.

**Custom scorers are essential.** Built-in metrics are starting points. Your use case will have specific requirements that need custom evaluation logic.

**The methodology is the main thing.** Once you can systematically evaluate RAG quality, you can make confident decisions about every component of your pipeline: chunking strategy, embedding model, k value, reranking approach, and more.