# Tutorial 1.2: Experiment Tracking for LLMs

## Tracking GenAI Experiments with MLflow

Welcome to the second notebook! Now that your environment is set up, you'll learn how to track LLM experiments systematically.

### What You'll Learn
- How to create and organize GenAI experiments
- Log LLM parameters (model, temperature, max_tokens, etc.)
- Track important metrics (latency, token usage, cost)
- Store artifacts (prompts, responses, model configs)
- Compare different LLM configurations
- Best practices for experiment organization

### Prerequisites
- Completed Notebook 1.1 (Setup)
- MLflow UI running (optional but recommended)

### Estimated Time: 25-30 minutes

---
## Step 1: Environment Setup

Let's load our environment and imports.

In [76]:
import os
import time
from datetime import datetime

import mlflow
from utils import get_databricks_client, get_openai_client
from dotenv import load_dotenv

# Load environment variables
load_dotenv()

# Configure MLflow
mlflow.set_tracking_uri("http://localhost:5000")

use_databricks = os.getenv("USE_DATABRICKS_CLIENT") == "True"
print(f"use_databricks: {use_databricks}")

# Verify which client to use
if use_databricks:
    client = get_databricks_client()
else:
    client = get_openai_client()

# Verify OpenAI key
if not use_databricks and not os.getenv("OPENAI_API_KEY"):
    raise ValueError("OPENAI_API_KEY not found. Please check your .env file.")

print("‚úÖ Environment configured successfully")
print(f"   MLflow Tracking URI: {mlflow.get_tracking_uri()}")

use_databricks: True
‚úÖ Environment configured successfully
   MLflow Tracking URI: http://localhost:5000


---
## Step 2: Understanding Experiment Tracking

### What is Experiment Tracking?

Experiment tracking captures the inputs, outputs, and context of your LLM experiments:

```
‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îÇ              EXPERIMENT                          ‚îÇ
‚îÇ  Name: "sentiment-analysis"                      ‚îÇ
‚îú‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î§
‚îÇ                                                  ‚îÇ
‚îÇ  RUN 1: gpt-5-2, temp=1.0                        ‚îÇ
‚îÇ  ‚îú‚îÄ Parameters: {model, temperature, ...}        ‚îÇ
‚îÇ  ‚îú‚îÄ Metrics: {accuracy, latency, cost}           ‚îÇ
‚îÇ  ‚îî‚îÄ Artifacts: {prompt.txt, config.json}         ‚îÇ
‚îÇ                                                  ‚îÇ
‚îÇ  RUN 2: gpt-5-2, temp=1.5                        ‚îÇ
‚îÇ  ‚îú‚îÄ Parameters: {model, temperature, ...}        ‚îÇ
‚îÇ  ‚îú‚îÄ Metrics: {accuracy, latency, cost}           ‚îÇ
‚îÇ  ‚îî‚îÄ Artifacts: {prompt.txt, config.json}         ‚îÇ
‚îÇ                                                  ‚îÇ
‚îÇ  RUN 3: gpt-5-2, temp=2.0                        ‚îÇ
‚îÇ  ...                                             ‚îÇ
‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
```

### Key Concepts

- **Parameters**: Configuration values (model name, temperature, max_tokens)
- **Metrics**: Numerical measurements (accuracy, latency, token count)
- **Artifacts**: Files (prompts, responses, model configs)
- **Tags**: Metadata for organizing and filtering runs

---
## Step 3: Your First Tracked LLM Call

Let's start with a simple example: making an LLM call and tracking everything about it.

In [77]:
# Create an experiment
experiment_name = "01-basic-llm-calls"
mlflow.set_experiment(experiment_name)

print(f"üìä Experiment: {experiment_name}")
print("   View in UI: http://localhost:5000")

2026/01/27 19:47:58 INFO mlflow.tracking.fluent: Experiment with name '01-basic-llm-calls' does not exist. Creating a new experiment.


üìä Experiment: 01-basic-llm-calls
   View in UI: http://localhost:5000


In [78]:
# Make a tracked LLM call
with mlflow.start_run(run_name="first-tracked-call") as run:
    
    # 1. Define inputs
    if use_databricks:
        model_name = "databricks-gpt-5-mini"
    else:
        model_name = "gpt-5-mini"
    
    # some default values
    temperature = 1.0
    max_tokens = 100
    prompt = "Explain MLflow in 2-3 sentences."
    
    # # 2. Log parameters
    mlflow.log_param("model", model_name)
    mlflow.log_param("temperature", temperature)
    mlflow.log_param("max_tokens", max_tokens)
    
    # 3. Make the LLM call (with timing)
    start_time = time.time()
    
    response = client.chat.completions.create(
        model=model_name,
        messages=[{"role": "user", "content": prompt}],
        temperature=temperature,
        max_tokens=max_tokens
    )
    
    latency = time.time() - start_time
    
    # 4. Extract response details
    answer = response.choices[0].message.content
    prompt_tokens = response.usage.prompt_tokens
    completion_tokens = response.usage.completion_tokens
    total_tokens = response.usage.total_tokens
    
    # # 5. Log metrics
    mlflow.log_metric("latency_seconds", latency)
    mlflow.log_metric("prompt_tokens", prompt_tokens)
    mlflow.log_metric("completion_tokens", completion_tokens)
    mlflow.log_metric("total_tokens", total_tokens)
    mlflow.log_metric("response_length_chars", len(answer))
    
    # # 6. Log artifacts
    mlflow.log_text(prompt, "prompt.txt")
    mlflow.log_text(answer, "response.txt")
    
    # # 7. Log additional metadata as tags
    mlflow.set_tag("task", "explanation")
    mlflow.set_tag("framework", "openai")
    
    # Display results
    print("\n" + "="*60)
    print("RUN COMPLETED")
    print("="*60)
    print(f"\nüìù Prompt: {prompt}")
    print(f"\nü§ñ Response: {answer}")
    print("\nüìä Metrics:")
    print(f"   Latency: {latency:.2f}s")
    print(f"   Tokens: {total_tokens} (prompt: {prompt_tokens}, completion: {completion_tokens})")
    print(f"\nüîó Run ID: {run.info.run_id}")
    print(f"   View in UI: http://localhost:5000/#/experiments/{run.info.experiment_id}/runs/{run.info.run_id}")


RUN COMPLETED

üìù Prompt: Explain MLflow in 2-3 sentences.

ü§ñ Response: MLflow is an open-source platform for managing the end-to-end machine learning lifecycle, including experiment tracking, packaging models, and deploying them. It provides four main components‚ÄîTracking (log and compare experiments), Projects (reproducible runs), Models (standardized model format and model registry), and Model Serving/Registry‚Äîto help teams reproduce, share, and deploy ML work.

üìä Metrics:
   Latency: 2.48s
   Tokens: 101 (prompt: 16, completion: 85)

üîó Run ID: d09489850e854a1fbfaf2198e717f072
   View in UI: http://localhost:5000/#/experiments/2/runs/d09489850e854a1fbfaf2198e717f072
üèÉ View run first-tracked-call at: http://localhost:5000/#/experiments/2/runs/d09489850e854a1fbfaf2198e717f072
üß™ View experiment at: http://localhost:5000/#/experiments/2


### üéØ What Just Happened?

1. **Created a Run**: MLflow created a unique run with ID
2. **Logged Parameters**: Stored configuration (model, temperature, max_tokens)
3. **Logged Metrics**: Tracked performance (latency, tokens)
4. **Logged Artifacts**: Saved prompt and response as files
5. **Added Tags**: Metadata for organizing runs

All this data is now stored in SQLite `mlflow.db` and visible in the UI!

---
## Step 4: Comparing Multiple Configurations

Let's run experiments with different LLM configurations to see how they compare.

In [79]:
# Helper function for tracked LLM calls
def tracked_llm_call(prompt, model="gpt-5-mini", temperature=1.0, max_tokens=100, run_name=None):
    """
    Make an LLM call with full MLflow tracking.
    
    Returns: (response_text, metrics_dict)
    """
    with mlflow.start_run(run_name=run_name, nested=True):
        # Log parameters
        mlflow.log_params({
            "model": model,
            "temperature": temperature,
            "max_tokens": max_tokens,
            "prompt_length": len(prompt)
        })
        
        # Make call with timing
        start_time = time.time()
        response = client.chat.completions.create(
            model=model,
            messages=[{"role": "user", "content": prompt}],
            temperature=temperature,
            max_tokens=max_tokens
        )
        latency = time.time() - start_time
        
        # Extract response
        answer = response.choices[0].message.content
        
        # Calculate metrics
        metrics = {
            "latency_seconds": latency,
            "prompt_tokens": response.usage.prompt_tokens,
            "completion_tokens": response.usage.completion_tokens,
            "total_tokens": response.usage.total_tokens,
            "response_length": len(answer)
        }
        
        # Log metrics
        mlflow.log_metrics(metrics)
        
        # Log artifacts
        mlflow.log_text(prompt, "prompt.txt")
        mlflow.log_text(answer, "response.txt")
        
        return answer, metrics

print("‚úÖ Helper function defined")

‚úÖ Helper function defined


In [80]:
# Create a new experiment for comparison
mlflow.set_experiment("02-temperature-comparison")

# Test prompt
test_prompt = "Write a creative tagline for an AI observability with MLflow platform."

# Test different temperatures
temperatures = [1.0, 1.5, 2.0]

print("\nüî¨ Running experiments with different temperatures...\n")

results = []
for temp in temperatures:
    print(f"Testing temperature={temp}...")
    response, metrics = tracked_llm_call(
        prompt=test_prompt,
        model="databricks-gpt-5-2" if use_databricks else "gpt-5-2",
        temperature=temp,
        max_tokens=50,
        run_name=f"temp_{temp}"
    )
    results.append((temp, response, metrics))
    print(f"  ‚úì Completed in {metrics['latency_seconds']:.2f}s\n")

print("\n" + "="*60)
print("RESULTS COMPARISON")
print("="*60 + "\n")

for temp, response, metrics in results:
    print(f"Temperature: {temp}")
    print(f"Response: {response}")
    print(f"Tokens: {metrics['total_tokens']}")
    print("-" * 60 + "\n")

2026/01/27 19:53:06 INFO mlflow.tracking.fluent: Experiment with name '02-temperature-comparison' does not exist. Creating a new experiment.



üî¨ Running experiments with different temperatures...

Testing temperature=1.0...
üèÉ View run temp_1.0 at: http://localhost:5000/#/experiments/3/runs/c2dc809986ae46a6bfa886c955f46b45
üß™ View experiment at: http://localhost:5000/#/experiments/3
  ‚úì Completed in 1.36s

Testing temperature=1.5...
üèÉ View run temp_1.5 at: http://localhost:5000/#/experiments/3/runs/7443af6e9e5a475da8a0f9884c168fb5
üß™ View experiment at: http://localhost:5000/#/experiments/3
  ‚úì Completed in 0.94s

Testing temperature=2.0...
üèÉ View run temp_2.0 at: http://localhost:5000/#/experiments/3/runs/cb114dd8a56844ab824b647daf7c0544
üß™ View experiment at: http://localhost:5000/#/experiments/3
  ‚úì Completed in 1.23s


RESULTS COMPARISON

Temperature: 1.0
Response: **‚ÄúSee every run. Trust every model.‚Äù**
Tokens: 34
------------------------------------------------------------

Temperature: 1.5
Response: **‚ÄúSee every run. Trust every model.‚Äù**
Tokens: 34
--------------------------------------

### üîç Analysis

Notice how temperature affects:
- **Creativity**: Higher temperature = more creative responses
- **Consistency**: Lower temperature = more deterministic
- **Token usage**: May vary with creativity level

**üí° Go to the MLflow UI to visualize these differences!**
1. Select the "02-temperature-comparison" experiment
2. Select all three runs
3. Click "Compare" button
4. View side-by-side metrics and artifacts

---
## Step 5: Tracking Cost Estimates

Let's add cost tracking to our experiments. This is crucial for production applications!

In [81]:
# OpenAI pricing (as of 2026 - verify current rates)
PRICING = {
    "gpt-5-2": {
        "input": 1.75/ 1_000_000,   # per token
        "output": 14.00 / 1_000_000   # per token
    },
    "gpt-5-2-mini": {
        "input": 0.250 / 1_000_000,   # per token
        "output": 2.000 / 1_000_000   # per token
    },
    "databricks-gpt-5-2": {
        "input": 1.75/ 1_000_000,   # per token
        "output": 14.00 / 1_000_000   # per token
    },
    "databricks-gpt-5-mini": {
        "input": 0.250 / 1_000_000,   # per token
        "output": 2.000 / 1_000_000   # per token
    },
}

def calculate_cost(model, prompt_tokens, completion_tokens):
    """
    Calculate estimated cost for an LLM call.
    """
    if model not in PRICING:
        return 0.0
    
    input_cost = prompt_tokens * PRICING[model]["input"]
    output_cost = completion_tokens * PRICING[model]["output"]
    total_cost = input_cost + output_cost
    
    return total_cost

print("‚úÖ Cost calculation function defined")

‚úÖ Cost calculation function defined


In [82]:
# Enhanced helper with cost tracking
def tracked_llm_call_with_cost(prompt, model="gpt-5-mini", temperature=0.7, max_tokens=100, run_name=None):
    """
    Make an LLM call with full tracking including cost estimation.
    """
    with mlflow.start_run(run_name=run_name):
        # Log parameters
        mlflow.log_params({
            "model": model,
            "temperature": temperature,
            "max_tokens": max_tokens,
            "prompt_length": len(prompt)
        })
        
        # Make call
        start_time = time.time()
        response = client.chat.completions.create(
            model=model,
            messages=[{"role": "user", "content": prompt}],
            temperature=temperature,
            max_tokens=max_tokens
        )
        latency = time.time() - start_time
        
        # Extract details
        answer = response.choices[0].message.content
        prompt_tokens = response.usage.prompt_tokens
        completion_tokens = response.usage.completion_tokens
        total_tokens = response.usage.total_tokens
        
        # Calculate cost
        cost = calculate_cost(model, prompt_tokens, completion_tokens)
        
        # Log metrics
        mlflow.log_metrics({
            "latency_seconds": latency,
            "prompt_tokens": prompt_tokens,
            "completion_tokens": completion_tokens,
            "total_tokens": total_tokens,
            "estimated_cost_usd": cost,
            "cost_per_1k_tokens": (cost / total_tokens * 1000) if total_tokens > 0 else 0
        })
        
        # Log artifacts
        mlflow.log_text(prompt, "prompt.txt")
        mlflow.log_text(answer, "response.txt")
        
        # Cost summary artifact
        cost_summary = f"""
Cost Breakdown
==============
Model: {model}
Input tokens: {prompt_tokens} @ ${PRICING[model]['input']*1_000_000:.2f}/M
Output tokens: {completion_tokens} @ ${PRICING[model]['output']*1_000_000:.2f}/M
Total cost: ${cost:.6f}
"""
        mlflow.log_text(cost_summary, "cost_breakdown.txt")
        
        return answer, cost

print("‚úÖ Enhanced tracking function with cost estimation defined")

‚úÖ Enhanced tracking function with cost estimation defined


In [72]:
# Compare costs across different models
mlflow.set_experiment("03-model-cost-comparison")

prompt = "Summarize the benefits of experiment tracking in 3 bullet points."

models_to_test = ["databricks-gpt-5-2", "databricks-gpt-5-mini"] if use_databricks else ["gpt-5-mini", "gpt-5-2"]

print("\nüí∞ Comparing costs across models...\n")

for model in models_to_test:
    print(f"Testing {model}...")
    response, cost = tracked_llm_call_with_cost(
        prompt=prompt,
        model=model,
        temperature=1.0,
        max_tokens=150,
        run_name=f"model_{model}"
    )
    print(f"  Cost: ${cost:.6f}")
    print(f"  Response length: {len(response)} chars\n")

print("‚úÖ Cost comparison complete!")
print("   View detailed breakdown in MLflow UI")

2026/01/27 19:37:46 INFO mlflow.tracking.fluent: Experiment with name '03-model-cost-comparison' does not exist. Creating a new experiment.



üí∞ Comparing costs across models...

Testing databricks-gpt-5-2...
üèÉ View run model_databricks-gpt-5-2 at: http://localhost:5000/#/experiments/3/runs/2064981454bb4ccf91c4a42df5ddeea9
üß™ View experiment at: http://localhost:5000/#/experiments/3
  Cost: $0.001547
  Response length: 560 chars

Testing databricks-gpt-5-mini...
üèÉ View run model_databricks-gpt-5-mini at: http://localhost:5000/#/experiments/3/runs/fd19f7a1a67f4656867b1e11f6e7e174
üß™ View experiment at: http://localhost:5000/#/experiments/3
  Cost: $0.000211
  Response length: 535 chars

‚úÖ Cost comparison complete!
   View detailed breakdown in MLflow UI


### üí° Cost Analysis Insights

By tracking costs, you can:
1. **Budget effectively** for production deployments
2. **Optimize model selection** (GPT-4o-mini vs GPT-4o)
3. **Identify expensive prompts** that need optimization
4. **Track spending trends** over time

**Pro Tip**: Set up cost alerts in production based on these metrics!

---
## Step 6: Organizing Experiments with Tags and Metadata

As your experiments grow, organization becomes critical. Let's learn how to use tags effectively. Effectively, this allows you to search and group by runs with tags.

In [83]:
# Example: Systematic experiment with rich metadata
mlflow.set_experiment("04-production-candidate-testing")

# Test configurations
open_configs = [ 
    {
        "name": "baseline",
        "model": "gpt-5-mini",
        "temperature": 1.0,
        "system_prompt": "You are a helpful assistant."
    },
    {
        "name": "creative",
        "model": "gpt-5-2",
        "temperature": 1.5,
        "system_prompt": "You are a creative writing assistant."
    },
    {
        "name": "precise",
        "model": "gemini-2-5-flash",
        "temperature": 1.0,
        "system_prompt": "You are a precise, technical assistant."
    }
]

# Databricks hosted foundational models if you want to test them
databricks_config = [

    {
        "name": "baseline",
        "model": "databricks-gpt-5-mini",
        "temperature": 1.0,
        "system_prompt": "You are a helpful assistant."
    },
    {
        "name": "creative",
        "model": "databricks-gpt-5-2",
        "temperature": 1.5,
        "system_prompt": "You are a creative writing assistant."
    },
    {
        "name": "precise",
        "model": "databricks-gemini-2-5-flash",
        "temperature": 1.0,
        "system_prompt": "You are a precise, technical assistant."
    }
]
model_configs = databricks_config if use_databricks else open_configs
test_prompt = "Explain the concept of LLM temperature."

print("\nüè∑Ô∏è  Running experiments with comprehensive tagging...\n")

for config in model_configs:
    with mlflow.start_run(run_name=config["name"]):
        # Log parameters
        mlflow.log_params({
            "model": config["model"],
            "temperature": config["temperature"],
            "system_prompt": config["system_prompt"]
        })
        
        # Make call
        start_time = time.time()
        response = client.chat.completions.create(
            model=config["model"],
            messages=[
                {"role": "system", "content": config["system_prompt"]},
                {"role": "user", "content": test_prompt}
            ],
            temperature=config["temperature"],
            max_tokens=200
        )
        latency = time.time() - start_time
        
        answer = response.choices[0].message.content
        
        # Log metrics
        mlflow.log_metrics({
            "latency_seconds": latency,
            "total_tokens": response.usage.total_tokens,
            "estimated_cost_usd": calculate_cost(
                config["model"],
                response.usage.prompt_tokens,
                response.usage.completion_tokens
            )
        })
        
        # Rich tagging
        mlflow.set_tags({
            "config_name": config["name"],
            "task": "explanation",
            "stage": "testing",
            "team": "ai-research",
            "version": "v1.0",
            "timestamp": datetime.now().isoformat(),
            "production_candidate": "true" if config["name"] == "baseline" else "false"
        })
        
        # Log artifacts
        mlflow.log_text(test_prompt, "prompt.txt")
        mlflow.log_text(answer, "response.txt")
        
        # Save full config
        mlflow.log_dict(config, "config.json")
        
        print(f"‚úì {config['name']} - {latency:.2f}s")

print("\n‚úÖ All runs completed with comprehensive tagging!")

2026/01/27 19:55:17 INFO mlflow.tracking.fluent: Experiment with name '04-production-candidate-testing' does not exist. Creating a new experiment.



üè∑Ô∏è  Running experiments with comprehensive tagging...

‚úì baseline - 3.17s
üèÉ View run baseline at: http://localhost:5000/#/experiments/4/runs/ae24e0c3a74e4b8ba8c9733b43233a26
üß™ View experiment at: http://localhost:5000/#/experiments/4
‚úì creative - 5.48s
üèÉ View run creative at: http://localhost:5000/#/experiments/4/runs/cb114792e3a648758c9b4bb8bb461258
üß™ View experiment at: http://localhost:5000/#/experiments/4
‚úì precise - 1.71s
üèÉ View run precise at: http://localhost:5000/#/experiments/4/runs/6638470f20d74afda86278dd88995e68
üß™ View experiment at: http://localhost:5000/#/experiments/4

‚úÖ All runs completed with comprehensive tagging!


### üè∑Ô∏è Tagging Best Practices

Use tags for:
1. **Environment**: `stage: development/testing/production`
2. **Ownership**: `team: ai-research`, `owner: jules`
3. **Purpose**: `task: summarization`, `use_case: customer-support`
4. **Status**: `production_candidate: true`, `approved: false`
5. **Version**: `version: v1.0`, `prompt_version: v2.1`

**üí° You can filter and search runs by tags in the MLflow UI!**

---
## Step 7: Advanced - Parent-Child Runs

For complex workflows with multiple LLM calls, use nested runs to maintain hierarchy.

In [84]:
# Example: Multi-step workflow with nested runs
mlflow.set_experiment("05-multi-step-workflow")

with mlflow.start_run(run_name="question-answering-pipeline") as parent_run:
    
    # Tag parent run
    mlflow.set_tag("workflow", "qa-pipeline")
    mlflow.set_tag("num_steps", "3")
    
    user_question = "What is machine learning?"
    
    # Step 1: Question preprocessing (nested run)
    with mlflow.start_run(run_name="step-1-preprocess", nested=True):
        mlflow.set_tag("step", "preprocessing")
        
        response = client.chat.completions.create(
            model="databricks-gpt-5-mini" if use_databricks else "gpt-5-mini",
            messages=[{
                "role": "user",
                "content": f"Rephrase this question to be more specific: {user_question}"
            }],
            temperature=1.0,
            max_tokens=50
        )
        
        refined_question = response.choices[0].message.content
        mlflow.log_param("original_question", user_question)
        mlflow.log_param("refined_question", refined_question)
        mlflow.log_metric("tokens_used", response.usage.total_tokens)
    
    # Step 2: Answer generation (nested run)
    with mlflow.start_run(run_name="step-2-generate", nested=True):
        mlflow.set_tag("step", "generation")
        
        response = client.chat.completions.create(
            model="databricks-gpt-5-mini" if use_databricks else "gpt-5-mini",
            messages=[{
                "role": "user",
                "content": f"Provide a detailed answer: {refined_question}"
            }],
            temperature=1.0,
            max_tokens=200
        )
        
        answer = response.choices[0].message.content
        mlflow.log_text(answer, "answer.txt")
        mlflow.log_metric("tokens_used", response.usage.total_tokens)
        mlflow.log_metric("answer_length", len(answer))
    
    # Step 3: Quality check (nested run)
    with mlflow.start_run(run_name="step-3-quality-check", nested=True):
        mlflow.set_tag("step", "quality-check")
        
        response = client.chat.completions.create(
            model="databricks-gpt-5-mini" if use_databricks else "gpt-5-mini",
            messages=[{
                "role": "user",
                "content": f"Rate the quality of this answer (1-10): {answer}"
            }],
            temperature=1.0,
            max_tokens=10
        )
        
        quality_rating = response.choices[0].message.content
        mlflow.log_param("quality_rating", quality_rating)
        mlflow.log_metric("tokens_used", response.usage.total_tokens)
    
    # Log parent run summary
    mlflow.log_text(f"""
Pipeline Summary
================
Original Question: {user_question}
Refined Question: {refined_question}
Answer Length: {len(answer)} chars
Quality Rating: {quality_rating}
""", "pipeline_summary.txt")
    
    print("\n‚úÖ Multi-step pipeline completed!")
    print(f"   Parent Run ID: {parent_run.info.run_id}")
    print("    View hierarchy in UI")

2026/01/27 19:55:33 INFO mlflow.tracking.fluent: Experiment with name '05-multi-step-workflow' does not exist. Creating a new experiment.


üèÉ View run step-1-preprocess at: http://localhost:5000/#/experiments/5/runs/24669fabcd574a1499441a582eb85a3f
üß™ View experiment at: http://localhost:5000/#/experiments/5
üèÉ View run step-2-generate at: http://localhost:5000/#/experiments/5/runs/7de9f2e57cc14fdbb3ec9f1487ccc807
üß™ View experiment at: http://localhost:5000/#/experiments/5
üèÉ View run step-3-quality-check at: http://localhost:5000/#/experiments/5/runs/80ce3f129ce9498fa9724ccdf930da64
üß™ View experiment at: http://localhost:5000/#/experiments/5

‚úÖ Multi-step pipeline completed!
   Parent Run ID: 3b0c9b18189042d1999abf726414e717
    View hierarchy in UI
üèÉ View run question-answering-pipeline at: http://localhost:5000/#/experiments/5/runs/3b0c9b18189042d1999abf726414e717
üß™ View experiment at: http://localhost:5000/#/experiments/5


### üå≥ Parent-Child Run Benefits

```
Parent Run: question-answering-pipeline
‚îú‚îÄ‚îÄ Child Run 1: step-1-preprocess
‚îú‚îÄ‚îÄ Child Run 2: step-2-generate
‚îî‚îÄ‚îÄ Child Run 3: step-3-quality-check
```

**Advantages:**
- Logical organization of complex workflows
- Individual step metrics without losing the big picture
- Easy to debug specific steps
- Aggregate metrics across all steps

---
## Step 8: Querying Experiments Programmatically

Let's learn how to retrieve and analyze experiment data using the MLflow API.

In [85]:
from mlflow.tracking import MlflowClient

client_mlflow = MlflowClient()

# Get experiment by name
experiment = client_mlflow.get_experiment_by_name("02-temperature-comparison")

if experiment:
    print(f"\nüìä Experiment: {experiment.name}")
    print(f"   ID: {experiment.experiment_id}")
    print(f"   Artifact Location: {experiment.artifact_location}")
    
    # Search runs in this experiment
    runs = client_mlflow.search_runs(
        experiment_ids=[experiment.experiment_id],
        order_by=["metrics.latency_seconds ASC"],
        max_results=5
    )
    
    print(f"\n   Found {len(runs)} runs:")
    print("\n" + "="*60)
    
    for run in runs:
        print(f"\n   Run: {run.info.run_name}")
        print("   Parameters:")
        for key, value in run.data.params.items():
            print(f"      {key}: {value}")
        print("   Metrics:")
        for key, value in run.data.metrics.items():
            print(f"      {key}: {value}")
else:
    print("Experiment not found. Make sure you ran the temperature comparison section.")


üìä Experiment: 02-temperature-comparison
   ID: 3
   Artifact Location: mlflow-artifacts:/3

   Found 3 runs:


   Run: temp_1.5
   Parameters:
      model: databricks-gpt-5-2
      temperature: 1.5
      max_tokens: 50
      prompt_length: 70
   Metrics:
      latency_seconds: 0.9443578720092773
      prompt_tokens: 20.0
      completion_tokens: 14.0
      total_tokens: 34.0
      response_length: 39.0

   Run: temp_2.0
   Parameters:
      model: databricks-gpt-5-2
      temperature: 2.0
      max_tokens: 50
      prompt_length: 70
   Metrics:
      latency_seconds: 1.2332289218902588
      prompt_tokens: 20.0
      completion_tokens: 20.0
      total_tokens: 40.0
      response_length: 77.0

   Run: temp_1.0
   Parameters:
      model: databricks-gpt-5-2
      temperature: 1.0
      max_tokens: 50
      prompt_length: 70
   Metrics:
      latency_seconds: 1.3623478412628174
      prompt_tokens: 20.0
      completion_tokens: 14.0
      total_tokens: 34.0
      response_length: 39.

In [86]:
# Find the best run based on a metric
if experiment:
    # Find fastest run
    fastest_run = client_mlflow.search_runs(
        experiment_ids=[experiment.experiment_id],
        order_by=["metrics.latency_seconds ASC"],
        max_results=1
    )[0]
    
    print("\nüèÜ Fastest Run:")
    print(f"   Name: {fastest_run.info.run_name}")
    print(f"   Latency: {fastest_run.data.metrics['latency_seconds']:.3f}s")
    print(f"   Temperature: {fastest_run.data.params.get('temperature', 'N/A')}")
    print(f"   Run ID: {fastest_run.info.run_id}")


üèÜ Fastest Run:
   Name: temp_1.5
   Latency: 0.944s
   Temperature: 1.5
   Run ID: 7443af6e9e5a475da8a0f9884c168fb5


### üîç Advanced Queries

The MLflow API supports powerful filtering:

```python
# Filter by metric threshold
fast_runs = client_mlflow.search_runs(
    experiment_ids=[experiment_id],
    filter_string="metrics.latency_seconds < 1.0"
)

# Filter by parameter
gpt4_runs = client_mlflow.search_runs(
    experiment_ids=[experiment_id],
    filter_string="params.model = 'gpt-4o'"
)

# Filter by tag
prod_candidates = client_mlflow.search_runs(
    experiment_ids=[experiment_id],
    filter_string="tags.production_candidate = 'true'"
)
```

---
## Summary

In this notebook, you learned:

1. ‚úÖ Core concepts of experiment tracking (parameters, metrics, artifacts)
2. ‚úÖ How to log LLM calls with full context
3. ‚úÖ Comparing multiple model configurations systematically
4. ‚úÖ Tracking costs for budget management
5. ‚úÖ Organizing experiments with tags and metadata
6. ‚úÖ Using parent-child runs for complex workflows
7. ‚úÖ Querying experiment data programmatically

### Best Practices Recap

- ‚úÖ **Be Consistent**: Use the same parameter/metric names across runs
- ‚úÖ **Tag Everything**: Make runs searchable with meaningful tags
- ‚úÖ **Track Costs**: Essential for production budgeting
- ‚úÖ **Use Nested Runs**: For multi-step workflows
- ‚úÖ **Name Runs Meaningfully**: `baseline-v1`, `high-temp-creative`

### Next Steps

Ready to dive deep into observability?

**üìì Notebook 1.3: Introduction to Tracing**
- Learn automatic tracing with autologging
- Understand the trace data model
- Visualize LLM execution flows
- Integrate with multiple frameworks