# Tutorial 1.2: Experiment Tracking for LLMs

![](images/3_Notebook-12-LLM-Experiment-Tracking.png)


## Tracking GenAI Experiments with MLflow

Welcome to the second notebook! Now that your environment is set up, you'll learn how to track LLM experiments systematically.

### What You'll Learn
- How `mlflow.openai.autolog()` captures model params, tokens, latency, and I/O automatically
- When you still need explicit `mlflow.log_*` calls (tags, custom artifacts)
- How to create and organize GenAI experiments
- Compare different LLM configurations
- Best practices for experiment organization

### Prerequisites
- Completed Notebook 1.1 (Setup)
- MLflow >= 3.10.0
- MLflow UI running (recommended)

### Estimated Time: 25-30 minutes

---
## Step 1: Environment Setup

Let's load our environment and enable autologging for OpenAI.

In [None]:
import os

import mlflow
from utils.clnt_utils import get_databricks_ai_gateway_client, get_openai_client, get_ai_gateway_model_names, is_databricks_ai_gateway_client
from dotenv import load_dotenv

# Load environment variables
load_dotenv()

# Configure MLflow
mlflow.set_tracking_uri("http://localhost:5000")

use_ai_gateway = is_databricks_ai_gateway_client()

# Verify which client to use
if use_ai_gateway:
    client = get_databricks_ai_gateway_client()
    model_name = get_ai_gateway_model_names()[0]
else:
    client = get_openai_client()
    model_name = "gpt-5.2"

# Verify OpenAI key
if not use_ai_gateway and not os.getenv("OPENAI_API_KEY"):
    raise ValueError("OPENAI_API_KEY not found. Please check your .env file.")

# Enable autologging for OpenAI.
# This automatically creates MLflow Traces that capture:
#   - model, temperature, max_tokens (all API params as span attributes)
#   - full input messages and response content (span I/O)
#   - token counts: prompt_tokens, completion_tokens, total_tokens
#   - latency (via span start/end timestamps)
mlflow.openai.autolog()

print("‚úÖ Environment configured successfully")
print(f"   MLflow Tracking URI: {mlflow.get_tracking_uri()}")
print(f"   Using model: {model_name}")
print("   Autolog: ENABLED")

---
## Step 2: Understanding Experiment Tracking

### What is Experiment Tracking?

Experiment tracking captures the inputs, outputs, and context of your LLM experiments:

```
‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îÇ              EXPERIMENT                          ‚îÇ
‚îÇ  Name: "sentiment-analysis"                      ‚îÇ
‚îú‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î§
‚îÇ                                                  ‚îÇ
‚îÇ  RUN 1: gpt-5.2, temp=1.0                        ‚îÇ
‚îÇ  ‚îú‚îÄ Parameters: {model, temperature, ...}        ‚îÇ
‚îÇ  ‚îú‚îÄ Metrics: {accuracy, latency, cost}           ‚îÇ
‚îÇ  ‚îî‚îÄ Artifacts: {prompt.txt, config.json}         ‚îÇ
‚îÇ                                                  ‚îÇ
‚îÇ  RUN 2: gpt-5.2, temp=1.5                        ‚îÇ
‚îÇ  ‚îú‚îÄ Parameters: {model, temperature, ...}        ‚îÇ
‚îÇ  ‚îú‚îÄ Metrics: {accuracy, latency, cost}           ‚îÇ
‚îÇ  ‚îî‚îÄ Artifacts: {prompt.txt, config.json}         ‚îÇ
‚îÇ                                                  ‚îÇ
‚îÇ  RUN 3: gpt-5.2, temp=2.0                        ‚îÇ
‚îÇ  ...                                             ‚îÇ
‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
```

### Key Concepts

- **Parameters**: Configuration values (model name, temperature, max_tokens)
- **Metrics**: Numerical measurements (accuracy, latency, token count)
- **Artifacts**: Files (prompts, responses, model configs)
- **Tags**: Metadata for organizing and filtering runs
- **Traces**: Automatic records of LLM calls (created by autolog)

### What Does `mlflow.openai.autolog()` Capture?

Since we enabled autolog in Step 1, every OpenAI call is automatically traced. Here is what you get for free vs. what still needs manual logging:

| Captured Automatically (Traces)             | Requires Explicit `log_*` Calls          |
|---------------------------------------------|------------------------------------------|
| model, temperature, max_tokens (span attrs) | Estimated cost in USD                    |
| Full input messages (span I/O)              | Semantic tags (task, stage, team, etc.)  |
| Full response content (span I/O)            | Structured config artifacts (`log_dict`) |
| Token counts: prompt, completion, total     | Custom business metrics                  |
| Latency (span start/end timestamps)         |                                          |

**The rest of this notebook demonstrates each category ‚Äî you'll see when a `log_*` call earns its keep.**

---
## Step 3: Your First Tracked LLM Call ‚Äî The Autolog Way

Let's see what autolog captures with zero manual instrumentation, then add only what it cannot provide.

In [None]:
# Create an experiment
experiment_name = "02-basic-llm-calls"
mlflow.set_experiment(experiment_name)

print(f"üìä Experiment: {experiment_name}")
print("   View in UI: http://localhost:5000")

In [None]:
# Make a tracked LLM call ‚Äî autolog does the heavy lifting.
#
# What we do NOT need to log manually (autolog captures all of this):
#   - mlflow.log_param("model", ...)        -> span attribute
#   - mlflow.log_param("temperature", ...)   -> span attribute
#   - mlflow.log_param("max_tokens", ...)    -> span attribute
#   - mlflow.log_metric("latency_seconds")   -> span timestamps
#   - mlflow.log_metric("prompt_tokens")     -> mlflow.chat.tokenUsage
#   - mlflow.log_metric("completion_tokens") -> mlflow.chat.tokenUsage
#   - mlflow.log_metric("total_tokens")      -> mlflow.chat.tokenUsage
#   - mlflow.log_text(prompt, "prompt.txt")  -> span input
#   - mlflow.log_text(answer, "response.txt")-> span output

prompt = "Explain MLflow GenAI Platform in 3-4 sentences."

with mlflow.start_run(run_name="first-llm-tracked-call") as run:

    # The only explicit call: a semantic tag that autolog cannot infer.
    mlflow.set_tag("task", "explanation")

    response = client.chat.completions.create(
        model=model_name,
        messages=[{"role": "user", "content": prompt}],
        temperature=1.0,
        max_completion_tokens=1000
    )

    answer = response.choices[0].message.content

print(f"\nüìù Prompt: {prompt}")
print(f"\nü§ñ Response: {answer}")
print(f"\nüîó Run ID: {run.info.run_id}")
print(f"   View in UI: http://localhost:5000/#/experiments/{run.info.experiment_id}/runs/{run.info.run_id}")

### üéØ What Just Happened?

1. **One tag** Use `set_tag("task", ...)` because that is a semantic label *we* know ‚Äî autolog has no way to infer it.

2. **Autolog created a Trace automatically.** In the MLflow UI, go to the **Traces** tab of the `02-basic-llm-calls` experiment. You will see:
   - `model`, `temperature`, `max_tokens` as span attributes
   - The full prompt and response captured as span I/O
   - Token counts under `mlflow.chat.tokenUsage`
   - Token costs for input and ouput
   - Latency derived from span start/end timestamps

3. **The Trace is linked to the Run** via `mlflow.sourceRun`. You can navigate from the Trace back to the Run and vice versa.

**Try it:** Open the MLflow UI. Find the run. Click into its Trace. Compare what autolog captured vs. what we logged manually.

---
## Step 4: Comparing Multiple Configurations

With autolog active, our comparison helper needs only to make the API call. 

In [None]:
# Simplified helper ‚Äî autolog captures params, tokens, latency, and I/O automatically.
def simple_llm_call(prompt, model=model_name, temperature=1.0, max_completion_tokens=1000, run_name=None):
    """
    Make an LLM call inside a nested run.
    autolog captures model params, token counts, latency, and full I/O as a Trace.
    The run exists only to group and name the experiment.
    """
    with mlflow.start_run(run_name=run_name, nested=True):
        response = client.chat.completions.create(
            model=model,
            messages=[{"role": "user", "content": prompt}],
            temperature=temperature,
            max_completion_tokens=max_completion_tokens
        )
        return response.choices[0].message.content

print("‚úÖ Helper function defined!")

In [None]:
# Create a new experiment for comparison
mlflow.set_experiment("02-temperature-comparison")

test_prompt = "Write a creative tagline for an AI observability with MLflow GenAI platform."
temperatures = [1.0, 1.5, 2.0]

print("üî¨ Running temperature comparison...\n")

# A parent run groups all nested calls together in the UI.
with mlflow.start_run(run_name="temperature-sweep"):
    mlflow.set_tag("sweep_variable", "temperature")

    for temp in temperatures:
        print(f"  temperature={temp} ...")
        response = simple_llm_call(
            prompt=test_prompt,
            model=model_name,
            temperature=temp,
            max_completion_tokens=1000,
            run_name=f"temp_{temp}"
        )
        print(f"    -> {response}\n")

print("‚úÖ Done. Compare traces side-by-side in the MLflow UI.")

### üîç Analysis

Notice how temperature affects:
- **Creativity**: Higher temperature = more creative responses
- **Consistency**: Lower temperature = more deterministic
- **Token usage**: May vary with creativity level

**üí° Comparing in the MLflow UI:**
1. Select the "02-temperature-comparison" experiment
2. Select the nested runs and click "Compare"
3. Because autolog captured `temperature` as a span attribute on every Trace, you can also filter by temperature directly in the **Traces** tab

---
## Step 5: Tracking Cost Estimates

MLflow 3.10 traces now account for how much a trace costs, broken down into input costs and output cost.
This helps you to ascertain the overall costs while trying out different models, either from the same provider
or a different provider.


In [None]:
def llm_call_with_cost(prompt, model=model_name, temperature=1.0, max_completion_tokens=1000, run_name=None):
    """
    Make an LLM call with cost tracking.
    autolog captures: model, temperature, token counts, latency, I/O.
    """
    with mlflow.start_run(run_name=run_name):
        response = client.chat.completions.create(
            model=model,
            messages=[{"role": "user", "content": prompt}],
            temperature=temperature,
            max_completion_tokens=max_completion_tokens
        )

        answer = response.choices[0].message.content

        return answer

print("‚úÖ Cost-aware helper defined")
print("   autolog captures: model, temperature, token counts, latency, I/O")

In [None]:
# Compare costs across different models
mlflow.set_experiment("03-model-cost-comparison")

prompt = "Summarize the benefits of experiment tracking in 3 bullet points."
models_to_test = ["jsd-gpt-5-2", "jsd-gpt-5-mini"] if use_ai_gateway else ["gpt-5-mini", "gpt-5.2"]

print("üí∞ Comparing costs across models...\n")

for model in models_to_test:
    print(f"Testing {model}...")
    response = llm_call_with_cost(
        prompt=prompt,
        model=model,
        temperature=1.0,
        max_completion_tokens=1000,
        run_name=f"model_{model}_run"
    )
    print(f"  Response: {response}...\n")

print("‚úÖ Cost comparison complete! View in MLflow UI.")

### üí° Cost Analysis Insights

By tracking costs, you can:
1. **Budget effectively** for production deployments
2. **Optimize model selection** (GPT-5-mini vs GPT-5.2)
3. **Identify expensive prompts** that need optimization
4. **Track spending trends** over time

**Note:** Token counts themselves come from autolog Traces (`mlflow.chat.tokenUsage`). The only value `estimated_cost_usd` adds is the dollar figure, which requires pricing data only you can supply. This is the correct pattern: **log exactly what observability infrastructure cannot derive on its own.**

---
## Step 6: Organizing Experiments with Tags and Metadata

Tags and structured configs are another area where explicit logging adds real value ‚Äî autolog has no way to know your team structure, production candidacy, or versioning scheme.

In [None]:
# Systematic experiment with rich metadata ‚Äî tags and config artifacts.
mlflow.set_experiment("04-production-candidate-testing")

# Test configurations
open_configs = [
    {
        "name": "baseline",
        "model": "gpt-5-mini",
        "temperature": 1.0,
        "system_prompt": "You are a helpful assistant."
    },
    {
        "name": "creative",
        "model": "gpt-5.2",
        "temperature": 2.0,
        "system_prompt": "You are a creative writing assistant."
    },
]

# Databricks hosted foundational models if you want to test them
databricks_config = [
    {
        "name": "baseline",
        "model": "jsd-gpt-5-mini",
        "temperature": 1.0,
        "system_prompt": "You are a helpful assistant."
    },
    {
        "name": "creative",
        "model": "jsd-gpt-5.2",
        "temperature": 1.5,
        "system_prompt": "You are a creative writing assistant."
    },
]
model_configs = databricks_config if use_ai_gateway else open_configs
test_prompt = "Explain the concept of LLM temperature."

print("üè∑Ô∏è  Running experiments with semantic tags...\n")

for config in model_configs:
    with mlflow.start_run(run_name=config["name"]):

        # Make the call ‚Äî autolog captures model, temperature, tokens, I/O, latency.
        response = client.chat.completions.create(
            model=config["model"],
            messages=[
                {"role": "system", "content": config["system_prompt"]},
                {"role": "user", "content": test_prompt}
            ],
            temperature=config["temperature"],
            max_completion_tokens=1000
        )

        # Log only what autolog cannot provide: semantic tags and structured config.
        mlflow.set_tags({
            "config_name": config["name"],
            "task": "explanation",
            "stage": "testing",
            "team": "ai-research",
            "version": "v1.0",
            "production_candidate": str(config["name"] == "baseline").lower(),
        })

        # Save full config as a structured artifact
        mlflow.log_dict(config, "config.json")

        print(f"  ‚úì {config['name']} done")

print("\n‚úÖ All runs completed! Filter by tag 'production_candidate=true' in the UI.")

### üè∑Ô∏è Tagging Best Practices

Use tags for:
1. **Environment**: `stage: development/testing/production`
2. **Ownership**: `team: ai-research`, `owner: jules`
3. **Purpose**: `task: summarization`, `use_case: customer-support`
4. **Status**: `production_candidate: true`, `approved: false`
5. **Version**: `version: v1.0`, `prompt_version: v2.1`
6. **Do not duplicate autolog data.** Tags like `model_used: gpt-5.2` or `total_tokens: 342` are already in the Trace. Reserve tags for information that is not derivable from the API call itself.

**üí° You can filter and search runs by tags in the MLflow UI!**

---
## Step 7: Querying Experiments Programmatically

Let's learn how to retrieve and analyze experiment data using the MLflow API.

In [None]:
from mlflow.tracking import MlflowClient

# Use the MlflowClient to query experiments and runs
mlflow_client = MlflowClient()

# Get experiment by name
experiment = mlflow_client.get_experiment_by_name("04-production-candidate-testing")

if experiment:
    print(f"üìä Experiment: {experiment.name}")
    print(f"   ID: {experiment.experiment_id}")

    # Search runs ‚Äî sort by start_time since autolog stores latency on Traces, not run metrics.
    runs = mlflow_client.search_runs(
        experiment_ids=[experiment.experiment_id],
        order_by=["start_time DESC"],
        max_results=5
    )

    print(f"\n   Found {len(runs)} runs:\n" + "="*60)

    for run in runs:
        print(f"\n   Run: {run.info.run_name}")
        if run.data.params:
            print("   Parameters:")
            for key, value in run.data.params.items():
                print(f"      {key}: {value}")
        if run.data.metrics:
            print("   Metrics:")
            for key, value in run.data.metrics.items():
                print(f"      {key}: {value}")
        if run.data.tags.get("config_name"):
            print(f"   Tag config_name: {run.data.tags['config_name']}")
else:
    print("Experiment not found. Make sure you ran the production candidate testing section.")

In [None]:
# Find production candidates using tag filters
if experiment:
    prod_runs = mlflow_client.search_runs(
        experiment_ids=[experiment.experiment_id],
        filter_string="tags.production_candidate = 'true'",
        max_results=5
    )

    print("üèÜ Production Candidates:")
    for run in prod_runs:
        print(f"   Name: {run.info.run_name}")
        print(f"   Config: {run.data.tags.get('config_name', 'N/A')}")
        print(f"   Run ID: {run.info.run_id}")

### üîç Home work: Advanced Queries

**Querying Runs** (explicit `log_*` data lives here):

```python
# Filter by metric threshold
cheap_runs = mlflow_client.search_runs(
    experiment_ids=[experiment_id],
    filter_string="metrics.estimated_cost_usd < 0.001"
)

# Filter by tag
prod_candidates = mlflow_client.search_runs(
    experiment_ids=[experiment_id],
    filter_string="tags.production_candidate = 'true'"
)
```

**Querying Traces** (autolog data lives here):

```python
# Search traces for an experiment
traces = mlflow.search_traces(
    experiment_ids=[experiment_id],
)

# Get all traces linked to a specific run
traces = mlflow.search_traces(
    run_id=run_id
)
```

Since autolog stores model params, token counts, and latency on Traces rather than run metrics, use `mlflow.search_traces()` when you need to query that data.

---
## Summary

In this notebook, you learned:

1. ‚úÖ How `mlflow.openai.autolog()` captures model params, tokens, latency, and I/O as Traces
2. ‚úÖ When to use explicit `mlflow.log_*` calls (cost, tags, config artifacts, semantic outputs)
3. ‚úÖ Comparing multiple model configurations with minimal boilerplate
4. ‚úÖ Tracking costs ‚Äî Now tracked automatically in MLflow 3.10
5. ‚úÖ Organizing experiments with tags and metadata
6. ‚úÖ Querying runs and traces programmatically

### What to Log and What to Skip

| Category                        | Autolog? | Explicit `log_*`? |
|---------------------------------|:--------:|:------------------:|
| Model, temperature, max_tokens  | ‚úÖ auto  | No ‚Äî redundant     |
| Token counts (prompt/completion) | ‚úÖ auto | No ‚Äî redundant     |
| Latency                         | ‚úÖ auto (span timestamps) | No ‚Äî redundant |
| Full prompt & response          | ‚úÖ auto (span I/O) | No ‚Äî redundant |
| Semantic tags (task, stage, team)| ‚ùå      | ‚úÖ YES             |
| Structured config artifacts     | ‚ùå       | ‚úÖ YES             |
| Cross-step summaries            | ‚ùå       | ‚úÖ YES             |
| Semantic output params          | ‚ùå       | ‚úÖ YES             |

### Next Steps

Ready to dive deep into observability?

**üìì Notebook 1.3: Introduction to Tracing**
- Learn automatic tracing with autologging
- Understand the trace data model
- Visualize LLM execution flows
- Integrate with multiple frameworks