# Part 8.3: ML Systems & Experiment Tracking — The Formula 1 Edition

Machine learning in production is 90% engineering. The model is just one component in a complex system of data pipelines, experiment tracking, model registries, and deployment infrastructure. Without proper systems, ML teams drown in untracked experiments, irreproducible results, and models that silently degrade.

**F1 analogy:** The car on the grid is what everyone sees, but it's the *factory* that wins championships. Behind every F1 car is a training infrastructure (the team's simulation farm running thousands of virtual races), data pipelines (continuous telemetry ingestion from car to cloud), feature stores (pre-computed track characteristics, driver profiles, tire degradation curves), and model serving (deploying the strategy model to the pit wall for live races). An F1 team without proper systems is like an ML team without experiment tracking — they might get lucky once, but they can't systematically improve.

This notebook builds the core infrastructure patterns that every ML team needs — from the ground up.

## Learning Objectives

- [ ] Build an experiment tracking system from scratch (MLflow-style)
- [ ] Implement systematic hyperparameter search (grid, random, Bayesian)
- [ ] Understand and build a model registry with versioning
- [ ] Create reproducible training pipelines with configuration management
- [ ] Implement a feature store for consistent feature engineering
- [ ] Build artifact logging for datasets, models, and metrics
- [ ] Design experiment comparison and visualization dashboards

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
from collections import defaultdict, OrderedDict
import json
import hashlib
import time
import math
import torch
import torch.nn as nn
import torch.nn.functional as F

np.random.seed(42)
torch.manual_seed(42)

print("Part 8.3: ML Systems & Experiment Tracking — The Formula 1 Edition")
print("=" * 65)

---

## 1. The ML Infrastructure Stack

A production ML system has many components beyond the model:

| Layer | Component | Purpose | F1 Parallel |
|-------|-----------|--------|-------------|
| **Data** | Feature store, data versioning | Consistent, reproducible features | Telemetry ingestion pipeline — car sensors to cloud, consistent lap-by-lap data |
| **Training** | Experiment tracker, config management | Track what was tried, reproduce results | The simulation farm log — which virtual setups were tested, which produced fastest laps |
| **Model** | Model registry, artifact store | Version and stage models | Car spec management — track which aero package version is on which chassis |
| **Serving** | Model server, API gateway | Serve predictions | Deploying the strategy model to the pit wall for live race decisions |
| **Monitoring** | Metrics, alerts, drift detection | Keep things working | Real-time telemetry health checks — detecting sensor failures, model drift mid-race |

In [None]:
# Visualize the ML infrastructure stack
fig, ax = plt.subplots(1, 1, figsize=(14, 8))
ax.set_xlim(0, 14)
ax.set_ylim(0, 10)
ax.axis('off')
ax.set_title('ML Infrastructure Stack', fontsize=15, fontweight='bold')

layers = [
    (7, 1.2, 12, 1.5, 'Data Layer', '#3498db',
     'Feature Store  |  Data Versioning  |  Data Validation  |  Pipelines'),
    (7, 3.0, 12, 1.5, 'Training Layer', '#2ecc71',
     'Experiment Tracking  |  Hyperparameter Search  |  Config Management'),
    (7, 4.8, 12, 1.5, 'Model Layer', '#f39c12',
     'Model Registry  |  Artifact Store  |  Model Versioning  |  Staging'),
    (7, 6.6, 12, 1.5, 'Serving Layer', '#e74c3c',
     'Model Server  |  API Gateway  |  Caching  |  Load Balancing'),
    (7, 8.4, 12, 1.5, 'Monitoring Layer', '#9b59b6',
     'Metrics  |  Alerts  |  Drift Detection  |  Logging'),
]

for x, y, w, h, label, color, desc in layers:
    box = mpatches.FancyBboxPatch((x - w/2, y - h/2), w, h,
                                   boxstyle="round,pad=0.15", facecolor=color,
                                   edgecolor='black', linewidth=2, alpha=0.85)
    ax.add_patch(box)
    ax.text(x, y + 0.2, label, ha='center', va='center', fontsize=12,
            fontweight='bold', color='white')
    ax.text(x, y - 0.3, desc, ha='center', va='center', fontsize=8, color='white')

# Arrows connecting layers
for i in range(len(layers) - 1):
    y_from = layers[i][1] + layers[i][3]/2
    y_to = layers[i+1][1] - layers[i+1][3]/2
    ax.annotate('', xy=(7, y_to), xytext=(7, y_from),
               arrowprops=dict(arrowstyle='->', lw=1.5, color='gray'))

plt.tight_layout()
plt.show()

---

## 2. Experiment Tracking

Without experiment tracking, ML development is chaos: "Which hyperparameters gave the best result?" "Can I reproduce last week's experiment?" "What changed between v1 and v2?"

An experiment tracker logs:
- **Parameters**: hyperparameters, config, code version
- **Metrics**: loss, accuracy, custom metrics (per step and final)
- **Artifacts**: model weights, plots, data samples
- **Metadata**: timestamps, hardware, git commit

**F1 analogy:** Experiment tracking is the simulation farm's logbook. Without it, an F1 team would be running thousands of CFD and simulator sessions without recording which wing angle, ride height, and spring stiffness produced each result. Imagine the chief engineer asking "What setup gave us the best Sector 2 time at Barcelona last Tuesday?" and nobody can answer because nobody wrote it down. That's ML without experiment tracking. Every F1 team meticulously logs every simulation run's parameters (setup), metrics (lap time, tire wear), and artifacts (telemetry traces) — and ML teams must do the same.

In [None]:
class ExperimentTracker:
    """MLflow-style experiment tracking from scratch."""
    
    def __init__(self, project_name='default'):
        self.project_name = project_name
        self.runs = {}  # run_id -> run data
        self.active_run = None
    
    def start_run(self, run_name=None, tags=None):
        """Start a new experiment run."""
        run_id = hashlib.md5(f"{time.time()}{run_name}".encode()).hexdigest()[:8]
        
        self.runs[run_id] = {
            'run_id': run_id,
            'name': run_name or f'run_{run_id}',
            'status': 'running',
            'start_time': time.time(),
            'params': {},
            'metrics': {},         # metric_name -> final value
            'metric_history': {},   # metric_name -> [(step, value), ...]
            'artifacts': {},
            'tags': tags or {},
        }
        self.active_run = run_id
        return run_id
    
    def log_param(self, key, value):
        """Log a hyperparameter."""
        self.runs[self.active_run]['params'][key] = value
    
    def log_params(self, params_dict):
        """Log multiple parameters."""
        self.runs[self.active_run]['params'].update(params_dict)
    
    def log_metric(self, key, value, step=None):
        """Log a metric value."""
        run = self.runs[self.active_run]
        run['metrics'][key] = value
        
        if key not in run['metric_history']:
            run['metric_history'][key] = []
        run['metric_history'][key].append((step, value))
    
    def log_artifact(self, name, data):
        """Log an artifact (model, plot, etc.)."""
        self.runs[self.active_run]['artifacts'][name] = {
            'type': type(data).__name__,
            'size': len(str(data)),
            'logged_at': time.time()
        }
    
    def end_run(self, status='completed'):
        """End the current run."""
        run = self.runs[self.active_run]
        run['status'] = status
        run['end_time'] = time.time()
        run['duration'] = run['end_time'] - run['start_time']
        self.active_run = None
    
    def get_best_run(self, metric, maximize=True):
        """Find the run with the best value for a metric."""
        best_run = None
        best_val = float('-inf') if maximize else float('inf')
        
        for run_id, run in self.runs.items():
            if metric in run['metrics']:
                val = run['metrics'][metric]
                if (maximize and val > best_val) or (not maximize and val < best_val):
                    best_val = val
                    best_run = run
        
        return best_run
    
    def compare_runs(self, metric_names=None):
        """Compare all runs in a table format."""
        rows = []
        for run_id, run in self.runs.items():
            row = {'run_id': run_id, 'name': run['name'], 'status': run['status']}
            row.update(run['params'])
            
            if metric_names:
                for m in metric_names:
                    row[m] = run['metrics'].get(m, None)
            else:
                row.update(run['metrics'])
            
            rows.append(row)
        return rows


# Simulate an ML experiment workflow
tracker = ExperimentTracker('classification_experiments')

# Run multiple experiments with different hyperparameters
configs = [
    {'lr': 0.001, 'hidden_size': 64, 'dropout': 0.1, 'optimizer': 'adam'},
    {'lr': 0.01, 'hidden_size': 128, 'dropout': 0.2, 'optimizer': 'adam'},
    {'lr': 0.001, 'hidden_size': 128, 'dropout': 0.1, 'optimizer': 'sgd'},
    {'lr': 0.005, 'hidden_size': 256, 'dropout': 0.3, 'optimizer': 'adam'},
    {'lr': 0.001, 'hidden_size': 64, 'dropout': 0.0, 'optimizer': 'adam'},
]

np.random.seed(42)
for i, config in enumerate(configs):
    run_id = tracker.start_run(run_name=f'exp_{i+1}', tags={'version': 'v1'})
    tracker.log_params(config)
    
    # Simulate training with metrics
    base_acc = 0.7 + np.random.normal(0, 0.05)
    for epoch in range(20):
        # Simulated improvement curve
        acc = base_acc + 0.015 * epoch * (1 + config['hidden_size'] / 256) + np.random.normal(0, 0.01)
        loss = 2.0 - acc + np.random.normal(0, 0.05)
        tracker.log_metric('accuracy', min(0.99, acc), step=epoch)
        tracker.log_metric('loss', max(0.1, loss), step=epoch)
    
    tracker.log_artifact('model_weights', f'model_{i}.pt')
    tracker.end_run()

# Show results
print("Experiment Comparison\n")
comparison = tracker.compare_runs(['accuracy', 'loss'])
print(f"{'Name':>8} {'LR':>8} {'Hidden':>8} {'Dropout':>8} {'Opt':>6} {'Accuracy':>10} {'Loss':>8}")
print("-" * 68)
for row in comparison:
    print(f"{row['name']:>8} {row.get('lr', ''):>8} {row.get('hidden_size', ''):>8} "
          f"{row.get('dropout', ''):>8} {row.get('optimizer', ''):>6} "
          f"{row.get('accuracy', 0):>10.4f} {row.get('loss', 0):>8.4f}")

best = tracker.get_best_run('accuracy', maximize=True)
print(f"\nBest run: {best['name']} (accuracy={best['metrics']['accuracy']:.4f})")

In [None]:
# Visualize experiment results
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Training curves
ax = axes[0]
colors = plt.cm.viridis(np.linspace(0.2, 0.9, len(tracker.runs)))
for (run_id, run), color in zip(tracker.runs.items(), colors):
    history = run['metric_history']['accuracy']
    steps = [h[0] for h in history]
    values = [h[1] for h in history]
    ax.plot(steps, values, linewidth=2, color=color, label=run['name'], alpha=0.8)

ax.set_xlabel('Epoch', fontsize=11)
ax.set_ylabel('Accuracy', fontsize=11)
ax.set_title('Training Curves Across Experiments', fontsize=13, fontweight='bold')
ax.legend(fontsize=9, loc='lower right')
ax.grid(True, alpha=0.3)

# Hyperparameter importance (parallel coordinates style)
ax = axes[1]
final_accs = [run['metrics']['accuracy'] for run in tracker.runs.values()]
lrs = [run['params']['lr'] for run in tracker.runs.values()]
hiddens = [run['params']['hidden_size'] for run in tracker.runs.values()]
dropouts = [run['params']['dropout'] for run in tracker.runs.values()]

scatter = ax.scatter(hiddens, final_accs, c=lrs, cmap='coolwarm', s=100,
                     edgecolors='black', linewidth=1)
plt.colorbar(scatter, ax=ax, label='Learning Rate')

for h, acc, d in zip(hiddens, final_accs, dropouts):
    ax.annotate(f'd={d}', (h, acc), textcoords='offset points',
               xytext=(5, 5), fontsize=8)

ax.set_xlabel('Hidden Size', fontsize=11)
ax.set_ylabel('Final Accuracy', fontsize=11)
ax.set_title('Hyperparameter Space Exploration', fontsize=13, fontweight='bold')
ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

---

## 3. Hyperparameter Search

Manually tuning hyperparameters doesn't scale. Systematic search methods find better configurations more efficiently.

| Method | Strategy | Efficiency | Best For | F1 Parallel |
|--------|----------|-----------|----------|-------------|
| **Grid Search** | Try all combinations | Low (exponential) | Few params, small ranges | Testing every combination of wing angle and ride height on a fixed grid — thorough but slow |
| **Random Search** | Sample randomly | Medium | Medium search spaces | Random sampling of the setup space — surprisingly effective at finding good configurations |
| **Bayesian** | Model the objective, choose wisely | High | Expensive evaluations | Using past simulation results to intelligently choose the next setup to test — each CFD run costs compute, so choose wisely |

**F1 analogy:** An F1 team can't test every possible car setup — there are too many combinations of wing angles, spring rates, damper settings, and differential maps. Grid search would try every combination on a fixed grid (wing angle 5, 10, 15 degrees x ride height 20, 25, 30mm) — exhaustive but expensive. Random search samples randomly and often finds good setups faster. Bayesian optimization is the smartest: it looks at all previous simulator runs, builds a model of the performance landscape, and picks the next setup most likely to be either very fast or very informative. This is exactly how F1 teams' optimization software works.

In [None]:
class HyperparameterSearch:
    """Hyperparameter search strategies."""
    
    @staticmethod
    def grid_search(param_grid):
        """Generate all combinations from parameter grid.
        
        param_grid: {'param_name': [value1, value2, ...]}
        """
        keys = list(param_grid.keys())
        values = list(param_grid.values())
        
        configs = []
        def _recurse(idx, current):
            if idx == len(keys):
                configs.append(dict(current))
                return
            for val in values[idx]:
                current[keys[idx]] = val
                _recurse(idx + 1, current)
        
        _recurse(0, {})
        return configs
    
    @staticmethod
    def random_search(param_distributions, n_trials=20):
        """Sample random configurations.
        
        param_distributions: {'param': {'type': 'uniform/loguniform/choice', ...}}
        """
        configs = []
        for _ in range(n_trials):
            config = {}
            for param, dist in param_distributions.items():
                if dist['type'] == 'uniform':
                    config[param] = np.random.uniform(dist['low'], dist['high'])
                elif dist['type'] == 'loguniform':
                    log_val = np.random.uniform(np.log(dist['low']), np.log(dist['high']))
                    config[param] = np.exp(log_val)
                elif dist['type'] == 'choice':
                    config[param] = np.random.choice(dist['options'])
                elif dist['type'] == 'int':
                    config[param] = int(np.random.randint(dist['low'], dist['high'] + 1))
            configs.append(config)
        return configs
    
    @staticmethod
    def bayesian_search(param_distributions, objective_fn, n_trials=20, n_initial=5):
        """Simple Bayesian optimization using a surrogate model.
        
        Uses random forest as surrogate (simplified GP-like behavior).
        """
        # Initial random exploration
        configs = HyperparameterSearch.random_search(param_distributions, n_initial)
        results = [(c, objective_fn(c)) for c in configs]
        
        for trial in range(n_initial, n_trials):
            # Generate candidates
            candidates = HyperparameterSearch.random_search(param_distributions, 100)
            
            # Score candidates using a simple surrogate:
            # Expected improvement heuristic based on similarity to best configs
            best_score = max(r[1] for r in results)
            
            best_candidate = None
            best_acquisition = -float('inf')
            
            for cand in candidates:
                # Predict score as weighted average of nearby evaluated points
                predicted = 0
                total_weight = 0
                for prev_config, prev_score in results:
                    # Distance between configs
                    dist = sum((cand.get(k, 0) - prev_config.get(k, 0))**2
                              for k in cand if isinstance(cand[k], (int, float)))
                    weight = 1 / (dist + 0.01)
                    predicted += weight * prev_score
                    total_weight += weight
                
                predicted /= total_weight
                # Exploration bonus: prefer unexplored regions
                min_dist = min(sum((cand.get(k, 0) - pc.get(k, 0))**2
                                  for k in cand if isinstance(cand[k], (int, float)))
                              for pc, _ in results)
                acquisition = predicted + 0.1 * math.sqrt(min_dist)
                
                if acquisition > best_acquisition:
                    best_acquisition = acquisition
                    best_candidate = cand
            
            score = objective_fn(best_candidate)
            results.append((best_candidate, score))
        
        return results


# Define a simulated objective function
def simulated_objective(config):
    """Simulated model training that returns accuracy."""
    lr = config.get('lr', 0.001)
    hidden = config.get('hidden_size', 64)
    dropout = config.get('dropout', 0.1)
    
    # Simulated accuracy surface (peaked around lr=0.003, hidden=128, dropout=0.15)
    lr_score = -50 * (np.log10(lr) - np.log10(0.003))**2
    hidden_score = -0.0001 * (hidden - 128)**2
    dropout_score = -5 * (dropout - 0.15)**2
    
    accuracy = 0.85 + lr_score + hidden_score + dropout_score + np.random.normal(0, 0.01)
    return min(0.99, max(0.5, accuracy))


# Compare search strategies
param_distributions = {
    'lr': {'type': 'loguniform', 'low': 1e-4, 'high': 1e-1},
    'hidden_size': {'type': 'int', 'low': 32, 'high': 512},
    'dropout': {'type': 'uniform', 'low': 0.0, 'high': 0.5},
}

search = HyperparameterSearch()

# Grid search (limited to specific values)
grid = search.grid_search({
    'lr': [0.0001, 0.001, 0.01, 0.1],
    'hidden_size': [64, 128, 256],
    'dropout': [0.0, 0.1, 0.2, 0.3]
})
grid_results = [(c, simulated_objective(c)) for c in grid[:20]]  # Limit to 20

# Random search
random_configs = search.random_search(param_distributions, n_trials=20)
random_results = [(c, simulated_objective(c)) for c in random_configs]

# Bayesian search
bayesian_results = search.bayesian_search(param_distributions, simulated_objective, n_trials=20)

print("Search Strategy Comparison (20 trials each)\n")
for name, results in [('Grid', grid_results), ('Random', random_results), ('Bayesian', bayesian_results)]:
    scores = [r[1] for r in results]
    best = max(results, key=lambda x: x[1])
    print(f"  {name:>10}: best={max(scores):.4f}, mean={np.mean(scores):.4f}, "
          f"std={np.std(scores):.4f}")
    print(f"             best config: lr={best[0]['lr']:.4f}, "
          f"hidden={best[0].get('hidden_size', '?')}, dropout={best[0]['dropout']:.2f}")

In [None]:
# Visualize search strategies
fig, axes = plt.subplots(1, 3, figsize=(16, 5))

for ax, (name, results), color in zip(axes,
    [('Grid Search', grid_results), ('Random Search', random_results), ('Bayesian', bayesian_results)],
    ['#e74c3c', '#3498db', '#2ecc71']):
    
    # Best score found over time
    scores = [r[1] for r in results]
    best_so_far = [max(scores[:i+1]) for i in range(len(scores))]
    
    ax.plot(range(1, len(scores)+1), scores, 'o', alpha=0.4, color=color, markersize=6)
    ax.plot(range(1, len(best_so_far)+1), best_so_far, '-', color=color, linewidth=2,
           label='Best so far')
    ax.set_xlabel('Trial', fontsize=11)
    ax.set_ylabel('Accuracy', fontsize=11)
    ax.set_title(f'{name}\n(best={max(scores):.4f})', fontsize=12, fontweight='bold')
    ax.legend(fontsize=9)
    ax.grid(True, alpha=0.3)
    ax.set_ylim(0.6, 0.95)

plt.suptitle('Hyperparameter Search Convergence', fontsize=14, fontweight='bold', y=1.02)
plt.tight_layout()
plt.show()

---

## 4. Model Registry

A **model registry** is a central store for versioned models with staging capabilities:
- **Version tracking**: Every model gets a version number
- **Staging**: Models move through stages (development -> staging -> production)
- **Metadata**: Each version stores metrics, params, and lineage
- **Rollback**: Easy to revert to a previous version

**F1 analogy:** The model registry is like the FIA's car specification management system. Every car has a precisely documented spec — aero package v3.2, floor v2.1, suspension v4.0. When a team brings an upgrade to a race weekend, it goes through stages: development (designed in CFD), staging (tested in the wind tunnel and simulator), production (fitted to the race car). If the upgrade doesn't work on track, they can roll back to the previous spec — but only because every version was meticulously documented. Without a registry, a team would lose track of which front wing is on which car and which version actually produced results.

In [None]:
class ModelRegistry:
    """Model registry with versioning and staging."""
    
    STAGES = ['development', 'staging', 'production', 'archived']
    
    def __init__(self):
        self.models = {}  # model_name -> {versions: [...], current_production: version}
    
    def register_model(self, name, version_info):
        """Register a new model version.
        
        version_info: dict with metrics, params, artifacts, etc.
        """
        if name not in self.models:
            self.models[name] = {'versions': [], 'production_version': None}
        
        version = len(self.models[name]['versions']) + 1
        entry = {
            'version': version,
            'stage': 'development',
            'registered_at': time.time(),
            **version_info
        }
        self.models[name]['versions'].append(entry)
        return version
    
    def promote(self, name, version, to_stage):
        """Promote a model version to a new stage."""
        if to_stage not in self.STAGES:
            raise ValueError(f"Invalid stage: {to_stage}")
        
        model = self.models[name]
        entry = model['versions'][version - 1]
        
        # If promoting to production, demote current production
        if to_stage == 'production' and model['production_version'] is not None:
            old_prod = model['versions'][model['production_version'] - 1]
            old_prod['stage'] = 'archived'
        
        entry['stage'] = to_stage
        if to_stage == 'production':
            model['production_version'] = version
        
        return entry
    
    def get_production_model(self, name):
        """Get the current production model version."""
        model = self.models.get(name)
        if model and model['production_version']:
            return model['versions'][model['production_version'] - 1]
        return None
    
    def list_versions(self, name):
        """List all versions of a model."""
        return self.models.get(name, {}).get('versions', [])
    
    def compare_versions(self, name, v1, v2):
        """Compare two model versions."""
        versions = self.models[name]['versions']
        entry1 = versions[v1 - 1]
        entry2 = versions[v2 - 1]
        
        comparison = {'version': (v1, v2)}
        # Compare metrics
        all_metrics = set(entry1.get('metrics', {}).keys()) | set(entry2.get('metrics', {}).keys())
        for metric in all_metrics:
            val1 = entry1.get('metrics', {}).get(metric)
            val2 = entry2.get('metrics', {}).get(metric)
            comparison[metric] = {
                'v1': val1, 'v2': val2,
                'delta': (val2 - val1) if val1 is not None and val2 is not None else None
            }
        return comparison


# Simulate model lifecycle
registry = ModelRegistry()

# Register several model versions
model_versions = [
    {'metrics': {'accuracy': 0.82, 'latency_ms': 45, 'f1': 0.80}, 'params': {'lr': 0.01, 'epochs': 10}},
    {'metrics': {'accuracy': 0.87, 'latency_ms': 42, 'f1': 0.85}, 'params': {'lr': 0.005, 'epochs': 20}},
    {'metrics': {'accuracy': 0.91, 'latency_ms': 50, 'f1': 0.89}, 'params': {'lr': 0.003, 'epochs': 30}},
    {'metrics': {'accuracy': 0.89, 'latency_ms': 38, 'f1': 0.87}, 'params': {'lr': 0.003, 'epochs': 30, 'distilled': True}},
    {'metrics': {'accuracy': 0.93, 'latency_ms': 48, 'f1': 0.91}, 'params': {'lr': 0.002, 'epochs': 50}},
]

for info in model_versions:
    v = registry.register_model('text_classifier', info)

# Promote versions through stages
registry.promote('text_classifier', 2, 'staging')
registry.promote('text_classifier', 3, 'production')
registry.promote('text_classifier', 5, 'staging')  # Testing v5

print("Model Registry: text_classifier\n")
for v in registry.list_versions('text_classifier'):
    stage_color = {'development': '', 'staging': '*', 'production': '>>>', 'archived': '(old)'}
    marker = stage_color.get(v['stage'], '')
    print(f"  v{v['version']} [{v['stage']:>12}] {marker:>4} "
          f"acc={v['metrics']['accuracy']:.2f}, f1={v['metrics']['f1']:.2f}, "
          f"latency={v['metrics']['latency_ms']}ms")

prod = registry.get_production_model('text_classifier')
print(f"\nCurrent production: v{prod['version']} (accuracy={prod['metrics']['accuracy']:.2f})")

# Compare v3 (current prod) vs v5 (staging)
comp = registry.compare_versions('text_classifier', 3, 5)
print(f"\nv3 vs v5 comparison:")
for metric in ['accuracy', 'f1', 'latency_ms']:
    if metric in comp:
        d = comp[metric]
        print(f"  {metric}: {d['v1']} -> {d['v2']} (delta={d['delta']:+.2f})")

---

## 5. Feature Store

A **feature store** ensures that features computed for training are identical to those computed at serving time. It solves the "training-serving skew" problem.

### Key Properties
- **Consistent**: Same feature logic for training and serving
- **Reusable**: Features computed once, used by many models
- **Versioned**: Feature definitions tracked over time
- **Fast**: Precomputed features for low-latency serving

**F1 analogy:** The feature store is the pre-computed track characteristics database every F1 team maintains. Before arriving at a circuit, the team pre-computes features for each track: corner radii, elevation changes, surface grip levels, historical weather patterns, overtaking difficulty scores. These features are used by the setup model (to choose car configuration), the strategy model (to predict pit stop windows), and the tire model (to predict degradation) — all consuming the *same* pre-computed track features. Without a feature store, each model might compute "track grip" differently, leading to contradictory recommendations. The same features used in pre-race simulation must be identical to those used on the live pit wall — that's the training-serving consistency guarantee.

In [None]:
class FeatureStore:
    """Simple feature store from scratch."""
    
    def __init__(self):
        self.feature_definitions = {}  # name -> {fn, description, version}
        self.offline_store = {}        # (entity_id, feature_name) -> value
        self.feature_sets = {}         # feature_set_name -> [feature_names]
    
    def register_feature(self, name, compute_fn, description='', version=1):
        """Register a feature definition."""
        self.feature_definitions[name] = {
            'fn': compute_fn,
            'description': description,
            'version': version,
            'created_at': time.time()
        }
    
    def register_feature_set(self, name, feature_names):
        """Group features into a feature set for a model."""
        self.feature_sets[name] = feature_names
    
    def compute_and_store(self, entity_id, raw_data):
        """Compute all features for an entity and store them."""
        features = {}
        for name, defn in self.feature_definitions.items():
            value = defn['fn'](raw_data)
            self.offline_store[(entity_id, name)] = value
            features[name] = value
        return features
    
    def get_features(self, entity_id, feature_set=None):
        """Retrieve features for an entity."""
        if feature_set:
            names = self.feature_sets.get(feature_set, [])
        else:
            names = list(self.feature_definitions.keys())
        
        return {name: self.offline_store.get((entity_id, name)) for name in names}
    
    def get_training_data(self, entity_ids, feature_set):
        """Get feature matrix for training."""
        names = self.feature_sets.get(feature_set, [])
        rows = []
        for eid in entity_ids:
            row = [self.offline_store.get((eid, name), 0) for name in names]
            rows.append(row)
        return np.array(rows), names


# Define features for a user behavior model
store = FeatureStore()

# Register feature definitions
store.register_feature('total_purchases', lambda d: d.get('purchases', 0),
                       'Total number of purchases')
store.register_feature('avg_order_value', 
                       lambda d: d.get('total_spend', 0) / max(d.get('purchases', 1), 1),
                       'Average order value')
store.register_feature('days_since_last', lambda d: d.get('days_since_last', 999),
                       'Days since last activity')
store.register_feature('session_count', lambda d: d.get('sessions', 0),
                       'Number of sessions in last 30 days')
store.register_feature('is_premium', lambda d: 1 if d.get('plan') == 'premium' else 0,
                       'Premium user flag')

store.register_feature_set('churn_model', 
                           ['total_purchases', 'avg_order_value', 'days_since_last',
                            'session_count', 'is_premium'])

# Simulate user data
np.random.seed(42)
users = []
for i in range(100):
    user_data = {
        'purchases': np.random.randint(0, 50),
        'total_spend': np.random.uniform(0, 5000),
        'days_since_last': np.random.randint(0, 365),
        'sessions': np.random.randint(0, 100),
        'plan': np.random.choice(['free', 'premium'], p=[0.7, 0.3]),
    }
    store.compute_and_store(f'user_{i}', user_data)
    users.append(f'user_{i}')

# Get training data
X, feature_names = store.get_training_data(users, 'churn_model')

print("Feature Store Summary\n")
print(f"  Registered features: {len(store.feature_definitions)}")
print(f"  Feature sets: {list(store.feature_sets.keys())}")
print(f"  Entities stored: {len(users)}")
print(f"  Training matrix: {X.shape}")
print(f"\n  Features in 'churn_model':")
for name in feature_names:
    vals = X[:, feature_names.index(name)]
    print(f"    {name:>20}: mean={np.mean(vals):.2f}, std={np.std(vals):.2f}")

# Show consistency: same features at training and serving time
print(f"\n  Serving example (user_0):")
serving_features = store.get_features('user_0', 'churn_model')
for k, v in serving_features.items():
    print(f"    {k}: {v}")

---

## 6. Configuration Management & Reproducibility

Reproducibility requires capturing **everything** about an experiment: code, data, config, environment, and random seeds.

**F1 analogy:** Reproducibility in F1 is life-or-death serious. If a car passes crash testing at a specific spec, every component must be traceable to that configuration. The team must be able to answer: "What exact carbon fiber layup, what cure temperature, what adhesive batch was used on the monocoque that passed FIA test #3847?" Configuration management captures all of this. In ML terms: every experiment must record not just the hyperparameters but the exact code version, data snapshot, random seed, and environment — so that any result can be precisely reproduced. A frozen config, like a homologated car spec, prevents accidental changes mid-experiment.

In [None]:
class ExperimentConfig:
    """Configuration management for reproducible experiments."""
    
    def __init__(self, config_dict=None):
        self._config = config_dict or {}
        self._frozen = False
    
    def set(self, key, value):
        if self._frozen:
            raise RuntimeError("Config is frozen after experiment starts")
        self._config[key] = value
    
    def get(self, key, default=None):
        return self._config.get(key, default)
    
    def freeze(self):
        """Freeze config to prevent accidental changes during training."""
        self._frozen = True
    
    def hash(self):
        """Unique hash for this configuration (for caching/dedup)."""
        config_str = json.dumps(self._config, sort_keys=True, default=str)
        return hashlib.sha256(config_str.encode()).hexdigest()[:12]
    
    def diff(self, other):
        """Compare with another config."""
        all_keys = set(self._config.keys()) | set(other._config.keys())
        diffs = {}
        for key in sorted(all_keys):
            v1 = self._config.get(key)
            v2 = other._config.get(key)
            if v1 != v2:
                diffs[key] = {'old': v1, 'new': v2}
        return diffs
    
    def to_dict(self):
        return dict(self._config)


class ReproducibleExperiment:
    """Run a fully reproducible experiment."""
    
    def __init__(self, config, tracker):
        self.config = config
        self.tracker = tracker
    
    def run(self, train_fn):
        """Execute training with full tracking."""
        # Set seeds for reproducibility
        seed = self.config.get('seed', 42)
        np.random.seed(seed)
        torch.manual_seed(seed)
        
        # Freeze config
        self.config.freeze()
        config_hash = self.config.hash()
        
        # Start tracked run
        run_id = self.tracker.start_run(
            run_name=f'run_{config_hash}',
            tags={'config_hash': config_hash}
        )
        self.tracker.log_params(self.config.to_dict())
        
        # Execute training
        try:
            result = train_fn(self.config, self.tracker)
            self.tracker.end_run('completed')
        except Exception as e:
            self.tracker.log_metric('error', 1)
            self.tracker.end_run('failed')
            raise
        
        return result


# Example: reproducible experiment
config_v1 = ExperimentConfig({
    'model': 'mlp', 'hidden_size': 128, 'lr': 0.003,
    'dropout': 0.1, 'epochs': 20, 'batch_size': 32, 'seed': 42
})

config_v2 = ExperimentConfig({
    'model': 'mlp', 'hidden_size': 256, 'lr': 0.001,
    'dropout': 0.2, 'epochs': 30, 'batch_size': 32, 'seed': 42
})

print("Configuration Management\n")
print(f"  Config v1 hash: {config_v1.hash()}")
print(f"  Config v2 hash: {config_v2.hash()}")
print(f"\n  Diff v1 -> v2:")
for key, diff in config_v1.diff(config_v2).items():
    print(f"    {key}: {diff['old']} -> {diff['new']}")

# Run reproducible experiments
tracker2 = ExperimentTracker('reproducibility_demo')

def simple_train(config, tracker):
    """Simulated training function."""
    acc = 0.7
    for epoch in range(config.get('epochs')):
        acc += config.get('lr') * (1 + config.get('hidden_size') / 500)
        acc = min(0.99, acc + np.random.normal(0, 0.005))
        tracker.log_metric('accuracy', acc, step=epoch)
    return {'final_accuracy': acc}

for cfg in [config_v1, config_v2]:
    # Need fresh config since freeze is one-way
    fresh_cfg = ExperimentConfig(cfg.to_dict())
    exp = ReproducibleExperiment(fresh_cfg, tracker2)
    result = exp.run(simple_train)
    print(f"\n  {fresh_cfg.hash()}: accuracy = {result['final_accuracy']:.4f}")

---

## 7. Putting It All Together: ML Pipeline

Let's build a complete ML pipeline that ties all the pieces together: feature store -> experiment tracking -> model registry.

**F1 analogy:** This is the complete race weekend pipeline: pull pre-computed track features from the feature store (track characteristics database) -> run the simulation with tracking (simulator sessions logged with full parameters and metrics) -> register the best setup in the model registry (commit the final car spec for the race). Every step is tracked, versioned, and reproducible — so when the team arrives at the same track next year, they know exactly what worked.

In [None]:
class MLPipeline:
    """End-to-end ML pipeline."""
    
    def __init__(self, feature_store, tracker, registry):
        self.feature_store = feature_store
        self.tracker = tracker
        self.registry = registry
    
    def train_and_register(self, config, entity_ids, feature_set_name, labels):
        """Full pipeline: get features -> train -> track -> register."""
        # Step 1: Get features from feature store
        X, feature_names = self.feature_store.get_training_data(
            entity_ids, feature_set_name
        )
        y = np.array(labels)
        
        # Step 2: Start experiment tracking
        run_id = self.tracker.start_run(run_name=f"pipeline_{config.get('model_name', 'model')}")
        self.tracker.log_params(config)
        self.tracker.log_param('n_features', len(feature_names))
        self.tracker.log_param('n_samples', len(entity_ids))
        
        # Step 3: Train (simulated)
        # Simple logistic regression equivalent
        X_tensor = torch.tensor(X, dtype=torch.float32)
        y_tensor = torch.tensor(y, dtype=torch.long)
        
        # Normalize features
        X_mean = X_tensor.mean(dim=0)
        X_std = X_tensor.std(dim=0).clamp(min=1e-8)
        X_norm = (X_tensor - X_mean) / X_std
        
        model = nn.Linear(X_norm.shape[1], 2)
        optimizer = torch.optim.Adam(model.parameters(), lr=config.get('lr', 0.01))
        
        for epoch in range(config.get('epochs', 50)):
            logits = model(X_norm)
            loss = F.cross_entropy(logits, y_tensor)
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
            
            acc = (logits.argmax(dim=1) == y_tensor).float().mean().item()
            self.tracker.log_metric('loss', loss.item(), step=epoch)
            self.tracker.log_metric('accuracy', acc, step=epoch)
        
        # Step 4: Final metrics
        final_acc = acc
        self.tracker.log_metric('final_accuracy', final_acc)
        self.tracker.end_run()
        
        # Step 5: Register model
        version = self.registry.register_model(config.get('model_name', 'model'), {
            'metrics': {'accuracy': final_acc, 'loss': loss.item()},
            'params': config,
            'feature_set': feature_set_name,
            'n_samples': len(entity_ids),
        })
        
        return {
            'run_id': run_id,
            'version': version,
            'accuracy': final_acc,
        }


# Run the full pipeline
pipeline = MLPipeline(store, ExperimentTracker('pipeline'), ModelRegistry())

# Generate labels (simulated churn: 1 = churned)
np.random.seed(42)
labels = (np.random.random(100) > 0.7).astype(int)

configs = [
    {'model_name': 'churn_predictor', 'lr': 0.01, 'epochs': 50},
    {'model_name': 'churn_predictor', 'lr': 0.005, 'epochs': 100},
    {'model_name': 'churn_predictor', 'lr': 0.001, 'epochs': 100},
]

print("ML Pipeline Results\n")
for config in configs:
    result = pipeline.train_and_register(config, users, 'churn_model', labels)
    print(f"  v{result['version']}: lr={config['lr']}, epochs={config['epochs']}, "
          f"accuracy={result['accuracy']:.4f}")

# Show registry
print("\nModel Registry:")
for v in pipeline.registry.list_versions('churn_predictor'):
    print(f"  v{v['version']}: accuracy={v['metrics']['accuracy']:.4f}, "
          f"stage={v['stage']}")

In [None]:
# Visualize pipeline results
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Version accuracy progression
ax = axes[0]
versions = pipeline.registry.list_versions('churn_predictor')
v_nums = [v['version'] for v in versions]
v_accs = [v['metrics']['accuracy'] for v in versions]
v_lrs = [v['params']['lr'] for v in versions]

bars = ax.bar(v_nums, v_accs, color=['#3498db', '#2ecc71', '#f39c12'],
             edgecolor='black', alpha=0.8)
for bar, acc, lr in zip(bars, v_accs, v_lrs):
    ax.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.005,
            f'{acc:.3f}\nlr={lr}', ha='center', fontsize=9)

ax.set_xlabel('Model Version', fontsize=11)
ax.set_ylabel('Accuracy', fontsize=11)
ax.set_title('Model Accuracy by Version', fontsize=13, fontweight='bold')
ax.set_ylim(0, 1.05)
ax.grid(True, alpha=0.3, axis='y')

# Pipeline overview diagram
ax = axes[1]
ax.set_xlim(0, 10)
ax.set_ylim(0, 6)
ax.axis('off')
ax.set_title('ML Pipeline Flow', fontsize=13, fontweight='bold')

steps = [
    (1.5, 3, 'Feature\nStore', '#3498db'),
    (4, 3, 'Train +\nTrack', '#2ecc71'),
    (6.5, 3, 'Model\nRegistry', '#f39c12'),
    (9, 3, 'Deploy', '#e74c3c'),
]

for x, y, label, color in steps:
    box = mpatches.FancyBboxPatch((x - 1, y - 0.8), 2, 1.6, boxstyle="round,pad=0.15",
                                   facecolor=color, edgecolor='black', linewidth=2, alpha=0.85)
    ax.add_patch(box)
    ax.text(x, y, label, ha='center', va='center', fontsize=10,
            fontweight='bold', color='white')

for i in range(len(steps) - 1):
    ax.annotate('', xy=(steps[i+1][0] - 1, steps[i+1][1]),
               xytext=(steps[i][0] + 1, steps[i][1]),
               arrowprops=dict(arrowstyle='->', lw=2, color='gray'))

plt.tight_layout()
plt.show()

---

## Exercises

### Exercise 1: Early Stopping with Tracking

Add early stopping to the ExperimentTracker. Implement a `should_stop(metric, patience)` method that returns True if the metric hasn't improved for `patience` epochs. Use it in a training loop.

**F1 scenario:** During a simulator session, the engineer monitors lap time improvement. If the optimization algorithm hasn't found a faster lap in the last 10 iterations (patience=10), it should stop and move on to testing a different parameter — no point burning CFD hours on a dead end. Implement this early stopping logic.

In [None]:
# Exercise 1: Your code here
# Hint: Track the best value and number of epochs since improvement.


### Exercise 2: A/B Test Integration

Extend the ModelRegistry with an `ab_test(name, v1, v2, traffic_split)` method that sets up an A/B test between two model versions. Simulate serving requests and collecting metrics for both versions.

**F1 scenario:** The team has two strategy models: v3 (current production, conservative) and v5 (new, more aggressive). Set up an A/B test where 70% of simulated race scenarios use v3 and 30% use v5. After enough "races," compare which strategy model produces better outcomes. This is how a team would validate a new pit wall model before trusting it with a real Grand Prix.

In [None]:
# Exercise 2: Your code here


### Exercise 3: Feature Store with Time Travel

Add "point-in-time" feature retrieval to the FeatureStore. Store features with timestamps, and add a `get_features_at(entity_id, timestamp)` method that returns the feature values as they were at a given point in time. This prevents data leakage in time-series ML.

**F1 scenario:** When building a model to predict tire degradation at lap 30, you must only use features available *before* lap 30 — not the updated track temperature from lap 35 (that's the future!). Implement time-travel feature retrieval so the feature store returns track conditions, tire state, and fuel load exactly as they were at any given lap number. This prevents the classic "leaking future data into training" mistake.

In [None]:
# Exercise 3: Your code here


---

## Summary

### Key Concepts

| Concept | What It Does | F1 Parallel |
|---------|-------------|-------------|
| **Experiment tracking** | Logs params, metrics, and artifacts for every run | The simulation farm logbook — every virtual test session recorded with full setup details and results |
| **Hyperparameter search** | Systematically explores the configuration space | Setup optimization — grid, random, or Bayesian search over wing angles, ride heights, and spring rates |
| **Model registry** | Versions models and manages staging (dev -> staging -> production -> archived) | Car spec management — tracking which aero package version is homologated and race-ready |
| **Feature store** | Ensures training-serving consistency and enables feature reuse | Pre-computed track characteristics database shared by setup, strategy, and tire models |
| **Configuration management** | Freezing and hashing enables reproducibility | FIA homologation — every component traceable to its exact specification |
| **ML pipelines** | Ties all components into a systematic workflow | The complete race weekend pipeline from pre-race simulation to pit wall deployment |

### The Systems Mindset

The difference between ML as a hobby and ML as engineering is infrastructure — just as the difference between a hobby racer and an F1 team is the factory. Without experiment tracking, you can't learn from past experiments — like running simulator sessions without saving the results. Without a model registry, you can't safely deploy — like fitting an untested front wing without knowing its spec. Without a feature store, your training and serving features will diverge — like using different track data in simulation vs. on the pit wall. These systems are what make ML teams productive and their models reliable, just as the factory infrastructure is what makes F1 teams consistently competitive.

---

## Next Steps

We've covered the full ML lifecycle from training infrastructure to deployment — the entire factory behind the race team. The final notebook in our curriculum explores an exciting frontier: **Notebook 28: Multimodal AI** — models that understand images, text, and their connections. In F1 terms, this is about building systems that can simultaneously process onboard camera footage, team radio messages, and telemetry numbers to build a unified understanding of the race.