# Checkpoint 10: Training at Scale - Full Pipeline

In this final checkpoint, we bring everything together into a **production-ready training pipeline**. You'll learn how to train RL agents at scale with proper monitoring, checkpointing, and evaluation.

## Learning Objectives
- Configure and run large-scale PPO training
- Monitor training with TensorBoard
- Implement custom callbacks for detailed logging
- Evaluate and visualize trained policies
- Understand common failure modes and debugging strategies
- Apply the pipeline to Mario Kart

In [None]:
# Install required packages
!pip install "gymnasium[classic-control,box2d]" stable-baselines3 tensorboard torch numpy matplotlib

In [None]:
# GPU Check with Memory Info

import torch

print("=" * 50)
print("Hardware Configuration")
print("=" * 50)

if torch.cuda.is_available():
    print(f"CUDA Available: True")
    print(f"CUDA Version: {torch.version.cuda}")
    print(f"Device Name: {torch.cuda.get_device_name(0)}")
    
    # Memory info
    props = torch.cuda.get_device_properties(0)
    total_memory = props.total_memory / 1e9
    print(f"Total GPU Memory: {total_memory:.2f} GB")
    
    # Current memory usage
    allocated = torch.cuda.memory_allocated(0) / 1e9
    cached = torch.cuda.memory_reserved(0) / 1e9
    print(f"Currently Allocated: {allocated:.2f} GB")
    print(f"Currently Cached: {cached:.2f} GB")
    
    device = "cuda"
elif torch.backends.mps.is_available():
    print(f"MPS (Apple Silicon) Available: True")
    device = "mps"
else:
    print(f"CUDA Available: False")
    print(f"Training will use CPU (slower but works)")
    device = "cpu"

print(f"\nSelected Device: {device}")
print("=" * 50)

## TensorBoard Metrics to Monitor

TensorBoard is essential for understanding training progress. Here are the key metrics:

| Metric | Description | What to Look For | Warning Signs |
|--------|-------------|------------------|---------------|
| `rollout/ep_rew_mean` | Mean episode reward | Steady increase over time | Flat, declining, or oscillating |
| `rollout/ep_len_mean` | Mean episode length | Task-dependent | Sudden changes may indicate problems |
| `train/policy_loss` | Policy network loss | Stable, relatively small | Large spikes, divergence |
| `train/value_loss` | Value network loss | Decreasing over time | Increasing or exploding |
| `train/entropy_loss` | Exploration entropy | Gradual decrease | Too fast = premature convergence |
| `train/approx_kl` | KL divergence | Small (< 0.02) | Large values = unstable updates |
| `train/clip_fraction` | PPO clipping rate | 0.1 - 0.3 typical | Very high = aggressive updates |
| `train/learning_rate` | Current LR | As scheduled | N/A |
| `time/fps` | Training speed | Stable | Decreasing = memory issues |

## Common Failure Modes

Understanding what can go wrong helps you debug faster.

### 1. Reward Hacking
**Symptoms**: High reward but poor qualitative behavior  
**Causes**: Reward function doesn't capture true objective  
**Solutions**: Redesign reward, add constraints, use auxiliary metrics

### 2. Catastrophic Forgetting
**Symptoms**: Performance drops after initial improvement  
**Causes**: Learning rate too high, environment distribution shift  
**Solutions**: Lower learning rate, use replay buffer, curriculum learning

### 3. Policy Collapse
**Symptoms**: Agent always takes same action, entropy drops to zero  
**Causes**: Entropy coefficient too low, reward too sparse  
**Solutions**: Increase entropy bonus, add reward shaping

### 4. Exploration Failure
**Symptoms**: Reward plateaus early, agent stuck in local optimum  
**Causes**: Insufficient exploration, environment too hard  
**Solutions**: Increase entropy, use intrinsic motivation, curriculum

### 5. Hyperparameter Sensitivity
**Symptoms**: Results vary wildly between runs  
**Causes**: Unstable learning dynamics, poor hyperparameters  
**Solutions**: Use multiple seeds, hyperparameter search, robust defaults

## Debugging Strategies

When training isn't working, try these approaches:

### 1. Visualize Episodes
- Record videos of the agent's behavior
- Look for unexpected patterns
- Compare early vs late training

### 2. Check Reward Distribution
- Plot reward histograms
- Identify reward spikes or anomalies
- Verify reward scale is appropriate

### 3. Monitor Gradients
- Watch for gradient explosions/vanishing
- Check gradient norms in TensorBoard

### 4. Compare to Baselines
- Random policy performance
- Simple heuristic policies
- Published results (if available)

### 5. Ablation Studies
- Remove components one at a time
- Identify which parts are essential
- Simplify until something works

### 6. Sanity Checks
- Can the agent solve a trivial version?
- Are observations/actions normalized correctly?
- Is the environment deterministic for debugging?

In [None]:
# Configuration Dictionary

CONFIG = {
    # Environment
    'env_name': 'LunarLander-v3',
    
    # Training
    'total_timesteps': 500_000,
    'n_envs': 8,  # Number of parallel environments
    
    # Evaluation
    'eval_freq': 10_000,  # Evaluate every N timesteps
    'n_eval_episodes': 10,
    
    # Checkpointing
    'save_freq': 50_000,  # Save checkpoint every N timesteps
    
    # Directories
    'log_dir': './logs/ppo_lunarlander',
    'model_dir': './models/ppo_lunarlander',
    'tensorboard_log': './tensorboard/ppo_lunarlander',
    
    # PPO Hyperparameters
    'learning_rate': 3e-4,
    'n_steps': 2048,
    'batch_size': 64,
    'n_epochs': 10,
    'gamma': 0.99,
    'gae_lambda': 0.95,
    'clip_range': 0.2,
    'ent_coef': 0.01,
}

print("Training Configuration:")
for key, value in CONFIG.items():
    print(f"  {key}: {value}")

In [None]:
# Imports

import os
import time
import warnings
warnings.filterwarnings('ignore')

import gymnasium as gym
import numpy as np
import matplotlib.pyplot as plt

from stable_baselines3 import PPO
from stable_baselines3.common.vec_env import DummyVecEnv, SubprocVecEnv, VecMonitor
from stable_baselines3.common.callbacks import (
    BaseCallback,
    CheckpointCallback,
    EvalCallback,
    CallbackList
)
from stable_baselines3.common.evaluation import evaluate_policy
from stable_baselines3.common.monitor import Monitor

print("All imports successful!")
print(f"Gymnasium version: {gym.__version__}")

In [None]:
# DetailedLoggingCallback: Custom Callback for Episode Tracking

class DetailedLoggingCallback(BaseCallback):
    """
    Custom callback for detailed episode logging.
    
    Tracks:
    - Episode rewards and lengths
    - Running statistics
    - Periodic progress reports
    """
    
    def __init__(self, verbose: int = 0, log_freq: int = 100):
        """
        Args:
            verbose: Verbosity level
            log_freq: Print stats every N episodes
        """
        super().__init__(verbose)
        self.log_freq = log_freq
        
        # Episode tracking
        self.episode_rewards = []
        self.episode_lengths = []
        self.episode_count = 0
        
        # Timing
        self.start_time = None
    
    def _on_training_start(self) -> None:
        """Called at the start of training."""
        self.start_time = time.time()
        print(f"\nTraining started at {time.strftime('%H:%M:%S')}")
        print(f"Target: {self.locals['total_timesteps']:,} timesteps")
        print("-" * 50)
    
    def _on_step(self) -> bool:
        """
        Called after each environment step.
        
        Returns:
            True to continue training, False to stop
        """
        # Check for completed episodes
        for info in self.locals.get('infos', []):
            if 'episode' in info:
                ep_reward = info['episode']['r']
                ep_length = info['episode']['l']
                
                self.episode_rewards.append(ep_reward)
                self.episode_lengths.append(ep_length)
                self.episode_count += 1
                
                # Periodic logging
                if self.episode_count % self.log_freq == 0:
                    self._log_progress()
        
        return True
    
    def _log_progress(self) -> None:
        """Print progress statistics."""
        recent_rewards = self.episode_rewards[-self.log_freq:]
        recent_lengths = self.episode_lengths[-self.log_freq:]
        
        avg_reward = np.mean(recent_rewards)
        std_reward = np.std(recent_rewards)
        avg_length = np.mean(recent_lengths)
        
        elapsed = time.time() - self.start_time
        eps_per_sec = self.episode_count / elapsed
        
        print(f"Episode {self.episode_count:>6d} | "
              f"Reward: {avg_reward:>8.2f} +/- {std_reward:<6.2f} | "
              f"Length: {avg_length:>6.1f} | "
              f"EPS: {eps_per_sec:.1f}")
    
    def _on_training_end(self) -> None:
        """Called at the end of training."""
        elapsed = time.time() - self.start_time
        print("-" * 50)
        print(f"Training completed!")
        print(f"Total episodes: {self.episode_count:,}")
        print(f"Total time: {elapsed/60:.2f} minutes")
        print(f"Final avg reward (last 100): {np.mean(self.episode_rewards[-100:]):.2f}")

print("DetailedLoggingCallback defined.")

In [None]:
# Create Vectorized Training Environment with VecMonitor

def make_env(env_name: str, rank: int = 0):
    """
    Create a function that returns a monitored environment.
    
    Args:
        env_name: Gymnasium environment ID
        rank: Unique identifier for parallel envs
    
    Returns:
        Callable that creates the environment
    """
    def _init():
        env = gym.make(env_name)
        return env
    return _init

# Create vectorized environment
# Note: SubprocVecEnv runs envs in separate processes (faster but more memory)
# DummyVecEnv runs in single process (slower but simpler for debugging)

print(f"Creating {CONFIG['n_envs']} parallel environments...")

# Use SubprocVecEnv for speed (comment out and use DummyVecEnv for debugging)
try:
    train_envs = SubprocVecEnv(
        [make_env(CONFIG['env_name'], i) for i in range(CONFIG['n_envs'])]
    )
    print("Using SubprocVecEnv (multiprocessing)")
except Exception as e:
    print(f"SubprocVecEnv failed ({e}), falling back to DummyVecEnv")
    train_envs = DummyVecEnv(
        [make_env(CONFIG['env_name'], i) for i in range(CONFIG['n_envs'])]
    )

# Wrap with VecMonitor for automatic episode statistics
train_envs = VecMonitor(train_envs)

print(f"Training environment ready!")
print(f"  Observation space: {train_envs.observation_space}")
print(f"  Action space: {train_envs.action_space}")

In [None]:
# Create Evaluation Environment

# Separate environment for evaluation (single env with rendering)
eval_env = gym.make(CONFIG['env_name'], render_mode='rgb_array')
eval_env = Monitor(eval_env)

print(f"Evaluation environment created: {CONFIG['env_name']}")
print(f"  Render mode: rgb_array (for visualization)")

In [None]:
# Setup Callbacks

# Create directories
os.makedirs(CONFIG['log_dir'], exist_ok=True)
os.makedirs(CONFIG['model_dir'], exist_ok=True)
os.makedirs(CONFIG['tensorboard_log'], exist_ok=True)

print("Setting up callbacks...")

# 1. Checkpoint Callback: Save model periodically
checkpoint_callback = CheckpointCallback(
    save_freq=CONFIG['save_freq'] // CONFIG['n_envs'],  # Adjust for parallel envs
    save_path=CONFIG['model_dir'],
    name_prefix='ppo_checkpoint',
    verbose=1
)
print(f"  - Checkpoint: Every {CONFIG['save_freq']:,} steps -> {CONFIG['model_dir']}")

# 2. Evaluation Callback: Evaluate and save best model
eval_callback = EvalCallback(
    eval_env,
    best_model_save_path=CONFIG['model_dir'],
    log_path=CONFIG['log_dir'],
    eval_freq=CONFIG['eval_freq'] // CONFIG['n_envs'],
    n_eval_episodes=CONFIG['n_eval_episodes'],
    deterministic=True,
    verbose=1
)
print(f"  - Evaluation: Every {CONFIG['eval_freq']:,} steps ({CONFIG['n_eval_episodes']} episodes)")

# 3. Custom Logging Callback
logging_callback = DetailedLoggingCallback(verbose=1, log_freq=100)
print(f"  - Custom logging: Every 100 episodes")

# Combine all callbacks
callbacks = CallbackList([checkpoint_callback, eval_callback, logging_callback])

print("\nCallbacks ready!")

In [None]:
# Create PPO Model with TensorBoard Logging

model = PPO(
    policy='MlpPolicy',
    env=train_envs,
    verbose=1,
    tensorboard_log=CONFIG['tensorboard_log'],
    
    # Hyperparameters
    learning_rate=CONFIG['learning_rate'],
    n_steps=CONFIG['n_steps'],
    batch_size=CONFIG['batch_size'],
    n_epochs=CONFIG['n_epochs'],
    gamma=CONFIG['gamma'],
    gae_lambda=CONFIG['gae_lambda'],
    clip_range=CONFIG['clip_range'],
    ent_coef=CONFIG['ent_coef'],
    
    # Device
    device='auto',  # Automatically select GPU if available
)

# Model summary
total_params = sum(p.numel() for p in model.policy.parameters())
trainable_params = sum(p.numel() for p in model.policy.parameters() if p.requires_grad)

print("\nPPO Model Created")
print("=" * 50)
print(f"Policy: MlpPolicy")
print(f"Total Parameters: {total_params:,}")
print(f"Trainable Parameters: {trainable_params:,}")
print(f"Device: {model.device}")
print("=" * 50)

In [None]:
# TensorBoard Integration
# Run this cell to start TensorBoard inline

%load_ext tensorboard
%tensorboard --logdir {CONFIG['tensorboard_log']}

In [None]:
# Training Cell: 500k Timesteps with Progress Bar

print("\n" + "=" * 60)
print("STARTING TRAINING")
print(f"Target: {CONFIG['total_timesteps']:,} timesteps")
print(f"Estimated time: 5-15 minutes (depending on hardware)")
print("=" * 60 + "\n")

start_time = time.time()

# Train the model
model.learn(
    total_timesteps=CONFIG['total_timesteps'],
    callback=callbacks,
    progress_bar=True,
    tb_log_name='PPO'
)

training_time = time.time() - start_time

print("\n" + "=" * 60)
print("TRAINING COMPLETE")
print(f"Total time: {training_time/60:.2f} minutes")
print(f"Steps per second: {CONFIG['total_timesteps']/training_time:.0f}")
print("=" * 60)

# Save final model
final_model_path = f"{CONFIG['model_dir']}/ppo_final"
model.save(final_model_path)
print(f"\nFinal model saved to: {final_model_path}.zip")

In [None]:
# Training Analysis Plots

# Get episode data from our custom callback
episode_rewards = logging_callback.episode_rewards
episode_lengths = logging_callback.episode_lengths

fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# Plot 1: Episode Rewards Over Time
ax1 = axes[0, 0]
ax1.plot(episode_rewards, alpha=0.3, color='blue', label='Episode Reward')

# Rolling average
window = min(100, len(episode_rewards) // 10) or 1
if len(episode_rewards) >= window:
    rolling_avg = np.convolve(episode_rewards, np.ones(window)/window, mode='valid')
    ax1.plot(range(window-1, len(episode_rewards)), rolling_avg, 
             color='red', linewidth=2, label=f'Rolling Avg ({window})')

ax1.set_xlabel('Episode')
ax1.set_ylabel('Reward')
ax1.set_title('Episode Rewards During Training')
ax1.legend()
ax1.grid(True, alpha=0.3)

# Plot 2: Reward Distribution Histogram
ax2 = axes[0, 1]
ax2.hist(episode_rewards, bins=50, color='green', alpha=0.7, edgecolor='black')
ax2.axvline(x=np.mean(episode_rewards), color='red', linestyle='--', 
            label=f'Mean: {np.mean(episode_rewards):.1f}')
ax2.axvline(x=np.median(episode_rewards), color='orange', linestyle='--',
            label=f'Median: {np.median(episode_rewards):.1f}')
ax2.set_xlabel('Reward')
ax2.set_ylabel('Frequency')
ax2.set_title('Reward Distribution')
ax2.legend()
ax2.grid(True, alpha=0.3)

# Plot 3: Episode Lengths
ax3 = axes[1, 0]
ax3.plot(episode_lengths, alpha=0.3, color='purple', label='Episode Length')

if len(episode_lengths) >= window:
    rolling_len = np.convolve(episode_lengths, np.ones(window)/window, mode='valid')
    ax3.plot(range(window-1, len(episode_lengths)), rolling_len,
             color='darkred', linewidth=2, label=f'Rolling Avg ({window})')

ax3.set_xlabel('Episode')
ax3.set_ylabel('Steps')
ax3.set_title('Episode Lengths During Training')
ax3.legend()
ax3.grid(True, alpha=0.3)

# Plot 4: Learning Curve (by training phase)
ax4 = axes[1, 1]
n_phases = 5
phase_size = len(episode_rewards) // n_phases
if phase_size > 0:
    phase_means = []
    phase_stds = []
    phases = []
    
    for i in range(n_phases):
        start_idx = i * phase_size
        end_idx = start_idx + phase_size
        phase_data = episode_rewards[start_idx:end_idx]
        phase_means.append(np.mean(phase_data))
        phase_stds.append(np.std(phase_data))
        phases.append(f"Phase {i+1}\n({start_idx}-{end_idx})")
    
    x_pos = np.arange(len(phases))
    bars = ax4.bar(x_pos, phase_means, yerr=phase_stds, capsize=5,
                   color=['#ff9999', '#ffcc99', '#ffff99', '#99ff99', '#99ccff'],
                   edgecolor='black')
    ax4.set_xticks(x_pos)
    ax4.set_xticklabels(phases, fontsize=9)
    ax4.set_ylabel('Mean Reward')
    ax4.set_title('Learning Progress by Phase')
    ax4.grid(True, alpha=0.3, axis='y')

plt.tight_layout()
plt.savefig(f"{CONFIG['log_dir']}/training_analysis.png", dpi=150)
plt.show()

# Print summary statistics
print("\nTraining Statistics:")
print(f"  Total Episodes: {len(episode_rewards)}")
print(f"  Mean Reward: {np.mean(episode_rewards):.2f} +/- {np.std(episode_rewards):.2f}")
print(f"  Max Reward: {np.max(episode_rewards):.2f}")
print(f"  Min Reward: {np.min(episode_rewards):.2f}")
print(f"  Mean Episode Length: {np.mean(episode_lengths):.1f}")

In [None]:
# Final Evaluation: 20 Episodes with Best Tracking

print("Loading best model for evaluation...")

# Load the best model (saved by EvalCallback)
best_model_path = f"{CONFIG['model_dir']}/best_model"
if os.path.exists(f"{best_model_path}.zip"):
    best_model = PPO.load(best_model_path)
    print(f"Loaded: {best_model_path}")
else:
    print("Best model not found, using final model")
    best_model = model

# Evaluation parameters
n_eval_episodes = 20
eval_rewards = []
eval_lengths = []
best_reward = -float('inf')
best_frames = None
all_episode_frames = []

print(f"\nRunning {n_eval_episodes} evaluation episodes...")
print("-" * 50)

for ep in range(n_eval_episodes):
    obs, info = eval_env.reset()
    episode_reward = 0
    episode_length = 0
    frames = []
    done = False
    
    while not done:
        # Get action from policy (deterministic for evaluation)
        action, _ = best_model.predict(obs, deterministic=True)
        
        # Step environment
        obs, reward, terminated, truncated, info = eval_env.step(action)
        
        episode_reward += reward
        episode_length += 1
        
        # Capture frame for visualization
        frame = eval_env.render()
        if frame is not None:
            frames.append(frame)
        
        done = terminated or truncated
    
    eval_rewards.append(episode_reward)
    eval_lengths.append(episode_length)
    
    # Track best episode
    if episode_reward > best_reward:
        best_reward = episode_reward
        best_frames = frames.copy()
        best_episode = ep + 1
    
    # Store frames for visualization
    all_episode_frames.append(frames)
    
    print(f"Episode {ep+1:2d}: Reward = {episode_reward:>8.2f}, Length = {episode_length:>4d}")

print("-" * 50)
print(f"\nFinal Evaluation Results:")
print(f"  Mean Reward: {np.mean(eval_rewards):.2f} +/- {np.std(eval_rewards):.2f}")
print(f"  Max Reward: {np.max(eval_rewards):.2f} (Episode {np.argmax(eval_rewards)+1})")
print(f"  Min Reward: {np.min(eval_rewards):.2f}")
print(f"  Mean Length: {np.mean(eval_lengths):.1f}")

In [None]:
# Visualize Best Episode Frames

if best_frames and len(best_frames) > 0:
    print(f"Visualizing best episode (Episode {best_episode}, Reward: {best_reward:.2f})")
    
    # Select frames to display (10 evenly spaced)
    n_frames_to_show = 10
    step_indices = np.linspace(0, len(best_frames)-1, n_frames_to_show, dtype=int)
    
    fig, axes = plt.subplots(2, 5, figsize=(20, 8))
    
    for ax, idx in zip(axes.flat, step_indices):
        ax.imshow(best_frames[idx])
        ax.set_title(f"Step {idx}", fontsize=10)
        ax.axis('off')
    
    plt.suptitle(f"Best Episode Visualization (Reward: {best_reward:.2f})", fontsize=14)
    plt.tight_layout()
    plt.savefig(f"{CONFIG['log_dir']}/best_episode_frames.png", dpi=150)
    plt.show()
else:
    print("No frames captured for visualization.")

In [None]:
# Optional: Model Checkpoint Comparison

print("Comparing model checkpoints...")

# Find all checkpoints
checkpoint_files = [
    f for f in os.listdir(CONFIG['model_dir']) 
    if f.startswith('ppo_checkpoint') and f.endswith('.zip')
]
checkpoint_files.sort(key=lambda x: int(x.split('_')[-2]))  # Sort by step number

if len(checkpoint_files) > 0:
    print(f"Found {len(checkpoint_files)} checkpoints")
    
    # Evaluate each checkpoint (limit to 5 for speed)
    checkpoints_to_eval = checkpoint_files[:5] if len(checkpoint_files) > 5 else checkpoint_files
    checkpoint_results = {}
    
    for cp_file in checkpoints_to_eval:
        cp_path = os.path.join(CONFIG['model_dir'], cp_file.replace('.zip', ''))
        cp_model = PPO.load(cp_path)
        
        # Quick evaluation (5 episodes)
        mean_reward, std_reward = evaluate_policy(
            cp_model, eval_env, n_eval_episodes=5, deterministic=True
        )
        
        checkpoint_results[cp_file] = (mean_reward, std_reward)
        print(f"  {cp_file}: {mean_reward:.2f} +/- {std_reward:.2f}")
    
    # Also evaluate final model
    mean_reward, std_reward = evaluate_policy(
        model, eval_env, n_eval_episodes=5, deterministic=True
    )
    checkpoint_results['ppo_final.zip'] = (mean_reward, std_reward)
    print(f"  ppo_final.zip: {mean_reward:.2f} +/- {std_reward:.2f}")
    
    # Plot checkpoint comparison
    if checkpoint_results:
        fig, ax = plt.subplots(figsize=(12, 6))
        
        names = list(checkpoint_results.keys())
        means = [v[0] for v in checkpoint_results.values()]
        stds = [v[1] for v in checkpoint_results.values()]
        
        colors = plt.cm.viridis(np.linspace(0.2, 0.8, len(names)))
        bars = ax.bar(range(len(names)), means, yerr=stds, capsize=5,
                      color=colors, edgecolor='black')
        
        ax.set_xticks(range(len(names)))
        ax.set_xticklabels([n.replace('ppo_checkpoint_', 'CP ').replace('_steps.zip', '') 
                           for n in names], rotation=45, ha='right')
        ax.set_ylabel('Mean Reward')
        ax.set_title('Model Performance Across Training')
        ax.grid(True, alpha=0.3, axis='y')
        
        plt.tight_layout()
        plt.savefig(f"{CONFIG['log_dir']}/checkpoint_comparison.png", dpi=150)
        plt.show()
else:
    print("No checkpoints found for comparison.")

## Next Steps: Applying to Mario Kart

Now that you have a working training pipeline, here's how to adapt it for Mario Kart 64:

### 1. Swap the Environment

Replace `LunarLander-v3` with your custom Mario Kart Gymnasium wrapper:

```python
CONFIG['env_name'] = 'MarioKart64-v0'  # Your registered environment
```

### 2. Use CnnPolicy for Image Observations

Mario Kart uses screen frames as observations, so switch to CNN:

```python
model = PPO(
    'CnnPolicy',  # Instead of 'MlpPolicy'
    env=train_envs,
    ...
)
```

### 3. Add Reward Wrapper

Use the reward functions from Notebook 9:

```python
from reward_functions import MarioKartRewardV3
from gymnasium import Wrapper

class RewardWrapper(Wrapper):
    def __init__(self, env):
        super().__init__(env)
        self.reward_fn = MarioKartRewardV3()
    
    def step(self, action):
        obs, reward, terminated, truncated, info = self.env.step(action)
        state = self._extract_state(info)
        custom_reward = self.reward_fn.compute(state)
        return obs, custom_reward, terminated, truncated, info
    
    def reset(self, **kwargs):
        self.reward_fn.reset()
        return self.env.reset(**kwargs)
```

### 4. Adjust Hyperparameters

Mario Kart likely needs different settings:

```python
MARIO_KART_CONFIG = {
    'env_name': 'MarioKart64-v0',
    'policy': 'CnnPolicy',
    'total_timesteps': 5_000_000,  # More training needed
    'n_envs': 4,  # Fewer envs (higher memory per env)
    'learning_rate': 2.5e-4,
    'n_steps': 1024,  # Smaller batches
    'ent_coef': 0.01,  # Encourage exploration
}
```

### 5. Increase Training Time

Complex visual environments need more experience:
- LunarLander: ~500K steps
- Atari games: ~10M steps
- Mario Kart: ~1-10M steps (estimate)

### 6. Monitor Carefully

Use TensorBoard to watch for:
- Reward hacking (high reward but bad driving)
- Policy collapse (entropy dropping to zero)
- Slow learning (try different reward shaping)

## Quiz

Test your understanding of training at scale:

### Question 1
Why use vectorized environments (multiple parallel envs)?

<details>
<summary>Click for answer</summary>

Vectorized environments provide several benefits:
- **Faster data collection**: Collect more experience per unit time
- **Better gradient estimates**: More diverse samples reduce variance
- **Hardware utilization**: Use multiple CPU cores efficiently
- **Batch processing**: Neural network forward passes are batched
</details>

### Question 2
What does VecMonitor track and why is it important?

<details>
<summary>Click for answer</summary>

VecMonitor automatically tracks:
- Episode rewards and lengths
- Episode completion times
- Running statistics

It's important because this data is needed for:
- TensorBoard logging
- Evaluation callbacks
- Progress monitoring
</details>

### Question 3
When should you use SubprocVecEnv vs DummyVecEnv?

<details>
<summary>Click for answer</summary>

**SubprocVecEnv** (multiprocessing):
- Faster for CPU-bound environments
- Uses more memory (separate process per env)
- Best for production training

**DummyVecEnv** (single process):
- Easier to debug (shared memory space)
- Lower memory usage
- Better for development and testing
</details>

### Question 4
What TensorBoard metrics might indicate reward hacking?

<details>
<summary>Click for answer</summary>

Signs of reward hacking:
- High `ep_rew_mean` but poor qualitative behavior (need to visualize)
- Very short episode lengths despite high rewards
- Reward increases but auxiliary metrics (e.g., actual game score) don't improve
- Sudden reward spikes without corresponding policy improvement
</details>

### Question 5
Why save checkpoints during training?

<details>
<summary>Click for answer</summary>

Checkpoints are valuable because:
- **Recovery**: Resume training after crashes or interruptions
- **Model selection**: Choose the best model (not always the final one)
- **Analysis**: Study how the policy evolves over training
- **Debugging**: Identify when problems started
- **Reproducibility**: Return to any point in training
</details>

In [None]:
# Cleanup: Close Environments and Print Summary

print("Cleaning up...")

# Close environments
train_envs.close()
eval_env.close()

print("Environments closed.")

# Summary of saved files
print("\n" + "=" * 60)
print("TRAINING COMPLETE - SUMMARY")
print("=" * 60)

print("\nSaved Files:")
print(f"  Models:")
print(f"    - Final model: {CONFIG['model_dir']}/ppo_final.zip")
print(f"    - Best model:  {CONFIG['model_dir']}/best_model.zip")

# List checkpoints
checkpoints = [f for f in os.listdir(CONFIG['model_dir']) if 'checkpoint' in f]
print(f"    - Checkpoints: {len(checkpoints)} files")

print(f"\n  Logs:")
print(f"    - TensorBoard: {CONFIG['tensorboard_log']}/")
print(f"    - Evaluations: {CONFIG['log_dir']}/evaluations.npz")
print(f"    - Plots:       {CONFIG['log_dir']}/*.png")

print("\nTo view TensorBoard logs later:")
print(f"  tensorboard --logdir {CONFIG['tensorboard_log']}")

print("\nTo load and use the trained model:")
print(f"  model = PPO.load('{CONFIG['model_dir']}/best_model')")
print(f"  action, _ = model.predict(observation, deterministic=True)")

print("\n" + "=" * 60)
print("Congratulations! You've completed the full training pipeline.")
print("=" * 60)

## Summary

In this notebook, you learned:

### 1. Training Infrastructure
- Configuration management with dictionaries
- Vectorized environments for parallel training
- VecMonitor for automatic statistics

### 2. Monitoring and Logging
- TensorBoard integration
- Custom callbacks for detailed tracking
- Understanding key metrics

### 3. Evaluation and Visualization
- Checkpoint callbacks for model saving
- Evaluation callbacks for best model selection
- Episode visualization

### 4. Debugging and Diagnosis
- Common failure modes
- Debugging strategies
- Checkpoint comparison

### 5. Next Steps
- Adapting for Mario Kart (CnnPolicy, reward wrappers)
- Scaling up training time
- Hyperparameter tuning

You now have all the tools to train RL agents at scale!