# PPO with Stable-Baselines3: Production-Ready RL

Now that you understand PPO, let's use a production-ready implementation!

## What You'll Learn

By the end of this notebook, you'll understand:
- The "chef's kitchen" analogy: when to use libraries vs scratch
- Stable-Baselines3 architecture and design
- Training PPO with SB3 on various environments
- Customizing hyperparameters
- Saving, loading, and evaluating models
- Monitoring training with callbacks

**Prerequisites:** Notebook 2 (PPO From Scratch)

**Time:** ~25 minutes

---
## The Big Picture: The Chef's Kitchen Analogy

```
    ┌────────────────────────────────────────────────────────────────┐
    │          THE CHEF'S KITCHEN ANALOGY                            │
    ├────────────────────────────────────────────────────────────────┤
    │                                                                │
    │  FROM SCRATCH (Previous notebook):                            │
    │    Like cooking at home with raw ingredients                  │
    │    • You understand every step                               │
    │    • Full control over everything                            │
    │    • Time-consuming                                          │
    │    • May have subtle bugs                                    │
    │                                                                │
    │  STABLE-BASELINES3 (This notebook):                          │
    │    Like a professional kitchen with prep done                │
    │    • Battle-tested implementation                            │
    │    • Optimized for performance                               │
    │    • Rich features (logging, callbacks, saving)              │
    │    • Used by researchers and industry                        │
    │                                                                │
    │  WHEN TO USE WHAT:                                            │
    │    From scratch: Learning, custom algorithms, research       │
    │    SB3: Production, reproducibility, standard benchmarks     │
    │                                                                │
    │  ANALOGY:                                                     │
    │    Knowing how to cook helps you use restaurant kitchen     │
    │    better - you understand what the tools are doing!        │
    │                                                                │
    └────────────────────────────────────────────────────────────────┘
```

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.patches import FancyBboxPatch, Circle

try:
    import gymnasium as gym
except ImportError:
    import gym

# Check if Stable-Baselines3 is available
try:
    from stable_baselines3 import PPO, A2C
    from stable_baselines3.common.env_util import make_vec_env
    from stable_baselines3.common.evaluation import evaluate_policy
    from stable_baselines3.common.callbacks import EvalCallback, BaseCallback
    from stable_baselines3.common.monitor import Monitor
    from stable_baselines3.common.vec_env import DummyVecEnv, SubprocVecEnv
    import torch
    SB3_AVAILABLE = True
    print("✓ Stable-Baselines3 is installed!")
    print(f"  Version info: Using PyTorch {torch.__version__}")
except ImportError:
    SB3_AVAILABLE = False
    print("✗ Stable-Baselines3 not installed.")
    print("\nTo install, run:")
    print("  pip install stable-baselines3[extra]")
    print("\nThis will install SB3 with extra dependencies including tensorboard.")

In [None]:
# Visualize Stable-Baselines3 architecture

fig, ax = plt.subplots(figsize=(14, 10))
ax.set_xlim(0, 14)
ax.set_ylim(0, 12)
ax.axis('off')
ax.set_title('Stable-Baselines3 Architecture', fontsize=16, fontweight='bold')

# User code
user_box = FancyBboxPatch((1, 9.5), 12, 1.5, boxstyle="round,pad=0.1",
                           facecolor='#e3f2fd', edgecolor='#1976d2', linewidth=3)
ax.add_patch(user_box)
ax.text(7, 10.25, 'YOUR CODE', ha='center', fontsize=12, fontweight='bold', color='#1976d2')
ax.text(7, 9.8, 'model = PPO("MlpPolicy", env) → model.learn() → model.predict()', 
        ha='center', fontsize=10)

# SB3 components
components = [
    ('Algorithms\n(PPO, A2C, SAC...)', 2, 7, '#c8e6c9', '#388e3c'),
    ('Policies\n(MLP, CNN)', 6, 7, '#fff3e0', '#f57c00'),
    ('Vectorized Envs\n(DummyVec, Subproc)', 10, 7, '#e1bee7', '#7b1fa2'),
]

for text, x, y, fcolor, ecolor in components:
    box = FancyBboxPatch((x, y), 3, 1.8, boxstyle="round,pad=0.1",
                          facecolor=fcolor, edgecolor=ecolor, linewidth=2)
    ax.add_patch(box)
    ax.text(x + 1.5, y + 0.9, text, ha='center', va='center', fontsize=9)

# Lower components
lower_components = [
    ('Rollout Buffer\n(stores transitions)', 2, 4.5, '#bbdefb', '#1976d2'),
    ('Callbacks\n(logging, eval)', 6, 4.5, '#ffcdd2', '#d32f2f'),
    ('Utils\n(save, load, evaluate)', 10, 4.5, '#dcedc8', '#689f38'),
]

for text, x, y, fcolor, ecolor in lower_components:
    box = FancyBboxPatch((x, y), 3, 1.8, boxstyle="round,pad=0.1",
                          facecolor=fcolor, edgecolor=ecolor, linewidth=2)
    ax.add_patch(box)
    ax.text(x + 1.5, y + 0.9, text, ha='center', va='center', fontsize=9)

# PyTorch + Gym foundation
foundation_box = FancyBboxPatch((1, 2), 12, 1.5, boxstyle="round,pad=0.1",
                                 facecolor='#fafafa', edgecolor='#666', linewidth=2)
ax.add_patch(foundation_box)
ax.text(7, 2.75, 'Built on: PyTorch + Gymnasium', ha='center', fontsize=11)

# Arrows
ax.annotate('', xy=(7, 9.4), xytext=(7, 8.9),
            arrowprops=dict(arrowstyle='->', lw=2, color='#666'))

plt.tight_layout()
plt.show()

print("\nSTABLE-BASELINES3 FEATURES:")
print("  • Clean, modular implementation")
print("  • Support for PPO, A2C, SAC, TD3, DQN...")
print("  • Vectorized environments for parallelism")
print("  • TensorBoard integration")
print("  • Easy save/load functionality")

---
## Quick Start: Training PPO in 3 Lines

```
    ┌────────────────────────────────────────────────────────────────┐
    │              PPO IN 3 LINES!                                   │
    ├────────────────────────────────────────────────────────────────┤
    │                                                                │
    │  from stable_baselines3 import PPO                           │
    │                                                                │
    │  model = PPO('MlpPolicy', 'CartPole-v1')  # Create           │
    │  model.learn(total_timesteps=10000)       # Train            │
    │  model.save('ppo_cartpole')               # Save             │
    │                                                                │
    │  That's it! Production-ready PPO.                            │
    │                                                                │
    └────────────────────────────────────────────────────────────────┘
```

In [None]:
if SB3_AVAILABLE:
    print("QUICK START: PPO IN 3 LINES")
    print("="*60)
    
    # Line 1: Create the model
    model = PPO('MlpPolicy', 'CartPole-v1', verbose=0)
    print("\n1. Created PPO model with MlpPolicy")
    
    # Line 2: Train
    print("\n2. Training for 10,000 timesteps...")
    model.learn(total_timesteps=10_000)
    print("   Done!")
    
    # Line 3: Save (we'll skip actual saving for demo)
    print("\n3. Model ready to save with: model.save('ppo_cartpole')")
    
    # Bonus: Evaluate
    env = gym.make('CartPole-v1')
    mean_reward, std_reward = evaluate_policy(model, env, n_eval_episodes=10)
    print(f"\nEvaluation: {mean_reward:.1f} ± {std_reward:.1f} reward")
    env.close()
    
    print("\n" + "="*60)
else:
    print("Install Stable-Baselines3 to run this example.")

---
## Understanding the PPO Hyperparameters

```
    ┌────────────────────────────────────────────────────────────────┐
    │              PPO HYPERPARAMETERS EXPLAINED                     │
    ├────────────────────────────────────────────────────────────────┤
    │                                                                │
    │  CORE PPO PARAMETERS:                                         │
    │    clip_range=0.2      Clipping ε (how much policy can change)│
    │    n_steps=2048        Steps before update (rollout length)   │
    │    batch_size=64       Minibatch size for SGD                 │
    │    n_epochs=10         Epochs per update                      │
    │                                                                │
    │  LEARNING PARAMETERS:                                         │
    │    learning_rate=3e-4  Adam learning rate                     │
    │    gamma=0.99          Discount factor                        │
    │    gae_lambda=0.95     GAE lambda (bias-variance tradeoff)    │
    │                                                                │
    │  LOSS COEFFICIENTS:                                           │
    │    ent_coef=0.0        Entropy bonus (exploration)            │
    │    vf_coef=0.5         Value function loss weight             │
    │                                                                │
    │  ENVIRONMENT:                                                 │
    │    n_envs              Number of parallel environments        │
    │                                                                │
    └────────────────────────────────────────────────────────────────┘
```

In [None]:
if SB3_AVAILABLE:
    print("PPO HYPERPARAMETERS IN STABLE-BASELINES3")
    print("="*60)
    
    # Create PPO with custom hyperparameters
    custom_model = PPO(
        policy='MlpPolicy',
        env='CartPole-v1',
        
        # Core PPO
        learning_rate=3e-4,      # Adam learning rate
        n_steps=2048,            # Steps per rollout
        batch_size=64,           # Minibatch size
        n_epochs=10,             # Epochs per update
        clip_range=0.2,          # PPO clipping epsilon
        
        # Advantage estimation
        gamma=0.99,              # Discount factor
        gae_lambda=0.95,         # GAE lambda
        
        # Loss coefficients
        ent_coef=0.01,           # Entropy coefficient
        vf_coef=0.5,             # Value function coefficient
        max_grad_norm=0.5,       # Gradient clipping
        
        # Misc
        verbose=0,               # 0: none, 1: training info
        seed=42,                 # Random seed
    )
    
    print("\nModel created with custom hyperparameters:")
    print(f"  learning_rate: {custom_model.learning_rate}")
    print(f"  n_steps: {custom_model.n_steps}")
    print(f"  batch_size: {custom_model.batch_size}")
    print(f"  n_epochs: {custom_model.n_epochs}")
    print(f"  clip_range: {custom_model.clip_range}")
    print(f"  gamma: {custom_model.gamma}")
    print(f"  gae_lambda: {custom_model.gae_lambda}")
    print(f"  ent_coef: {custom_model.ent_coef}")
    print(f"  vf_coef: {custom_model.vf_coef}")
    
    print("\n" + "="*60)
else:
    print("Install Stable-Baselines3 to run this example.")

In [None]:
# Visualize hyperparameter effects

fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# Top left: Clip range effect
ax1 = axes[0, 0]
clip_ranges = [0.1, 0.2, 0.3, 0.4]
stability = [0.95, 0.9, 0.7, 0.4]  # Made up for illustration
learning_speed = [0.5, 0.8, 0.9, 0.95]

x = np.arange(len(clip_ranges))
width = 0.35
ax1.bar(x - width/2, stability, width, label='Stability', color='#4caf50')
ax1.bar(x + width/2, learning_speed, width, label='Learning Speed', color='#2196f3')
ax1.set_xticks(x)
ax1.set_xticklabels([f'ε={c}' for c in clip_ranges])
ax1.set_ylabel('Score (normalized)', fontsize=10)
ax1.set_title('Clip Range (ε) Effect', fontsize=12, fontweight='bold')
ax1.legend()
ax1.axvline(x=1, color='red', linestyle='--', alpha=0.5)
ax1.text(1, 1.05, 'Default', ha='center', fontsize=9, color='red')
ax1.grid(True, alpha=0.3, axis='y')

# Top right: n_steps effect
ax2 = axes[0, 1]
n_steps_vals = [128, 256, 512, 1024, 2048, 4096]
variance = [0.9, 0.75, 0.6, 0.45, 0.35, 0.3]
memory = [0.1, 0.15, 0.25, 0.4, 0.6, 0.85]

ax2.plot(n_steps_vals, variance, 'o-', linewidth=2, color='#f44336', label='Gradient Variance')
ax2.plot(n_steps_vals, memory, 's-', linewidth=2, color='#ff9800', label='Memory Usage')
ax2.axvline(x=2048, color='green', linestyle='--', alpha=0.5)
ax2.text(2048, 1.0, 'Default', ha='center', fontsize=9, color='green')
ax2.set_xlabel('n_steps', fontsize=10)
ax2.set_ylabel('Score (normalized)', fontsize=10)
ax2.set_title('Rollout Length (n_steps) Effect', fontsize=12, fontweight='bold')
ax2.legend()
ax2.grid(True, alpha=0.3)

# Bottom left: Entropy coefficient
ax3 = axes[1, 0]
ent_coefs = [0.0, 0.001, 0.01, 0.05, 0.1]
exploration = [0.2, 0.4, 0.7, 0.85, 0.95]
exploitation = [0.95, 0.9, 0.75, 0.5, 0.3]

ax3.fill_between(ent_coefs, 0, exploration, alpha=0.3, color='#2196f3', label='Exploration')
ax3.fill_between(ent_coefs, 0, exploitation, alpha=0.3, color='#4caf50', label='Exploitation')
ax3.plot(ent_coefs, exploration, 'b-', linewidth=2)
ax3.plot(ent_coefs, exploitation, 'g-', linewidth=2)
ax3.axvline(x=0.01, color='red', linestyle='--', alpha=0.5)
ax3.text(0.01, 1.0, 'Good\nbalance', ha='center', fontsize=9, color='red')
ax3.set_xlabel('Entropy Coefficient', fontsize=10)
ax3.set_ylabel('Behavior', fontsize=10)
ax3.set_title('Entropy Coefficient (ent_coef) Effect', fontsize=12, fontweight='bold')
ax3.legend()
ax3.grid(True, alpha=0.3)

# Bottom right: Learning rate
ax4 = axes[1, 1]
epochs = np.arange(100)
np.random.seed(42)

# Simulated learning curves
lr_low = 100 * (1 - np.exp(-epochs/80)) + np.random.randn(100) * 5
lr_good = 100 * (1 - np.exp(-epochs/30)) + np.random.randn(100) * 3
lr_high = 100 * (1 - np.exp(-epochs/10)) * np.exp(-epochs/50) + np.random.randn(100) * 10

ax4.plot(epochs, lr_low, alpha=0.7, linewidth=2, label='lr=1e-5 (too slow)', color='#2196f3')
ax4.plot(epochs, lr_good, alpha=0.7, linewidth=2, label='lr=3e-4 (good)', color='#4caf50')
ax4.plot(epochs, lr_high, alpha=0.7, linewidth=2, label='lr=1e-2 (unstable)', color='#f44336')
ax4.set_xlabel('Epoch', fontsize=10)
ax4.set_ylabel('Performance', fontsize=10)
ax4.set_title('Learning Rate Effect', fontsize=12, fontweight='bold')
ax4.legend()
ax4.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("\nHYPERPARAMETER TUNING TIPS:")
print("  • Start with defaults - they're well-tuned!")
print("  • clip_range: 0.1-0.3 (smaller = more stable)")
print("  • n_steps: 2048 works well for most tasks")
print("  • ent_coef: Increase if agent gets stuck")
print("  • learning_rate: 3e-4 is a good starting point")

---
## Training with Vectorized Environments

```
    ┌────────────────────────────────────────────────────────────────┐
    │              VECTORIZED ENVIRONMENTS                           │
    ├────────────────────────────────────────────────────────────────┤
    │                                                                │
    │  WHY VECTORIZE?                                               │
    │    • Collect data from N envs in parallel                    │
    │    • N× faster data collection                               │
    │    • Better GPU utilization                                  │
    │    • More diverse data per update                            │
    │                                                                │
    │  SB3 OPTIONS:                                                 │
    │                                                                │
    │  DummyVecEnv:                                                 │
    │    • Single process, sequential                              │
    │    • Simple, no overhead                                     │
    │    • Good for fast environments                              │
    │                                                                │
    │  SubprocVecEnv:                                               │
    │    • Multiple processes, true parallelism                    │
    │    • Good for slow environments                              │
    │    • Some communication overhead                             │
    │                                                                │
    │  make_vec_env(): Easy wrapper for either!                    │
    │                                                                │
    └────────────────────────────────────────────────────────────────┘
```

In [None]:
if SB3_AVAILABLE:
    print("TRAINING WITH VECTORIZED ENVIRONMENTS")
    print("="*60)
    
    # Create 4 parallel environments
    n_envs = 4
    vec_env = make_vec_env('CartPole-v1', n_envs=n_envs)
    
    print(f"\nCreated {n_envs} parallel environments")
    print(f"  Environment: CartPole-v1")
    print(f"  Observation space: {vec_env.observation_space}")
    print(f"  Action space: {vec_env.action_space}")
    
    # Create PPO with vectorized env
    model = PPO(
        'MlpPolicy',
        vec_env,
        verbose=1,
        n_steps=512,      # Steps per env before update
        batch_size=64,
        n_epochs=10,
    )
    
    # Total steps per update = n_steps × n_envs = 512 × 4 = 2048
    print(f"\nSteps per update: {model.n_steps} × {n_envs} = {model.n_steps * n_envs}")
    
    # Train
    print("\nTraining for 20,000 timesteps...")
    model.learn(total_timesteps=20_000)
    
    # Evaluate
    mean_reward, std_reward = evaluate_policy(model, vec_env, n_eval_episodes=10)
    print(f"\nEvaluation: {mean_reward:.1f} ± {std_reward:.1f} reward")
    
    vec_env.close()
    print("\n" + "="*60)
else:
    print("Install Stable-Baselines3 to run this example.")

---
## Monitoring Training with Callbacks

```
    ┌────────────────────────────────────────────────────────────────┐
    │              CALLBACKS: MONITOR AND CONTROL TRAINING           │
    ├────────────────────────────────────────────────────────────────┤
    │                                                                │
    │  CALLBACKS LET YOU:                                           │
    │    • Log custom metrics                                       │
    │    • Save best models automatically                          │
    │    • Early stopping                                          │
    │    • Visualize during training                               │
    │                                                                │
    │  BUILT-IN CALLBACKS:                                          │
    │    EvalCallback:       Periodic evaluation + save best       │
    │    CheckpointCallback: Save model every N steps              │
    │    StopTrainingOnReward: Early stopping                      │
    │                                                                │
    │  CUSTOM CALLBACKS:                                            │
    │    Inherit from BaseCallback and override:                   │
    │    • _on_step(): Called every step                           │
    │    • _on_rollout_start/end(): Rollout boundaries            │
    │    • _on_training_start/end(): Training boundaries          │
    │                                                                │
    └────────────────────────────────────────────────────────────────┘
```

In [None]:
if SB3_AVAILABLE:
    # Custom callback example
    class RewardLoggerCallback(BaseCallback):
        """
        Custom callback to track rewards during training.
        """
        
        def __init__(self, verbose=0):
            super().__init__(verbose)
            self.episode_rewards = []
            self.current_rewards = None
        
        def _on_training_start(self):
            """Called at the start of training."""
            self.current_rewards = np.zeros(self.training_env.num_envs)
        
        def _on_step(self) -> bool:
            """
            Called after every step.
            
            Returns:
                bool: If False, stop training.
            """
            # Track rewards
            self.current_rewards += self.locals['rewards']
            
            # Check for episode ends
            for i, done in enumerate(self.locals['dones']):
                if done:
                    self.episode_rewards.append(self.current_rewards[i])
                    self.current_rewards[i] = 0
            
            return True  # Continue training
        
        def _on_training_end(self):
            """Called at the end of training."""
            if self.verbose > 0:
                print(f"\nTraining complete! {len(self.episode_rewards)} episodes")
    
    print("TRAINING WITH CUSTOM CALLBACK")
    print("="*60)
    
    # Create environment and callback
    env = make_vec_env('CartPole-v1', n_envs=4)
    reward_callback = RewardLoggerCallback(verbose=1)
    
    # Create and train model with callback
    model = PPO('MlpPolicy', env, verbose=0)
    model.learn(total_timesteps=20_000, callback=reward_callback)
    
    # Plot rewards from callback
    if len(reward_callback.episode_rewards) > 0:
        fig, ax = plt.subplots(figsize=(10, 5))
        ax.plot(reward_callback.episode_rewards, alpha=0.3, color='blue')
        
        # Smoothed
        window = min(20, len(reward_callback.episode_rewards) // 3)
        if window > 1:
            smoothed = np.convolve(reward_callback.episode_rewards, 
                                   np.ones(window)/window, mode='valid')
            ax.plot(range(window-1, len(reward_callback.episode_rewards)), 
                    smoothed, 'r-', linewidth=2, label='Smoothed')
        
        ax.set_xlabel('Episode', fontsize=11)
        ax.set_ylabel('Reward', fontsize=11)
        ax.set_title('Training Progress (via Custom Callback)', fontsize=12, fontweight='bold')
        ax.legend()
        ax.grid(True, alpha=0.3)
        plt.tight_layout()
        plt.show()
    
    env.close()
    print("\n" + "="*60)
else:
    print("Install Stable-Baselines3 to run this example.")

---
## Saving and Loading Models

```
    ┌────────────────────────────────────────────────────────────────┐
    │              SAVING AND LOADING MODELS                         │
    ├────────────────────────────────────────────────────────────────┤
    │                                                                │
    │  SAVE MODEL:                                                  │
    │    model.save("ppo_cartpole")                                │
    │    # Creates: ppo_cartpole.zip                               │
    │                                                                │
    │  LOAD MODEL:                                                  │
    │    model = PPO.load("ppo_cartpole")                          │
    │    # Ready to use!                                           │
    │                                                                │
    │  CONTINUE TRAINING:                                           │
    │    model = PPO.load("ppo_cartpole", env=env)                 │
    │    model.learn(total_timesteps=more_steps)                   │
    │                                                                │
    │  WHAT'S SAVED:                                                │
    │    • Policy network weights                                  │
    │    • Value network weights                                   │
    │    • Optimizer state                                         │
    │    • Hyperparameters                                         │
    │                                                                │
    └────────────────────────────────────────────────────────────────┘
```

In [None]:
if SB3_AVAILABLE:
    import tempfile
    import os
    
    print("SAVING AND LOADING MODELS")
    print("="*60)
    
    # Create and train a model
    env = make_vec_env('CartPole-v1', n_envs=4)
    model = PPO('MlpPolicy', env, verbose=0)
    
    print("\n1. Training original model...")
    model.learn(total_timesteps=10_000)
    
    # Evaluate
    mean_reward1, _ = evaluate_policy(model, env, n_eval_episodes=10)
    print(f"   Original model reward: {mean_reward1:.1f}")
    
    # Save to temp directory
    with tempfile.TemporaryDirectory() as tmpdir:
        save_path = os.path.join(tmpdir, "ppo_cartpole")
        
        print(f"\n2. Saving model to: {save_path}")
        model.save(save_path)
        
        # Check file exists
        print(f"   File created: {os.path.exists(save_path + '.zip')}")
        
        # Load model
        print("\n3. Loading model...")
        loaded_model = PPO.load(save_path)
        
        # Evaluate loaded model
        mean_reward2, _ = evaluate_policy(loaded_model, env, n_eval_episodes=10)
        print(f"   Loaded model reward: {mean_reward2:.1f}")
        
        # Continue training
        print("\n4. Continue training loaded model...")
        loaded_model.set_env(env)
        loaded_model.learn(total_timesteps=10_000)
        
        mean_reward3, _ = evaluate_policy(loaded_model, env, n_eval_episodes=10)
        print(f"   After more training: {mean_reward3:.1f}")
    
    env.close()
    print("\n" + "="*60)
else:
    print("Install Stable-Baselines3 to run this example.")

---
## Complete Training Example

Let's put it all together with a proper training setup!

In [None]:
if SB3_AVAILABLE:
    print("COMPLETE PPO TRAINING EXAMPLE")
    print("="*60)
    
    # ========================================
    # 1. Environment Setup
    # ========================================
    n_envs = 4
    env = make_vec_env('CartPole-v1', n_envs=n_envs)
    
    print(f"\n1. Created {n_envs} parallel environments")
    
    # ========================================
    # 2. Model Creation with Good Hyperparameters
    # ========================================
    model = PPO(
        policy='MlpPolicy',
        env=env,
        learning_rate=3e-4,
        n_steps=512,
        batch_size=64,
        n_epochs=10,
        gamma=0.99,
        gae_lambda=0.95,
        clip_range=0.2,
        ent_coef=0.01,
        vf_coef=0.5,
        verbose=0,
        seed=42,
    )
    
    print("\n2. Created PPO model with optimized hyperparameters")
    
    # ========================================
    # 3. Training with Progress Tracking
    # ========================================
    reward_callback = RewardLoggerCallback()
    
    print("\n3. Training for 50,000 timesteps...")
    model.learn(
        total_timesteps=50_000,
        callback=reward_callback,
        progress_bar=True  # Shows nice progress bar
    )
    
    # ========================================
    # 4. Evaluation
    # ========================================
    print("\n4. Evaluating trained model...")
    mean_reward, std_reward = evaluate_policy(
        model, env, n_eval_episodes=20, deterministic=True
    )
    print(f"   Result: {mean_reward:.1f} ± {std_reward:.1f}")
    
    # ========================================
    # 5. Visualization
    # ========================================
    if len(reward_callback.episode_rewards) > 10:
        fig, axes = plt.subplots(1, 2, figsize=(14, 5))
        
        # Learning curve
        ax1 = axes[0]
        rewards = reward_callback.episode_rewards
        ax1.plot(rewards, alpha=0.3, color='blue', label='Episode Reward')
        
        window = min(30, len(rewards) // 3)
        if window > 1:
            smoothed = np.convolve(rewards, np.ones(window)/window, mode='valid')
            ax1.plot(range(window-1, len(rewards)), smoothed, 
                     'r-', linewidth=2, label='Smoothed')
        
        ax1.axhline(y=500, color='green', linestyle='--', label='Max Score')
        ax1.set_xlabel('Episode', fontsize=11)
        ax1.set_ylabel('Reward', fontsize=11)
        ax1.set_title('Training Progress', fontsize=12, fontweight='bold')
        ax1.legend()
        ax1.grid(True, alpha=0.3)
        
        # Reward distribution
        ax2 = axes[1]
        ax2.hist(rewards[-100:] if len(rewards) > 100 else rewards, 
                 bins=20, color='#64b5f6', edgecolor='black')
        ax2.axvline(x=mean_reward, color='red', linewidth=2, 
                    label=f'Mean: {mean_reward:.1f}')
        ax2.set_xlabel('Reward', fontsize=11)
        ax2.set_ylabel('Count', fontsize=11)
        ax2.set_title('Final Reward Distribution', fontsize=12, fontweight='bold')
        ax2.legend()
        ax2.grid(True, alpha=0.3)
        
        plt.tight_layout()
        plt.show()
    
    env.close()
    
    print("\n" + "="*60)
    print("TRAINING COMPLETE!")
    print(f"  Episodes completed: {len(reward_callback.episode_rewards)}")
    print(f"  Final performance: {mean_reward:.1f} ± {std_reward:.1f}")
    print("="*60)
else:
    print("Install Stable-Baselines3 to run this example.")

---
## Summary: Key Takeaways

### SB3 Quick Reference

| Task | Code |
|------|------|
| Create model | `model = PPO('MlpPolicy', env)` |
| Train | `model.learn(total_timesteps=10000)` |
| Evaluate | `evaluate_policy(model, env)` |
| Save | `model.save('my_model')` |
| Load | `model = PPO.load('my_model')` |

### Key Hyperparameters

| Parameter | Default | Description |
|-----------|---------|-------------|
| `learning_rate` | 3e-4 | Adam learning rate |
| `n_steps` | 2048 | Steps per rollout |
| `batch_size` | 64 | Minibatch size |
| `n_epochs` | 10 | Epochs per update |
| `clip_range` | 0.2 | PPO clipping |

### Vectorized Environments

```python
env = make_vec_env('CartPole-v1', n_envs=4)
```

Benefits: 4× faster data collection, more diverse gradients!

---
## Test Your Understanding

**1. Why use Stable-Baselines3 instead of implementing from scratch?**
<details>
<summary>Click to reveal answer</summary>
SB3 provides:
- Battle-tested, bug-free implementations
- Optimized performance
- Rich features (logging, callbacks, saving)
- Good defaults that work well
- Easy to reproduce results

Use from-scratch for learning and research; SB3 for production and benchmarks.
</details>

**2. What does n_envs=4 do?**
<details>
<summary>Click to reveal answer</summary>
It creates 4 parallel copies of the environment. Benefits:
- 4× faster data collection
- More diverse experiences per update
- Better GPU utilization
- Total steps per update = n_steps × n_envs
</details>

**3. What's the purpose of callbacks?**
<details>
<summary>Click to reveal answer</summary>
Callbacks let you:
- Log custom metrics during training
- Save the best model automatically
- Implement early stopping
- Visualize training progress
- Execute custom code at specific training events
</details>

**4. How do you continue training a saved model?**
<details>
<summary>Click to reveal answer</summary>
```python
# Load the model
model = PPO.load("my_model")

# Set the environment
model.set_env(env)

# Continue training
model.learn(total_timesteps=more_steps)
```
</details>

**5. Which hyperparameter should you adjust first if training is unstable?**
<details>
<summary>Click to reveal answer</summary>
Try reducing the learning rate first (e.g., from 3e-4 to 1e-4). Other options:
- Reduce clip_range (e.g., 0.2 → 0.1)
- Increase n_steps for more stable gradients
- Reduce batch_size for more frequent updates
</details>

---
## What's Next?

You've mastered PPO with Stable-Baselines3! In the next notebook, we'll explore **SAC (Soft Actor-Critic)** for continuous control tasks.

**Continue to:** [Notebook 4: SAC for Continuous Control](04_sac_continuous_control.ipynb)

---

*SB3: "Professional tools for professional results!"*