# Training and Evaluation: Monitoring Learning Curves, Rewards, and Stability

## ðŸ“š Learning Objectives

By completing this notebook, you will:
- Monitor learning curves during Deep RL training
- Track rewards and evaluate model performance
- Assess training stability
- Visualize training progress
- Identify convergence and performance issues

## ðŸ”— Prerequisites

- âœ… Understanding of Deep RL algorithms (DQN, Actor-Critic)
- âœ… Understanding of training loops
- âœ… Python knowledge (matplotlib, numpy)
- âœ… Experience with OpenAI Gym

---

## Official Structure Reference

This notebook covers practical activities from **Course 09, Unit 3**:
- Training and evaluation: monitoring learning curves, rewards, and stability to evaluate model performance
- **Source:** `DETAILED_UNIT_DESCRIPTIONS.md` - Unit 3 Practical Content

---

## Introduction

**Monitoring and evaluation** are crucial for Deep RL training. Learning curves, reward tracking, and stability metrics help assess training progress and identify issues early.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from collections import deque
import gym

print("âœ… Libraries imported!")
print("\nTraining and Evaluation: Monitoring Learning Curves")
print("=" * 60)

## Part 1: Monitoring Learning Curves


In [None]:
print("=" * 60)
print("Part 1: Monitoring Learning Curves")
print("=" * 60)

class TrainingMonitor:
    """Monitor training progress for Deep RL."""
    
    def __init__(self, window
size =100):
        self.episode
rewards = []
        self.episode
lengths = []
        self.window
size = window
size
        self.reward
window = deque(maxlen= window
size)
    
    def record_episode(self, reward, length):
        """Record an episode's results."""
        self.episode_rewards.append(reward)
        self.episode_lengths.append(length)
        self.reward_window.append(reward)
    
    def get_average_reward(self):
        """Get average reward over window."""
        return np.mean(self.reward_window) if self.reward_window else 0.0
    
    def plot_learning_curve(self):
        """Plot learning curves."""
        fig, axes = plt.subplots(1, 2, figsize=(14, 5))
        
        # Plot 1: Episode Rewards
        axes[0].plot(self.episode_rewards, alpha=0.3, label='Raw rewards', color='blue')
        if len(self.episode_rewards) >= self.window_size:
            smoothed = [np.mean(self.episode_rewards[max(0, i-self.window_size):i+1]) 
                       for i in range(len(self.episode_rewards))]
            axes[0].plot(smoothed, label=f'Smoothed (window={self.window_size})', 
                        color='red', linewidth=2)
        axes[0].set_xlabel('Episode')
        axes[0].set_ylabel('Reward')
        axes[0].set_title('Learning Curve: Episode Rewards')
        axes[0].legend()
        axes[0].grid(True, alpha=0.3)
        
        # Plot 2: Episode Lengths
        axes[1].plot(self.episode_lengths, alpha=0.3, label='Episode lengths', color='green')
        if len(self.episode_lengths) >= self.window_size:
            smoothed
lengths = [np.mean(self.episode_lengths[max(0, i-self.window_size):i+1]) 
                               for i in range(len(self.episode_lengths))]
            axes[1].plot(smoothed_lengths, label=f'Smoothed (window={self.window_size})', 
                        color='orange', linewidth=2)
        axes[1].set_xlabel('Episode')
        axes[1].set_ylabel('Episode Length')
        axes[1].set_title('Learning Curve: Episode Lengths')
        axes[1].legend()
        axes[1].grid(True, alpha=0.3)
        
        plt.tight_layout()
        plt.show()

# Simulate training data
monitor = TrainingMonitor(window
size =50)

# Simulate training progress (improving over time)
np.random.seed(42)
for episode in range(200):
    # Simulate improving performance
    base
        reward = 20 + episode * 0.5 + np.random.normal(0, 5)
    reward = max(0, base_reward)
    length = int(50 + episode * 0.3 + np.random.normal(0, 10))
    length = max(1, length)
    monitor.record_episode(reward, length)

print(f"\nTraining Statistics:")
print(f"  Total episodes: {len(monitor.episode_rewards)}")
print(f"  Average reward (final 50 episodes): {monitor.get_average_reward():.2f}")
print(f"  Best episode reward: {max(monitor.episode_rewards):.2f}")
print(f"  Average episode length: {np.mean(monitor.episode_lengths):.2f}")

monitor.plot_learning_curve()

print("\nâœ… Learning curves monitored!")

## Part 2: Evaluating Stability


In [None]:
print("\n" + "=" * 60)
print("Part 2: Evaluating Stability")
print("=" * 60)

def evaluate_stability(rewards, window
size =100):
    """Evaluate training stability metrics."""
    metrics = {}
    
    # Calculate standard deviation over windows
    if len(rewards) >= window
size:
        windows = [rewards[i:i+window_size] for i in range(0, len(rewards)-window_size+1, window_size//2)]
        stds = [np.std(window) for window in windows]
        metrics['average_std'] = np.mean(stds)
        metrics['max_std'] = np.max(stds)
    
    # Calculate reward variance
    metrics['overall_variance'] = np.var(rewards)
    metrics['overall_std'] = np.std(rewards)
    
    # Check for convergence (stability in later episodes)
    if len(rewards) >= 2 * window_size:
        early
rewards = rewards[:window_size]
        late
rewards = rewards[-window_size:]
        metrics['early_mean'] = np.mean(early_rewards)
        metrics['late_mean'] = np.mean(late_rewards)
        metrics['improvement'] = metrics['late_mean'] - metrics['early_mean']
        metrics['early_std'] = np.std(early_rewards)
        metrics['late_std'] = np.std(late_rewards)
        metrics['stability_improvement'] = metrics['early_std'] - metrics['late_std']
    
    return metrics

stability
metrics = evaluate
stability(monitor.episode_rewards)

print("\nStability Metrics:")
for key, value in stability_metrics.items():
    print(f"  {key}: {value:.4f}")

# Visualize stability
plt.figure(figsize=(12, 5))

plt.subplot(1, 2, 1)
if len(monitor.episode_rewards) >= 100:
    window
std = [np.std(monitor.episode_rewards[max(0, i-50):i+1]) 
                  for i in range(len(monitor.episode_rewards))]
    plt.plot(window_std, label='Rolling Std Dev', color='purple')
    plt.axhline(y= stability
metrics.get('overall_std', 0), color='r', 
               linestyle='--', label=f'Overall Std: {stability_metrics.get("overall_std", 0):.2f}')
    plt.xlabel('Episode')
    plt.ylabel('Standard Deviation')
    plt.title('Training Stability: Reward Variance')
    plt.legend()
    plt.grid(True, alpha=0.3)

plt.subplot(1, 2, 2)
if len(monitor.episode_rewards) >= 200:
    early = monitor.episode_rewards[:100]
    late = monitor.episode_rewards[-100:]
    plt.hist(early, bins=20, alpha=0.7, label='Early episodes', density=True)
    plt.hist(late, bins=20, alpha=0.7, label='Late episodes', density=True)
    plt.xlabel('Reward')
    plt.ylabel('Density')
    plt.title('Reward Distribution: Early vs Late')
    plt.legend()
    plt.grid(True, alpha=0.3, axis='y')

plt.tight_layout()
plt.show()

print("\nâœ… Stability evaluated!")

## Summary

### Key Metrics:
1. **Learning Curves**: Episode rewards and lengths over time
2. **Smoothed Curves**: Moving averages to reduce noise
3. **Stability Metrics**: Variance, standard deviation, convergence
4. **Performance Tracking**: Best rewards, average performance

### Best Practices:
- Monitor rewards, episode lengths, and loss (if applicable)
- Use smoothing to identify trends
- Track stability metrics (variance reduction over time)
- Compare early vs late performance
- Set up early stopping based on convergence

### Evaluation Checklist:
- âœ… Learning curves showing improvement
- âœ… Stable/declining variance over time
- âœ… Convergence to good performance
- âœ… Consistent behavior in late training

### Applications:
- All Deep RL algorithms (DQN, A2C, PPO, etc.)
- Hyperparameter tuning
- Algorithm comparison
- Debugging training issues

**Reference:** Course 09, Unit 3: "Deep Reinforcement Learning" - Training and evaluation practical contenttt