# Playing Atari Games: Where Deep RL Began

This is where deep RL made history! In 2013, DeepMind's DQN learned to play Atari games from raw pixels, achieving superhuman performance on many games.

## What You'll Learn

By the end of this notebook, you'll understand:
- The historical significance of DQN on Atari
- Why images need special preprocessing
- Frame stacking: giving the agent "motion vision"
- CNN architectures for image-based RL
- Using Stable-Baselines3 for Atari
- Tips for training and evaluation

**Prerequisites:** Notebooks 1-5 (Deep RL fundamentals)

**Time:** ~25 minutes

---
## The Big Picture: A Historic Achievement

```
    ┌────────────────────────────────────────────────────────────────┐
    │          DQN ON ATARI: WHY IT MATTERED                        │
    ├────────────────────────────────────────────────────────────────┤
    │                                                                │
    │  BEFORE DQN (2013):                                           │
    │    • RL worked on small, hand-crafted state spaces            │
    │    • Needed domain experts to design features                 │
    │    • Couldn't handle raw sensory input (images, sounds)       │
    │                                                                │
    │  THE DQN BREAKTHROUGH:                                        │
    │    • Input: Raw pixels (210×160×3 = 100,800 numbers!)        │
    │    • Output: Which button to press                           │
    │    • No game-specific engineering                            │
    │    • Same algorithm for ALL 49 Atari games!                  │
    │                                                                │
    │  RESULTS:                                                     │
    │    • Superhuman on Breakout, Pong, Space Invaders, etc.      │
    │    • Learned strategies humans didn't teach it               │
    │    • Famous Breakout "tunnel" strategy discovered by AI      │
    │                                                                │
    │  IMPACT:                                                      │
    │    • Proved deep learning + RL could scale                   │
    │    • Led to AlphaGo, AlphaFold, ChatGPT (RLHF!)             │
    │    • Started the modern deep RL revolution                   │
    │                                                                │
    └────────────────────────────────────────────────────────────────┘
```

In [None]:
import numpy as np
import torch
import torch.nn as nn
import matplotlib.pyplot as plt
from matplotlib.patches import FancyBboxPatch, Rectangle, Circle

# Visualize the Atari challenge
fig, axes = plt.subplots(1, 3, figsize=(15, 5))

# Left: Raw pixel input
ax1 = axes[0]
ax1.set_xlim(0, 10)
ax1.set_ylim(0, 10)
ax1.axis('off')
ax1.set_title('The Challenge:\nRaw Pixels → Actions', fontsize=14, fontweight='bold')

# Pixel grid representation
np.random.seed(42)
pixel_img = np.random.rand(8, 8, 3)
pixel_img[3:5, 3:5] = [1, 0, 0]  # Red object
pixel_img[6:7, 2:6] = [0, 1, 0]  # Green paddle

img_box = FancyBboxPatch((1, 4), 4, 4, boxstyle="round,pad=0.1",
                          facecolor='#e0e0e0', edgecolor='#333', linewidth=2)
ax1.add_patch(img_box)
ax1.imshow(pixel_img, extent=(1.2, 4.8, 4.2, 7.8))
ax1.text(3, 3.5, '210×160×3 pixels', ha='center', fontsize=10)
ax1.text(3, 2.8, '= 100,800 numbers!', ha='center', fontsize=9, color='#666')

# Arrow
ax1.annotate('', xy=(7, 5.5), xytext=(5.5, 5.5),
             arrowprops=dict(arrowstyle='->', lw=3, color='#1976d2'))
ax1.text(6.25, 6.2, 'DQN', ha='center', fontsize=12, fontweight='bold', color='#1976d2')

# Output: Actions
action_box = FancyBboxPatch((7, 4.5), 2.5, 2, boxstyle="round,pad=0.1",
                             facecolor='#c8e6c9', edgecolor='#388e3c', linewidth=2)
ax1.add_patch(action_box)
ax1.text(8.25, 5.8, 'Action', ha='center', fontsize=11, fontweight='bold')
ax1.text(8.25, 5.2, '←  →  fire', ha='center', fontsize=10)

# Middle: Timeline
ax2 = axes[1]
ax2.set_xlim(0, 10)
ax2.set_ylim(0, 10)
ax2.axis('off')
ax2.set_title('The Deep RL Revolution', fontsize=14, fontweight='bold')

milestones = [
    ('2013', 'DQN Paper', '#bbdefb'),
    ('2015', 'Nature Paper', '#c8e6c9'),
    ('2016', 'AlphaGo', '#fff3e0'),
    ('2017', 'Rainbow', '#e1bee7'),
    ('2022', 'ChatGPT (RLHF)', '#ffcdd2'),
]

for i, (year, event, color) in enumerate(milestones):
    y = 8 - i * 1.5
    box = FancyBboxPatch((2, y-0.4), 6, 0.8, boxstyle="round,pad=0.1",
                          facecolor=color, edgecolor='#333', linewidth=1)
    ax2.add_patch(box)
    ax2.text(5, y, f'{year}: {event}', ha='center', va='center', fontsize=10)

ax2.text(5, 0.5, 'DQN started it all!', ha='center', fontsize=11, 
         style='italic', color='#666')

# Right: Performance chart
ax3 = axes[2]
games = ['Breakout', 'Pong', 'Q*bert', 'Seaquest', 'Enduro']
human = [31.8, 9.3, 13455, 42055, 309.6]
dqn = [401.2, 20.9, 10596, 5286, 475.6]
dqn_pct = [d/h*100 if h > 0 else 100 for d, h in zip(dqn, human)]

colors = ['#4caf50' if p > 100 else '#ff9800' for p in dqn_pct]
ax3.barh(games, dqn_pct, color=colors, edgecolor='black', linewidth=1)
ax3.axvline(x=100, color='#333', linestyle='--', linewidth=2, label='Human Level')
ax3.set_xlabel('DQN Score (% of Human)', fontsize=11)
ax3.set_title('DQN vs Human Performance', fontsize=14, fontweight='bold')
ax3.grid(True, alpha=0.3, axis='x')
ax3.legend()

plt.tight_layout()
plt.show()

print("\n" + "="*70)
print("THE DQN ACHIEVEMENT")
print("="*70)
print("""
DQN proved that a single algorithm could:
  1. Take raw pixels as input (no hand-crafted features)
  2. Learn to play 49 different games
  3. Achieve superhuman performance on many of them
  4. Discover novel strategies humans didn't teach it

This was the "ImageNet moment" for reinforcement learning!
""")
print("="*70)

---
## The Preprocessing Pipeline

Raw Atari frames need careful preprocessing!

```
    ┌────────────────────────────────────────────────────────────────┐
    │              ATARI PREPROCESSING PIPELINE                      │
    ├────────────────────────────────────────────────────────────────┤
    │                                                                │
    │  RAW FRAME                                                    │
    │  ┌─────────────┐                                              │
    │  │             │  210 × 160 × 3 (RGB)                        │
    │  │   COLOR     │  = 100,800 values                           │
    │  │   IMAGE     │                                              │
    │  └─────────────┘                                              │
    │        ↓                                                      │
    │  STEP 1: Convert to Grayscale                                │
    │  ┌─────────────┐                                              │
    │  │             │  210 × 160 × 1                               │
    │  │   GRAY      │  = 33,600 values                            │
    │  │   IMAGE     │  (Color doesn't help gameplay!)             │
    │  └─────────────┘                                              │
    │        ↓                                                      │
    │  STEP 2: Resize to 84×84                                     │
    │  ┌─────┐                                                      │
    │  │     │  84 × 84 × 1                                        │
    │  │     │  = 7,056 values                                     │
    │  └─────┘  (Still has all important info!)                    │
    │        ↓                                                      │
    │  STEP 3: Stack 4 Frames (for motion)                         │
    │  ┌─────┬─────┬─────┬─────┐                                   │
    │  │ t-3 │ t-2 │ t-1 │  t  │  84 × 84 × 4                      │
    │  └─────┴─────┴─────┴─────┘  = 28,224 values                  │
    │                                                                │
    │  This is what the DQN network sees!                          │
    │                                                                │
    └────────────────────────────────────────────────────────────────┘
```

In [None]:
# Simulate the preprocessing pipeline
np.random.seed(42)

fig, axes = plt.subplots(2, 4, figsize=(14, 7))

# Top row: Preprocessing steps
ax1, ax2, ax3, ax4 = axes[0]

# Raw frame (simulate with random colors)
raw_frame = np.random.rand(210, 160, 3)
# Add some structure
raw_frame[180:200, 70:90] = [0.2, 0.8, 0.2]  # Green paddle
raw_frame[50:60, 80:90] = [1, 0.2, 0.2]       # Red ball
raw_frame[10:20, :] = [0.3, 0.3, 1]           # Blue bricks

ax1.imshow(raw_frame)
ax1.set_title('1. Raw Frame\n210×160×3 = 100,800', fontsize=11)
ax1.axis('off')

# Grayscale
gray_frame = 0.299*raw_frame[:,:,0] + 0.587*raw_frame[:,:,1] + 0.114*raw_frame[:,:,2]
ax2.imshow(gray_frame, cmap='gray')
ax2.set_title('2. Grayscale\n210×160×1 = 33,600', fontsize=11)
ax2.axis('off')

# Resized (simulate)
from scipy.ndimage import zoom
resized_frame = zoom(gray_frame, (84/210, 84/160))
ax3.imshow(resized_frame, cmap='gray')
ax3.set_title('3. Resized\n84×84×1 = 7,056', fontsize=11)
ax3.axis('off')

# Frame stack visualization
ax4.set_xlim(0, 10)
ax4.set_ylim(0, 10)
ax4.axis('off')
ax4.set_title('4. Frame Stack\n84×84×4 = 28,224', fontsize=11)

# Draw stacked frames
for i, offset in enumerate([0.3, 0.2, 0.1, 0]):
    alpha = 0.4 + 0.2 * i
    color = f'{0.3 + 0.2*i}'
    rect = Rectangle((2 + offset*3, 2 + offset*3), 5, 5, 
                     facecolor=color, edgecolor='black', linewidth=2, alpha=alpha)
    ax4.add_patch(rect)
ax4.text(5, 1, 't-3, t-2, t-1, t', ha='center', fontsize=10)
ax4.text(5, 0.3, '(captures motion!)', ha='center', fontsize=9, color='#666')

# Bottom row: Why frame stacking matters
for ax in axes[1]:
    ax.axis('off')

axes[1, 0].set_title('Why Stack Frames?', fontsize=12, fontweight='bold', loc='left')

# Single frame problem
ax5 = axes[1, 0]
ax5.text(0.5, 0.9, 'Single Frame Problem:', transform=ax5.transAxes, fontsize=10, fontweight='bold')
ax5.text(0.5, 0.6, '• Ball position: (50, 80)', transform=ax5.transAxes, fontsize=9)
ax5.text(0.5, 0.4, '• But is it going ↑ or ↓?', transform=ax5.transAxes, fontsize=9)
ax5.text(0.5, 0.2, '• Cannot tell from ONE frame!', transform=ax5.transAxes, fontsize=9, color='#d32f2f')

# Stacked frames solution
ax6 = axes[1, 1]
ax6.text(0.5, 0.9, 'Stacked Frames Solution:', transform=ax6.transAxes, fontsize=10, fontweight='bold')
ax6.text(0.5, 0.65, '• Frame t-3: ball at (50, 75)', transform=ax6.transAxes, fontsize=9)
ax6.text(0.5, 0.45, '• Frame t-2: ball at (50, 77)', transform=ax6.transAxes, fontsize=9)
ax6.text(0.5, 0.25, '• Frame t-1: ball at (50, 79)', transform=ax6.transAxes, fontsize=9)
ax6.text(0.5, 0.05, '• Ball is moving DOWN! ↓', transform=ax6.transAxes, fontsize=9, color='#388e3c', fontweight='bold')

# Analogy
ax7 = axes[1, 2]
ax7.text(0.5, 0.9, 'Analogy:', transform=ax7.transAxes, fontsize=10, fontweight='bold')
ax7.text(0.5, 0.6, 'Single frame = Photo', transform=ax7.transAxes, fontsize=9)
ax7.text(0.5, 0.4, 'Frame stack = Short Video', transform=ax7.transAxes, fontsize=9)
ax7.text(0.5, 0.15, 'Video shows MOTION!', transform=ax7.transAxes, fontsize=9, color='#1976d2', fontweight='bold')

plt.tight_layout()
plt.show()

print("\nPREPROCESSING SUMMARY:")
print("  1. Grayscale: Color doesn't help, reduces dimensions")
print("  2. Resize: 84×84 is enough detail, faster to process")
print("  3. Frame Stack: 4 frames capture velocity/motion")
print("\n  Final input: 84×84×4 = 28,224 values per observation")

---
## The CNN Architecture

DQN uses a Convolutional Neural Network to process images!

```
    ┌────────────────────────────────────────────────────────────────┐
    │                    DQN CNN ARCHITECTURE                        │
    ├────────────────────────────────────────────────────────────────┤
    │                                                                │
    │  INPUT: 84 × 84 × 4 (4 stacked grayscale frames)              │
    │         ↓                                                      │
    │  CONV1: 32 filters, 8×8, stride 4 → 20×20×32                  │
    │         ReLU                                                   │
    │         ↓                                                      │
    │  CONV2: 64 filters, 4×4, stride 2 → 9×9×64                    │
    │         ReLU                                                   │
    │         ↓                                                      │
    │  CONV3: 64 filters, 3×3, stride 1 → 7×7×64                    │
    │         ReLU                                                   │
    │         ↓                                                      │
    │  FLATTEN: 7×7×64 = 3,136                                      │
    │         ↓                                                      │
    │  FC1: 512 units, ReLU                                         │
    │         ↓                                                      │
    │  OUTPUT: num_actions (e.g., 4 for Breakout)                   │
    │                                                                │
    │  WHY CNNs?                                                    │
    │    • Detect patterns regardless of position (translation inv) │
    │    • Hierarchical features: edges → shapes → objects         │
    │    • Parameter efficient: shared weights across image         │
    │                                                                │
    └────────────────────────────────────────────────────────────────┘
```

In [None]:
class AtariDQN(nn.Module):
    """
    DQN Architecture for Atari Games.
    
    This is the exact architecture from the Nature paper!
    Uses CNNs to process image input.
    """
    
    def __init__(self, num_actions):
        super().__init__()
        
        # ========================================
        # CONVOLUTIONAL LAYERS (feature extraction)
        # ========================================
        self.conv = nn.Sequential(
            # Conv1: 84×84×4 → 20×20×32
            # 8×8 kernel with stride 4 captures large features
            nn.Conv2d(4, 32, kernel_size=8, stride=4),
            nn.ReLU(),
            
            # Conv2: 20×20×32 → 9×9×64
            # 4×4 kernel captures medium features
            nn.Conv2d(32, 64, kernel_size=4, stride=2),
            nn.ReLU(),
            
            # Conv3: 9×9×64 → 7×7×64
            # 3×3 kernel captures fine details
            nn.Conv2d(64, 64, kernel_size=3, stride=1),
            nn.ReLU()
        )
        
        # ========================================
        # FULLY CONNECTED LAYERS (decision making)
        # ========================================
        self.fc = nn.Sequential(
            nn.Linear(64 * 7 * 7, 512),  # 3,136 → 512
            nn.ReLU(),
            nn.Linear(512, num_actions)   # 512 → actions
        )
    
    def forward(self, x):
        """
        Forward pass: image → Q-values.
        
        Args:
            x: Tensor of shape (batch, 4, 84, 84)
               4 stacked frames, each 84×84 grayscale
        
        Returns:
            Q-values for each action
        """
        # Normalize pixel values to [0, 1]
        x = x / 255.0
        
        # Extract features with convolutions
        features = self.conv(x)
        
        # Flatten: (batch, 64, 7, 7) → (batch, 3136)
        features = features.view(features.size(0), -1)
        
        # Compute Q-values
        q_values = self.fc(features)
        
        return q_values


# Examine the network
print("ATARI DQN ARCHITECTURE")
print("="*60)

num_actions = 4  # e.g., Breakout: noop, fire, left, right
model = AtariDQN(num_actions)

print("\nNetwork Structure:")
print(model)

# Count parameters
total_params = sum(p.numel() for p in model.parameters())
print(f"\nTotal Parameters: {total_params:,}")

# Test forward pass
dummy_input = torch.randn(1, 4, 84, 84)  # Batch of 1, 4 frames, 84×84
with torch.no_grad():
    output = model(dummy_input)
print(f"\nInput shape: {dummy_input.shape}")
print(f"Output shape: {output.shape} (Q-values for {num_actions} actions)")
print(f"Output: {output.numpy()[0].round(3)}")
print("="*60)

In [None]:
# Visualize what each conv layer does

fig, axes = plt.subplots(1, 4, figsize=(15, 4))

# Input
ax1 = axes[0]
ax1.set_xlim(0, 10)
ax1.set_ylim(0, 10)
ax1.axis('off')
ax1.set_title('Input\n84×84×4', fontsize=12, fontweight='bold')

input_box = FancyBboxPatch((2, 2), 6, 6, boxstyle="round,pad=0.1",
                            facecolor='#bbdefb', edgecolor='#1976d2', linewidth=3)
ax1.add_patch(input_box)
ax1.text(5, 5, '4 stacked\nframes', ha='center', va='center', fontsize=10)
ax1.text(5, 1.5, '28,224 values', ha='center', fontsize=9, color='#666')

# Conv1
ax2 = axes[1]
ax2.set_xlim(0, 10)
ax2.set_ylim(0, 10)
ax2.axis('off')
ax2.set_title('After Conv1\n20×20×32', fontsize=12, fontweight='bold')

conv1_box = FancyBboxPatch((3, 3), 4, 4, boxstyle="round,pad=0.1",
                            facecolor='#c8e6c9', edgecolor='#388e3c', linewidth=3)
ax2.add_patch(conv1_box)
ax2.text(5, 5, '32 feature\nmaps', ha='center', va='center', fontsize=10)
ax2.text(5, 2, 'Detects edges,\nbasic shapes', ha='center', fontsize=9, color='#666')

# Conv2
ax3 = axes[2]
ax3.set_xlim(0, 10)
ax3.set_ylim(0, 10)
ax3.axis('off')
ax3.set_title('After Conv2\n9×9×64', fontsize=12, fontweight='bold')

conv2_box = FancyBboxPatch((3.5, 3.5), 3, 3, boxstyle="round,pad=0.1",
                            facecolor='#fff3e0', edgecolor='#f57c00', linewidth=3)
ax3.add_patch(conv2_box)
ax3.text(5, 5, '64 feature\nmaps', ha='center', va='center', fontsize=10)
ax3.text(5, 2.5, 'Detects objects,\nmotion patterns', ha='center', fontsize=9, color='#666')

# Conv3 + FC
ax4 = axes[3]
ax4.set_xlim(0, 10)
ax4.set_ylim(0, 10)
ax4.axis('off')
ax4.set_title('After Conv3 → FC\n→ Q-values', fontsize=12, fontweight='bold')

conv3_box = FancyBboxPatch((4, 6), 2, 2, boxstyle="round,pad=0.1",
                            facecolor='#e1bee7', edgecolor='#7b1fa2', linewidth=3)
ax4.add_patch(conv3_box)
ax4.text(5, 7, '7×7×64', ha='center', va='center', fontsize=9)

ax4.annotate('', xy=(5, 5.2), xytext=(5, 5.9),
             arrowprops=dict(arrowstyle='->', lw=2, color='#666'))

fc_box = FancyBboxPatch((3.5, 3), 3, 1.5, boxstyle="round,pad=0.1",
                         facecolor='#ffcdd2', edgecolor='#d32f2f', linewidth=3)
ax4.add_patch(fc_box)
ax4.text(5, 3.75, 'Q-values', ha='center', va='center', fontsize=10)

ax4.text(5, 1.5, 'High-level decisions:\n"Ball going left → move left"', 
         ha='center', fontsize=9, color='#666')

# Arrows between stages
for ax in axes[:-1]:
    ax.annotate('', xy=(9.5, 5), xytext=(8.5, 5),
                arrowprops=dict(arrowstyle='->', lw=2, color='#666'))

plt.tight_layout()
plt.show()

print("\nWHAT EACH LAYER LEARNS:")
print("  Conv1: Low-level features (edges, corners, colors)")
print("  Conv2: Mid-level features (shapes, textures, simple objects)")
print("  Conv3: High-level features (game objects, spatial relationships)")
print("  FC: Decision making (which action maximizes future reward)")

---
## Using Stable-Baselines3 for Atari

For production use, don't implement from scratch - use a well-tested library!

In [None]:
# Check if Stable-Baselines3 is available
try:
    from stable_baselines3 import DQN
    from stable_baselines3.common.env_util import make_atari_env
    from stable_baselines3.common.vec_env import VecFrameStack
    from stable_baselines3.common.evaluation import evaluate_policy
    SB3_AVAILABLE = True
    print("✓ Stable-Baselines3 is installed!")
except ImportError:
    SB3_AVAILABLE = False
    print("Stable-Baselines3 not installed.")
    print("\nTo install, run:")
    print("  pip install stable-baselines3[extra] gymnasium[atari] ale-py")
    print("\nShowing conceptual code only.")

In [None]:
# Create Atari environment with preprocessing
if SB3_AVAILABLE:
    try:
        # Create the environment with all preprocessing built-in!
        # make_atari_env handles:
        # - Grayscale conversion
        # - Resizing to 84×84
        # - Reward clipping
        # - Frame skipping
        env = make_atari_env(
            "BreakoutNoFrameskip-v4",  # No frame skip (we control it)
            n_envs=1,                   # Number of parallel envs
            seed=42
        )
        
        # Add frame stacking (4 frames)
        env = VecFrameStack(env, n_stack=4)
        
        print("ATARI ENVIRONMENT CREATED")
        print("="*60)
        print(f"\nObservation space: {env.observation_space}")
        print(f"Action space: {env.action_space}")
        
        # Show actions
        print("\nAvailable actions:")
        print("  0: NOOP (no operation)")
        print("  1: FIRE (start game/launch ball)")
        print("  2: RIGHT (move paddle right)")
        print("  3: LEFT (move paddle left)")
        print("="*60)
        
        env.close()
        
    except Exception as e:
        print(f"Error creating environment: {e}")
        print("\nYou may need to install ROM files. Try:")
        print("  pip install 'gymnasium[accept-rom-license]'")
        SB3_AVAILABLE = False
else:
    print("\n(Skipping environment creation - SB3 not installed)")

---
## Training DQN on Atari

Here's how you would train a DQN agent on Breakout:

In [None]:
# Training code (conceptual - don't run unless you have hours!)

training_code = '''
from stable_baselines3 import DQN
from stable_baselines3.common.env_util import make_atari_env
from stable_baselines3.common.vec_env import VecFrameStack

# Create environment
env = make_atari_env("BreakoutNoFrameskip-v4", n_envs=1, seed=42)
env = VecFrameStack(env, n_stack=4)

# Create DQN agent with Atari-specific hyperparameters
model = DQN(
    "CnnPolicy",              # Use CNN for images
    env,
    
    # Core hyperparameters
    learning_rate=1e-4,       # Adam learning rate
    buffer_size=100_000,      # Replay buffer size
    batch_size=32,            # Minibatch size
    gamma=0.99,               # Discount factor
    
    # Target network
    target_update_interval=1000,  # Update target every N steps
    
    # Training schedule
    learning_starts=10_000,   # Random actions before learning
    train_freq=4,             # Train every 4 steps
    
    # Exploration (epsilon-greedy)
    exploration_fraction=0.1,     # Anneal ε over 10% of training
    exploration_final_eps=0.01,   # Final ε value
    
    verbose=1
)

# Train! (This takes HOURS)
model.learn(total_timesteps=1_000_000)

# Save the trained model
model.save("dqn_breakout")

# Later, load and use:
# model = DQN.load("dqn_breakout")
'''

print("TRAINING CODE FOR DQN ON ATARI")
print("="*70)
print(training_code)
print("="*70)
print("\nNOTE: Training takes 2-10 hours depending on hardware!")
print("For production, consider using pretrained models or more compute.")

---
## Key Hyperparameters for Atari

```
    ┌────────────────────────────────────────────────────────────────┐
    │              RECOMMENDED HYPERPARAMETERS                       │
    ├────────────────────────────────────────────────────────────────┤
    │                                                                │
    │  PARAMETER              VALUE      WHY                        │
    │  ─────────────────────────────────────────────────────────────│
    │  Buffer size            1M         Need lots of diverse exp  │
    │  Batch size             32         Original DQN paper        │
    │  Learning rate          1e-4       Adam optimizer            │
    │  Gamma                  0.99       Long-term planning        │
    │  Target update          10K        Stability                 │
    │  Frame skip             4          Speed up training         │
    │  Frame stack            4          Capture motion            │
    │  ε start                1.0        Full exploration          │
    │  ε end                  0.01       Mostly greedy             │
    │  ε decay steps          1M         Gradual transition        │
    │                                                                │
    │  TRAINING TIME:                                               │
    │  • 1M steps ≈ 2-4 hours (GPU)                                │
    │  • 10M steps ≈ 1-2 days (GPU)                                │
    │  • For best results: 50M+ steps                              │
    │                                                                │
    └────────────────────────────────────────────────────────────────┘
```

In [None]:
# Visualize training time and performance

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Left: Training time
ax1 = axes[0]
steps = [100_000, 1_000_000, 10_000_000, 50_000_000]
hours_gpu = [0.25, 2.5, 25, 125]
hours_cpu = [2, 20, 200, 1000]

x = np.arange(len(steps))
width = 0.35

ax1.bar(x - width/2, hours_gpu, width, label='GPU', color='#4caf50', edgecolor='black')
ax1.bar(x + width/2, hours_cpu, width, label='CPU', color='#ff9800', edgecolor='black')

ax1.set_yscale('log')
ax1.set_xticks(x)
ax1.set_xticklabels(['100K', '1M', '10M', '50M'])
ax1.set_xlabel('Training Steps', fontsize=11)
ax1.set_ylabel('Training Time (hours, log scale)', fontsize=11)
ax1.set_title('DQN Training Time on Atari', fontsize=14, fontweight='bold')
ax1.legend()
ax1.grid(True, alpha=0.3, axis='y')
ax1.axhline(y=24, color='red', linestyle='--', alpha=0.5)
ax1.text(3.5, 30, '1 day', fontsize=9, color='red')

# Right: Typical learning curve
ax2 = axes[1]

# Simulate a typical learning curve
np.random.seed(42)
steps_plot = np.linspace(0, 10_000_000, 1000)
# Sigmoid-like learning curve with noise
base_reward = 400 / (1 + np.exp(-steps_plot / 2_000_000 + 2))
noise = np.random.randn(1000) * 30
rewards = base_reward + noise
rewards = np.clip(rewards, 0, None)

ax2.plot(steps_plot / 1e6, rewards, 'b-', alpha=0.3, linewidth=0.5)
# Smoothed
window = 50
smoothed = np.convolve(rewards, np.ones(window)/window, mode='valid')
ax2.plot(steps_plot[window-1:] / 1e6, smoothed, 'b-', linewidth=2, label='Score')

ax2.axhline(y=31.8, color='red', linestyle='--', linewidth=2, label='Human Level')
ax2.set_xlabel('Million Steps', fontsize=11)
ax2.set_ylabel('Score (Breakout)', fontsize=11)
ax2.set_title('Typical DQN Learning Curve', fontsize=14, fontweight='bold')
ax2.legend()
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("\nTRAINING TIPS:")
print("  1. Start with 1M steps to verify everything works")
print("  2. Use GPU if available (10x faster!)")
print("  3. Monitor learning curves to catch problems early")
print("  4. For production: train for 10M+ steps")

---
## Summary: Key Takeaways

### The DQN Achievement

| Aspect | Details |
|--------|--------|
| **Input** | Raw pixels (84×84×4 after preprocessing) |
| **Output** | Q-values for each action |
| **Games** | 49 Atari games with same architecture |
| **Performance** | Superhuman on many games |

### Preprocessing Pipeline

| Step | What | Why |
|------|------|-----|
| Grayscale | Remove color | Not needed for gameplay |
| Resize | 84×84 | Reduce computation |
| Frame Stack | 4 frames | Capture motion |

### Architecture

| Layer | Size | Purpose |
|-------|------|--------|
| Conv1 | 32×8×8 | Detect edges, basic features |
| Conv2 | 64×4×4 | Detect objects, patterns |
| Conv3 | 64×3×3 | High-level features |
| FC | 512 | Decision making |

---
## Test Your Understanding

**1. Why do we stack 4 frames instead of using a single frame?**
<details>
<summary>Click to reveal answer</summary>
A single frame is like a photograph - it shows position but not motion. We can't tell if the ball is moving up or down. By stacking 4 consecutive frames, the network can see how objects have moved over time, like a short video. This captures velocity and trajectory information essential for playing games.
</details>

**2. Why convert to grayscale?**
<details>
<summary>Click to reveal answer</summary>
Color rarely provides useful information for Atari games - the important features (ball position, paddle position, etc.) are visible in grayscale. Removing color reduces the input size by 3x, making training faster and the network simpler without losing important information.
</details>

**3. Why use CNNs instead of fully-connected networks for images?**
<details>
<summary>Click to reveal answer</summary>
CNNs have several advantages for images:
1. Translation invariance: A ball in the corner uses the same features as a ball in the center
2. Parameter efficiency: Shared weights across the image means fewer parameters
3. Hierarchical features: Layers naturally build from edges → shapes → objects
</details>

**4. Why does training take so long (millions of steps)?**
<details>
<summary>Click to reveal answer</summary>
Several reasons:
1. Sparse rewards: In Breakout, you only get reward when hitting a brick
2. Credit assignment: The action that caused the reward may be hundreds of steps earlier
3. Exploration: Need to try many random actions to discover good strategies
4. Image complexity: 28K inputs per observation takes time to learn patterns
</details>

**5. What made DQN a breakthrough compared to earlier methods?**
<details>
<summary>Click to reveal answer</summary>
DQN was the first to:
1. Use deep neural networks successfully with RL (earlier attempts failed)
2. Learn directly from raw pixels with no hand-crafted features
3. Use the same algorithm for many different games
4. Achieve superhuman performance on complex visual tasks

The key innovations (experience replay + target networks) solved the instability problems that made earlier attempts fail.
</details>

---
## Congratulations!

You've completed the **Deep RL** section! You now understand:

- ✅ Why function approximation is needed
- ✅ How DQN combines deep learning with RL
- ✅ Experience replay and target networks
- ✅ DQN improvements (Double, Dueling, Rainbow)
- ✅ Training DQN on Atari games

**Next Steps:**

Move on to **[Policy Gradient Methods](../policy-gradient/)** to learn about:
- REINFORCE algorithm
- Actor-Critic methods
- A2C and A3C

These methods directly learn policies instead of value functions!

---

*"DQN proved that a single algorithm could learn superhuman performance from pixels alone. The deep RL revolution had begun."*