# REINFORCE: The Simplest Policy Gradient Algorithm

REINFORCE is where policy gradients begin! It's simple, elegant, and the foundation for everything that follows.

## What You'll Learn

By the end of this notebook, you'll understand:
- The REINFORCE algorithm step-by-step (with a talent show analogy!)
- Why it's called "Monte Carlo Policy Gradient"
- Implementing REINFORCE from scratch in PyTorch
- Training on CartPole and analyzing results
- The high variance problem (and why it matters)

**Prerequisites:** Notebook 1 (Policy Gradient Intuition)

**Time:** ~30 minutes

---
## The Big Picture: The Talent Show Analogy

```
    ┌────────────────────────────────────────────────────────────────┐
    │          THE TALENT SHOW ANALOGY                               │
    ├────────────────────────────────────────────────────────────────┤
    │                                                                │
    │  Imagine you're learning to perform in talent shows...        │
    │                                                                │
    │  REINFORCE is like this:                                      │
    │    1. Go on stage with your current act (follow your policy) │
    │    2. Perform the WHOLE act (complete episode)               │
    │    3. See the judges' total score (return)                   │
    │    4. Update your act based on the score:                    │
    │       - High score? "Do MORE of those moves!"               │
    │       - Low score? "Do LESS of those moves!"                │
    │    5. Repeat at the next show                                │
    │                                                                │
    │  KEY INSIGHT:                                                 │
    │    You only get feedback AFTER the whole performance!        │
    │    No coaching during the act.                               │
    │    This is "MONTE CARLO" - wait for the episode to end!     │
    │                                                                │
    │  THE PROBLEM:                                                 │
    │    Some nights the audience is tough, some nights easy.      │
    │    Same act → different scores! (High variance)              │
    │    This makes learning noisy and slow.                       │
    │                                                                │
    └────────────────────────────────────────────────────────────────┘
```

In [None]:
import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
import matplotlib.pyplot as plt
from matplotlib.patches import FancyBboxPatch, Circle, Rectangle

try:
    import gymnasium as gym
except ImportError:
    import gym

# Visualize the REINFORCE algorithm
fig, ax = plt.subplots(figsize=(14, 8))
ax.set_xlim(0, 14)
ax.set_ylim(0, 10)
ax.axis('off')

ax.text(7, 9.5, 'REINFORCE Algorithm Flow', ha='center', fontsize=16, fontweight='bold')

# Step boxes
steps = [
    ('1. COLLECT EPISODE', 'Follow policy π(a|s;θ)', '#bbdefb', '#1976d2'),
    ('2. COMPUTE RETURNS', 'G_t = Σ γᵏ r_{t+k}', '#c8e6c9', '#388e3c'),
    ('3. COMPUTE LOSS', '-log π(a|s) × G', '#fff3e0', '#f57c00'),
    ('4. UPDATE POLICY', 'θ ← θ - α∇L', '#e1bee7', '#7b1fa2'),
]

for i, (title, desc, color, edge) in enumerate(steps):
    x = 1 + i * 3.2
    box = FancyBboxPatch((x, 5), 2.8, 2.5, boxstyle="round,pad=0.1",
                          facecolor=color, edgecolor=edge, linewidth=3)
    ax.add_patch(box)
    ax.text(x + 1.4, 6.8, title, ha='center', fontsize=10, fontweight='bold')
    ax.text(x + 1.4, 5.8, desc, ha='center', fontsize=9)
    
    if i < 3:
        ax.annotate('', xy=(x + 3, 6.25), xytext=(x + 2.9, 6.25),
                    arrowprops=dict(arrowstyle='->', lw=2, color='#666'))

# Loop back arrow
ax.annotate('', xy=(1.5, 5), xytext=(12.5, 5),
            arrowprops=dict(arrowstyle='->', lw=2, color='#666',
                           connectionstyle='arc3,rad=0.4'))
ax.text(7, 3.5, 'Repeat for many episodes', ha='center', fontsize=10, style='italic')

# Key properties
ax.text(7, 2, 'KEY PROPERTIES:', ha='center', fontsize=12, fontweight='bold')
ax.text(7, 1.3, '• Monte Carlo: Wait for episode to end • On-policy: Use current policy to collect data', ha='center', fontsize=10)
ax.text(7, 0.6, '• Unbiased gradient estimate • High variance (can be noisy)', ha='center', fontsize=10)

plt.tight_layout()
plt.show()

print("\n" + "="*70)
print("THE REINFORCE UPDATE")
print("="*70)
print("""
For each timestep t in the episode:
    1. Compute return: G_t = r_t + γr_{t+1} + γ²r_{t+2} + ...
    2. Compute gradient: ∇_θ log π(a_t|s_t; θ) × G_t
    3. Update: θ ← θ + α × gradient

This increases the probability of actions that led to high returns!
""")
print("="*70)

---
## The REINFORCE Algorithm

```
    ┌────────────────────────────────────────────────────────────────┐
    │              REINFORCE PSEUDOCODE                              │
    ├────────────────────────────────────────────────────────────────┤
    │                                                                │
    │  Initialize policy network π(a|s; θ) with random weights      │
    │                                                                │
    │  FOR each episode:                                            │
    │      # Step 1: Collect trajectory                            │
    │      trajectory = []                                          │
    │      state = env.reset()                                      │
    │      WHILE not done:                                          │
    │          action ~ π(·|state; θ)  # Sample from policy        │
    │          next_state, reward, done = env.step(action)         │
    │          trajectory.append((state, action, reward))          │
    │          state = next_state                                   │
    │                                                                │
    │      # Step 2: Compute returns (backwards)                   │
    │      returns = []                                             │
    │      G = 0                                                    │
    │      FOR (s, a, r) in reversed(trajectory):                  │
    │          G = r + γ × G                                       │
    │          returns.insert(0, G)                                 │
    │                                                                │
    │      # Step 3: Update policy                                 │
    │      loss = 0                                                 │
    │      FOR (s, a), G in zip(trajectory, returns):              │
    │          loss -= log π(a|s; θ) × G                           │
    │      θ ← θ - α × ∇_θ loss                                    │
    │                                                                │
    └────────────────────────────────────────────────────────────────┘
```

In [None]:
class PolicyNetwork(nn.Module):
    """
    Simple policy network for discrete actions.
    
    Architecture:
        State → Hidden (128) → Hidden (128) → Action probabilities
    """
    
    def __init__(self, state_dim, action_dim, hidden_dim=128):
        super().__init__()
        
        self.network = nn.Sequential(
            nn.Linear(state_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, action_dim),
            nn.Softmax(dim=-1)
        )
    
    def forward(self, state):
        """Returns action probabilities."""
        if not isinstance(state, torch.Tensor):
            state = torch.FloatTensor(state)
        return self.network(state)
    
    def get_action(self, state):
        """
        Sample action and return both action and log probability.
        
        Returns:
            action: The sampled action
            log_prob: Log probability of the action (for gradient computation)
        """
        probs = self.forward(state)
        dist = torch.distributions.Categorical(probs)
        action = dist.sample()
        log_prob = dist.log_prob(action)
        return action.item(), log_prob


# Create policy for CartPole
policy = PolicyNetwork(state_dim=4, action_dim=2)
print("POLICY NETWORK")
print("="*60)
print(policy)
print(f"\nTotal parameters: {sum(p.numel() for p in policy.parameters()):,}")
print("="*60)

In [None]:
def compute_returns(rewards, gamma=0.99):
    """
    Compute discounted returns for each timestep.
    
    G_t = r_t + γr_{t+1} + γ²r_{t+2} + ...
    
    We compute this BACKWARDS for efficiency:
        G_T = r_T                    (last step)
        G_{T-1} = r_{T-1} + γ × G_T  (second-to-last)
        ...and so on
    
    Args:
        rewards: List of rewards from the episode
        gamma: Discount factor
    
    Returns:
        List of returns, one for each timestep
    """
    returns = []
    G = 0
    
    # Work backwards through the episode
    for reward in reversed(rewards):
        G = reward + gamma * G
        returns.insert(0, G)  # Insert at beginning to maintain order
    
    return returns


# Demonstrate return computation
print("COMPUTING RETURNS")
print("="*60)

# Example: 5-step episode
rewards = [1, 1, 1, 1, 1]  # Reward of 1 each step
gamma = 0.99

returns = compute_returns(rewards, gamma)

print(f"\nRewards: {rewards}")
print(f"Gamma: {gamma}")
print(f"\nReturns (discounted sum of future rewards):")
for t, (r, G) in enumerate(zip(rewards, returns)):
    print(f"  t={t}: r={r}, G_t={G:.4f}")

print("\n" + "-"*60)
print("Notice: Earlier timesteps have higher returns!")
print("They have more future rewards to collect.")
print("="*60)

In [None]:
def reinforce(env_name='CartPole-v1', n_episodes=500, gamma=0.99, lr=1e-2,
              normalize_returns=True, print_every=50):
    """
    REINFORCE algorithm: Monte Carlo Policy Gradient.
    
    The simplest policy gradient algorithm!
    
    Args:
        env_name: Gymnasium environment name
        n_episodes: Number of episodes to train
        gamma: Discount factor
        lr: Learning rate
        normalize_returns: Whether to normalize returns (reduces variance)
        print_every: Print progress every N episodes
    
    Returns:
        policy: Trained policy network
        rewards_history: List of episode rewards
    """
    # ========================================
    # SETUP
    # ========================================
    env = gym.make(env_name)
    state_dim = env.observation_space.shape[0]
    action_dim = env.action_space.n
    
    policy = PolicyNetwork(state_dim, action_dim)
    optimizer = optim.Adam(policy.parameters(), lr=lr)
    
    rewards_history = []
    
    # ========================================
    # TRAINING LOOP
    # ========================================
    for episode in range(n_episodes):
        # ----------------------------------------
        # STEP 1: Collect episode trajectory
        # ----------------------------------------
        state, _ = env.reset()
        log_probs = []  # Store log π(a|s) for each step
        rewards = []    # Store rewards for each step
        
        # Run episode until done
        for _ in range(500):  # Max steps
            # Sample action from policy
            action, log_prob = policy.get_action(state)
            log_probs.append(log_prob)
            
            # Take action in environment
            next_state, reward, terminated, truncated, _ = env.step(action)
            rewards.append(reward)
            
            state = next_state
            if terminated or truncated:
                break
        
        # ----------------------------------------
        # STEP 2: Compute returns
        # ----------------------------------------
        returns = compute_returns(rewards, gamma)
        returns = torch.FloatTensor(returns)
        
        # Optional: Normalize returns (reduces variance!)
        if normalize_returns and len(returns) > 1:
            returns = (returns - returns.mean()) / (returns.std() + 1e-8)
        
        # ----------------------------------------
        # STEP 3: Compute policy gradient loss
        # ----------------------------------------
        # loss = -Σ log π(a|s) × G
        # Negative because we want to MAXIMIZE expected return
        # (gradient descent minimizes, so we negate)
        loss = 0
        for log_prob, G in zip(log_probs, returns):
            loss -= log_prob * G
        
        # ----------------------------------------
        # STEP 4: Update policy
        # ----------------------------------------
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        
        # Track progress
        episode_reward = sum(rewards)
        rewards_history.append(episode_reward)
        
        if (episode + 1) % print_every == 0:
            avg_reward = np.mean(rewards_history[-print_every:])
            print(f"Episode {episode+1:4d} | Avg Reward: {avg_reward:6.1f} | Episode: {episode_reward:3.0f}")
    
    env.close()
    return policy, rewards_history

In [None]:
# Train REINFORCE on CartPole!
print("TRAINING REINFORCE ON CARTPOLE")
print("="*70)
print("\nThis may take a minute...\n")

policy, rewards_history = reinforce(
    env_name='CartPole-v1',
    n_episodes=500,
    gamma=0.99,
    lr=1e-2,
    normalize_returns=True,
    print_every=50
)

print("\n" + "="*70)
print(f"Final average (last 50): {np.mean(rewards_history[-50:]):.1f}")
print(f"Best episode: {max(rewards_history):.0f}")
print("="*70)

In [None]:
# Visualize training progress
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Left: Raw rewards and smoothed
ax1 = axes[0]
ax1.plot(rewards_history, alpha=0.3, color='blue', label='Episode Reward')

# Smoothed curve
window = 50
smoothed = np.convolve(rewards_history, np.ones(window)/window, mode='valid')
ax1.plot(range(window-1, len(rewards_history)), smoothed, color='red', linewidth=2, label=f'{window}-Episode Average')

ax1.axhline(y=500, color='green', linestyle='--', linewidth=2, label='Max Score (500)')
ax1.set_xlabel('Episode', fontsize=11)
ax1.set_ylabel('Reward', fontsize=11)
ax1.set_title('REINFORCE Training on CartPole', fontsize=14, fontweight='bold')
ax1.legend()
ax1.grid(True, alpha=0.3)

# Right: Reward distribution over time
ax2 = axes[1]

# Split into quarters
quarters = np.array_split(rewards_history, 4)
labels = ['0-25%', '25-50%', '50-75%', '75-100%']
positions = [1, 2, 3, 4]

bp = ax2.boxplot(quarters, positions=positions, patch_artist=True)
colors = ['#ffcdd2', '#fff3e0', '#c8e6c9', '#81c784']
for patch, color in zip(bp['boxes'], colors):
    patch.set_facecolor(color)

ax2.set_xticklabels(labels)
ax2.set_xlabel('Training Progress', fontsize=11)
ax2.set_ylabel('Episode Reward', fontsize=11)
ax2.set_title('Reward Distribution Over Training', fontsize=14, fontweight='bold')
ax2.grid(True, alpha=0.3, axis='y')

plt.tight_layout()
plt.show()

print("\nOBSERVATIONS:")
print("  • The raw rewards are VERY noisy (high variance!)")
print("  • The smoothed curve shows learning progress")
print("  • Later training episodes have higher and more consistent rewards")

---
## The High Variance Problem

```
    ┌────────────────────────────────────────────────────────────────┐
    │              THE HIGH VARIANCE PROBLEM                         │
    ├────────────────────────────────────────────────────────────────┤
    │                                                                │
    │  REINFORCE multiplies: log π(a|s) × G                         │
    │                                                                │
    │  PROBLEM: G can vary WILDLY between episodes!                 │
    │                                                                │
    │  Episode 1: G = 50 (bad luck)                                 │
    │  Episode 2: G = 300 (good luck)                               │
    │  Episode 3: G = 100 (ok)                                      │
    │                                                                │
    │  Same action → wildly different gradients!                    │
    │                                                                │
    │  SOURCES OF VARIANCE:                                         │
    │  1. Stochastic policy: Different actions each episode        │
    │  2. Stochastic environment: Randomness in transitions        │
    │  3. Monte Carlo: Must wait for full episode                  │
    │  4. Credit assignment: Early actions credit for late rewards │
    │                                                                │
    │  WHY THIS MATTERS:                                            │
    │  • Slow learning (gradients point in wrong direction)        │
    │  • Need more samples to converge                             │
    │  • Unstable training                                         │
    │                                                                │
    └────────────────────────────────────────────────────────────────┘
```

In [None]:
# Demonstrate the variance problem

# Run multiple episodes with trained policy to see variance
env = gym.make('CartPole-v1')
episode_rewards = []
episode_lengths = []

for _ in range(100):
    state, _ = env.reset()
    total_reward = 0
    
    for step in range(500):
        with torch.no_grad():
            action, _ = policy.get_action(state)
        state, reward, terminated, truncated, _ = env.step(action)
        total_reward += reward
        if terminated or truncated:
            break
    
    episode_rewards.append(total_reward)
    episode_lengths.append(step + 1)

env.close()

# Visualize variance
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Left: Distribution of returns
ax1 = axes[0]
ax1.hist(episode_rewards, bins=20, color='#64b5f6', edgecolor='black', alpha=0.7)
ax1.axvline(x=np.mean(episode_rewards), color='red', linewidth=2, linestyle='--', 
            label=f'Mean: {np.mean(episode_rewards):.1f}')
ax1.axvline(x=np.mean(episode_rewards) + np.std(episode_rewards), color='orange', linewidth=2, linestyle=':',
            label=f'Std: ±{np.std(episode_rewards):.1f}')
ax1.axvline(x=np.mean(episode_rewards) - np.std(episode_rewards), color='orange', linewidth=2, linestyle=':')
ax1.set_xlabel('Episode Reward', fontsize=11)
ax1.set_ylabel('Count', fontsize=11)
ax1.set_title('Variance in Episode Returns\n(Same Policy, Different Outcomes)', fontsize=12, fontweight='bold')
ax1.legend()
ax1.grid(True, alpha=0.3)

# Right: Returns over episodes
ax2 = axes[1]
ax2.plot(episode_rewards, 'o-', alpha=0.7, markersize=3)
ax2.axhline(y=np.mean(episode_rewards), color='red', linewidth=2, linestyle='--', label='Mean')
ax2.fill_between(range(100), 
                 np.mean(episode_rewards) - np.std(episode_rewards),
                 np.mean(episode_rewards) + np.std(episode_rewards),
                 alpha=0.2, color='red', label='±1 Std')
ax2.set_xlabel('Episode', fontsize=11)
ax2.set_ylabel('Reward', fontsize=11)
ax2.set_title('Episode-to-Episode Variance\n(Even with trained policy!)', fontsize=12, fontweight='bold')
ax2.legend()
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print(f"\nVARIANCE STATISTICS:")
print(f"  Mean return: {np.mean(episode_rewards):.1f}")
print(f"  Std return: {np.std(episode_rewards):.1f}")
print(f"  Min: {min(episode_rewards):.0f}, Max: {max(episode_rewards):.0f}")
print(f"\n  This variance is HUGE! Same policy gives very different outcomes.")
print(f"  During training, this makes gradient estimates very noisy.")

In [None]:
# Show effect of return normalization

def reinforce_comparison(env_name='CartPole-v1', n_episodes=300, gamma=0.99, lr=1e-2):
    """Compare REINFORCE with and without return normalization."""
    
    results = {}
    
    for normalize in [False, True]:
        name = "With Normalization" if normalize else "Without Normalization"
        print(f"\nTraining {name}...")
        
        _, rewards = reinforce(
            env_name=env_name,
            n_episodes=n_episodes,
            gamma=gamma,
            lr=lr,
            normalize_returns=normalize,
            print_every=100
        )
        results[name] = rewards
    
    return results

print("COMPARING WITH AND WITHOUT NORMALIZATION")
print("="*60)
results = reinforce_comparison(n_episodes=300)

In [None]:
# Plot comparison
fig, ax = plt.subplots(figsize=(12, 6))

window = 30
for name, rewards in results.items():
    smoothed = np.convolve(rewards, np.ones(window)/window, mode='valid')
    color = '#4caf50' if 'With' in name else '#f44336'
    ax.plot(range(window-1, len(rewards)), smoothed, linewidth=2, label=name, color=color)

ax.set_xlabel('Episode', fontsize=11)
ax.set_ylabel('Reward (smoothed)', fontsize=11)
ax.set_title('Effect of Return Normalization on REINFORCE', fontsize=14, fontweight='bold')
ax.legend()
ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("\nNormalization typically helps by:")
print("  • Centering returns around 0 (negative for bad, positive for good)")
print("  • Scaling to similar magnitude")
print("  • Reducing variance in gradient estimates")

---
## Summary: Key Takeaways

### REINFORCE Algorithm

```
For each episode:
    1. Collect trajectory: (s₀, a₀, r₀), (s₁, a₁, r₁), ...
    2. Compute returns: G_t = Σ γᵏ r_{t+k}
    3. Update policy: θ ← θ + α × Σ ∇log π(aₜ|sₜ) × Gₜ
```

### Key Properties

| Property | Description |
|----------|-------------|
| **Monte Carlo** | Wait for episode to end |
| **On-policy** | Use current policy to collect data |
| **Unbiased** | Gradient estimate is unbiased |
| **High Variance** | Returns can vary wildly |

### Pros and Cons

| Pros | Cons |
|------|------|
| Simple to implement | High variance |
| Unbiased gradients | Needs complete episodes |
| Works with continuous actions | Sample inefficient |
| Foundation for advanced methods | Slow to converge |

---
## Test Your Understanding

**1. Why is REINFORCE called "Monte Carlo" policy gradient?**
<details>
<summary>Click to reveal answer</summary>
Because it uses complete episode returns (G_t) rather than bootstrapped estimates. Monte Carlo methods wait until the episode ends to compute the true return, rather than estimating it from value functions. This makes the gradient estimate unbiased but high variance.
</details>

**2. What does the loss `-log π(a|s) × G` mean intuitively?**
<details>
<summary>Click to reveal answer</summary>
- `-log π(a|s)`: How "surprising" was this action (lower prob = more surprising = higher loss)
- `× G`: Scale by how good the outcome was

If G is high and π(a|s) was low, we increase π(a|s) a lot (make this good action more likely).
If G is low, we decrease the probability of that action.
</details>

**3. Why do we normalize returns?**
<details>
<summary>Click to reveal answer</summary>
Normalization reduces variance by:
1. Centering returns around 0 (negative = below average, positive = above average)
2. Scaling to consistent magnitude across episodes

This makes gradient updates more stable, even though it technically adds some bias.
</details>

**4. Why is REINFORCE high variance?**
<details>
<summary>Click to reveal answer</summary>
Several sources of variance:
1. Stochastic policy samples different actions
2. Environment may have randomness
3. Returns include ALL future rewards (credit assignment problem)
4. Single episode can have very different outcomes

The same action might lead to return 50 in one episode and 300 in another!
</details>

**5. What's the credit assignment problem?**
<details>
<summary>Click to reveal answer</summary>
Each action's gradient is weighted by G_t, which includes ALL future rewards. But early actions shouldn't get credit for rewards that came from good late actions! For example, in a 100-step episode, the first action's gradient uses all 100 rewards, even though most of them had nothing to do with that first action.
</details>

---
## What's Next?

REINFORCE works but has high variance. In the next notebook, we'll learn **variance reduction techniques** that make policy gradients practical!

**Continue to:** [Notebook 3: Variance Reduction](03_variance_reduction.ipynb)

---

*REINFORCE: Simple, elegant, but needs help with variance!*