# Week 4 Implementation: Evolution Strategies for Non-Differentiable RL

**Date:** February 6, 2026  
**Course:** STAT 4830

This notebook demonstrates a working implementation comparing Evolution Strategies (ES) with PPO on sparse reward gridworld environments.

## Problem Setup

### Clear Problem Statement

**Goal:** Learn a policy œÄ_Œ∏ that navigates from bottom-left to top-right in a gridworld with obstacles.

**Challenge:** Rewards are sparse (+1 at goal, 0 elsewhere), making gradient-based learning difficult.

**Approach:** Compare parameter-space optimization (ES) vs. action-space RL (PPO).

### Mathematical Formulation

**Objective:**
$$\max_{\theta} J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta} \left[ \sum_{t=0}^T r_t \right]$$

**Evolution Strategies Gradient Estimate:**
$$\nabla_\theta J(\theta) \approx \frac{1}{N\sigma} \sum_{i=1}^N R(\theta + \sigma \epsilon_i) \cdot \epsilon_i$$

where $\epsilon_i \sim \mathcal{N}(0, I)$

**Update Rule:**
$$\theta_{t+1} = \theta_t + \alpha \cdot \nabla_\theta J(\theta_t)$$

### Data Requirements

**Environment:**
- State space: 64-dim (8√ó8 grid, one-hot encoded)
- Action space: 4 discrete actions {up, down, left, right}
- Episode length: max 50 steps
- Obstacles: 8 randomly placed

**Training Data:**
- Generated online through policy rollouts
- ES: 20 perturbations √ó 5 episodes = 100 episodes per iteration
- PPO: 128 steps per rollout

### Success Metrics

1. **Success Rate:** % of episodes reaching goal (target: >30%)
2. **Average Return:** Mean cumulative reward
3. **Learning Stability:** Std dev across trials (lower is better)
4. **Sample Efficiency:** Iterations to reach threshold performance

## Implementation

In [1]:
# All required imports
import sys
sys.path.append('../src')  # Add src to path

import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
from typing import Tuple, List, Dict

# Set style for plots
sns.set_style("whitegrid")
plt.rcParams['figure.figsize'] = (12, 4)

print("Imports successful!")
print(f"PyTorch version: {torch.__version__}")
print(f"Device: {'cuda' if torch.cuda.is_available() else 'cpu'}")

Imports successful!
PyTorch version: 2.8.0
Device: cpu


### Environment Implementation

In [2]:
from model import GridWorld

# Test environment
env = GridWorld(size=8, n_obstacles=8, max_steps=50, seed=42)
state = env.reset()

print(f"Environment: {env.size}√ó{env.size} grid")
print(f"State shape: {state.shape}")
print(f"Action space: {env.n_actions} actions")
print(f"Start position: {env.start_pos}")
print(f"Goal position: {env.goal_pos}")
print(f"Number of obstacles: {len(env.obstacles)}")

# Visualize environment
env.render()

Environment: 8√ó8 grid
State shape: (64,)
Action space: 4 actions
Start position: (7, 0)
Goal position: (0, 7)
Number of obstacles: 8


<Figure size 800x800 with 1 Axes>

### Policy Network Implementation

In [3]:
from model import PolicyNetwork

# Create policy network
state_dim = env._get_state().shape[0]  # 64 for 8√ó8 grid
action_dim = env.n_actions  # 4
hidden_dim = 64
n_layers = 2

policy = PolicyNetwork(
    state_dim=state_dim,
    action_dim=action_dim,
    hidden_dim=hidden_dim,
    n_layers=n_layers
)

print(f"Policy Network:")
print(f"  Input dim: {state_dim}")
print(f"  Hidden dim: {hidden_dim}")
print(f"  Output dim: {action_dim}")
print(f"  Layers: {n_layers}")
print(f"  Total parameters: {sum(p.numel() for p in policy.parameters())}")

# Test forward pass
with torch.no_grad():
    state_tensor = torch.FloatTensor(state).unsqueeze(0)
    logits = policy(state_tensor)
    probs = F.softmax(logits, dim=-1)
    print(f"\nTest forward pass:")
    print(f"  Input shape: {state_tensor.shape}")
    print(f"  Output shape: {logits.shape}")
    print(f"  Action probs: {probs.squeeze().numpy()}")
    print(f"  Sum: {probs.sum().item():.6f}")

Policy Network:
  Input dim: 64
  Hidden dim: 64
  Output dim: 4
  Layers: 2
  Total parameters: 8580

Test forward pass:
  Input shape: torch.Size([1, 64])
  Output shape: torch.Size([1, 4])
  Action probs: [0.27932164 0.22664662 0.20861788 0.28541377]
  Sum: 1.000000


### Evolution Strategies Implementation

ES optimizes by perturbing parameters and estimating gradients from fitness evaluations.

In [4]:
def evaluate_policy(policy, env, n_episodes=5, max_steps=50):
    """Evaluate policy and return average reward."""
    total_reward = 0.0
    
    for _ in range(n_episodes):
        state = env.reset()
        episode_reward = 0.0
        done = False
        steps = 0
        
        while not done and steps < max_steps:
            action, _ = policy.get_action(state, deterministic=False)
            state, reward, done, _ = env.step(action)
            episode_reward += reward
            steps += 1
        
        total_reward += episode_reward
    
    return total_reward / n_episodes


def es_step(policy, env, N=20, sigma=0.05, n_eval_episodes=5, max_steps=50):
    """
    Single ES optimization step.
    
    Args:
        policy: PolicyNetwork to optimize
        env: Environment for evaluation
        N: Population size
        sigma: Noise scale
        n_eval_episodes: Episodes per perturbation
        max_steps: Max steps per episode
    
    Returns:
        gradient: Estimated gradient
        avg_reward: Average reward across population
    """
    # Get flattened parameters
    params = torch.cat([p.flatten() for p in policy.parameters()])
    n_params = params.shape[0]
    
    # Sample perturbations and evaluate
    perturbations = []
    rewards = []
    
    for i in range(N):
        # Sample perturbation
        epsilon = torch.randn(n_params)
        perturbations.append(epsilon)
        
        # Perturb parameters
        perturbed_params = params + sigma * epsilon
        
        # Set perturbed parameters
        offset = 0
        for p in policy.parameters():
            numel = p.numel()
            p.data = perturbed_params[offset:offset+numel].view_as(p)
            offset += numel
        
        # Evaluate
        reward = evaluate_policy(policy, env, n_eval_episodes, max_steps)
        rewards.append(reward)
    
    # Estimate gradient
    rewards = torch.tensor(rewards, dtype=torch.float32)
    perturbations = torch.stack(perturbations)
    
    # Store original average reward before standardization
    avg_reward = rewards.mean().item()
    
    # Standardize rewards for stability (only if there's variance)
    if rewards.std() > 1e-8:
        rewards_normalized = (rewards - rewards.mean()) / (rewards.std() + 1e-8)
    else:
        # If no variance, use raw rewards (no gradient signal, but at least not NaN)
        rewards_normalized = rewards - rewards.mean()
    
    gradient = (perturbations.T @ rewards_normalized) / (N * sigma)
    
    # Restore original parameters
    offset = 0
    for p in policy.parameters():
        numel = p.numel()
        p.data = params[offset:offset+numel].view_as(p)
        offset += numel
    
    return gradient, avg_reward


print("ES functions defined successfully!")

ES functions defined successfully!


In [5]:
# ES vs PPO: Fair Comparison (8√ó8 grid, 8 obstacles, weaker shaping)
import sys
if '../tiny-grpo-es' not in sys.path:
    sys.path.append('../tiny-grpo-es')
from policy_network import ValueNetwork
from train_ppo_gridworld import train_ppo

# Define shaped reward GridWorld (inherits from GridWorld)
class ShapedRewardEnvComparison(GridWorld):
    """GridWorld with distance-based reward shaping."""
    def reset(self):
        state = super().reset()
        self.prev_dist = abs(self.agent_pos[0] - self.goal_pos[0]) + abs(self.agent_pos[1] - self.goal_pos[1])
        return state
    
    def step(self, action):
        state, reward, done, info = super().step(action)
        curr_dist = abs(self.agent_pos[0] - self.goal_pos[0]) + abs(self.agent_pos[1] - self.goal_pos[1])
        shaped_reward = reward + 0.2 * (self.prev_dist - curr_dist) - 0.01  # Weaker shaping
        self.prev_dist = curr_dist
        return state, shaped_reward, done, info

# Use 8x8 with 8 obstacles (challenging)
comparison_env = ShapedRewardEnvComparison(size=8, n_obstacles=8, max_steps=50, seed=123)
eval_env = GridWorld(size=8, n_obstacles=8, max_steps=50, seed=123)

# 1. Train ES
print("\n[1/2] Training ES...")
import time
es_start_time = time.time()
es_policy = PolicyNetwork(state_dim=64, action_dim=4, hidden_dim=64, n_layers=2)
es_params = torch.cat([p.flatten() for p in es_policy.parameters()])

for iteration in range(80):
    gradient, train_reward = es_step(es_policy, comparison_env, N=50, sigma=0.1, n_eval_episodes=5, max_steps=50)
    es_params = es_params + 0.05 * gradient
    
    offset = 0
    for p in es_policy.parameters():
        numel = p.numel()
        p.data = es_params[offset:offset+numel].view_as(p)
        offset += numel
    
    if iteration % 20 == 0:
        print(f"  ES iter {iteration}/80 - train_reward: {train_reward:.3f}")

# Evaluate ES
es_rewards = []
es_successes = []
for _ in range(20):
    state = eval_env.reset()
    ep_reward = 0
    done = False
    steps = 0
    while not done and steps < 50:
        action, _ = es_policy.get_action(state, deterministic=True)
        state, reward, done, info = eval_env.step(action)
        ep_reward += reward
        steps += 1
    es_rewards.append(ep_reward)
    es_successes.append(float(info['success']))

es_time = time.time() - es_start_time
print(f"\nES: reward={np.mean(es_rewards):.3f}, success={np.mean(es_successes):.2%}, time={es_time:.1f}s")

# 2. Train PPO
print("\n[2/2] Training PPO...")
ppo_start_time = time.time()
ppo_policy = PolicyNetwork(state_dim=64, action_dim=4, hidden_dim=64, n_layers=2)
ppo_value = ValueNetwork(state_dim=64, hidden_dim=64, n_layers=2)

# PPO also trains on shaped rewards (ShapedRewardEnvComparison)
trained_ppo_policy, _ = train_ppo(
    policy=ppo_policy,
    value_net=ppo_value,
    env_class=ShapedRewardEnvComparison,
    env_kwargs={"size": 8, "n_obstacles": 8, "max_steps": 50, "seed": 123},
    n_iterations=80,
    n_steps=128,
    n_epochs=4,
    batch_size=64,
    lr_policy=3e-4,
    lr_value=1e-3,
    eval_every=10,
    log_wandb=False,
    seed=42
)

# Evaluate PPO
ppo_rewards = []
ppo_successes = []
for _ in range(20):
    state = eval_env.reset()
    ep_reward = 0
    done = False
    steps = 0
    while not done and steps < 50:
        action, _ = trained_ppo_policy.get_action(state, deterministic=True)
        state, reward, done, info = eval_env.step(action)
        ep_reward += reward
        steps += 1
    ppo_rewards.append(ep_reward)
    ppo_successes.append(float(info['success']))

ppo_time = time.time() - ppo_start_time
print(f"\nPPO: reward={np.mean(ppo_rewards):.3f}, success={np.mean(ppo_successes):.2%}, time={ppo_time:.1f}s")

# Comparison
print("\n" + "="*60)
print("COMPARISON SUMMARY")
print("="*60)
print(f"{'Method':<10} {'Avg Reward':<15} {'Success Rate':<15} {'Time':<10}")
print("-"*60)
print(f"{'ES':<10} {np.mean(es_rewards):>6.3f} ¬± {np.std(es_rewards):>5.3f}  {np.mean(es_successes):>6.1%} ¬± {np.std(es_successes):>5.1%}  {es_time:>6.1f}s")
print(f"{'PPO':<10} {np.mean(ppo_rewards):>6.3f} ¬± {np.std(ppo_rewards):>5.3f}  {np.mean(ppo_successes):>6.1%} ¬± {np.std(ppo_successes):>5.1%}  {ppo_time:>6.1f}s")
print("="*60)

# Visualize comparison
fig, axes = plt.subplots(1, 2, figsize=(12, 5))

axes[0].boxplot([es_rewards, ppo_rewards], labels=['ES', 'PPO'])
axes[0].set_ylabel('Reward')
axes[0].set_title('Reward Distribution')
axes[0].grid(True, alpha=0.3)

axes[1].bar(['ES', 'PPO'], [np.mean(es_successes), np.mean(ppo_successes)], 
            color=['blue', 'orange'], alpha=0.6)
axes[1].set_ylabel('Success Rate')
axes[1].set_title('Success Rate Comparison')
axes[1].set_ylim([0, 1])
axes[1].grid(True, alpha=0.3, axis='y')

plt.tight_layout()
plt.show()


[1/2] Training ES...
  ES iter 0/80 - train_reward: 0.198
  ES iter 20/80 - train_reward: 1.882
  ES iter 40/80 - train_reward: 3.409
  ES iter 60/80 - train_reward: 3.660
ES Results: reward=1.000, success=100.00%

[2/2] Training PPO...
Starting PPO training for 80 iterations...
Policy parameters: 8580
Value parameters: 8385
Iter 1/80: train_reward=0.050, eval_reward=-0.500, eval_success=0.00, eval_steps=50.0, ema=0.050, best=0.050
Iter 10/80: train_reward=0.450, eval_reward=0.900, eval_success=0.00, eval_steps=50.0, ema=-0.049, best=1.600
Iter 20/80: train_reward=1.960, eval_reward=0.900, eval_success=0.00, eval_steps=50.0, ema=0.647, best=2.987
Iter 30/80: train_reward=2.913, eval_reward=1.300, eval_success=0.00, eval_steps=50.0, ema=1.637, best=2.987
Iter 40/80: train_reward=1.763, eval_reward=1.300, eval_success=0.00, eval_steps=50.0, ema=2.260, best=3.210
Iter 50/80: train_reward=3.365, eval_reward=3.660, eval_success=1.00, eval_steps=14.0, ema=2.791, best=3.514
Iter 60/80: train

  plt.show()


## 6. Resource Monitoring

Track computational resources used by each method.

In [6]:
import psutil
import os

print("="*60)
print("RESOURCE MONITORING")
print("="*60)

process = psutil.Process(os.getpid())
memory_mb = process.memory_info().rss / 1024 / 1024

print(f"\nMemory: {memory_mb:.1f} MB")
print(f"CPU cores: {psutil.cpu_count()}")
print(f"CPU usage: {psutil.cpu_percent(interval=1):.1f}%")
print("\n" + "="*60)

RESOURCE MONITORING

üìä Current Resource Usage:
   Memory: 122.0 MB
   CPU cores available: 8
   CPU percent: 20.3%

‚è±Ô∏è  Training Configuration:
   ES: 80 iterations √ó 50 population √ó 5 episodes = 20,000 episodes
   PPO: 80 iterations √ó 128 steps = 10,240 steps
   Environment: 8√ó8 grid, 8 obstacles, max 50 steps

üí° Key Differences:
   ES: More episodes (parameter perturbations)
   PPO: More efficient (gradient-based, fewer episodes)
   Both converge to 100% success on this task

‚úì Resource monitoring complete!
