# Optimization: Experience Replay, Reward Shaping, and Hyperparameter Tuning

## ðŸ“š Learning Objectives

By completing this notebook, you will:
- Understand and implement experience replay
- Apply reward shaping techniques
- Perform hyperparameter tuning for Deep RL
- Compare optimization techniques
- Improve learning efficiency

## ðŸ”— Prerequisites

- âœ… Understanding of Deep RL algorithms
- âœ… Understanding of neural networks
- âœ… Python knowledge
- âœ… NumPy, collections knowledge

---

## Official Structure Reference

This notebook covers practical activities from **Course 09, Unit 3**:
- Optimization: experimenting with techniques like experience replay, reward shaping, and hyperparameter tuning to improve learning efficiency
- **Source:** `DETAILED_UNIT_DESCRIPTIONS.md` - Unit 3 Practical Content

---

## Introduction

**Optimization techniques** like experience replay, reward shaping, and hyperparameter tuning are crucial for improving Deep RL training efficiency and performance.

In [1]:
import numpy as np
from collections import deque
import random

print("âœ… Libraries imported!")
print("\nOptimization: Experience Replay, Reward Shaping, Hyperparameter Tuning")
print("=" * 60)

âœ… Libraries imported!

Optimization: Experience Replay, Reward Shaping, Hyperparameter Tuning


## Part 1: Experience Replay


In [2]:
print("=" * 60)
print("Part 1: Experience Replay")
print("=" * 60)

class ReplayBuffer:
    """Experience replay buffer for storing and sampling transitions."""
    
    def __init__(self, capacity=10000):
        self.buffer = deque(maxlen=capacity)
        self.capacity = capacity
    
    def push(self, state, action, reward, next_state, done):
        """Store a transition."""
        self.buffer.append((state, action, reward, next_state, done))
    
    def sample(self, batch_size):
        """Sample a batch of transitions."""
        batch = random.sample(self.buffer, min(batch_size, len(self.buffer)))
        states, actions, rewards, next_states, dones = zip(*batch)
        return np.array(states), np.array(actions), np.array(rewards), np.array(next_states), np.array(dones)
    
    def __len__(self):
        return len(self.buffer)

# Example usage
buffer = ReplayBuffer(capacity=1000)

# Simulate storing experiences
for i in range(100):
    state = np.random.randn(4)
    action = random.randint(0, 1)
    reward = random.random()
    next_state = np.random.randn(4)
    done = False
    buffer.push(state, action, reward, next_state, done)

print(f"\nReplay Buffer:")
print(f"  Capacity: {buffer.capacity}")
print(f"  Current size: {len(buffer)}")

# Sample a batch
batch_states, batch_actions, batch_rewards, batch_next_states, batch_dones = buffer.sample(32)
print(f"  Sampled batch size: {len(batch_states)}")

print("\nExperience Replay Benefits:")
print("  - Breaks correlation between consecutive samples")
print("  - Reuses past experiences (sample efficiency)")
print("  - Stabilizes training")
print("  - Enables off-policy learning")

print("\nâœ… Experience replay implemented!")

Part 1: Experience Replay

Replay Buffer:
  Capacity: 1000
  Current size: 100
  Sampled batch size: 32

Experience Replay Benefits:
  - Breaks correlation between consecutive samples
  - Reuses past experiences (sample efficiency)
  - Stabilizes training
  - Enables off-policy learning

âœ… Experience replay implemented!


## Part 2: Reward Shaping


In [3]:
print("\n" + "=" * 60)
print("Part 2: Reward Shaping")
print("=" * 60)

def shaped_reward(original_reward, state, next_state, goal_state):
    """Apply reward shaping to guide learning."""
    # Potential-based reward shaping
    potential = -np.linalg.norm(state - goal_state)
    next_potential = -np.linalg.norm(next_state - goal_state)
    shaping = 0.1 * (next_potential - potential)  # Small shaping coefficient
    return original_reward + shaping

# Example
state = np.array([0.0, 0.0])
next_state = np.array([0.5, 0.5])
goal = np.array([1.0, 1.0])
original_reward = 0.0

shaped = shaped_reward(original_reward, state, next_state, goal)
print(f"\nReward Shaping Example:")
print(f"  Original reward: {original_reward}")
print(f"  Shaped reward: {shaped:.4f}")
print(f"  Shaping term: {shaped - original_reward:.4f}")

print("\nReward Shaping Benefits:")
print("  - Guides agent toward goals")
print("  - Provides intermediate feedback")
print("  - Speeds up learning")
print("  - Maintains policy invariance (potential-based)")

print("\nâœ… Reward shaping implemented!")


Part 2: Reward Shaping

Reward Shaping Example:
  Original reward: 0.0
  Shaped reward: 0.0707
  Shaping term: 0.0707

Reward Shaping Benefits:
  - Guides agent toward goals
  - Provides intermediate feedback
  - Speeds up learning
  - Maintains policy invariance (potential-based)

âœ… Reward shaping implemented!


## Part 3: Hyperparameter Tuning


In [4]:
print("\n" + "=" * 60)
print("Part 3: Hyperparameter Tuning")
print("=" * 60)

# Hyperparameter grid
hyperparams_grid = {
    'learning_rate': [1e-4, 5e-4, 1e-3],
    'gamma': [0.95, 0.99, 0.999],
    'epsilon': [0.1, 0.2, 0.3],
    'batch_size': [32, 64, 128]
}

print("\nHyperparameter Grid Search:")
print(f"  Learning rates: {hyperparams_grid['learning_rate']}")
print(f"  Discount factors: {hyperparams_grid['gamma']}")
print(f"  Epsilon values: {hyperparams_grid['epsilon']}")
print(f"  Batch sizes: {hyperparams_grid['batch_size']}")

total_combinations = (len(hyperparams_grid['learning_rate']) * 
                     len(hyperparams_grid['gamma']) * 
                     len(hyperparams_grid['epsilon']) * 
                     len(hyperparams_grid['batch_size']))
print(f"  Total combinations: {total_combinations}")

print("\nCommon Hyperparameters to Tune:")
print("  - Learning rate (Î±)")
print("  - Discount factor (Î³)")
print("  - Exploration rate (Îµ)")
print("  - Batch size")
print("  - Network architecture")
print("  - Replay buffer size")
print("  - Update frequency")

print("\nTuning Strategies:")
print("  - Grid search (exhaustive)")
print("  - Random search (more efficient)")
print("  - Bayesian optimization")
print("  - Population-based training")

print("\nâœ… Hyperparameter tuning concepts covered!")


Part 3: Hyperparameter Tuning

Hyperparameter Grid Search:
  Learning rates: [0.0001, 0.0005, 0.001]
  Discount factors: [0.95, 0.99, 0.999]
  Epsilon values: [0.1, 0.2, 0.3]
  Batch sizes: [32, 64, 128]
  Total combinations: 81

Common Hyperparameters to Tune:
  - Learning rate (Î±)
  - Discount factor (Î³)
  - Exploration rate (Îµ)
  - Batch size
  - Network architecture
  - Replay buffer size
  - Update frequency

Tuning Strategies:
  - Grid search (exhaustive)
  - Random search (more efficient)
  - Bayesian optimization
  - Population-based training

âœ… Hyperparameter tuning concepts covered!


## Summary

### Key Techniques:
1. **Experience Replay**: Store and randomly sample past experiences
2. **Reward Shaping**: Add shaping rewards to guide learning
3. **Hyperparameter Tuning**: Optimize learning parameters

### Benefits:
- **Experience Replay**: Sample efficiency, stability, decorrelation
- **Reward Shaping**: Faster convergence, better guidance
- **Hyperparameter Tuning**: Optimal performance

### Best Practices:
- Use experience replay for off-policy algorithms
- Apply potential-based reward shaping
- Tune hyperparameters systematically
- Monitor performance across different configurations

**Reference:** Course 09, Unit 3: "Deep Reinforcement Learning" - Optimization practical content