# Deep Reinforcement Learning: Algorithm Comparison

This notebook demonstrates and compares four actor-critic algorithms for continuous control:
- **DDPG** - Deep Deterministic Policy Gradient (original)
- **OurDDPG** - Re-tuned DDPG with 256-256 architecture
- **TD3** - Twin Delayed Deep Deterministic Policy Gradient
- **QRTD3** - Quantile Regression TD3

All implementations are imported from the Python files in this repository.


## Configuration

Set the environment and seed here. You can also choose which algorithm to run.


In [None]:
# Configuration - Set these at the beginning
ENV_NAME = "Reacher-v5"  # Options: Reacher-v5, Ant-v5, HalfCheetah-v5, Hopper-v5, Walker2d-v5, etc.
SEED = 0
POLICY = "TD3"  # Options: "TD3", "DDPG", "OurDDPG", "QRTD3"

# Training hyperparameters
MAX_TIMESTEPS = int(1e6)  # 1M timesteps
START_TIMESTEPS = int(25e3)  # Initial random exploration
EVAL_FREQ = int(5e3)  # Evaluation frequency
BATCH_SIZE = 256
DISCOUNT = 0.99
TAU = 0.005
POLICY_NOISE = 0.2
NOISE_CLIP = 0.5
POLICY_FREQ = 2
EXPL_NOISE = 0.1

print(f"Configuration:")
print(f"  Environment: {ENV_NAME}")
print(f"  Seed: {SEED}")
print(f"  Policy: {POLICY}")
print(f"  Max timesteps: {MAX_TIMESTEPS:,}")


Configuration:
  Environment: Reacher-v5
  Seed: 0
  Policy: TD3
  Max timesteps: 1,000,000


## Imports

Import the necessary libraries and algorithm implementations from the Python files.


In [None]:
import numpy as np
import torch
import gymnasium as gym
from tqdm import tqdm
import matplotlib.pyplot as plt
import time
import os

# Import algorithm implementations
import utils
import TD3
import DDPG
import OurDDPG
import QR_TD3

# Set device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")
print(f"PyTorch version: {torch.__version__}")


Using device: cpu
PyTorch version: 2.9.1


## Algorithm Overview

### DDPG (Deep Deterministic Policy Gradient)
- **Paper**: [Continuous control with deep reinforcement learning](https://arxiv.org/abs/1509.02971) (Lillicrap et al., 2015)
- Single critic network (400-300 architecture)
- Actor network (400-300 architecture)
- Updates policy every step
- Learning rate: 1e-4 (actor), weight_decay=1e-2 (critic)
- Tau: 0.001

### OurDDPG (Re-tuned DDPG)
- Same as DDPG but with 256-256 architecture
- Learning rate: 3e-4
- Tau: 0.005
- Batch size: 256

### TD3 (Twin Delayed Deep Deterministic Policy Gradient)
- **Paper**: [Addressing Function Approximation Error in Actor-Critic Methods](https://arxiv.org/abs/1802.09477) (Fujimoto et al., 2018)
- **Key improvements over DDPG**:
  1. **Clipped Double Q-Learning**: Uses two critic networks, takes minimum to reduce overestimation
  2. **Delayed Policy Updates**: Updates policy every 2 steps (less frequent than critics)
  3. **Target Policy Smoothing**: Adds clipped noise to target actions
- Twin critics (256-256 architecture each)
- Actor network (256-256 architecture)
- Learning rate: 3e-4
- Tau: 0.005

### QRTD3 (Quantile Regression TD3)
- Extension of TD3 using quantile regression for the critic
- Uses quantile loss instead of MSE loss
- Can better capture uncertainty in value estimates
- Same architecture as TD3 but with quantile-based critics


## Evaluation Function

This function evaluates the policy over multiple episodes without exploration noise.


In [None]:
def eval_policy(policy, env_name, seed, eval_episodes=10):
    """Evaluate the policy over multiple episodes."""
    eval_env = gym.make(env_name)
    avg_reward = 0.
    for _ in tqdm(range(eval_episodes), desc="Evaluating", leave=False):
        state = eval_env.reset(seed=seed + 100)[0]
        done = False
        while not done:
            action = policy.select_action(np.array(state))
            state, reward, terminated, truncated, _ = eval_env.step(action)
            done = terminated or truncated
            avg_reward += reward
    
    avg_reward /= eval_episodes
    print(f"Evaluation over {eval_episodes} episodes: {avg_reward:.3f}")
    return avg_reward


## Training Setup

Initialize the environment and policy based on the configuration.


In [None]:
# Create environment
env = gym.make(ENV_NAME)

# Set seeds
env.action_space.seed(SEED)
torch.manual_seed(SEED)
np.random.seed(SEED)

# Get environment dimensions
state_dim = env.observation_space.shape[0]
action_dim = env.action_space.shape[0]
max_action = float(env.action_space.high[0])

print(f"State dimension: {state_dim}")
print(f"Action dimension: {action_dim}")
print(f"Max action: {max_action}")

# Initialize policy
kwargs = {
    "state_dim": state_dim,
    "action_dim": action_dim,
    "max_action": max_action,
    "discount": DISCOUNT,
    "tau": TAU,
}

if POLICY == "TD3":
    kwargs["policy_noise"] = POLICY_NOISE * max_action
    kwargs["noise_clip"] = NOISE_CLIP * max_action
    kwargs["policy_freq"] = POLICY_FREQ
    policy = TD3.TD3(**kwargs)
elif POLICY == "QRTD3":
    kwargs["policy_noise"] = POLICY_NOISE * max_action
    kwargs["noise_clip"] = NOISE_CLIP * max_action
    kwargs["policy_freq"] = POLICY_FREQ
    policy = QR_TD3.QRTD3(**kwargs)
elif POLICY == "OurDDPG":
    policy = OurDDPG.DDPG(**kwargs)
elif POLICY == "DDPG":
    policy = DDPG.DDPG(**kwargs)
else:
    raise ValueError(f"Unknown policy: {POLICY}")

print(f"\nPolicy {POLICY} initialized successfully!")

# Initialize replay buffer
replay_buffer = utils.ReplayBuffer(state_dim, action_dim)
print(f"Replay buffer initialized")


State dimension: 10
Action dimension: 2
Max action: 1.0

Policy TD3 initialized successfully!
Replay buffer initialized


## Training Loop

The training loop follows the standard off-policy actor-critic procedure:
1. Collect experience by interacting with the environment
2. Store transitions in the replay buffer
3. Sample batches and update the policy
4. Periodically evaluate the policy without exploration noise


In [None]:
# Evaluate untrained policy
print("Evaluating untrained policy...")
evaluations = [eval_policy(policy, ENV_NAME, SEED)]

# Initialize training variables
state = env.reset(seed=SEED + 100)[0]
done = False
episode_reward = 0
episode_timesteps = 0
episode_num = 0

start_time = time.time()

print(f"\nStarting training for {MAX_TIMESTEPS:,} timesteps...")
print(f"Random exploration for first {START_TIMESTEPS:,} steps")
print(f"Evaluation frequency: every {EVAL_FREQ:,} steps\n")

# Main training loop
for t in tqdm(range(int(MAX_TIMESTEPS)), desc="Training"):
    episode_timesteps += 1
    
    # Select action randomly or according to policy with exploration noise
    if t < START_TIMESTEPS:
        action = env.action_space.sample()
    else:
        action = (
            policy.select_action(np.array(state))
            + np.random.normal(0, max_action * EXPL_NOISE, size=action_dim)
        ).clip(-max_action, max_action)
    
    # Perform action in environment
    next_state, reward, terminated, truncated, _ = env.step(action)
    done = terminated or truncated
    done_bool = float(done) if episode_timesteps < env._max_episode_steps else 0
    
    # Store transition in replay buffer
    replay_buffer.add(state, action, next_state, reward, done_bool)
    
    state = next_state
    episode_reward += reward
    
    # Train agent after collecting sufficient data
    if t >= START_TIMESTEPS:
        policy.train(replay_buffer, BATCH_SIZE)
    
    if done:
        # Reset environment
        state = env.reset(seed=SEED + 100)[0]
        done = False
        episode_reward = 0
        episode_timesteps = 0
        episode_num += 1
    
    # Evaluate episode
    if (t + 1) % EVAL_FREQ == 0:
        eval_reward = eval_policy(policy, ENV_NAME, SEED)
        evaluations.append(eval_reward)

end_time = time.time()
duration = end_time - start_time
hours = int(duration // 3600)
minutes = int((duration % 3600) // 60)
seconds = int(duration % 60)

print("\n" + "="*70)
print("Training completed!")
print(f"Training time: {hours:02d}:{minutes:02d}:{seconds:02d}")
print(f"Final evaluation reward: {evaluations[-1]:.3f}")
print("="*70)


Evaluating untrained policy...


                                                  

Evaluation over 10 episodes: -5.469

Starting training for 1,000,000 timesteps...
Random exploration for first 25,000 steps
Evaluation frequency: every 5,000 steps



Training:   1%|          | 6802/1000000 [00:00<01:42, 9720.69it/s] 

Evaluation over 10 episodes: -5.469


Training:   1%|          | 11350/1000000 [00:01<01:43, 9563.88it/s]

Evaluation over 10 episodes: -5.469


Training:   2%|▏         | 17038/1000000 [00:01<01:42, 9593.44it/s] 

Evaluation over 10 episodes: -5.469


Training:   2%|▏         | 21540/1000000 [00:02<01:43, 9488.82it/s] 

Evaluation over 10 episodes: -5.469


Training:   2%|▏         | 24991/1000000 [00:02<01:30, 10744.87it/s]

Evaluation over 10 episodes: -5.469


Training:   3%|▎         | 30029/1000000 [00:33<1:46:02, 152.46it/s]

Evaluation over 10 episodes: -0.914


Training:   4%|▎         | 35017/1000000 [00:59<1:45:08, 152.96it/s]

Evaluation over 10 episodes: -0.955


Training:   4%|▍         | 40030/1000000 [01:26<1:28:21, 181.08it/s]

Evaluation over 10 episodes: -0.970


Training:   5%|▍         | 45028/1000000 [01:52<1:30:21, 176.16it/s]

Evaluation over 10 episodes: -3.703


Training:   5%|▌         | 50031/1000000 [02:22<1:32:21, 171.41it/s]

Evaluation over 10 episodes: -1.222


Training:   6%|▌         | 55033/1000000 [02:48<1:28:17, 178.38it/s]

Evaluation over 10 episodes: -1.011


Training:   6%|▌         | 60018/1000000 [03:16<1:35:15, 164.47it/s]

Evaluation over 10 episodes: -0.660


Training:   7%|▋         | 65016/1000000 [03:51<1:45:33, 147.62it/s]

Evaluation over 10 episodes: -0.677


Training:   7%|▋         | 70015/1000000 [04:20<1:35:48, 161.77it/s]

Evaluation over 10 episodes: -0.829


Training:   8%|▊         | 75037/1000000 [04:48<1:28:51, 173.50it/s]

Evaluation over 10 episodes: -0.832


Training:   8%|▊         | 80040/1000000 [05:13<1:27:44, 174.75it/s]

Evaluation over 10 episodes: -0.963


Training:   9%|▊         | 85013/1000000 [05:41<3:38:58, 69.64it/s] 

Evaluation over 10 episodes: -0.965


Training:   9%|▉         | 90020/1000000 [06:09<1:41:35, 149.29it/s]

Evaluation over 10 episodes: -1.041


Training:  10%|▉         | 95034/1000000 [06:38<1:26:41, 173.98it/s]

Evaluation over 10 episodes: -0.916


Training:  10%|█         | 100021/1000000 [07:10<2:31:08, 99.24it/s]

Evaluation over 10 episodes: -0.954


Training:  11%|█         | 105011/1000000 [07:39<1:30:48, 164.27it/s]

Evaluation over 10 episodes: -0.938


Training:  11%|█         | 110026/1000000 [08:05<1:24:04, 176.41it/s]

Evaluation over 10 episodes: -0.889


Training:  12%|█▏        | 115028/1000000 [08:31<1:22:18, 179.21it/s]

Evaluation over 10 episodes: -0.934


Training:  12%|█▏        | 120024/1000000 [08:56<1:22:11, 178.43it/s]

Evaluation over 10 episodes: -0.988


Training:  13%|█▎        | 125018/1000000 [09:23<1:26:12, 169.17it/s]

Evaluation over 10 episodes: -0.921


Training:  13%|█▎        | 130024/1000000 [09:50<1:36:38, 150.03it/s]

Evaluation over 10 episodes: -0.915


Training:  14%|█▎        | 135022/1000000 [10:20<1:38:21, 146.58it/s]

Evaluation over 10 episodes: -0.920


Training:  14%|█▍        | 140024/1000000 [10:50<1:26:29, 165.71it/s]

Evaluation over 10 episodes: -0.915


Training:  15%|█▍        | 145028/1000000 [11:14<1:21:23, 175.07it/s]

Evaluation over 10 episodes: -0.927


Training:  15%|█▌        | 150038/1000000 [11:49<1:19:13, 178.79it/s]

Evaluation over 10 episodes: -0.915


Training:  16%|█▌        | 155039/1000000 [12:25<1:23:06, 169.44it/s]

Evaluation over 10 episodes: -0.920


Training:  16%|█▌        | 160018/1000000 [12:56<1:33:47, 149.27it/s]

Evaluation over 10 episodes: -0.871


Training:  17%|█▋        | 165016/1000000 [13:35<2:12:28, 105.05it/s]

Evaluation over 10 episodes: -0.897


Training:  17%|█▋        | 170028/1000000 [14:14<1:11:59, 192.14it/s]

Evaluation over 10 episodes: -0.950


Training:  18%|█▊        | 175019/1000000 [14:48<1:58:44, 115.80it/s]

Evaluation over 10 episodes: -0.957


Training:  18%|█▊        | 180014/1000000 [15:25<2:06:54, 107.69it/s]

Evaluation over 10 episodes: -0.927


Training:  19%|█▊        | 185023/1000000 [15:59<1:40:54, 134.60it/s]

Evaluation over 10 episodes: -0.865


Training:  19%|█▉        | 190024/1000000 [16:36<1:42:27, 131.76it/s]

Evaluation over 10 episodes: -0.895


Training:  20%|█▉        | 195018/1000000 [17:13<2:15:13, 99.21it/s] 

Evaluation over 10 episodes: -0.911


Training:  20%|██        | 200031/1000000 [17:48<1:34:51, 140.56it/s]

Evaluation over 10 episodes: -0.905


Training:  21%|██        | 205009/1000000 [18:26<2:57:42, 74.56it/s] 

Evaluation over 10 episodes: -0.939


Training:  21%|██        | 210028/1000000 [19:03<1:42:40, 128.23it/s]

Evaluation over 10 episodes: -0.987


Training:  22%|██▏       | 215016/1000000 [19:39<2:18:55, 94.17it/s] 

Evaluation over 10 episodes: -0.974


Training:  22%|██▏       | 220027/1000000 [20:13<1:38:15, 132.30it/s]

Evaluation over 10 episodes: -0.972


Training:  23%|██▎       | 225016/1000000 [20:48<1:42:10, 126.41it/s]

Evaluation over 10 episodes: -0.890


Training:  23%|██▎       | 230018/1000000 [21:28<1:37:32, 131.56it/s]

Evaluation over 10 episodes: -0.956


Training:  24%|██▎       | 235016/1000000 [22:02<1:40:47, 126.50it/s]

Evaluation over 10 episodes: -0.937


Training:  24%|██▍       | 240026/1000000 [22:44<1:34:42, 133.75it/s]

Evaluation over 10 episodes: -0.925


Training:  25%|██▍       | 245028/1000000 [23:19<1:40:02, 125.77it/s]

Evaluation over 10 episodes: -0.935


Training:  25%|██▌       | 250017/1000000 [24:04<1:43:06, 121.24it/s]

Evaluation over 10 episodes: -0.932


Training:  26%|██▌       | 255017/1000000 [24:40<1:33:30, 132.79it/s]

Evaluation over 10 episodes: -0.924


Training:  26%|██▌       | 260023/1000000 [25:17<1:58:53, 103.73it/s]

Evaluation over 10 episodes: -0.949


Training:  27%|██▋       | 265021/1000000 [25:52<1:25:57, 142.50it/s]

Evaluation over 10 episodes: -0.951


Training:  27%|██▋       | 270029/1000000 [26:26<1:27:14, 139.47it/s]

Evaluation over 10 episodes: -0.926


Training:  28%|██▊       | 275026/1000000 [27:01<1:31:48, 131.62it/s]

Evaluation over 10 episodes: -0.982


Training:  28%|██▊       | 280016/1000000 [27:48<1:46:36, 112.55it/s]

Evaluation over 10 episodes: -0.916


Training:  29%|██▊       | 285022/1000000 [28:22<1:25:18, 139.69it/s]

Evaluation over 10 episodes: -0.945


Training:  29%|██▉       | 290022/1000000 [28:55<1:32:18, 128.19it/s]

Evaluation over 10 episodes: -0.926


Training:  30%|██▉       | 295016/1000000 [29:28<1:26:18, 136.13it/s]

Evaluation over 10 episodes: -0.924


Training:  30%|███       | 300024/1000000 [30:01<1:27:27, 133.38it/s]

Evaluation over 10 episodes: -0.912


Training:  31%|███       | 305022/1000000 [30:34<1:29:43, 129.09it/s]

Evaluation over 10 episodes: -0.951


Training:  31%|███       | 310030/1000000 [31:07<1:22:46, 138.94it/s]

Evaluation over 10 episodes: -0.913


Training:  32%|███▏      | 315028/1000000 [31:40<1:24:58, 134.35it/s]

Evaluation over 10 episodes: -0.940


Training:  32%|███▏      | 320018/1000000 [32:14<2:01:40, 93.14it/s] 

Evaluation over 10 episodes: -0.979


Training:  33%|███▎      | 325026/1000000 [32:51<1:22:15, 136.77it/s]

Evaluation over 10 episodes: -0.946


Training:  33%|███▎      | 330024/1000000 [33:24<1:29:24, 124.89it/s]

Evaluation over 10 episodes: -0.914


Training:  34%|███▎      | 335023/1000000 [33:59<1:23:34, 132.61it/s]

Evaluation over 10 episodes: -0.918


Training:  34%|███▍      | 340027/1000000 [34:34<1:28:04, 124.88it/s]

Evaluation over 10 episodes: -0.983


Training:  35%|███▍      | 345013/1000000 [35:27<3:27:31, 52.60it/s] 

Evaluation over 10 episodes: -0.915


Training:  35%|███▌      | 350015/1000000 [36:10<1:29:54, 120.50it/s]

Evaluation over 10 episodes: -0.948


Training:  36%|███▌      | 355027/1000000 [36:50<1:21:53, 131.26it/s]

Evaluation over 10 episodes: -0.925


Training:  36%|███▌      | 360020/1000000 [37:33<1:31:32, 116.51it/s]

Evaluation over 10 episodes: -0.915


Training:  37%|███▋      | 365009/1000000 [39:15<2:38:48, 66.64it/s] 

Evaluation over 10 episodes: -0.934


Training:  37%|███▋      | 370019/1000000 [40:29<2:08:22, 81.79it/s] 

Evaluation over 10 episodes: -0.924


Training:  38%|███▊      | 375017/1000000 [41:47<2:17:23, 75.82it/s] 

Evaluation over 10 episodes: -0.940


Training:  38%|███▊      | 380016/1000000 [43:11<2:14:27, 76.85it/s] 

Evaluation over 10 episodes: -0.921


Training:  39%|███▊      | 385016/1000000 [44:13<2:18:48, 73.84it/s] 

Evaluation over 10 episodes: -0.903


Training:  39%|███▉      | 390016/1000000 [45:13<2:19:34, 72.84it/s] 

Evaluation over 10 episodes: -0.918


Training:  40%|███▉      | 395008/1000000 [46:36<2:29:42, 67.35it/s] 

Evaluation over 10 episodes: -0.923


Training:  40%|████      | 400006/1000000 [48:13<3:42:14, 45.00it/s] 

Evaluation over 10 episodes: -0.951


Training:  41%|████      | 405014/1000000 [49:21<2:23:32, 69.09it/s]

Evaluation over 10 episodes: -0.953


Training:  41%|████      | 410012/1000000 [50:38<1:51:33, 88.14it/s] 

Evaluation over 10 episodes: -0.928


Training:  42%|████▏     | 415018/1000000 [51:31<1:55:58, 84.07it/s] 

Evaluation over 10 episodes: -0.848


Training:  42%|████▏     | 419073/1000000 [52:16<5:32:34, 29.11it/s] 

## Results Visualization

Plot the learning curve showing how the average reward improves over time.


In [None]:
# Plot learning curve
timesteps = np.array([0 if i == 0 else i * EVAL_FREQ for i in range(len(evaluations))])

plt.figure(figsize=(12, 6))
plt.plot(timesteps, evaluations, marker='o', linewidth=2, markersize=4, label=POLICY)
plt.xlabel('Timestep', fontsize=14)
plt.ylabel('Average Reward', fontsize=14)
plt.title(f'{POLICY} Learning Curve on {ENV_NAME} (Seed {SEED})', fontsize=16, fontweight='bold')
plt.legend(fontsize=12)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

print(f"\nPerformance Summary:")
print(f"  Initial reward: {evaluations[0]:.3f}")
print(f"  Final reward: {evaluations[-1]:.3f}")
print(f"  Improvement: {evaluations[-1] - evaluations[0]:.3f}")
print(f"  Total evaluations: {len(evaluations)}")


## Summary

This notebook demonstrated how to:
1. Import algorithm implementations from Python files
2. Configure environment and hyperparameters
3. Train an agent using the off-policy actor-critic framework
4. Visualize learning curves

To compare different algorithms, simply change the `POLICY` variable at the top and re-run the training cells. All algorithms share the same interface (`select_action`, `train`), making it easy to compare them.

### Key Differences Between Algorithms

| Algorithm | Critic Networks | Policy Updates | Special Features |
|-----------|----------------|----------------|------------------|
| DDPG | Single (400-300) | Every step | - |
| OurDDPG | Single (256-256) | Every step | Re-tuned hyperparameters |
| TD3 | Twin (256-256) | Every 2 steps | Clipped Double Q, Target Smoothing |
| QRTD3 | Twin Quantile (256-256) | Every 2 steps | Quantile Regression |

For more details, see the individual algorithm files: `DDPG.py`, `OurDDPG.py`, `TD3.py`, `QR_TD3.py`
