## 🚗 Deep Double Q-Learning on MountainCar-v0

This project implements a clean and minimal version of **Deep Double Q-Learning (Double DQN)** to solve the `MountainCar-v0` environment from OpenAI Gym. The goal is to teach an agent to drive a car up a steep hill using reinforcement learning.

---

### 🧠 Background: Q-Learning vs. Double Q-Learning

In **standard Q-learning**, the agent learns the value of state-action pairs $$Q(s, a)$$ using the Bellman update:

$$
Q(s, a) \leftarrow Q(s, a) + \alpha \left[ r + \gamma \max_{a'} Q(s', a') - Q(s, a) \right]
$$

However, this can **overestimate** action values because the same Q-function is used for both selecting and evaluating the next action.

---

### ❗ Double Q-Learning Fixes This

**Double Q-Learning** introduces two Q-networks:
- A **policy network** $$Q_\theta$$ to select actions.
- A **target network** $$Q_{\theta^-}$$ to evaluate the selected actions.

The updated target becomes:

$$
y = r + \gamma Q_{\theta^-}(s', \arg\max_{a'} Q_\theta(s', a'))
$$

Then the loss function becomes:

$$
\mathcal{L} = \left( Q_\theta(s, a) - y \right)^2
$$

This reduces overestimation by **decoupling** selection and evaluation.

---

## 📐 Environment: `MountainCar-v0`

- **State**: A continuous 2D vector $$(\text{position}, \text{velocity})$$
- **Action Space**: {0: Push Left, 1: No Push, 2: Push Right}
- **Reward**: -1 per step until the car reaches the goal at position $$\geq 0.5$$

This environment is challenging due to:
- **Sparse rewards** (no feedback until the goal)
- **Delayed credit assignment** (the agent must first move backward to gain momentum)

---

### 🧮 Q-Network Architecture

Each Q-network is a simple MLP:

$$
\text{Input: } s = [\text{position}, \text{velocity}] \in \mathbb{R}^2 \\
\text{Network: } \text{Linear}(2 \to 128) \rightarrow \text{ReLU} \rightarrow \text{Linear}(128 \to 3)
$$

The output is a vector $$Q(s, \cdot) \in \mathbb{R}^3$$ with one value per action.

---

### 🔁 Training Loop Summary

1. Initialize environment and agent.
2. At each step:
   - Choose action using **epsilon-greedy** strategy.
   - Store the transition $$(s, a, r, s', \text{done})$$ in a **replay buffer**.
3. Sample a batch of transitions and compute the target:
   - Use `policy_net` to select the next action.
   - Use `target_net` to evaluate the value of that action.
4. Compute the loss and update the `policy_net`.
5. Periodically update `target_net` using `policy_net` weights.

---

### 🔧 Core Components

| Component         | Description |
|------------------|-------------|
| `policy_net`      | Learns the Q-function. Used for action selection. |
| `target_net`      | Provides stable Q-targets. Updated slowly. |
| Replay Buffer     | Stores experience tuples for training. |
| Epsilon-Greedy    | Balances exploration (random) and exploitation (greedy). |
| Target Update     | Synchronizes target weights every few episodes. |

---

### 📉 Loss Function

For a batch of transitions:

$$
\mathcal{L} = \frac{1}{N} \sum_{i=1}^{N} \left( Q_\theta(s_i, a_i) - \left[ r_i + \gamma Q_{\theta^-}(s'_i, \arg\max_{a'} Q_\theta(s'_i, a')) \right] \right)^2
$$

---


In [1]:
# ------------------------------------------------------------
# Deep Double Q-Learning for OpenAI Gym's MountainCar-v0
# ------------------------------------------------------------
# This is a self-contained, fully annotated implementation that:
#   - Uses two neural networks for Double DQN (policy and target)
#   - Trains using experience replay
#   - Applies epsilon-greedy exploration
#   - Reduces overestimation by decoupling action selection & evaluation
# ------------------------------------------------------------

import gym
import numpy as np
import random
import torch
import torch.nn as nn
import torch.optim as optim
from collections import deque
import matplotlib.pyplot as plt

# -------------------------------
# SETUP
# -------------------------------
# Fix seeds for reproducibility
seed = 42
random.seed(seed)
np.random.seed(seed)
torch.manual_seed(seed)

# Use GPU if available for faster training
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Initialize environment
env = gym.make('MountainCar-v0')
state_dim = env.observation_space.shape[0]  # 2D state: [position, velocity]
n_actions = env.action_space.n              # 3 possible actions: left, neutral, right

# -------------------------------
# Q-NETWORK DEFINITION
# -------------------------------
class QNetwork(nn.Module):
    """
    A simple feedforward neural network for approximating Q-values.
    Given a state, it outputs Q-values for all possible actions.
    """
    def __init__(self):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(state_dim, 128),  # Input layer: 2 -> 128
            nn.ReLU(),
            nn.Linear(128, n_actions)   # Output layer: 128 -> 3 (Q-values)
        )

    def forward(self, x):
        return self.net(x)

# -------------------------------
# HYPERPARAMETERS
# -------------------------------
episodes = 500               # Total training episodes
gamma = 0.99                 # Discount factor
epsilon = 1.0                # Initial exploration rate
epsilon_min = 0.01           # Minimum epsilon after decay
epsilon_decay = 0.995        # Decay rate per episode
lr = 1e-3                    # Learning rate
batch_size = 64              # Mini-batch size for updates
memory_size = 10_000         # Max replay buffer size
target_update_freq = 10      # Update target network every N episodes

# -------------------------------
# MEMORY (Replay Buffer)
# -------------------------------
# Stores experience tuples: (state, action, reward, next_state, done)
replay_buffer = deque(maxlen=memory_size)

# -------------------------------
# INITIALIZE NETWORKS
# -------------------------------
policy_net = QNetwork().to(device)         # Main Q-network
target_net = QNetwork().to(device)         # Target network
target_net.load_state_dict(policy_net.state_dict())  # Copy weights initially
target_net.eval()  # We don’t train target_net directly

# Optimizer and loss function
optimizer = optim.Adam(policy_net.parameters(), lr=lr)
loss_fn = nn.MSELoss()

# For tracking episode rewards
reward_log = []

# -------------------------------
# TRAINING LOOP
# -------------------------------
for ep in range(episodes):
    state = env.reset()
    total_reward = 0
    done = False

    while not done:
        # Convert state to tensor
        state_tensor = torch.FloatTensor(state).unsqueeze(0).to(device)

        # Epsilon-greedy action selection
        if random.random() < epsilon:
            action = env.action_space.sample()  # Explore
        else:
            with torch.no_grad():
                q_vals = policy_net(state_tensor)
                action = q_vals.argmax().item()  # Exploit

        # Take action in the environment
        next_state, reward, done, _ = env.step(action)
        total_reward += reward

        # Store experience in replay buffer
        replay_buffer.append((state, action, reward, next_state, done))
        state = next_state  # Move to next state

        # Start training when we have enough samples
        if len(replay_buffer) >= batch_size:
            # Sample a random minibatch
            batch = random.sample(replay_buffer, batch_size)
            s, a, r, s2, d = zip(*batch)

            # Convert to tensors
            s = torch.FloatTensor(s).to(device)
            a = torch.LongTensor(a).unsqueeze(1).to(device)
            r = torch.FloatTensor(r).unsqueeze(1).to(device)
            s2 = torch.FloatTensor(s2).to(device)
            d = torch.FloatTensor(d).unsqueeze(1).to(device)

            # -------------------------------
            # DOUBLE DQN TARGET CALCULATION
            # -------------------------------
            with torch.no_grad():
                # Action selection: use policy_net to get best actions
                best_actions = policy_net(s2).argmax(dim=1, keepdim=True)
                # Action evaluation: use target_net to evaluate chosen actions
                target_q = target_net(s2).gather(1, best_actions)
                # Compute target: r + γ * Q_target(s', a*)
                y = r + gamma * target_q * (1 - d)

            # Current Q-values for taken actions
            q = policy_net(s).gather(1, a)

            # Compute MSE loss and backpropagate
            loss = loss_fn(q, y)
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()

    # Track rewards for visualization
    reward_log.append(total_reward)

    # Decay epsilon for less exploration over time
    epsilon = max(epsilon * epsilon_decay, epsilon_min)

    # Sync target network every few episodes
    if (ep + 1) % target_update_freq == 0:
        target_net.load_state_dict(policy_net.state_dict())

    # Log training progress every 10 episodes
    if (ep + 1) % 10 == 0:
        avg = np.mean(reward_log[-10:])
        print(f"Episode {ep+1}, Avg Reward: {avg:.2f}, Epsilon: {epsilon:.3f}")

# -------------------------------
# VISUALIZE RESULTS
# -------------------------------
plt.plot(reward_log)
plt.xlabel("Episode")
plt.ylabel("Total Reward")
plt.title("Deep Double Q-Learning on MountainCar-v0")
plt.grid(True)
plt.show()

env.close()
