# Day 58: Deep Q-Networks (DQN)

## What is DQN?

Deep Q-Networks (DQN) combine Q-Learning with deep neural networks. Instead of using a Q-table, which is not scalable for large or continuous state spaces, DQN uses a neural network to approximate the Q-function.

## Key Concepts:

- **Q-function approximation**: The neural network takes the state as input and outputs Q-values for all possible actions.
- **Experience Replay**: Stores past experiences (state, action, reward, next state) and samples mini-batches to break correlation and improve learning stability.
- **Target Network**: A separate network used to generate target Q-values, updated periodically to stabilize training.

## Components:

1. **Replay Buffer**: Stores experiences.
2. **Policy Network**: Approximates the Q-function.
3. **Target Network**: Helps compute stable target Q-values.
4. **Loss Function**: Mean Squared Error between predicted Q-values and target Q-values.

## Bellman Equation in DQN:

\[ Q(s,a) = r + \gamma \cdot \max_a' Q_{\text{target}}(s',a') \]

## Applications:

- Video games (e.g., Atari)
- Robotics
- Any domain with large or continuous state/action space


In [1]:
pip install gym




In [4]:
!pip install gym[box2d]  # For Box2D-based environments (optional)
!pip install gym[classic_control]  # For CartPole

Collecting box2d-py==2.3.5 (from gym[box2d])
  Downloading box2d-py-2.3.5.tar.gz (374 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/374.4 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m374.4/374.4 kB[0m [31m21.2 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting pygame==2.1.0 (from gym[box2d])
  Downloading pygame-2.1.0.tar.gz (5.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.8/5.8 MB[0m [31m60.2 MB/s[0m eta [36m0:00:00[0m
[?25h  [1;31merror[0m: [1msubprocess-exited-with-error[0m
  
  [31m×[0m [32mpython setup.py egg_info[0m did not run successfully.
  [31m│[0m exit code: [1;36m1[0m
  [31m╰─>[0m See above for output.
  
  [1;35mnote[0m: This error originates from a subprocess, and is likely not a problem with pip.
  Preparing metadata (setup.py) ... [?25l[?25herror
[1;31merror[0m: [1mmetadata-generation-

In [13]:
import gym
import random
import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
from collections import deque

# Neural Network for Q-value approximation
class DQN(nn.Module):
    def __init__(self, state_size, action_size):
        super(DQN, self).__init__()
        self.fc1 = nn.Linear(state_size, 24)
        self.fc2 = nn.Linear(24, 24)
        self.out = nn.Linear(24, action_size)

    def forward(self, x):
        x = torch.relu(self.fc1(x))
        x = torch.relu(self.fc2(x))
        return self.out(x)

# Hyperparameters
EPISODES = 500
GAMMA = 0.95
EPSILON = 1.0
EPSILON_MIN = 0.01
EPSILON_DECAY = 0.995
LR = 0.001
BATCH_SIZE = 32
MEMORY_SIZE = 10000
TARGET_UPDATE_FREQ = 10

# Environment
env = gym.make("CartPole-v1", disable_env_checker=True)
state_size = env.observation_space.shape[0]
action_size = env.action_space.n

# DQN and Target Networks
policy_net = DQN(state_size, action_size)
target_net = DQN(state_size, action_size)
target_net.load_state_dict(policy_net.state_dict())
target_net.eval()

optimizer = optim.Adam(policy_net.parameters(), lr=LR)
memory = deque(maxlen=MEMORY_SIZE)

# Choose action using epsilon-greedy
def choose_action(state, epsilon):
    if random.random() <= epsilon:
        return random.randrange(action_size)
    state = torch.FloatTensor(state).unsqueeze(0)
    with torch.no_grad():
        return torch.argmax(policy_net(state)).item()

# Replay and train
def replay():
    if len(memory) < BATCH_SIZE:
        return
    batch = random.sample(memory, BATCH_SIZE)
    states, actions, rewards, next_states, dones = zip(*batch)

    states = torch.FloatTensor(states)
    actions = torch.LongTensor(actions).unsqueeze(1)
    rewards = torch.FloatTensor(rewards)
    next_states = torch.FloatTensor(next_states)
    # Convert dones to boolean tensor, handling potential numpy arrays
    dones = torch.BoolTensor([bool(d) for d in dones])

    q_values = policy_net(states).gather(1, actions).squeeze()
    next_q_values = target_net(next_states).max(1)[0]
    target_q = rewards + (GAMMA * next_q_values * ~dones)

    loss = nn.MSELoss()(q_values, target_q.detach())
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

# Training loop
for episode in range(EPISODES):
    state = env.reset()
    done = False
    total_reward = 0
    while not done:
        action = choose_action(state, EPSILON)
        next_state, reward, done, _ = env.step(action)
        memory.append((state, action, reward, next_state, done))
        state = next_state
        total_reward += reward
        replay()

    EPSILON = max(EPSILON_MIN, EPSILON * EPSILON_DECAY)

    if episode % TARGET_UPDATE_FREQ == 0:
        target_net.load_state_dict(policy_net.state_dict())

    print(f"Episode {episode}, Total Reward: {total_reward}, Epsilon: {EPSILON:.3f}")

env.close()

Episode 0, Total Reward: 20.0, Epsilon: 0.995
Episode 1, Total Reward: 49.0, Epsilon: 0.990
Episode 2, Total Reward: 13.0, Epsilon: 0.985
Episode 3, Total Reward: 26.0, Epsilon: 0.980
Episode 4, Total Reward: 38.0, Epsilon: 0.975
Episode 5, Total Reward: 16.0, Epsilon: 0.970
Episode 6, Total Reward: 11.0, Epsilon: 0.966
Episode 7, Total Reward: 13.0, Epsilon: 0.961
Episode 8, Total Reward: 24.0, Epsilon: 0.956
Episode 9, Total Reward: 16.0, Epsilon: 0.951
Episode 10, Total Reward: 13.0, Epsilon: 0.946
Episode 11, Total Reward: 29.0, Epsilon: 0.942
Episode 12, Total Reward: 14.0, Epsilon: 0.937
Episode 13, Total Reward: 23.0, Epsilon: 0.932
Episode 14, Total Reward: 12.0, Epsilon: 0.928
Episode 15, Total Reward: 14.0, Epsilon: 0.923
Episode 16, Total Reward: 18.0, Epsilon: 0.918
Episode 17, Total Reward: 35.0, Epsilon: 0.914
Episode 18, Total Reward: 18.0, Epsilon: 0.909
Episode 19, Total Reward: 10.0, Epsilon: 0.905
Episode 20, Total Reward: 16.0, Epsilon: 0.900
Episode 21, Total Rewar