# Reinforcement Learning Assignment  
**Points:** 120

---

## **Deep Q-Network (DQN) on Cartpole v1**  
**Homework 4**  
**Deep Learning Course**  
**Instructor:** Dr. Beigy  
**Term:** Fall 2024  

---

### **Student Information**  
- **Full Name:** _[Your Full Name]_  
- **Student ID (SID):** _[Your SID]_  

---


# Implementing Deep Q-Network (DQN) on CartPole v1  

---

In this notebook, we are going to implement **Deep Q-Network (DQN)** on the [CartPole v1 environment](https://www.gymlibrary.dev/environments/classic_control/cart_pole/) using the **PyTorch** framework.  

We will:  

1. **Understand the Problem**:
   - Explore the CartPole v1 environment provided by OpenAI Gym.
   - Understand the state-space, action-space, and the reward structure of the environment.

2. **Implement the DQN Algorithm**:
   - Define the neural network architecture to approximate the Q-value function.
   - Implement the experience replay buffer to store and sample transitions.
   - Use the ε-greedy policy for balancing exploration and exploitation.
   - Implement the target network to stabilize training.

3. **Train the Agent**:
   - Define the training loop where the agent interacts with the environment.
   - Update the Q-network based on the Bellman equation.
   - Periodically update the target network.

4. **Evaluate Performance**:
   - Track and visualize the agent's performance during training.
   - Analyze the rewards and stability of the learned policy.

---

By the end of this notebook, you will gain a deeper understanding of how to implement DQN in practice and how to apply reinforcement learning techniques to solve control tasks.  


In [28]:
!pip install gym

  and should_run_async(code)




In [29]:
import torch
import torch.nn as nn
import torch.optim as optim
import numpy as np
import gym
from collections import deque
import random
import time

In [30]:
# Network Architecture (15 points)
# Define the DQN network architecture
class DQN(nn.Module):
    def __init__(self, input_size, output_size):
        super(DQN, self).__init__()

        # TODO: Define the network architecture
        pass

    def forward(self, x):
        # TODO: Implement the forward pass
        pass

In [31]:
# DQNAgent Initialization (15 points)
# Define the DQN Agent
class DQNAgent:
    def __init__(self, state_size, action_size):
        self.state_size = state_size
        self.action_size = action_size
        self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

        # Hyperparameters
        # TODO: Adjust these hyperparameters if needed
        self.gamma = 0.95    # discount factor
        self.epsilon = 1.0   # initial exploration rate
        self.epsilon_min = 0.01
        self.epsilon_decay = 0.995
        self.learning_rate = 0.001
        self.batch_size = 64

        # TODO: Initialize the replay memory

        # TODO: Create policy_net and target_net using DQN, and move them to self.device

        # TODO: Copy policy_net weights to target_net

        # TODO: Create optimizer (e.g., Adam) for policy_net parameters
        pass

    # Memory Handling (10 points)
    def remember(self, state, action, reward, next_state, done):
        # TODO: Store the experience (state, action, reward, next_state, done) in the replay memory
        pass

    # Action Selection (10 points)
    def act(self, state, evaluate=False):
        # TODO: Implement epsilon-greedy action selection:
        # If not in evaluate mode and random < epsilon, choose a random action
        # Otherwise, choose the best action from the policy network
        pass

    # Experience Replay and Training Step (20 points)
    def replay(self):
        # TODO: Sample a minibatch from replay memory
        # Compute target Q-values using target_net
        # Compute predicted Q-values using policy_net
        # Compute the loss and perform a gradient update step
        # Decrease epsilon if above epsilon_min
        pass

    # Target Network Updates (5 points)
    def update_target_network(self):
        # TODO: Update target_net parameters with policy_net parameters
        pass

In [31]:
# Training Loop (20 points)
def train_cartpole():
    # Create the environment
    env = gym.make('CartPole-v1')
    state_size = env.observation_space.shape[0]
    action_size = env.action_space.n

    # TODO: Create an instance of DQNAgent

    episodes = 500
    target_update_frequency = 10
    scores = []

    for episode in range(episodes):
        # TODO: Reset the environment

        score = 0
        done = False

        while not done:
            # TODO: Use agent.act to get an action

            # TODO: Take the action in the environment, observe next_state, reward, and done

            # TODO: Store the experience in the agent's memory (agent.remember)

            # TODO: Call agent.replay() to update the policy_net

            # TODO: Update state and accumulate reward

        # TODO: Update target network periodically

        scores.append(score)
        mean_score = np.mean(scores[-100:])

        print(f"Episode: {episode + 1}, Score: {score}, Average Score: {mean_score:.2f}, Epsilon: {0.0}")  # TODO: Print agent.epsilon instead of 0.0

        # TODO: Early stopping condition if solved

    env.close()
    # return agent, scores
    pass

In [32]:
# Visualization (5 points)
def visualize_agent(agent, num_episodes=5):
    """
    Visualize the trained agent in the environment.
    """
    # TODO: Implement evaluation loop without exploration (evaluate=True).
    # For each episode:
    # 1. Reset environment
    # 2. Render and step through with the agent's policy actions
    # 3. Print the score at the end of each episode

    pass

In [33]:
print("Training the agent...")
# TODO: Train the agent by calling train_cartpole and store the trained agent
# agent, scores = train_cartpole()

print("\nStarting visualization...")
# TODO: Visualize the trained agent by calling visualize_agent(agent)

Training the agent...

Starting visualization...


### ❓ **Question**

#### *(20 points)*  
> **Why is it important to maintain both a policy network and a target network in Deep Q-Network training, and how does the use of a target network help stabilize learning?**  
