# **Dueling DQN**

#### **This notebook is based on [Nimish Sanghi's book on Deep RL](https://link.springer.com/book/10.1007/979-8-8688-0273-7).**

This notebook builds up on the basic Deep Q-Networks (DQNs) notebook: https://www.kaggle.com/code/aryamanbansal/basic-dqn

Feel free to check out this notebook on Kaggle: https://www.kaggle.com/code/aryamanbansal/dueling-dqn

### **Motivation**

In many environments, the value of a state is not solely dependent on the chosen action - some states are inherently good or bad regardless of which action is taken. In standard DQN, the Q-value is computed directly for every (state, action) pair. However, this approach can be inefficient when the value of the state itself is a dominant factor.

**Dueling DQN** addresses this by decomposing the Q-value function into two separate estimators:

- **State Value Function**, $V(s)$: Represents how good it is to be in a given state regardless of the action taken.
- **Advantage Function**, $A(s,a)$: Represents the relative benefit of taking a specific action compared to the other actions in that state.

This separation helps the network learn the value of states and the relative advantage of actions more efficiently, especially in states where actions have little effect on the overall outcome.

### **Comparative Study: Dueling DQN vs. Basic DQN**

- **Basic DQN:**
    - Computes a single stream that directly estimates $Q(s,a)$ for each action.
    - May struggle to learn an effective state value when the advantage of each action is subtle.
- **Dueling DQN:**
    - Decomposes $Q(s,a)$ into $V(s)$ and $A(s,a)$, which can help the network focus on learning the inherent value of the state.
    - Often achieves improved performance and more stable training, especially in environments where the choice of action has less impact on the overall state value.
    - Typically shows enhanced sample efficiency, as the value function learning benefits from more targeted gradients.

In [None]:
import gymnasium as gym
from stable_baselines3.common.vec_env import VecVideoRecorder, DummyVecEnv

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

import numpy as np
import matplotlib.pyplot as plt
import random
from scipy.signal import convolve
from scipy.signal.windows import gaussian

from base64 import b64encode
from IPython.display import HTML, clear_output

from tqdm import trange

print("imports done!")

imports done!


In [None]:
# set a seed
seed = 132
random.seed(seed)
np.random.seed(seed)
torch.manual_seed(seed)

  and should_run_async(code)


<torch._C.Generator at 0x786050e12d90>

In [None]:
# Assuming a global device setting (CPU or CUDA)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
device

device(type='cpu')

### **Neural Network for Dueling DQN**

1. **Decomposition:**

- In basic DQN: $$Q(s,a) \approx \text{some function } f(s,a)$$

- The core idea of Dueling DQN is to express the Q-value function as: $$Q(s,a) = V(s) + \left( A(s,a) - \frac{1}{|\mathcal{A}|} \sum_{a'} A(s,a') \right)$$

, where:
- $V(s)$: is the state-value function, ie, it tells the value of being in state $s$.
- $A(s,a)$: is the advantage function, ie, it tells the relative advantage of taking action $a$ in state $s$ compared to other actions.
- The term $\frac{1}{|\mathcal{A}|} \sum_{a'} A(s,a')$ is the mean of advantages, ie, it normalizes the advantage, ensuring that the network focuses on learning the true state value.

&nbsp;  

2. **Why use a Dueling Architecture?**

- **Efficient Learning**: In many states, the choice of action might have little impact on the value of that state. By separately estimating $V(s)$ and $A(s,a)$, the network can learn $V(s)$ even when the advantage values are noisy.
- **Stability**: This separation can lead to more stable estimates and can help reduce overfitting of Q-values to specific actions.

&nbsp;  

3. **Implementation Details:**

- The network typically has a common convolutional or fully connected layer (shared feature representation).
- After the shared layers, the network splits into two streams: one for estimating $V(s)$ (usually ending in a single output) and one for estimating $A(s,a)$ (ending in as many outputs as actions).
- The final Q-values are then computed as described above.

In [None]:
class DuelingDQNAgent(nn.Module):
    """
    Dueling Deep Q-Network (Dueling DQN) Agent.

    This agent implements the dueling architecture which separates the estimation of
    the state value function V(s) and the advantage function A(s,a) to compute Q(s,a).
    The final Q-values are obtained by combining these two streams:
    
        Q(s,a) = V(s) + (A(s,a) - mean(A(s,·)))

    Attributes:
        epsilon (float): Exploration rate (can be used for epsilon-greedy policies).
        n_actions (int): Number of possible actions.
        state_shape (tuple): Shape of the input state.
        fc1 (nn.Linear): First fully connected layer.
        fc2 (nn.Linear): Second fully connected layer.
        fc_value (nn.Linear): Fully connected layer for the value stream.
        fc_adv (nn.Linear): Fully connected layer for the advantage stream.
        value (nn.Linear): Final layer that outputs the state value V(s).
        adv (nn.Linear): Final layer that outputs the advantages A(s,a) for each action.
    """
    
    def __init__(self, state_shape, n_actions, epsilon=0):
        """
        Initialize the DuelingDQNAgent.

        Args:
            state_shape (tuple): Shape of the input state (e.g., (state_dim,)).
            n_actions (int): Number of possible actions.
            epsilon (float, optional): Exploration rate for epsilon-greedy policies. Default is 0.
        """
        super().__init__()
        self.epsilon = epsilon
        self.n_actions = n_actions
        self.state_shape = state_shape
        
        # Define the shared fully-connected layers for feature extraction from the input state.
        self.fc1 = nn.Linear(state_shape[0], 64)
        self.fc2 = nn.Linear(64, 128)
        
        # Define the separate streams for the value and advantage functions.
        # The value stream processes features to estimate V(s).
        self.fc_value = nn.Linear(128, 32)
        self.value = nn.Linear(32, 1)
        
        # The advantage stream processes features to estimate A(s,a) for each action.
        self.fc_adv = nn.Linear(128, 32)
        self.adv = nn.Linear(32, n_actions)
        

    def forward(self, state_t):
        """
        Forward pass through the dueling network to compute Q-values.

        Args:
            state_t (torch.Tensor): Input tensor representing the state, 
                                    of shape [batch_size, state_dim].

        Returns:
            torch.Tensor: The computed Q-values for each action, of shape [batch_size, n_actions].
        """
        # Pass the input state through the first fully-connected layer with ReLU activation.
        x = F.relu(self.fc1(state_t))
        # Further process the features with a second fully-connected layer.
        x = F.relu(self.fc2(x))
        
        # --- Value Stream ---
        # Process features to estimate the state value V(s).
        v = F.relu(self.fc_value(x))
        v = self.value(v)  # Output shape: [batch_size, 1]
        
        # --- Advantage Stream ---
        # Process features to estimate the advantages A(s,a) for each action.
        adv = F.relu(self.fc_adv(x))
        adv = self.adv(adv)  # Output shape: [batch_size, n_actions]
        
        # Normalize the advantage values by subtracting the mean advantage for each sample.
        adv_avg = torch.mean(adv, dim=1, keepdim=True)
        
        # Combine the value and normalized advantage streams to obtain Q-values.
        # Q(s,a) = V(s) + (A(s,a) - mean(A(s,·)))
        qvalues = v + adv - adv_avg
        
        return qvalues

    
    def get_qvalues(self, states):
        """
        Compute Q-values for a batch of states provided as arrays.
        
        Args:
            states (array-like): Batch of states.
            
        Returns:
            np.array: Q-values for each state.
        """
        states = torch.tensor(np.array(states), device=device, dtype=torch.float32)
        qvalues = self.forward(states)
        return qvalues.data.cpu().numpy()

    
    def get_action(self, states):
        """
        Returns the best (greedy) actions for a batch of states.
        
        Args:
            states (array-like): Batch of states.
            
        Returns:
            np.array: Best actions.
        """
        states = torch.tensor(np.array(states), device=device, dtype=torch.float32)
        qvalues = self.forward(states)
        best_actions = qvalues.argmax(axis=-1)
        return best_actions


    def sample_actions(self, qvalues):
        """
        Implements the epsilon-greedy policy on a batch of Q-values
        
        Args:
            qvalues (np.array): Q-values for a batch of states.
            
        Returns:
            np.array: Actions selected (random with probability epsilon, otherwise best action).
        """
        epsilon = self.epsilon
        batch_size, n_actions = qvalues.shape
        # Randomly choose actions for exploration
        random_actions = np.random.choice(n_actions, size=batch_size)
        # Greedy actions based on Q-values
        best_actions = qvalues.argmax(axis=-1)
        # Create an array of booleans indicating whether to explore (0) or exploit (1)
        should_explore = np.random.choice([0, 1], batch_size, p=[1 - epsilon, epsilon])
        # Choose random actions where element is 0, otherwise choose best actions
        return np.where(should_explore, random_actions, best_actions)


    def save(self, path):
        """
        Saves the model parameters to a file.
        
        Args:
            path (str): Path to save the model.
        """
        print("Saving model to:", path)
        torch.save(self.network.state_dict(), f"{path}.zip")

### **Replay Buffer**

In [None]:
class ReplayBuffer:
    def __init__(self, size):
        """
        Initialize the ReplayBuffer.

        Args:
            size (int): Maximum number of experiences to store.
        """
        self.size = size  # Maximum buffer size
        self.buffer = []  # List to store experiences
        self.next_id = 0  # Index pointer for cyclic buffer replacement


    def __len__(self):
        """
        Return the current number of experiences in the buffer.
        """
        return len(self.buffer)


    def add(self, state, action, reward, next_state, done):
        """
        Add a new experience to the buffer.

        Args:
            state (object): The current state.
            action (int): The action taken.
            reward (float): The reward received.
            next_state (object): The next state after taking the action.
            done (bool): Flag indicating whether the episode has terminated.
        """
        # Pack the experience into a tuple
        item = (state, action, reward, next_state, done)
        
        # If the buffer isn't full, append the experience
        if len(self.buffer) < self.size:
            self.buffer.append(item)
        else:
            # If the buffer is full, overwrite the oldest experience
            self.buffer[self.next_id] = item
        
        # Update the index in a cyclic manner
        self.next_id = (self.next_id + 1) % self.size


    def sample(self, batch_size):
        """
        Sample a batch of experiences from the buffer.

        Args:
            batch_size (int): Number of experiences to sample.

        Returns:
            A tuple of numpy arrays: (states, actions, rewards, next_states, done_flags)
        """
        # Randomly select indices for the batch
        idxs = np.random.choice(len(self.buffer), batch_size, replace=False)
        # Retrieve the experiences at the selected indices
        samples = [self.buffer[i] for i in idxs]
        # Unzip the list of tuples into separate components
        states, actions, rewards, next_states, done_flags = list(zip(*samples))
        # Convert each component into a numpy array and return
        return (np.array(states),
                np.array(actions),
                np.array(rewards),
                np.array(next_states),
                np.array(done_flags))

### **TD Loss**

In [None]:
def compute_td_loss(agent, target_network, states, actions, rewards, 
                    next_states, done_flags, gamma=0.99, device=device):
    """
    Compute the TD loss for a batch of experiences.

    Args:
        agent (nn.Module): The current Q-network.
        target_network (nn.Module): The target Q-network.
        states (np.array): Batch of current states.
        actions (np.array): Batch of actions taken.
        rewards (np.array): Batch of rewards received.
        next_states (np.array): Batch of next states.
        done_flags (np.array): Batch of done flags (True/False).
        gamma (float): Discount factor.
        device: Device to run the computations on (CPU/GPU).

    Returns:
        torch.Tensor: The computed TD loss.
    """
    # Convert numpy arrays to torch tensors and move them to the appropriate device
    states = torch.tensor(states, device=device, dtype=torch.float)
    actions = torch.tensor(actions, device=device, dtype=torch.long)
    rewards = torch.tensor(rewards, device=device, dtype=torch.float)
    next_states = torch.tensor(next_states, device=device, dtype=torch.float)
    done_flags = torch.tensor(done_flags.astype('float32'), device=device, dtype=torch.float)

    # Compute Q-values for all actions for the current states using the agent network.
    predicted_qvalues = agent(states)  # shape: [batch_size, n_actions]

    # Compute Q-values for all actions for the next states using the target network.
    predicted_next_qvalues = target_network(next_states)  # shape: [batch_size, n_actions]

    # For each experience in the batch, select the Q-value for the taken action.
    predicted_qvalues_for_actions = predicted_qvalues[range(len(actions)), actions]

    # Compute the maximum Q-value for the next states (greedy action selection)
    next_state_values, _ = torch.max(predicted_next_qvalues, dim=1)

    # Compute the target Q-values using the TD target equation:
    # target = reward + gamma * max_a' Q(s', a') * (1 - done)
    # (1 - done) ensures that if the next state is terminal, 
    # we don't add the discounted future reward.
    target_qvalues_for_actions = rewards + gamma * next_state_values * (1 - done_flags)

    # Compute the Mean Squared Error (MSE) loss between the predicted Q-values and the target Q-values.
    loss = torch.mean((predicted_qvalues_for_actions - target_qvalues_for_actions.detach()) ** 2)
    return loss


### **Recording video of trained agents**



In [None]:
def record_video(env_id, video_folder, video_length, agent):
    """
    Record a video of the agent interacting with the environment.

    Args:
        env_id (str): Environment ID (e.g., 'CartPole-v1').
        video_folder (str): Folder where the video will be saved.
        video_length (int): Number of timesteps to record.
        agent: Trained agent with a get_action() method.
    
    Returns:
        str: The file path to the saved video.
    """
    # Create a dummy vectorized environment with rendering enabled.
    vec_env = DummyVecEnv([lambda: gym.make(env_id, render_mode="rgb_array")])
    
    # Wrap the environment with VecVideoRecorder.
    vec_env = VecVideoRecorder(
        vec_env, video_folder,
        record_video_trigger=lambda x: x == 0,
        video_length=video_length,
        name_prefix=f"{type(agent).__name__}-{env_id}"
    )

    # Reset environment to start recording.
    obs = vec_env.reset()
    for _ in range(video_length + 1):
        # Get action from the agent and step the environment.
        action = agent.get_action(obs).detach().cpu().numpy()
        obs, _, _, _ = vec_env.step(action)
    
    # Construct the file path of the recorded video.
    file_path = "./" + video_folder + vec_env.video_recorder.path.split("/")[-1]
    vec_env.close()
    return file_path


In [None]:
def play_video(file_path):
    """
    Display a video file in a Jupyter Notebook.

    Args:
        file_path (str): Path to the video file.

    Returns:
        HTML: HTML object that can display the video.
    """
    # Read video file in binary mode.
    mp4 = open(file_path, 'rb').read()
    # Encode the video file in base64.
    data_url = "data:video/mp4;base64," + b64encode(mp4).decode()
    # Create HTML snippet with a video player.
    return HTML(f"""
        <video width=400 controls>
            <source src="{data_url}" type="video/mp4">
        </video>
        """)


### **Setting up the training parameters**



In [None]:
def play_and_record(start_state, agent, env, exp_replay, n_steps=1):
    """
    Run the agent in the environment for a fixed number of steps and record the transitions.

    This function allows the agent to interact with the environment for `n_steps` timesteps,
    collects the transitions (state, action, reward, next state, done flag), and stores them
    in the experience replay buffer. It also accumulates the total reward obtained during these steps.

    Args:
        start_state: The initial state from which the agent starts.
        agent: The DQN agent that provides action selection via its get_qvalues and sample_actions methods.
        env: The environment in which the agent is acting (should follow the Gymnasium API).
        exp_replay: The experience replay buffer (an instance of ReplayBuffer) to store transitions.
        n_steps (int, optional): The number of steps to run the agent in the environment. Defaults to 1.

    Returns:
        tuple: A tuple (sum_rewards, s) where:
            - sum_rewards (float): The total reward accumulated over the n_steps.
            - s: The state at the end of the n_steps, which can be used as the starting state for subsequent calls.
    """
    s = start_state          # Initialize the current state.
    sum_rewards = 0          # Initialize the reward accumulator.

    # Run the agent for n_steps steps.
    for _ in range(n_steps):
        # Obtain Q-values for the current state.
        qvalues = agent.get_qvalues([s])
        
        # Select an action using the agent's epsilon-greedy policy.
        a = agent.sample_actions(qvalues)[0]
        
        # Execute the action in the environment.
        next_s, r, terminated, truncated, _ = env.step(a)
        
        # Accumulate the reward.
        sum_rewards += r
        
        # Check if the episode has ended.
        done = terminated or truncated
        
        # Record the transition in the replay buffer.
        exp_replay.add(s, a, r, next_s, done)
        
        # Update the current state:
        # If the episode ended, reset the environment; otherwise, continue with the next state.
        if done:
            s, _ = env.reset()
        else:
            s = next_s

    return sum_rewards, s

In [None]:
# Setup the environment and agent networks
env_name = 'CartPole-v1'
env = gym.make(env_name, render_mode="rgb_array", max_episode_steps=4000)    # Create the environment
state_dim = env.observation_space.shape     # e.g., (4,) for CartPole
n_actions = env.action_space.n              # e.g., 2 for CartPole

In [None]:
# Reset environment and set seed for reproducibility
state, _ = env.reset(seed=seed)

# Initialize DQN agent with initial high exploration (epsilon=1)
agent = DuelingDQNAgent(state_dim, n_actions, epsilon=1).to(device)
target_network = DuelingDQNAgent(state_dim, n_actions, epsilon=1).to(device)
target_network.load_state_dict(agent.state_dict())  # Synchronize target network

# Populate the experience replay buffer with initial random experiences
exp_replay = ReplayBuffer(10**4)  # Replay buffer with capacity 10,000
for i in range(100):
    play_and_record(state, agent, env, exp_replay, n_steps=10**2)  # Helper function to collect experiences
    if len(exp_replay) == 10**4:
        break

# Set up training hyperparameters
timesteps_per_epoch = 1        # Timesteps per epoch (for logging purposes)
batch_size = 32                # Mini-batch size for training updates
total_steps = 50000            # Total training steps

# Initialize the optimizer (Adam) for updating the agent's parameters
opt = torch.optim.Adam(agent.parameters(), lr=1e-4)

# Define the exploration schedule (epsilon decay)
start_epsilon = 1              # Starting exploration rate
end_epsilon = 0.05             # Minimum exploration rate
eps_decay_final_step = 2 * 10**4  # Steps over which epsilon decays to end_epsilon

# Define frequencies for logging and updating the target network
loss_freq = 20                      # Log the loss every 20 steps
refresh_target_network_freq = 100   # Update target network every 100 steps
eval_freq = 1000                    # Evaluate the agent every 1000 steps

# Set gradient clipping threshold to stabilize training
max_grad_norm = 5000

In [None]:
mean_rw_history = []
td_loss_history = []

  and should_run_async(code)


### **The main training loop**



In [None]:
def epsilon_schedule(start_eps, end_eps, step, final_step):
    """
    Compute the exploration epsilon for the current step using a linear decay schedule.

    Args:
        start_eps (float): The initial epsilon (e.g., 1.0).
        end_eps (float): The final epsilon after decay (e.g., 0.05).
        step (int): The current training step.
        final_step (int): The step at which epsilon decays to end_eps.

    Returns:
        float: The computed epsilon value for the current step.
    """
    # Ensure the step does not exceed final_step for correct interpolation.
    return start_eps + (end_eps - start_eps) * min(step, final_step) / final_step

In [None]:
def smoothen(values):
    """
    Smooths out the given values using a Gaussian filter.

    Args:
        values (list or np.array): The sequence of values to smooth.

    Returns:
        np.array: The smoothed values.
    """
    kernel = gaussian(100, std=100)
    kernel = kernel / np.sum(kernel)
    return convolve(values, kernel, 'valid')

In [None]:
def evaluate(env, agent, n_games=1, greedy=False, t_max=10000):
    """
    Evaluate the agent's performance by running it for a specified number of games.

    Args:
        env (gym.Env): The environment to evaluate in.
        agent (DQNAgent): The DQN agent.
        n_games (int): Number of games (episodes) to run.
        greedy (bool): If True, use the greedy policy (argmax); otherwise use epsilon-greedy.
        t_max (int): Maximum timesteps per episode.

    Returns:
        float: The average total reward over the evaluated games.
    """
    rewards = []
    for _ in range(n_games):
        s, _ = env.reset()
        total_reward = 0
        for _ in range(t_max):
            # Get Q-values from the agent.
            qvalues = agent.get_qvalues([s])
            # Choose action: greedy (argmax) if specified, otherwise use agent's sampling.
            action = qvalues.argmax(axis=-1)[0] if greedy else agent.sample_actions(qvalues)[0]
            s, r, terminated, truncated, _ = env.step(action)
            total_reward += r
            if terminated:
                break
        rewards.append(total_reward)
    return np.mean(rewards)

In [None]:
def train_dqn(total_steps, timesteps_per_epoch, batch_size, 
              start_epsilon, end_epsilon, eps_decay_final_step,
              loss_freq, refresh_target_network_freq, eval_freq,
              max_grad_norm, agent, target_network, env, exp_replay,
              opt, td_loss_history, mean_rw_history, env_name, device):
    """
    Main training loop for the DQN agent.

    The function updates the agent by:
      - Decaying the exploration rate.
      - Collecting experiences and storing them in the replay buffer.
      - Sampling mini-batches from the replay buffer.
      - Computing the TD loss and performing backpropagation.
      - Periodically updating the target network.
      - Evaluating and logging the agent's performance.

    Args:
        total_steps (int): Total number of training steps.
        timesteps_per_epoch (int): Number of environment steps per training epoch.
        batch_size (int): Mini-batch size for training.
        start_epsilon (float): Initial exploration rate.
        end_epsilon (float): Final exploration rate after decay.
        eps_decay_final_step (int): The step at which epsilon should reach end_epsilon.
        loss_freq (int): Frequency (in steps) to log TD loss.
        refresh_target_network_freq (int): Frequency (in steps) to update the target network.
        eval_freq (int): Frequency (in steps) to evaluate the agent.
        max_grad_norm (float): Maximum gradient norm for clipping.
        agent (DQNAgent): The online agent network.
        target_network (DQNAgent): The target network.
        env (gym.Env): The environment for interaction.
        exp_replay (ReplayBuffer): The experience replay buffer.
        opt (torch.optim.Optimizer): The optimizer for training.
        td_loss_history (list): List to record TD loss history.
        mean_rw_history (list): List to record mean reward history.
        env_name (str): Environment name (used for creating a new env during evaluation).
        device (torch.device): Device to perform computations on.

    Returns:
        None
    """
    # Reset the environment to get the initial state.
    state, _ = env.reset()

    # Main training loop.
    for step in trange(total_steps + 1):
        # 1. Update exploration rate (epsilon) based on schedule.
        agent.epsilon = epsilon_schedule(start_epsilon, end_epsilon, step, eps_decay_final_step)

        # 2. Interact with the environment and record experiences.
        #    play_and_record() should update the replay buffer.
        _, state = play_and_record(state, agent, env, exp_replay, timesteps_per_epoch)

        # 3. Sample a mini-batch from the replay buffer.
        states, actions, rewards, next_states, done_flags = exp_replay.sample(batch_size)

        # 4. Compute the TD loss using the agent and target networks.
        loss = compute_td_loss(agent, target_network,
                               states, actions, rewards, next_states, done_flags,
                               gamma=0.99, device=device)

        # 5. Perform backpropagation and update the network.
        loss.backward()
        # Clip gradients to stabilize training.
        nn.utils.clip_grad_norm_(agent.parameters(), max_grad_norm)
        opt.step()
        opt.zero_grad()

        # 6. Log the TD loss at specified intervals.
        if step % loss_freq == 0:
            td_loss_history.append(loss.data.cpu().item())

        # 7. Update the target network periodically.
        if step % refresh_target_network_freq == 0:
            target_network.load_state_dict(agent.state_dict())

        # 8. Evaluate the agent and update logs/plots.
        if step % eval_freq == 0:
            # Create a fresh environment for evaluation.
            eval_env = gym.make(env_name, render_mode="rgb_array", max_episode_steps=4000)
            mean_reward = evaluate(eval_env, agent, n_games=3, greedy=True, t_max=1000)
            mean_rw_history.append(mean_reward)

            clear_output(wait=True)
            print("Buffer size = %i, Epsilon = %.5f" % (len(exp_replay), agent.epsilon))

            # Plot the mean return and smoothened TD loss.
            plt.figure(figsize=[16, 5])
            plt.subplot(1, 2, 1)
            plt.title("Mean return per episode")
            plt.plot(mean_rw_history)
            plt.grid()

            plt.subplot(1, 2, 2)
            plt.title("TD loss history (smoothened)")
            plt.plot(smoothen(td_loss_history))
            plt.grid()

            plt.show()

### **Applying DQN on the CartPole**



In [None]:
train_dqn(total_steps, timesteps_per_epoch, batch_size,
          start_epsilon, end_epsilon, eps_decay_final_step,
          loss_freq, refresh_target_network_freq, eval_freq,
          max_grad_norm, agent, target_network, env, exp_replay,
          opt, td_loss_history, mean_rw_history, env_name, device)

In [None]:
final_score = evaluate(
  gym.make(env_name, render_mode="rgb_array", max_episode_steps=4000),
  agent, n_games=30, greedy=True, t_max=1000
)
print('final score:', final_score)

In [None]:
# video_folder = ""  # enter folder location
# video_length = 500

# video_file = record_video(env_name, video_folder, video_length, agent)

# play_video(video_file)