<a href="https://colab.research.google.com/github/bmcgregor22/reinforcement_learning_online_msds/blob/work/mywork/module7/lab_dueling_q_network.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Lab: Navigating the Lunar Lander with a Dueling Deep Q Network

### University of Virginia
### Reinforcement Learning
#### Last updated: February 27, 2024

---

## Bruce McGregor

#### Instructions:

You will work with the `LunarLander-v2` environment from `gymnasium` in this lab.  

An overview of the environment can be found [here](https://gymnasium.farama.org/).  
If you're curious about the source code, see [here](https://github.com/Farama-Foundation/Gymnasium/blob/main/gymnasium/envs/box2d/lunar_lander.py).

Your mission will be to implement a dueling deep Q -network using PyTorch.  
You might run this on Colab.

There are a few specific tasks outlined below for you to solve.

The bigger tasks will be to:

- Show that the algorithm works to train the agent in the environment
- Run episodes and show the results

**Submission**  
As you will likely have several files including this notebook, you can zip all files and submit.

---

![lunar](https://github.com/bmcgregor22/reinforcement_learning_online_msds/blob/work/mywork/module7/lunar_lander1.png?raw=1)

#### TOTAL POINTS: 12

---

**Hint:** Modules you may need to install include:

swig  
gym[box2d]  
gymnasium

#### 1) What is the penalty for crashing?  
**(POINTS: 1)**


The penalthy for crashing is negative 100 points.

#### 2) Set up the environment and run 2 steps by taking random actions.
**(POINTS: 1)**

In [1]:
!apt-get -y install swig
!pip install -U gymnasium[box2d] torch


Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
The following additional packages will be installed:
  swig4.0
Suggested packages:
  swig-doc swig-examples swig4.0-examples swig4.0-doc
The following NEW packages will be installed:
  swig swig4.0
0 upgraded, 2 newly installed, 0 to remove and 38 not upgraded.
Need to get 1,116 kB of archives.
After this operation, 5,542 kB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu jammy/universe amd64 swig4.0 amd64 4.0.2-1ubuntu1 [1,110 kB]
Get:2 http://archive.ubuntu.com/ubuntu jammy/universe amd64 swig all 4.0.2-1ubuntu1 [5,632 B]
Fetched 1,116 kB in 1s (1,312 kB/s)
Selecting previously unselected package swig4.0.
(Reading database ... 126455 files and directories currently installed.)
Preparing to unpack .../swig4.0_4.0.2-1ubuntu1_amd64.deb ...
Unpacking swig4.0 (4.0.2-1ubuntu1) ...
Selecting previously unselected package swig.
Preparing to unpack .../swig_4.0.2-1ubu

In [11]:
import gymnasium as gym, numpy as np
env = gym.make("LunarLander-v3")
obs, info = env.reset(seed=42)


action = env.action_space.sample()
next_obs, reward, terminated, truncated, info = env.step(action)

print("\nStep 1:")
print("  Action taken:", action)
print("  Reward:", reward)
print("  Terminated:", terminated, "Truncated:", truncated)
print("  Next obs sample:", np.round(next_obs, 3))

action = env.action_space.sample()  # another random action
next_obs, reward, terminated, truncated, info = env.step(action)
print("\nStep 2:")
print("  Action taken:", action)
print("  Reward:", reward)
print("  Terminated:", terminated, "Truncated:", truncated)
print("  Next obs sample:", np.round(next_obs, 3))


env.close()


Step 1:
  Action taken: 0
  Reward: 1.1449803922348565
  Terminated: False Truncated: False
  Next obs sample: [ 0.005  1.425  0.232  0.295 -0.005 -0.052  0.     0.   ]

Step 2:
  Action taken: 1
  Reward: 1.9085491278836753
  Terminated: False Truncated: False
  Next obs sample: [ 0.007  1.431  0.223  0.269 -0.006 -0.015  0.     0.   ]


#### 3) Briefly discuss your approach to solving the problem  
**(POINTS: 2)**

My approach to solving this problem is to create modular programs for the agent, the network, the repla

#### 4) Create supporting code files (`.py` format) to create the agent, train, and run episodes
**(POINTS: 6)**

Your code should include:

- **(POINTS: 1)** A class for the dueling DQN agent
- **(POINTS: 1)** An architecture with separate Value and Advantage streams
- **(POINTS: 1)** A method called `forward()` for the forward pass of the algorithm
- **(POINTS: 1)** A replay buffer
- **(POINTS: 1)** A training function
- **(POINTS: 1)** A function to run episodes

In [13]:
%%writefile config.py

from dataclasses import dataclass

@dataclass
class Config:
    env_id: str = "LunarLander-v3"
    total_steps: int = 300_000
    start_learning: int = 10_000
    buffer_size: int = 200_000
    batch_size: int = 128
    gamma: float = 0.99
    lr: float = 1e-3
    train_freq: int = 1
    target_update_freq: int = 1_000  # hard copy every N steps
    tau: float = 1.0                 # 1.0 = hard; <1.0 = Polyak
    eps_start: float = 1.0
    eps_end: float = 0.05
    eps_decay_steps: int = 200_000
    grad_clip: float = 10.0
    reward_clip: float | None = None  # set to 1.0 to clip rewards to [-1,1] if desired

SEED = 42


Writing config.py


In [14]:
%%writefile network.py

import torch
import torch.nn as nn
from typing import Tuple

class DuelingQNetwork(nn.Module):
    def __init__(self, state_dim: int, n_actions: int, hidden: Tuple[int, int] = (256, 256)):
        super().__init__()
        self.feature = nn.Sequential(
            nn.Linear(state_dim, hidden[0]), nn.ReLU(),
            nn.Linear(hidden[0], hidden[1]), nn.ReLU(),
        )
        self.value = nn.Sequential(
            nn.Linear(hidden[1], 128), nn.ReLU(),
            nn.Linear(128, 1),
        )
        self.advantage = nn.Sequential(
            nn.Linear(hidden[1], 128), nn.ReLU(),
            nn.Linear(128, n_actions),
        )

    def forward(self, x):
        z = self.feature(x)
        v = self.value(z)                        # [B, 1]
        a = self.advantage(z)                    # [B, A]
        a = a - a.mean(dim=1, keepdim=True)      # center A for identifiability
        return v + a                              # [B, A]


Writing network.py


#### 5) Run the training and show evidence that the agent is learning.  

For example, its average reward (score) should increase with more episodes.

**(POINTS: 1 if successful)**

#### 6) Run a few episodes and show results
**(POINTS: 1 if successful)**