Problem: Applying a Deep Reinforcement Learning (RL) methodology to train an agent capable of solving the Lunar Lander problem in OpenAI Gym’s "LunarLander-v2" environment. The goal is for the agent to learn to land a lunar module safely between two flags.

Overview: This project applies a DQN algorithm to solve the "LunarLander-v2" environment. The DQN model approximates Q-values to guide the agent's actions, using experience replay and target network updates. Through training, the agent learns to maximize rewards by controlling the lunar module’s descent and achieving successful landings.


Step 1: Installing AI Gym, the instructions can be found at OpenAI-Lunar-Lander Links to an external site.


In [1]:
!sudo apt-get update
!sudo apt-get install python3.10-venv


Hit:1 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  InRelease
Get:2 https://cloud.r-project.org/bin/linux/ubuntu jammy-cran40/ InRelease [3,626 B]
Get:3 http://security.ubuntu.com/ubuntu jammy-security InRelease [110 kB]
Hit:4 http://archive.ubuntu.com/ubuntu jammy InRelease
Get:5 http://archive.ubuntu.com/ubuntu jammy-updates InRelease [119 kB]
Hit:6 https://ppa.launchpadcontent.net/c2d4u.team/c2d4u4.0+/ubuntu jammy InRelease
Get:7 http://security.ubuntu.com/ubuntu jammy-security/universe amd64 Packages [1,083 kB]
Hit:8 http://archive.ubuntu.com/ubuntu jammy-backports InRelease
Hit:9 https://ppa.launchpadcontent.net/deadsnakes/ppa/ubuntu jammy InRelease
Get:10 http://archive.ubuntu.com/ubuntu jammy-updates/main amd64 Packages [2,118 kB]
Hit:11 https://ppa.launchpadcontent.net/graphics-drivers/ppa/ubuntu jammy InRelease
Hit:12 https://ppa.launchpadcontent.net/ubuntugis/ppa/ubuntu jammy InRelease
Get:13 http://security.ubuntu.com/ubuntu jammy-security/main 

In [2]:
!python3 -m venv env
!source env/bin/activate


In [3]:
!sudo apt-get install swig libpython3.10-dev


Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
libpython3.10-dev is already the newest version (3.10.12-1~22.04.3).
libpython3.10-dev set to manually installed.
Suggested packages:
  swig-doc swig-examples swig4.0-examples swig4.0-doc
The following NEW packages will be installed:
  swig swig4.0
0 upgraded, 2 newly installed, 0 to remove and 45 not upgraded.
Need to get 1,116 kB of archives.
After this operation, 5,542 kB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu jammy/universe amd64 swig4.0 amd64 4.0.2-1ubuntu1 [1,110 kB]
Get:2 http://archive.ubuntu.com/ubuntu jammy/universe amd64 swig all 4.0.2-1ubuntu1 [5,632 B]
Fetched 1,116 kB in 2s (494 kB/s)
debconf: unable to initialize frontend: Dialog
debconf: (No usable dialog-like program is installed, so the dialog based frontend cannot be used. at /usr/share/perl5/Debconf/FrontEnd/Dialog.pm line 78, <> line 2.)
debconf: falling back to frontend: Readline


In [4]:
!pip install box2d-py


Collecting box2d-py
  Downloading box2d-py-2.3.8.tar.gz (374 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/374.5 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m112.6/374.5 kB[0m [31m3.2 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m374.5/374.5 kB[0m [31m5.7 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: box2d-py
  Building wheel for box2d-py (setup.py) ... [?25l[?25hdone
  Created wheel for box2d-py: filename=box2d_py-2.3.8-cp310-cp310-linux_x86_64.whl size=2349139 sha256=948fb4ea83773d3087cb3865f0df26a84fa875bfc3e784cb0fc05221adcaae8e
  Stored in directory: /root/.cache/pip/wheels/47/01/d2/6a780da77ccb98b1d2facdd520a8d10838a03b590f6f8d50c0
Successfully built box2d-py
Installing collected packages: box2d-py
Successfully installed box2d-py-2.3.8


In [5]:
!pip install gym[box2d]==0.25.2


Collecting box2d-py==2.3.5 (from gym[box2d]==0.25.2)
  Downloading box2d-py-2.3.5.tar.gz (374 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m374.4/374.4 kB[0m [31m6.2 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting pygame==2.1.0 (from gym[box2d]==0.25.2)
  Downloading pygame-2.1.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (18.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m18.3/18.3 MB[0m [31m33.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting swig==4.* (from gym[box2d]==0.25.2)
  Downloading swig-4.2.1-py2.py3-none-manylinux_2_5_x86_64.manylinux1_x86_64.whl (1.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.9/1.9 MB[0m [31m57.3 MB/s[0m eta [36m0:00:00[0m
[?25hBuilding wheels for collected packages: box2d-py
  Building wheel for box2d-py (setup.py) ... [?25l[?25hdone
  Created wheel for box2d-py: filename=box2d_py-2.3.5-cp310-cp310-linux_x8

Step 2. Setting up the environment

In [6]:
import copy
import os
import random

import numpy as np
import torch
from gym import make
from torch import nn
from torch.optim import Adam
from tqdm.notebook import tqdm

SEED = 42
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
GAMMA = 0.99
TAU = 1e-3
INITIAL_STEPS = 1024
TRANSITIONS = 500_000
STEPS_PER_UPDATE = 4
STEPS_PER_TARGET_UPDATE = STEPS_PER_UPDATE * 1000
BATCH_SIZE = 128
LEARNING_RATE = 5e-4
HID_DIM = 64
ENV_NAME = "LunarLander-v2"

In [7]:
def set_seed(seed: int = 42) -> None:
    os.environ["PYTHONHASHSEED"] = str(seed)
    np.random.seed(seed)
    random.seed(seed)
    torch.backends.cudnn.benchmark = False
    torch.backends.cudnn.deterministic = True
    torch.manual_seed(seed)
    torch.cuda.manual_seed(seed)


def evaluate_policy(agent, episodes=5, verbose=False):
    env = make(ENV_NAME)
    returns = []
    if verbose:
        pbar = tqdm(total=episodes)
    for _ in range(episodes):
        done = False
        state = env.reset()
        total_reward = 0.0

        while not done:
            state, reward, done, *_ = env.step(agent.act(state))
            total_reward += reward
        returns.append(total_reward)

        if verbose:
            pbar.update(1)

    return returns

  and should_run_async(code)


In [8]:
class ExperienceBuffer:
    "Buffer for DeepQNetwork"

    def __init__(self, capacity=10_000, device=DEVICE):
        self.capacity = capacity
        self.n_stored = 0
        self.next_idx = 0
        self.device = device

        self.state = None
        self.action = None
        self.next_state = None
        self.reward = None
        self.done = None

    def is_samplable(self, replay_size):
        return replay_size <= self.n_stored

    def add(
        self,
        state: list,
        action: int,
        next_state: list,
        reward: float,
        is_done: bool,
    ):
        state = torch.tensor(state)
        next_state = torch.tensor(next_state)

        if self.state is None:
            self.state = torch.empty(
                [self.capacity] + list(state.shape),
                dtype=torch.float32,
                device=self.device,
            )
            self.action = torch.empty(
                self.capacity, dtype=torch.long, device=self.device
            )
            self.next_state = torch.empty(
                [self.capacity] + list(state.shape),
                dtype=torch.float32,
                device=self.device,
            )
            self.reward = torch.empty(
                self.capacity, dtype=torch.float32, device=self.device
            )
            self.done = torch.empty(self.capacity, dtype=torch.long, device=self.device)
        self.state[self.next_idx] = state
        self.action[self.next_idx] = action
        self.next_state[self.next_idx] = next_state
        self.reward[self.next_idx] = reward
        self.done[self.next_idx] = is_done

        self.next_idx = (self.next_idx + 1) % self.capacity
        self.n_stored = min(self.capacity, self.n_stored + 1)

    def get_batch(self, replay_size=BATCH_SIZE):
        idxes = torch.randperm(self.n_stored)[:replay_size]
        return (
            self.state[idxes],
            self.action[idxes].view(-1, 1),
            self.next_state[idxes],
            self.reward[idxes].view(-1, 1),
            self.done[idxes].view(-1, 1),
        )

In [9]:
class DeepQNetworkModel(torch.nn.Module):
    "Classic DQN"

    def __init__(self, state_dim, action_dim, hid_dim=HID_DIM):
        super().__init__()
        self.hid_dim = hid_dim
        self.activation = torch.nn.ReLU()
        self.fc1 = nn.Linear(state_dim, hid_dim)
        self.fc2 = nn.Linear(hid_dim, hid_dim)
        self.fc3 = nn.Linear(hid_dim, action_dim)

    def forward(self, state):
        h = self.activation(self.fc1(state))
        h = self.activation(self.fc2(h))
        out = self.fc3(h)
        return out

In [10]:
class DQN_Agent:
    def __init__(self, state_dim, action_dim, hid_dim=64):
        self.steps = 0
        self.state_dim = state_dim
        self.action_dim = action_dim
        self.hid_dim = hid_dim
        self._buffer = ExperienceBuffer(10**5)
        self.local_model = DeepQNetworkModel(state_dim, action_dim, hid_dim).to(DEVICE)
        self.target_model = DeepQNetworkModel(state_dim, action_dim, hid_dim).to(DEVICE)
        self.target_model.eval()
        self.optimizer = Adam(self.local_model.parameters())
        self.criterion = nn.MSELoss()

    def consume_transition(self, transition):
        self._buffer.add(*transition)

    def sample_batch(self):
        return self._buffer.get_batch()

    def train_step(self, batch):
        # Use batch to update DQN's network.
        states, actions, next_states, rewards, dones = batch

        q_pred = self.local_model(states).gather(1, actions)
        with torch.no_grad():
            q_next = self.target_model(next_states).max(1)[0].unsqueeze(1)
        q_target = rewards + GAMMA * q_next * (1 - dones)

        self.optimizer.zero_grad()
        loss = self.criterion(q_pred, q_target)
        loss.backward()
        self.optimizer.step()

        self._soft_update_target_network()

    def _soft_update_target_network(self):
        for target_param, local_param in zip(
            self.target_model.parameters(), self.local_model.parameters()
        ):
            target_param.data.copy_(
                TAU * local_param.data + (1.0 - TAU) * target_param.data
            )

    def update_target_network(self):
        self.target_model = copy.deepcopy(self.local_model)

    def act(self, state, target=False):
        state = torch.from_numpy(state).float().unsqueeze(0).to(DEVICE)

        self.local_model.eval()
        with torch.no_grad():
            action = np.argmax(self.local_model(state).cpu().numpy())
        self.local_model.train()

        return action

    def update(self, transition):
        self.consume_transition(transition)
        if self.steps % STEPS_PER_UPDATE == 0:
            batch = self.sample_batch()
            self.train_step(batch)
        if self.steps % STEPS_PER_TARGET_UPDATE == 0:
            self.update_target_network()
        self.steps += 1

    def save(self):
        torch.save(self.local_model.state_dict(), "agent.pth")

Step 3. Training the model

In [11]:
set_seed(SEED)
env = make("LunarLander-v2")
dqn = DQN_Agent(state_dim=env.observation_space.shape[0], action_dim=env.action_space.n, hid_dim=HID_DIM)
eps = 0.1
state = env.reset()

for _ in range(INITIAL_STEPS):
    action = env.action_space.sample()

    next_state, reward, done, *_ = env.step(action)
    dqn.consume_transition((state, action, next_state, reward, done))

    state = next_state if not done else env.reset()

best_avg_rewards = -np.inf
# pbar = tqdm(total=TRANSITIONS)
for i in range(TRANSITIONS):
    # Epsilon-greedy policy
    if random.random() < eps:
        action = env.action_space.sample()
    else:
        action = dqn.act(state)

    next_state, reward, done, *_ = env.step(action)
    dqn.update((state, action, next_state, reward, done))

    state = next_state if not done else env.reset()

#     pbar.update(1)

    if (i + 1) % (TRANSITIONS // 100) == 0:
        rewards = evaluate_policy(dqn, 5)
        avg_reward = np.mean(rewards)
#         pbar.set_description(
#             f"Best reward mean: {best_avg_rewards:.2f}, Reward mean: {avg_reward:.2f}, Reward std: {np.std(rewards):.2f}"
#         )
        print(f"Step: {i + 1}/{TRANSITIONS}, Best reward mean: {best_avg_rewards:.2f}, Reward mean: {avg_reward:.2f}, Reward std: {np.std(rewards):.2f}")
        if avg_reward > best_avg_rewards:
            best_avg_rewards = avg_reward
            dqn.save()

  deprecation(
  deprecation(
  if not isinstance(terminated, (bool, np.bool8)):
  deprecation(
  deprecation(


Step: 5000/500000, Best reward mean: -inf, Reward mean: -97.10, Reward std: 12.24
Step: 10000/500000, Best reward mean: -97.10, Reward mean: -221.78, Reward std: 85.28
Step: 15000/500000, Best reward mean: -97.10, Reward mean: -122.09, Reward std: 22.70
Step: 20000/500000, Best reward mean: -97.10, Reward mean: -54.35, Reward std: 111.13
Step: 25000/500000, Best reward mean: -54.35, Reward mean: -263.92, Reward std: 74.99
Step: 30000/500000, Best reward mean: -54.35, Reward mean: -251.72, Reward std: 20.79
Step: 35000/500000, Best reward mean: -54.35, Reward mean: -12.44, Reward std: 98.63
Step: 40000/500000, Best reward mean: -12.44, Reward mean: -156.08, Reward std: 41.81
Step: 45000/500000, Best reward mean: -12.44, Reward mean: -176.78, Reward std: 43.26
Step: 50000/500000, Best reward mean: -12.44, Reward mean: 67.96, Reward std: 82.26
Step: 55000/500000, Best reward mean: 67.96, Reward mean: -72.92, Reward std: 20.99
Step: 60000/500000, Best reward mean: 67.96, Reward mean: -50.8

Step 4. Playing an episode of the problem using the agent.

In [14]:
set_seed(SEED)

In [15]:
class Agent:
    def __init__(self, weights="agent.pth"):
        self.model = DeepQNetworkModel(8, 4, 64)
        weights = torch.load(weights, map_location=DEVICE)
        self.model.load_state_dict(weights)
        self.model.to(DEVICE)
        self.model.eval()

    def act(self, state):
        state = torch.from_numpy(state).float().unsqueeze(0).to(DEVICE)
        with torch.no_grad():
            action = np.argmax(self.model(state).cpu().numpy())
        return action

In [25]:
AGENT_WEIGHTS_PATH = "agent.pth"
agent = Agent(AGENT_WEIGHTS_PATH)
rewards = evaluate_policy(agent, 5, True)
print("Average reward on 5 episodes:", np.mean(rewards))

  0%|          | 0/5 [00:00<?, ?it/s]

Average reward on 5 episodes: 224.8539570615434


In [26]:
import glob
import io
import base64
from gym.wrappers.monitoring import video_recorder
from IPython import display

def show_video(env_name, video_dir="."):
    mp4list = glob.glob(f'{video_dir}/*.mp4')
    if len(mp4list) > 0:
        mp4 = f'{video_dir}/{env_name}.mp4'
        video = io.open(mp4, 'rb').read()
        encoded = base64.b64encode(video)
        display.display(display.HTML(data='''<video alt="test" autoplay
                loop controls style="height: 400px;">
                <source src="data:video/mp4;base64,{0}" type="video/mp4" />
             </video>'''.format(encoded.decode('ascii'))))
         display.display(FileLink(mp4, result_html_prefix="Click here to download: "))
    else:
        print("Could not find video")

def render_video_of_model(agent, env_name):
    env = make(env_name)
    vid = video_recorder.VideoRecorder(env, path=f"{env_name}.mp4")
    state = env.reset()
    done = False
    while not done:
        frame = env.render(mode='rgb_array')
        vid.capture_frame()

        action = agent.act(state)

        state, reward, done, _ = env.step(action)
    env.close()


render_video_of_model(agent, ENV_NAME)
show_video(ENV_NAME)
#show_video(ENV_NAME, video_dir="rl-agents")

  logger.deprecation(
  logger.deprecation(
  self.pid = _posixsubprocess.fork_exec(
  if not isinstance(terminated, (bool, np.bool8)):
See here for more information: https://www.gymlibrary.ml/content/api/[0m
  deprecation(


Step 6. Discussions

1. The agent was evaluated over 5 episodes to measure its performance consistency.
2. The agent achieved an average reward of approximately 224.85 across 5 episodes.
3. The success threshold for a successful landing is set at 200 points in "LunarLander-v2". The agent exceeded this threshold in all episodes.
4. Success Rate: 100%, as all tested episodes resulted in rewards above the success threshold.
5. The consistently high rewards suggest that the agent has effectively learned a successful strategy for the Lunar Lander task.