# **Homework 12 - Reinforcement Learning**

If you have any problem, e-mail us at mlta-2023-spring@googlegroups.com



## Preliminary work

First, we need to install all necessary packages.
One of them, gym, builded by OpenAI, is a toolkit for developing Reinforcement Learning algorithm. Other packages are for visualization in colab.

In [1]:
!apt update
!apt install python-opengl xvfb -y
!pip install -q swig
!pip install box2d==2.3.2 gym[box2d]==0.25.2 box2d-py pyvirtualdisplay tqdm numpy==1.22.4
!pip install box2d==2.3.2 box2d-kengz
!pip freeze > requirements.txt


[33m0% [Working][0m            Hit:1 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64  InRelease
[33m0% [Connecting to archive.ubuntu.com (185.125.190.36)] [Waiting for headers] [W[0m                                                                               Get:2 http://security.ubuntu.com/ubuntu focal-security InRelease [114 kB]
Get:3 https://cloud.r-project.org/bin/linux/ubuntu focal-cran40/ InRelease [3,622 B]
Hit:4 http://archive.ubuntu.com/ubuntu focal InRelease
Hit:5 http://ppa.launchpad.net/c2d4u.team/c2d4u4.0+/ubuntu focal InRelease
Get:6 http://archive.ubuntu.com/ubuntu focal-updates InRelease [114 kB]
Hit:7 http://ppa.launchpad.net/cran/libgit2/ubuntu focal InRelease
Get:8 http://security.ubuntu.com/ubuntu focal-security/main amd64 Packages [2,781 kB]
Hit:9 http://ppa.launchpad.net/deadsnakes/ppa/ubuntu focal InRelease
Get:10 http://archive.ubuntu.com/ubuntu focal-backports InRelease [108 kB]
Get:11 http://security.ubuntu.com/ubuntu foc


Next, set up virtual display，and import all necessaary packages.

In [2]:
%%capture
from pyvirtualdisplay import Display
virtual_display = Display(visible=0, size=(1400, 900))
virtual_display.start()

%matplotlib inline
import matplotlib.pyplot as plt

from IPython import display

import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
from torch.distributions import Categorical
from tqdm.notebook import tqdm

# Warning ! Do not revise random seed !!!
# Your submission on JudgeBoi will not reproduce your result !!!
Make your HW result to be reproducible.


In [3]:
seed = 2023 # Do not change this
def fix(env, seed):
  env.seed(seed)
  env.action_space.seed(seed)
  torch.backends.cudnn.deterministic = True
  torch.backends.cudnn.benchmark = False
  np.random.seed(seed)
  torch.manual_seed(seed)
  if torch.cuda.is_available():
      torch.cuda.manual_seed_all(seed)

Last, call gym and build an [Lunar Lander](https://gym.openai.com/envs/LunarLander-v2/) environment.

In [4]:
%%capture
import gym
import random
env = gym.make('LunarLander-v2')
fix(env, seed) # fix the environment Do not revise this !!!

## D3QN

REF: https://github.com/DanielPalaio/LunarLander-v2_DeepRL/tree/main

In [5]:
class ReplayBuffer():
    def __init__(self, size, input_shape):
        self.size = size
        self.counter = 0
        self.state_buffer = np.zeros((self.size, *input_shape), dtype=np.float32)
        self.action_buffer = np.zeros(self.size, dtype=np.int32)
        self.reward_buffer = np.zeros(self.size, dtype=np.float32)
        self.new_state_buffer = np.zeros((self.size, *input_shape), dtype=np.float32)
        self.terminal_buffer = np.zeros(self.size, dtype=np.bool_)

    def store_tuples(self, state, action, reward, new_state, done):
        idx = self.counter % self.size
        self.state_buffer[idx] = state
        self.action_buffer[idx] = action
        self.reward_buffer[idx] = reward
        self.new_state_buffer[idx] = new_state
        self.terminal_buffer[idx] = done
        self.counter += 1

    def sample_buffer(self, batch_size):
        max_buffer = min(self.counter, self.size)
        batch = np.random.choice(max_buffer, batch_size, replace=False)
        state_batch = self.state_buffer[batch]
        action_batch = self.action_buffer[batch]
        reward_batch = self.reward_buffer[batch]
        new_state_batch = self.new_state_buffer[batch]
        done_batch = self.terminal_buffer[batch]

        return state_batch, action_batch, reward_batch, new_state_batch, done_batch

  and should_run_async(code)


In [6]:
class DuelingDoubleDeepQNetwork(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc1 = nn.Sequential(
          nn.Linear(8, 128),
          nn.ReLU(),
        )
        self.fc2 = nn.Sequential(
          nn.Linear(128, 128),
          nn.ReLU(),
        )
        self.V = nn.Linear(128, 1)
        self.A = nn.Linear(128, 4)
        self.optimizer = optim.Adam(self.parameters(), lr=0.00075)

    def forward(self, state):
        x = self.fc1(state)
        x = self.fc2(x)
        V = self.V(x)
        A = self.A(x)
        avg_A = torch.mean(A, dim=-1, keepdim=True)
        Q = (V + (A - avg_A))

        return Q, A

In [7]:
class D3QNAgent():
  def __init__(self, discount_factor=0.99, num_actions=4, epsilon=1.0, batch_size=64, input_dim=[8]):
      self.action_space = [i for i in range(num_actions)]
      self.discount_factor = discount_factor
      self.epsilon = epsilon
      self.batch_size = batch_size
      self.epsilon_decay = 0.001
      self.epsilon_final = 0.01
      self.update_rate = 120
      self.step_counter = 0
      self.buffer = ReplayBuffer(100000, input_dim)
      self.q_net = DuelingDoubleDeepQNetwork()
      self.q_target_net = DuelingDoubleDeepQNetwork()
      self.max_reward = 0

  def store_tuple(self, state, action, reward, new_state, done):
      self.buffer.store_tuples(state, action, reward, new_state, done)

  def policy(self, observation):
      if np.random.random() < self.epsilon:
          action = np.random.choice(self.action_space)
      else:
          state = np.array([observation])
          _, actions = self.q_net(torch.from_numpy(state))
          action = np.argmax(actions.detach().numpy(), axis=1)[0]

      return action
  def train(self):
    if self.buffer.counter < self.batch_size:
        return
    if self.step_counter % self.update_rate == 0:
        for q_target_params, q_params in zip(self.q_target_net.parameters(), self.q_net.parameters()):
            q_target_params.data.copy_(q_params)

    self.q_net.train()
    states, actions, rewards, next_states, terminals = self.buffer.sample_buffer(self.batch_size)
    batch_idx = torch.arange(self.batch_size, dtype=torch.long)
    states_tensor = torch.tensor(states, dtype=torch.float)
    actions_tensor = torch.tensor(actions, dtype=torch.long)
    rewards_tensor = torch.tensor(rewards, dtype=torch.float)
    next_states_tensor = torch.tensor(next_states, dtype=torch.float)
    terminals_tensor = torch.tensor(terminals)

    with torch.no_grad():
        q_, _ = self.q_target_net.forward(next_states_tensor)
        q2, _ = self.q_net.forward(next_states_tensor)
        max_actions = torch.argmax(q2, dim=-1)
        q_[terminals_tensor] = 0.0
        target = rewards_tensor + self.discount_factor * q_[batch_idx, max_actions]
    q, _ = self.q_net.forward(states_tensor)

    loss = F.mse_loss(q[batch_idx, actions_tensor], target.detach())
    self.q_net.optimizer.zero_grad()
    loss.backward()
    self.q_net.optimizer.step()

    self.epsilon = self.epsilon - self.epsilon_decay if self.epsilon > self.epsilon_final else self.epsilon_final
    self.step_counter += 1

  def test_try(self):
    fix(env, seed)
    self.q_net.eval()  # set the network into evaluation mode
    NUM_OF_TEST = 5 # Do not revise this !!!
    test_total_reward = []
    action_list = []
    for i in range(NUM_OF_TEST):
      actions = []
      state = env.reset()

      #img = plt.imshow(env.render(mode='rgb_array'))

      total_reward = 0

      done = False
      while not done:
          action = self.policy(state)
          actions.append(action)
          state, reward, done, _ = env.step(action)

          total_reward += reward

          #img.set_data(env.render(mode='rgb_array'))
          #display.display(plt.gcf())
          #display.clear_output(wait=True)

      #print(total_reward)
      test_total_reward.append(total_reward)

      action_list.append(actions) # save the result of testing

    #print(np.mean(test_total_reward))
    if np.mean(test_total_reward) > self.max_reward:
      self.max_reward = np.mean(test_total_reward)
      print('new record:', self.max_reward)
      distribution = {}
      for actions in action_list:
        for action in actions:
          if action not in distribution.keys():
            distribution[action] = 1
          else:
            distribution[action] += 1
      print(distribution)

      PATH = "Action_List.npy" # Can be modified into the name or path you want
      np.save(PATH ,np.array(action_list))
    else:
      print(np.mean(test_total_reward))


  def train_model(self, env, num_episodes):
    scores, episodes, avg_scores, obj = [], [], [], []
    goal = 200
    f = 0
    txt = open("saved_networks.txt", "w")

    for i in range(num_episodes):
        done = False
        score = 0.0
        state = env.reset()
        while not done:
            action = self.policy(state)
            new_state, reward, done, _ = env.step(action)
            score += reward
            self.store_tuple(state, action, reward, new_state, done)
            state = new_state
            self.train()
        scores.append(score)
        obj.append(goal)
        episodes.append(i)
        avg_score = np.mean(scores[-100:])
        avg_scores.append(avg_score)
        print("Episode {0}/{1}, Score: {2} ({3}), AVG Score: {4}".format(i, num_episodes, score, self.epsilon, avg_score))
        if avg_score >= 180.0 and score >= 230:
            self.test_try()
    plt.plot(avg_scores)
    plt.title("Average Rewards")
    plt.show()



In [None]:
env = gym.make("LunarLander-v2")
spec = gym.spec("LunarLander-v2")

num_episodes = 800

d3qn_agent = D3QNAgent(discount_factor=0.99, num_actions=4, epsilon=1.0, batch_size=64, input_dim=[8])

d3qn_agent.train_model(env, num_episodes)

Episode 0/800, Score: -100.78185180717264 (0.973), AVG Score: -100.78185180717264
Episode 1/800, Score: -265.0908979523888 (0.8999999999999999), AVG Score: -182.93637487978071
Episode 2/800, Score: -87.11924857016923 (0.8159999999999998), AVG Score: -150.99733277657688
Episode 3/800, Score: -248.67310463223433 (0.6949999999999997), AVG Score: -175.41627574049124
Episode 4/800, Score: -484.6653833796269 (0.5439999999999996), AVG Score: -237.2660972683184
Episode 5/800, Score: -449.213579209247 (0.4289999999999995), AVG Score: -272.59067759180647
Episode 6/800, Score: -111.70910277690714 (0.2989999999999994), AVG Score: -249.60759547539232
Episode 7/800, Score: -297.588531905625 (0.14399999999999924), AVG Score: -255.60521252917138
Episode 8/800, Score: -448.54353326855374 (0.01), AVG Score: -277.0428037224361
Episode 9/800, Score: -464.3633629798325 (0.01), AVG Score: -295.7748596481757
Episode 10/800, Score: -160.04142351569968 (0.01), AVG Score: -283.43545636340514
Episode 11/800, Sco

In [12]:
from google.colab import files
files.download("Action_List.npy")

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

## Reference

Below are some useful tips for you to get high score.

- [DRL Lecture 1: Policy Gradient (Review)](https://youtu.be/z95ZYgPgXOY)
- [ML Lecture 23-3: Reinforcement Learning (including Q-learning) start at 30:00](https://youtu.be/2-JNBzCq77c?t=1800)
- [Lecture 7: Policy Gradient, David Silver](http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching_files/pg.pdf)
