# **Homework 12 - Reinforcement Learning**

If you have any problem, e-mail us at ntu-ml-2022spring-ta@googlegroups.com



In [None]:
import torch
GPU_name = torch.cuda.get_device_name()
print("Your GPU is {}!".format(GPU_name))

In [None]:
from google.colab import drive
drive.mount('/content/gdrive')

## Preliminary work

First, we need to install all necessary packages.
One of them, gym, builded by OpenAI, is a toolkit for developing Reinforcement Learning algorithm. Other packages are for visualization in colab.

In [None]:
!apt update
!apt install python-opengl xvfb -y
!pip install gym[box2d]==0.18.3 pyvirtualdisplay tqdm numpy==1.19.5 torch==1.8.1


Next, set up virtual display，and import all necessaary packages.

In [None]:
%%capture
from pyvirtualdisplay import Display
virtual_display = Display(visible=0, size=(1400, 900))
virtual_display.start()

%matplotlib inline
import matplotlib.pyplot as plt

from IPython import display

import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
from torch.distributions import Categorical
from tqdm.notebook import tqdm

# Warning ! Do not revise random seed !!!
# Your submission on JudgeBoi will not reproduce your result !!!
Make your HW result to be reproducible.


In [None]:
seed = 543 # Do not change this
def fix(env, seed):
  env.seed(seed)
  env.action_space.seed(seed)
  torch.manual_seed(seed)
  torch.cuda.manual_seed(seed)
  torch.cuda.manual_seed_all(seed)
  np.random.seed(seed)
  random.seed(seed)
#   torch.set_deterministic(True)
  torch.backends.cudnn.benchmark = False
  torch.backends.cudnn.deterministic = True

Last, call gym and build an [Lunar Lander](https://gym.openai.com/envs/LunarLander-v2/) environment.

In [None]:
%%capture
import gym
import random
env = gym.make('LunarLander-v2')
fix(env, seed) # fix the environment Do not revise this !!!

## What Lunar Lander？

“LunarLander-v2”is to simulate the situation when the craft lands on the surface of the moon.

This task is to enable the craft to land "safely" at the pad between the two yellow flags.
> Landing pad is always at coordinates (0,0).
> Coordinates are the first two numbers in state vector.

![](https://gym.openai.com/assets/docs/aeloop-138c89d44114492fd02822303e6b4b07213010bb14ca5856d2d49d6b62d88e53.svg)

"LunarLander-v2" actually includes "Agent" and "Environment". 

In this homework, we will utilize the function `step()` to control the action of "Agent". 

Then `step()` will return the observation/state and reward given by the "Environment".

### Observation / State

First, we can take a look at what an Observation / State looks like.

In [None]:
print(env.observation_space)


`Box(8,)`means that observation is an 8-dim vector
### Action

Actions can be taken by looks like

In [None]:
print(env.action_space)
ACTION_NUM = 4

`Discrete(4)` implies that there are four kinds of actions can be taken by agent.
- 0 implies the agent will not take any actions
- 2 implies the agent will accelerate downward
- 1, 3 implies the agent will accelerate left and right

Next, we will try to make the agent interact with the environment. 
Before taking any actions, we recommend to call `reset()` function to reset the environment. Also, this function will return the initial state of the environment.

In [None]:
initial_state = env.reset()
print(initial_state)
STATE_DIM = initial_state.shape[0]

Then, we try to get a random action from the agent's action space.

In [None]:
random_action = env.action_space.sample()
print(random_action)

More, we can utilize `step()` to make agent act according to the randomly-selected `random_action`.
The `step()` function will return four values:
- observation / state
- reward
- done (True/ False)
- Other information

In [None]:
observation, reward, done, info = env.step(random_action)

In [None]:
print(done)

### Reward


> Landing pad is always at coordinates (0,0). Coordinates are the first two numbers in state vector. Reward for moving from the top of the screen to landing pad and zero speed is about 100..140 points. If lander moves away from landing pad it loses reward back. Episode finishes if the lander crashes or comes to rest, receiving additional -100 or +100 points. Each leg ground contact is +10. Firing main engine is -0.3 points each frame. Solved is 200 points. 

In [None]:
print(reward)

### Random Agent
In the end, before we start training, we can see whether a random agent can successfully land the moon or not.

In [None]:
env.reset()

img = plt.imshow(env.render(mode='rgb_array'))

done = False
while not done:
    action = env.action_space.sample()
    observation, reward, done, _ = env.step(action)

    img.set_data(env.render(mode='rgb_array'))
    display.display(plt.gcf())
    display.clear_output(wait=True)

## Deep Q-learning with Experience Replay
Reference: https://github.com/mlefkovitz/Lunar-Lander/blob/master/DQN%20Lunar%20Lander.py

DQN introduction: https://medium.com/雞雞與兔兔的工程世界/機器學習-ml-note-reinforcement-learning-強化學習-dqn-實作atari-game-7f9185f833b0

In [None]:
import numpy as np
from collections import namedtuple, deque
import matplotlib.pyplot as plt
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

BUFFER_SIZE = int(1e5)  # replay buffer size
BATCH_SIZE = 64         # minibatch size
GAMMA = 0.99            # discount factor
TAU = 1e-3              # for soft update of target parameters
LR = 5e-4               
UPDATE_EVERY = 4        

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

In [None]:
from torch.nn.modules.activation import ReLU
class QNetwork(nn.Module):
    def __init__(self, fc1_dim=64, fc2_dim=64):
        super(QNetwork, self).__init__()
        self.network = nn.Sequential(
            nn.Linear(STATE_DIM, fc1_dim),
            nn.ReLU(),
            nn.Linear(fc1_dim, fc2_dim),
            nn.ReLU(),
            nn.Linear(fc2_dim, ACTION_NUM)
        )
        
    def forward(self, state):
        return self.network(state)

In [None]:
class ReplayBuffer():
    def __init__(self, buffer_size=BUFFER_SIZE, batch_size=BATCH_SIZE):
        self.memory = deque(maxlen=buffer_size)
        self.batch_size = batch_size
        self.experience = namedtuple('Experience', 'state, action, reward, next_state, done')

    def add(self, state, action, reward, next_state, done):
        exp = self.experience(state, action, reward, next_state, done)
        self.memory.append(exp)

    def sample(self):
        exp_batch = random.sample(self.memory, k=self.batch_size)

        state_batch  = torch.FloatTensor(np.vstack([exp.state for exp in exp_batch  if exp is not None])).to(device)
        action_batch  = torch.LongTensor(np.vstack([exp.action for exp in exp_batch  if exp is not None])).to(device)
        reward_batch  = torch.FloatTensor(np.vstack([exp.reward for exp in exp_batch  if exp is not None])).to(device)
        next_state_batch  = torch.FloatTensor(np.vstack([exp.next_state for exp in exp_batch  if exp is not None])).to(device)
        done_batch  = torch.FloatTensor(np.vstack([exp.done for exp in exp_batch  if exp is not None]).astype(np.uint8)).to(device)

        return state_batch, action_batch, reward_batch, next_state_batch, done_batch 

    def __len__(self):
        return len(self.memory)


Line 28 below is to pick a random number between 0 ~ 1.

If the number is bigger than epsilon, 
the agent will listen to the decision of the policy.

In the early stage of the training process, epsilon is close to 1,
because the `QNetwork` is not mature enough, 
so we tend to pick a random action rather than listening to the decision of the policy.



In [None]:
class PolicyAgent():
    def __init__(self):
        self.local_qnetwork = QNetwork().to(device)
        self.target_qnetwork = QNetwork().to(device)
        self.optimizer = optim.Adam(self.local_qnetwork.parameters(), lr=LR)

        self.memory = ReplayBuffer()

        self.t_step = 0

    def step(self, state, action, reward, next_state, done):
        self.memory.add(state, action, reward, next_state, done)

        self.t_step = self.t_step + 1
        self.t_step = self.t_step % UPDATE_EVERY
        if self.t_step == 0:
            if len(self.memory) > BATCH_SIZE:
                exp_batch = self.memory.sample()
                self.learn(exp_batch, GAMMA)

    def act(self, state, epsilon=0.):
        state = torch.FloatTensor(state).unsqueeze(0).to(device) # (state_dim) -> (bacth_size=1, state_dim)
        self.local_qnetwork.eval()
        with torch.no_grad():
            action_values = self.local_qnetwork(state)
        self.local_qnetwork.train()

        if random.random() > epsilon: 
            return torch.argmax(action_values.cpu()).item()
        else:
            return random.choice(np.arange(ACTION_NUM))
    
    def learn(self, exp_batch, gamma):
        state_batch, action_batch, reward_batch, next_state_batch, done_batch = exp_batch
        Q_next_targets = self.target_qnetwork(next_state_batch).detach().max(1)[0].unsqueeze(1)
        Q_targets = reward_batch + (gamma * Q_next_targets * (1 - done_batch))
        Q_expected = self.local_qnetwork(state_batch).gather(1, action_batch)

        loss = F.mse_loss(Q_expected, Q_targets)
        self.optimizer.zero_grad()
        loss.backward()
        self.optimizer.step()

        self.soft_update(self.local_qnetwork, self.target_qnetwork, TAU)

    def soft_update(self, local_qnetwork, target_qnetwork, tau):
        for local_param, target_param in zip(local_qnetwork.parameters(), target_qnetwork.parameters()):
            target_param.data.copy_(tau * local_param.data + (1.0 - tau) * target_param.data)

In [None]:
agent = PolicyAgent()

## Training Agent

Now let's start to train our agent.
Through taking all the interactions between agent and environment as training data, the policy network can learn from all these attempts,

In [None]:
n_episodes = 8000   
max_time = 1000

eps_start=1.0
eps_end=0.01
eps_decay=0.996
eps = eps_start
total_rewards_window = deque(maxlen=100)
total_rewards = []
best_adv_reward = 0

prg_bar = tqdm(range(n_episodes))
for episode in prg_bar:
    state = env.reset()
    total_reward = 0

    for time in range(max_time):
        action = agent.act(state, eps) 
        next_state, reward, done, _ = env.step(action)
        agent.step(state, action, reward, next_state, done)

        state = next_state
        total_reward += reward

        if done:
            break

    total_rewards_window.append(total_reward)
    total_rewards.append(total_reward)

    eps = max(eps_end, eps * eps_decay)

    print(f'\rEpisode {episode}\tAverage Score: {np.mean(total_rewards_window):.2f}', end="")
    if np.mean(total_rewards_window) >= best_adv_reward:
        best_adv_reward = np.mean(total_rewards_window)
        print(f'\nBest model saved in {episode - 100} episodes!\tAverage Score: {best_adv_reward:.2f}')
        torch.save(agent.local_qnetwork.state_dict(), './gdrive/MyDrive/ML2022/ML2022_hw12/best_checkpoint.pth')

print(f'\nTrainig completed!\tAverage Score: {np.mean(total_rewards_window):.2f}')
torch.save(agent.local_qnetwork.state_dict(), './gdrive/MyDrive/ML2022/ML2022_hw12/last_checkpoint.pth')

### Training Result
 


In [None]:
plt.plot(total_rewards)
plt.title("Total Rewards")
plt.show()

## Testing
The testing result will be the average reward of 5 testing

In [None]:
fix(env, seed)
agent.local_qnetwork.load_state_dict(torch.load('./gdrive/MyDrive/ML2022/ML2022_hw12/best_checkpoint.pth'))
agent.local_qnetwork.eval()
NUM_OF_TEST = 5 # Do not revise this !!!
test_total_reward = []
action_list = []
for i in range(NUM_OF_TEST):
  actions = []
  state = env.reset()

  img = plt.imshow(env.render(mode='rgb_array'))

  total_reward = 0

  done = False
  while not done:
      action = agent.act(state)
      actions.append(action)
      state, reward, done, _ = env.step(action)

      total_reward += reward

      img.set_data(env.render(mode='rgb_array'))
      display.display(plt.gcf())
      display.clear_output(wait=True)
      
  print(total_reward)
  test_total_reward.append(total_reward)

  action_list.append(actions) # save the result of testing 


In [None]:
print(np.mean(test_total_reward))

Action list

In [None]:
print("Action list looks like ", action_list)
print("Action list's shape looks like ", np.shape(action_list))

Analysis of actions taken by agent

In [None]:
distribution = {}
for actions in action_list:
  for action in actions:
    if action not in distribution.keys():
      distribution[action] = 1
    else:
      distribution[action] += 1
print(distribution)

Saving the result of Model Testing


In [None]:
PATH = "./gdrive/MyDrive/ML2022/ML2022_hw12/Action_List.npy" # Can be modified into the name or path you want
np.save(PATH ,np.array(action_list)) 

### This is the file you need to submit !!!
Download the testing result to your device



In [None]:
from google.colab import files
files.download(PATH)

# Server 
The code below simulate the environment on the judge server. Can be used for testing.

In [None]:
action_list = np.load(PATH,allow_pickle=True) # The action list you upload
seed = 543 # Do not revise this
fix(env, seed)

agent.network.eval()  # set network to evaluation mode

test_total_reward = []
if len(action_list) != 5:
  print("Wrong format of file !!!")
  exit(0)
for actions in action_list:
  state = env.reset()
  img = plt.imshow(env.render(mode='rgb_array'))

  total_reward = 0

  done = False

  for action in actions:
  
      state, reward, done, _ = env.step(action)
      total_reward += reward
      if done:
        break

  print(f"Your reward is : %.2f"%total_reward)
  test_total_reward.append(total_reward)

# Your score

In [None]:
print(f"Your final reward is : %.2f"%np.mean(test_total_reward))

## Reference

Below are some useful tips for you to get high score.

- [DRL Lecture 1: Policy Gradient (Review)](https://youtu.be/z95ZYgPgXOY)
- [ML Lecture 23-3: Reinforcement Learning (including Q-learning) start at 30:00](https://youtu.be/2-JNBzCq77c?t=1800)
- [Lecture 7: Policy Gradient, David Silver](http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching_files/pg.pdf)
