# Reinforcement Learning (Open AI: Lunar Lander v2)

- Member 1: Anonymised
- Member 2: Yee Hang (2112675)

# Defining Objectives

1. Develop and evaluate reinforcement learning algorithms to land an agent successfully on the lunar lander gym environment
2. Investigate applications of reinforcement learning algorithms

## !! About notebook !!
1. This notebook will go through the process of developing an A2C RL algorithm
2. Since our introduction has been covered in earlier ipynb, we will not cover it here

# Project Initialization Setup

In [None]:
!pip install Box2D
!pip install box2d
!pip install box2d-py
!pip install gym[all]
!pip install gym[Box_2D]2
!pip install gym
!pip install wandb tqdm tensorflow_addons

> ### Installing necessary dependencies for OpenAI Gym

- `wandb`: Explained later
- `tqdm`: Progress bar

### Install X11 system and other dependencies

- Install X11 to render display and other dependencies to make sure we can run OpenAI environments in Google Colab.

In [None]:
!apt-get install -y xvfb x11-utils
!pip install pyvirtualdisplay==0.2.* PyOpenGL==3.1.* PyOpenGL-accelerate==3.1.*

### Create virtual display in background

- Create a new virtual display in the background that the environment can connect for rendering.
- `echo` to ensure there is any running background display

In [3]:
import pyvirtualdisplay
_display = pyvirtualdisplay.Display(visible=False,
                                    size=(1024, 768) )
_display.start()
!echo $DISPLAY

:1001


In [4]:
# Mount to google colab, to use their CPU/GPU
from google.colab import drive
drive.mount('/content/drive', force_remount=True)
%cd /content/drive/MyDrive

Mounted at /content/drive
/content/drive/MyDrive


## General imports

In [5]:
from collections import *
from typing import List, Optional, Tuple
import numpy as np
import pandas as pd
import plotly.graph_objects as go
from plotly.subplots import make_subplots
from matplotlib import pyplot as plt
from tqdm import tqdm
import seaborn as sns
import plotly.express as px
import pytz
from copy import deepcopy
import os, time, math, datetime, warnings,glob,random,wandb,sys,functools, plotly
from IPython.display import display, HTML
from matplotlib import animation, rc

os.environ['TF_CPP_MIN_LOG_LEVEL'] = '2'  # Ignore warnings
%matplotlib inline

In [6]:
plotly.offline.init_notebook_mode()

In [7]:
import absl.logging
absl.logging.set_verbosity(absl.logging.ERROR)
# ignore warning
import logging
logging.getLogger('tensorflow').disabled = True

# Ignore GPU when i'm not using colab because my GPU is not very good
if 'google.colab'  not in sys.modules:
    os.environ['CUDA_VISIBLE_DEVICES'] = '-1'


In [8]:
import tensorflow as tf
from tensorflow.keras.layers import Dense, Input, Flatten, LeakyReLU, ReLU, Conv2D
from tensorflow.keras.models import load_model, Model, model_from_json

from tensorflow.keras.optimizers import Adam, RMSprop
from tensorflow_addons.layers import NoisyDense
from tensorflow.keras import backend as K


# OpenAI Gym 

It is a toolkit for building, evaluating and comparing RL algorithms. It is compatible with algorithms written in any frameworks like TensoFlow. It is simple and easy to comprehend. It makes no assumption about the structure of our agent and provides an interface to all RL tasks.

In [9]:
import gym
from gym import RewardWrapper, ObservationWrapper,Wrapper, logger

In [11]:
warnings.filterwarnings("ignore") # ignore warning

# random seed for reproducibility
seed = 42
tf.random.set_seed(seed)
np.random.seed(seed)
random.seed(seed)


### Weights & Biases
- A MLOps platform with a novelty feature of experiment tracking to see machine learning model's performances of different versionings.
- Build better models faster with experiment tracking, dataset versioning, and model management



In [12]:
wandb.login(key='e6ba4b45952ca612073be61bfda87ceca39072c5')
wandb.init()

[34m[1mwandb[0m: W&B API key is configured. Use [1m`wandb login --relogin`[0m to force relogin
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33mlhurr[0m. Use [1m`wandb login --relogin`[0m to force relogin


## Create environment

In [16]:
env = gym.make("LunarLander-v2")
env.reset(seed=seed)

array([ 0.00229702,  1.4181306 ,  0.2326471 ,  0.3204666 , -0.00265488,
       -0.05269805,  0.        ,  0.        ], dtype=float32)

## Helper functions

In [18]:
def reward_plot(df, cols = ['Average Score', 'Solved Requirement']):
    fig = px.line(df, x='x', y=cols, markers=True, title='Score Analysis')
    fig.update_traces(patch={"line": {"width": 4, "dash": 'dash'}})
    fig.add_traces(go.Scatter(x= df['x'], y=df['Score'], mode='markers+lines', name='Score')).update_traces(patch={"line": {"width": 4}})
    fig.update_layout(legend_title="Legend")
    return fig

# Training A2C (Advantage Actor Critic) model

A2C or Advantage Actor Critic is an reinforcement learning algorithm that combines value optimization and policy optimization approaches. We have the following: the actor and the critic
> Actor: a policy gradient algorithm that decides on an action to take

> Critic: Q-learning algorithm that critiques the action that the Actor selected, providing feedback on how to adjust. It can take advantage of efficiency tricks in Q-learning, such as memory replay. 

In the beginning, the agent does is not familiar with the environment, so actions are taken randomly, the Critic observes the action and provides feedback. 

The Q value can be learned by parameterizing the Q function with a neural network, leading us to actor critic methods, where the critic estimates the value function. This could be the action-value (the Q value) or state-value (the V value). The actor updates the policy distribution in the direction suggested by the Critic (such as with policy gradients).

This model will serve as a baseline for our future experiments, in the Lunar Lander environment

![](https://miro.medium.com/max/828/1*GjirmHTNdxHgo1Z8iQjDbg.webp)

Source: https://towardsdatascience.com/understanding-actor-critic-methods-931b97b6df3f

Code modified from [here](https://github.com/germain-hug/Deep-RL-Keras/blob/master/A2C/a2c.py)

> **Huber loss**
1. We will be using huber loss instead of MSE for our critic. It combines the benefits of MSE and MAE losses. 
1. Using mean squared error (mse) as loss function might lead to very high errors when training. These high errors can cause the model to over adjust and fail to converge to a good solution. Huber loss on the other hand is more robust and is less sensitive to outliers. 
2. Outliers are penalized, and the network is encouraged to reduce the loss to 0 when loss is very small.


## Basic hyperparameters
1. We will set the learning rate to 0.0001, and discount factor gamma to 0.99, which are standard values

In [32]:
class A2C:
    def __init__(self,env,lr,gamma= 0.99,entropy_beta = 0.01,reward_steps = 4,clip_grad = 0.1):
        self.env = env
        self.action_space = env.action_space
        self.observation_space = env.observation_space

        self.lr = lr
        self.gamma = gamma
        self.entropy_beta = entropy_beta
        self.reward_steps = reward_steps
        self.clip_grad = clip_grad

        self.state_shape = self.observation_space.shape[0]
        self.num_actions = self.action_space.n

        self.a2c_model = self.algo_init("A2C")
        self.optimizer = Adam(learning_rate=lr)
        self.huber_loss = tf.keras.losses.Huber()

        self.reward_history = []  
        self.d = {'x': [], 'Score':[], 'Average Score': [], 'Solved Requirement': []}

    def algo_init(self, name):
        inputs = Input(shape=(self.state_shape))  
        x = Dense(16, activation="relu")(inputs)
        x = Dense(16, activation="relu")(x)

        actor = Dense(32, activation="relu")(x)
        actor = Dense(64, activation="relu")(actor)
        actor = Dense(self.num_actions, activation="softmax")(actor)

        critic = Dense(32, activation="relu")(x)
        critic = Dense(64, activation="relu")(critic)
        critic = Dense(1)(critic)  

        model = Model(inputs=inputs, outputs=[actor, critic], name=name)
        # print(model.summary())
        return model

    def train(self, num_episodes = 1000, mean_stopping = True):

        tqdm_e = tqdm(
            range(num_episodes), total=num_episodes, desc="Score", unit=" episodes"
        )
        for episode in tqdm_e:
            state = self.env.reset()
            state_values = [] 
            actions, rewards = [], []
            cumulative_reward = 0

            with tf.GradientTape() as tape:
                for t in range(1, self.reward_steps):
                    action_probs, critic_value = self.forward(state)

                    action = np.random.choice(
                        self.num_actions, p=np.squeeze(action_probs)
                    )

                    actions.append(tf.math.log(action_probs[0, action]))
                    state_values.append(critic_value[0, 0])

                    state, reward, done, _ = self.env.step(action)
                    cumulative_reward += reward
                    rewards.append(reward)
                    if done:
                        break

                self.reward_history.append(cumulative_reward)
                last_rewards_mean = np.mean(
                    self.reward_history[-100:])  
                if last_rewards_mean > 280 and mean_stopping:
                    print("Training Complete")
                    break

                # Calculate expected return
                returns = []
                expected_return = 0
                for r in reversed(rewards):
                    expected_return = r + self.gamma * expected_return
                    returns.insert(0, expected_return)
                returns = np.array(returns)

                eps = 1e-8 # Small numerical constant to prevent division by zero
                advantages = returns - np.mean(returns) / (np.std(returns) + eps)
                advantages = advantages.tolist()

                # Calculate Loss for Gradient Updates
                history = zip(actions, state_values, advantages)
                actor_losses = []
                entropy_losses = []
                critic_losses = []

                for (
                    log_prob,
                    value,
                    advantage,
                ) in history:
                    # Actor Loss
                    actor_loss = (
                        advantage - value
                    )  
                    actor_loss *= log_prob  
                    actor_loss *= (
                        -1
                    )  
                    entropy_loss = -(
                        tf.math.exp(log_prob) * log_prob
                    )  # Entropy is used to encourage exploration of the environment by encouraging a uniform distribution of the policy


                    critic_loss = self.huber_loss(
                        tf.expand_dims(value, 0), tf.expand_dims(advantage, 0)
                    )  
                    actor_losses.append(actor_loss)
                    entropy_losses.append(entropy_loss)
                    critic_losses.append(critic_loss)

                loss = (
                    sum(actor_losses)+ sum(critic_losses) + self.entropy_beta * sum(entropy_losses)
                )

                gradients = tape.gradient(loss, self.a2c_model.trainable_variables)
                
                if self.clip_grad is not None:
                    gradients = [
                        tf.clip_by_norm(gradient, self.clip_grad)
                        for gradient in gradients
                    ]

                self.optimizer.apply_gradients(
                    zip(gradients, self.a2c_model.trainable_variables)
                )
            wandb.log(
                {
                    "Episode": episode,
                    "Reward": cumulative_reward,
                    "Avg-Reward": last_rewards_mean})
            self.d['x'].append(episode)
            self.d['Score'].append(cumulative_reward)
            self.d['Average Score'].append(last_rewards_mean)
            self.d['Solved Requirement'].append(200)

            tqdm_e.set_description(f"Score: {cumulative_reward}")
            tqdm_e.refresh()

        self.save("saved-models/a2c/")
        wandb.log_artifact("saved-models/a2c/", name="A2C", type="model")
        return pd.DataFrame(self.d)

    @tf.function
    def forward(self, x: np.ndarray):
        x = tf.convert_to_tensor(x)
        x = tf.expand_dims(x, 0)
        action_probs, critic_value = self.a2c_model(x)
        return action_probs, critic_value

    def save(self, path: str):
        self.a2c_model.save(path)

    def load(self, path: str):
        self.a2c_model.load_weights(path)


In [33]:
model = A2C(env, 0.0001, 0.99, 0.01, 50, 1)

Model: "A2C"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
 input_5 (InputLayer)           [(None, 8)]          0           []                               
                                                                                                  
 dense_32 (Dense)               (None, 16)           144         ['input_5[0][0]']                
                                                                                                  
 dense_33 (Dense)               (None, 16)           272         ['dense_32[0][0]']               
                                                                                                  
 dense_34 (Dense)               (None, 32)           544         ['dense_33[0][0]']               
                                                                                                

In [34]:
current_time = datetime.datetime.now(pytz.timezone("Singapore")).strftime("%d:%m:%Y_%H:%M")
a2c_run = wandb.init(
    project="LunarLander-v2",
    name=f"A2C-{[current_time]}",
    dir=os.getcwd()
)

df = model.train(20000, True)

VBox(children=(Label(value='0.001 MB of 0.009 MB uploaded (0.000 MB deduped)\r'), FloatProgress(value=0.091208…

Score: -190.28635524884896: 100%|██████████| 20000/20000 [2:33:33<00:00,  2.17 episodes/s]
[34m[1mwandb[0m: Adding directory to artifact (./saved-models/a2c)... Done. 0.0s


## Analysis
1. We see that A2C does not land at all, with negative rewards consistently.
2. From the reward analysis plot, we see that the agent is not able to learn properly, even after training for long periods. It still consistently crashes. It does not solve the environment as well.
3. We can conclude that A2C is not very suited for this environment.

In [37]:
fig = reward_plot(df)
fig.show()

In [2]:
!jupyter nbconvert --to html "2_A2C_Train".ipynb

[NbConvertApp] Converting notebook 2_A2C_Train.ipynb to html
[NbConvertApp] Writing 5548134 bytes to 2_A2C_Train.html
