#Deep Reinforcement Learning for Control

## Lab Session

### 3rd International Summer School on Artificial Intelligience AI-DLDA 2020





Notebook written by Matteo Dunnhofer `matteo.dunnhofer@uniud.it`

Machine Learning and Perception Lab

University of Udine

In this notebook we see how to solve a toy control problem with **reinforcement learning** (RL) techniques.

## Problem

![alt text](https://raw.githubusercontent.com/cpow-89/Extended-Deep-Q-Learning-For-Open-AI-Gym-Environments/master/images/Lunar_Lander_v2.gif)

The **goal** is to control a lunar module to land on the moon, without having access to any ground-truth or prior information. 

To solve this problem, we will train an **artificial agent** to by trial-and-error. The agent will control the module and learn by the experience acquired through the **interaction** between the module and the lunar environment. The interaction will happen through the **observation** of some features of the lander module, the subsequent execution of **actions**, and the **rewarding** of the latter based on their quality. The agent will be implemented as a **neural network** and the state-of-the-art **Soft Actor-Critic** (SAC) algorithm will be employed to optimize the network weights. 

We will use the most popular tools in the RL landscape to implement and solve this problem.

## Utilities


Let's start by fixing the seed for the random number generators.

In [1]:
# Fix the seed for random number generators
SEED = 123

The following cells implement utilities to visualize graphically (as a video) the interaction between the agent and the environment. You do not need this part if you run the notebook as a script.

In [2]:
!pip install pyvirtualdisplay > /dev/null 2>&1
!apt-get install x11-utils > /dev/null 2>&1
!apt-get install -y xvfb python-opengl ffmpeg > /dev/null 2>&1

In [3]:
from gym import logger as gymlogger
from gym.wrappers import Monitor
gymlogger.set_level(40) #error only
import glob
import io
import os
import base64
from IPython.display import HTML
from IPython import display as ipythondisplay
import time

In [4]:
from pyvirtualdisplay import Display
display = Display(visible=0, size=(1400, 900))
display.start()

<pyvirtualdisplay.display.Display at 0x7fafd42dd4a8>

In [5]:
"""
Utility functions to enable video recording of gym environment and displaying it
To enable video, just do "env = wrap_env(env)""
"""
def show_video():
    mp4list = glob.glob('videos/*/*.mp4')
    mp4list.sort(key=os.path.getmtime)
    if len(mp4list) > 0:
        mp4 = mp4list[-1]
        video = io.open(mp4, 'rb').read()
        encoded = base64.b64encode(video)
        ipythondisplay.display(HTML(data='''<video alt="test" autoplay 
                    loop controls style="height: 400px;">
                    <source src="data:video/mp4;base64,{0}" type="video/mp4" />
                </video>'''.format(encoded.decode('ascii'))))
    else: 
        print("Could not find video")
    

def wrap_env(env):
    env = Monitor(env, './videos/' + str(time.time()) + '/')  # Monitor objects are used to save interactions as videos
    return env

## Environment

Let's define the environment that the agent will interact with to pursue its goals. We will take advantage of OpenAi's [``gym``](https://gym.openai.com) which provides a large number of predefined environments for RL experiments.

In particular, we will use the ``LunarLanderContinuous-v2`` environment. This env provides 8 features as **states** (position of the lander, velocity, angles, terrain contact sensors) and requires a continuous action space of 2 **actions** (amount of fire for left/right engine, amount of fire for main engine). **Rewards** are given based on the quality of the descent and the landing position, and on the usage of the fuel. Interactions end if the lander crashes or when it lands succesfully.

Let's install `gym`.

In [6]:
# Install packages related to gym and gym, necessary for LunarLanderContinuous-v2
!pip install box2d-py > /dev/null 2>&1
!pip install gym > /dev/null 2>&1
!pip install gym[Box_2D] > /dev/null 2>&1

And now let's instantiate the environment.

In [7]:
import gym

# Create the environment 'LunarLanderContinuous-v2'
env = gym.make('LunarLanderContinuous-v2')
env.seed(SEED)

# Print the observation and action spaces of the env
print('State space: {}'.format(env.observation_space))
print('Action space: {} - low: {} high: {}'.format(env.action_space, env.action_space.low[0], env.action_space.high[0]))

State space: Box(8,)
Action space: Box(2,) - low: -1.0 high: 1.0


We can look into the environment definition [here](https://github.com/openai/gym/blob/master/gym/envs/box2d/lunar_lander.py).

## Agent

Let's move now to the agent definition! The agent will be the controller of the lander, it will contain the policy, and it will be responsible for its learning and execution.

Again, we will take advantage of an OpenAI's library, this time [Spinning Up in Deep RL!](https://spinningup.openai.com/en/latest/)

It's an excellent resource to start working with RL, and
it includes basic and advanced RL algorithms implementations, with very detailed explanations. 
Algorithms and models are implemented by means of both PyTorch and TensorFlow frameworks. In our case, we will use the PyTorch version.

Let's install it.

In [None]:
!git clone https://github.com/openai/spinningup.git > /dev/null 2>&1
!pip install -e spinningup > /dev/null 2>&1

Cloning into 'spinningup'...
remote: Enumerating objects: 1263, done.[K
remote: Total 1263 (delta 0), reused 0 (delta 0), pack-reused 1263[K
Receiving objects: 100% (1263/1263), 31.02 MiB | 25.72 MiB/s, done.
Resolving deltas: 100% (590/590), done.
Obtaining file:///content/spinningup
Collecting cloudpickle==1.2.1
  Downloading https://files.pythonhosted.org/packages/09/f4/4a080c349c1680a2086196fcf0286a65931708156f39568ed7051e42ff6a/cloudpickle-1.2.1-py2.py3-none-any.whl
Collecting gym[atari,box2d,classic_control]~=0.15.3
[?25l  Downloading https://files.pythonhosted.org/packages/e0/01/8771e8f914a627022296dab694092a11a7d417b6c8364f0a44a8debca734/gym-0.15.7.tar.gz (1.6MB)
[K     |████████████████████████████████| 1.6MB 6.5MB/s 
Collecting matplotlib==3.1.1
[?25l  Downloading https://files.pythonhosted.org/packages/57/4f/dd381ecf6c6ab9bcdaa8ea912e866dedc6e696756156d8ecc087e20817e2/matplotlib-3.1.1-cp36-cp36m-manylinux1_x86_64.whl (13.1MB)
[K     |████████████████████████████████| 1

Let's see how to exploit the library to define our agent.
As we are going to use the SAC algorithm, we need to implement an object that fits the `actor_critic` parameter required by the [`spinup.sac_pytorch()`](https://spinningup.openai.com/en/latest/algorithms/sac.html#documentation-pytorch-version) function (which implements the SAC learning).

We can do this by defining a custom `torch.nn.Module` that respects such requirements, or just use the `MLPActorCritic` implementation given by OpenAI.



In [None]:
from spinup.algos.pytorch.sac.core import MLPActorCritic

# Instantiate the agent as an actor-critic agent composed of multi-layer perceptrons
agent = MLPActorCritic(env.observation_space, env.action_space)

print(agent)

This is just a `torch.nn.Module` that contains an MLP for the stochastic policy `pi` (the actor), and two MLPs for the Q-value functions `q1` and `q2` (the critics). We can have a look to the detailed implementation [here](https://github.com/openai/spinningup/blob/master/spinup/algos/pytorch/sac/core.py).

## Interaction

Let's move to the implementatioon of the finite-horizon interaction procedure that must happen between agent and environment.
This is done by means of the ``run_episode`` function. Following the [general structure of `gym`'s environments](https://gym.openai.com/docs/#observations), a state will be obtained from the environment and it will be given in input to the agent which will produce its action. Then, the action will be passed to the environment which will retirn the reward. This procedure will run until the stop conditions are met.

We will use this function for quantitive and qualitative evaluations, but a similar procedure is implemented by the SAC algorithm during training. 


In [None]:
import torch

def run_episode(agent, env, render=False):
    """
    Given agent and env, runs an episode and returns the obtained rewards

    Args:
        agent: an object respecting the spinup.sac_pytorch actor_critic parameter
        env: a gym environment

    Returns:
        rewards: list of scalar rewards
    """
    # Empty lists to save rewards
    rewards = []        # for rewards

    if render:
        # Do not need this if you locally run this notebook as a script
        env = wrap_env(env)

    # Reset environment to the first state
    state = env.reset() 
    done = False            # signal from environment that episode is over

    # Run until the episode is finished
    while not done:

        # Render environment to screen
        if render:
            env.render()

        # Get the action from the agent for the current state
        action = agent.act(torch.tensor(state, dtype=torch.float32))
        
        # Perform action and receive reward and new state
        state, reward, done, _ = env.step(action)

        # Save reward
        rewards.append(reward)

    if render:
        # Do not need this if you locally run this notebook as a script
        env.close()
        show_video()

    return rewards


Let's use `run_episode` to see how the agent performs without training.

In [None]:
rewards = run_episode(agent, env, render=True)

print('R(tau) = {}'.format(sum(rewards)))

## Training

Let's arrive to the actual learning phase!

As we said, we will use the Soft Actor-Critic algorithm, which is a state-of-the-art off-policy RL algorithm particularly designed for robotics and control problems. 

Here are some features:


*   Stochastic Policy optimization in an Off-Policy way
    - It bridges the sample efficiency of off-policy methods with the stability of policy optimization

*   Entropy Regularization
    - Maximization of policy entropy for better exploration


*   Double Critic trick
    - To reduce bias and make learning faster



For the details of the algorithm, Spinning Up in Deep RL! gives a [very good tutorial](https://spinningup.openai.com/en/latest/algorithms/sac.html). Here we can have a look to the pseudocode.

![alt text](https://imgur.com/oNsh1a8.png)


With the `spinup` implementation is just matter of a single statement. If you are interested how it is defined inside, check [here](https://github.com/openai/spinningup/blob/master/spinup/algos/pytorch/sac/sac.py).

Now let's train!

In [None]:
from spinup import sac_pytorch

sac_pytorch(lambda: gym.make('LunarLanderContinuous-v2'),
            actor_critic=MLPActorCritic,
            seed=SEED,
            epochs=50,
            logger_kwargs={'output_dir' : './experiments'})

## Evaluating

Let's see how our trained agent performs!

In [None]:
agent = torch.load('agent-trained.pt')

# Qualitatively evaluate the performance of the agent on 5 episodes
for e in range(5):
    rewards = run_episode(agent, env, render=True)

    R = sum(rewards)

    print('Test episode {} - R(tau) = {}'.format(e, R))

Not bad!