<a href="https://colab.research.google.com/github/berthine/Reinforcement-Learnin/blob/master/AIMS_(Reinforce)_Exercice.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Install, import and utilities

In [0]:
!pip install gym > /dev/null 2>&1

In [0]:
!pip install gym pyvirtualdisplay > /dev/null 2>&1
!apt-get install -y xvfb python-opengl ffmpeg > /dev/null 2>&1

In [0]:
!apt-get update > /dev/null 2>&1
!apt-get install cmake > /dev/null 2>&1
!pip install --upgrade setuptools 2>&1
!pip install ez_setup > /dev/null 2>&1

In [0]:
import gym
from gym import logger as gymlogger
from gym.wrappers import Monitor
gymlogger.set_level(40) #error only

import torch
import torch.nn as nn
import torch.nn.functional as F 
from torch import optim
import numpy as np
import pandas as pd

import seaborn as sns
from pyvirtualdisplay import Display
from IPython import display as ipythondisplay
from IPython.display import clear_output
from pathlib import Path

import random, os.path, math, glob, csv, base64, itertools, sys
from pprint import pprint

import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline
import io
from IPython.display import HTML


In [0]:
# The following code is will be used to visualize the environments.

def show_video(directory):
    html = []
    for mp4 in Path(directory).glob("*.mp4"):
        video_b64 = base64.b64encode(mp4.read_bytes())
        html.append('''<video alt="{}" autoplay 
                      loop controls style="height: 400px;">
                      <source src="data:video/mp4;base64,{}" type="video/mp4" />
                 </video>'''.format(mp4, video_b64.decode('ascii')))
    ipythondisplay.display(ipythondisplay.HTML(data="<br>".join(html)))
    
def make_seed(seed):
    np.random.seed(seed=seed)
    torch.manual_seed(seed=seed)
  
from pyvirtualdisplay import Display
display = Display(visible=0, size=(1400, 900))
display.start()

## Reminder of the RL setting

As always we will consider a MDP $M = (\mathcal{S}, \mathcal{A}, p, r, \gamma)$ with:
* $\mathcal{S}$ the state space,
* $\mathcal{A}$ the action space,
* $p(x^\prime \mid x, a)$ the transition probability,
* $r(x, a, x^\prime)$ the reward of the transition $(x, a, x^\prime)$,
* $\gamma \in [0,1)$ is the discount factor.

A policy $\pi$ is a mapping from the state space $\mathcal{S}$ to the probability of selecting each action.

The action value function of a policy is the overall expected reward from a state action. $Q^\pi(s, a) = \mathbb{E}_{\tau \sim \pi}\big[R(\tau) \mid s_0=s, a_0=a\big]$ where $\tau$ is an episode $(s_0, a_0, r_0, s_1, a_1, r_1, s_2, ..., s_T, a_T, r_T)$ with the actions drawn from $\pi(s)$; $R(\tau)$ is the random variable defined as the cumulative sum of the discounted reward.

The goal is to maximize the agent's reward.

$$ J(\pi) = \mathbb{E}_{\tau \sim \pi}\big[R(\tau) \big]$$

## Gym Environment

In this lab and also the next one we are going to use the [OpenAI's Gym library](https://gym.openai.com/envs/). This library provides a large number of environments to test RL algorithm.

We will focus on the **CartPole-v1** environment in this lab but we encourage you to also test your code on:
* **Acrobot-v1**
* **MountainCar-v0**

| Env Info          	| CartPole-v1 	| Acrobot-v1                	| MountainCar-v0 	|
|-------------------	|-------------	|---------------------------	|----------------	|
| **Observation Space** 	| Box(4)      	| Box(6)                    	| Box(2)         	|
| **Action Space**      	| Discrete(2) 	| Discrete(3)               	| Discrete(3)    	|
| **Rewards**           	| 1 per step  	| -1 if not terminal else 0 	| -1 per step    	|

A gym environment is loaded with the command `env = gym.make(env_id)`. Once the environment is created, you need to reset it with `observation = env.reset()` and then you can interact with it using the method step: `observation, reward, done, info = env.step(action)`.

### Carpole

In [0]:
# We load CartPole-v1
env = gym.make('CartPole-v1')
# We wrap it in order to save our experiment on a file.
env = Monitor(env, "./gym-results", force=True, video_callable=lambda episode: True)

In [0]:
done = False
obs = env.reset()
while not done:
    action = env.action_space.sample()
    obs, reward, done, info = env.step(action)
env.close()
show_video("./gym-results")

### Acrobot-v1

In [0]:
# We load Acrobot-v1
env = gym.make('Acrobot-v1')
# We wrap it in order to save our experiment on a file.
env = Monitor(env, "./gym-results", force=True, video_callable=lambda episode: True)

In [0]:
done = False
obs = env.reset()
while not done:
    action = env.action_space.sample()
    obs, reward, done, info = env.step(action)
env.close()
show_video("./gym-results")

### MountainCar-v0

In [0]:
# We load Acrobot-v1
env = gym.make('MountainCar-v0')
# We wrap it in order to save our experiment on a file.
env = Monitor(env, "./gym-results", force=True, video_callable=lambda episode: True)

In [0]:
done = False
obs = env.reset()
while not done:
    action = env.action_space.sample()
    obs, reward, done, info = env.step(action)
env.close()
show_video("./gym-results")

## REINFORCE

### Introduction

Reinforce is an actor-based **on policy** method. The policy $\pi_{\theta}$ is parametrized by a function approximator (e.g. a neural network).

Recall: $$ J(\pi) = \mathbb{E}_{\tau \sim \pi}\big[ \sum_{t} \gamma^t R_t \mid x_0, \pi \big].$$

To update the parameters $\theta$ of the policy, one has to do gradient ascent: $\theta_{k+1} = \theta_{k} + \alpha \nabla_{\theta}J(\pi_{\theta})|_{\theta_{k}}$.


### Policy Gradient Theorem

$$ \nabla_{\theta} J(\pi_{\theta}) = \mathbb{E}_{\tau \sim \pi_{\theta}}\left[{\sum_{t=0}^{T} \nabla_{\theta} \log \pi_{\theta}(a_t |s_t) R(\tau)}\right]$$


The policy gradient can be approximated with:
$$ \hat{g} = \frac{1}{|\mathcal{D}|} \sum_{\tau \in \mathcal{D}} \sum_{t=0}^{T} \nabla_{\theta} \log \pi_{\theta}(a_t |s_t) R(\tau) $$

### Implementation of the REINFORCE algorithm

In [0]:
# This is your neural network model
# You do not need to update it!

class Model(nn.Module):
    def __init__(self, dim_observation, n_actions):
        super(Model, self).__init__()
        
        self.n_actions = n_actions
        self.dim_observation = dim_observation
        
        self.net = nn.Sequential(
            nn.Linear(in_features=self.dim_observation, out_features=16),
            nn.ReLU(),
            nn.Linear(in_features=16, out_features=8),
            nn.ReLU(),
            nn.Linear(in_features=8, out_features=self.n_actions),
            nn.Softmax(dim=0)
        )
        
    def policy(self, state):
        state = torch.tensor(state, dtype=torch.float)
        return self.net(state)
    
    def sample_action(self, state):
        state = torch.tensor(state, dtype=torch.float)
        action = torch.multinomial(self.policy(state), 1)
        return action.item()


It is always nice to visualize the differents layers of our model.

In [0]:
# You can select your environment here
env_id = 'MountainCar-v0'  #@param ["CartPole-v1", "Acrobot-v1", "MountainCar-v0"]
env = gym.make(env_id)


In [0]:
# Define you network
model = Model(env.observation_space.shape[0], env.action_space.n)
print(model)

# Define your optimizer
optimizer = torch.optim.Adam(model.net.parameters(), lr=0.01)


num_steps = 50   # How many gradient step do we perform   
batch_size = 64  # How many trajectories you have perfrom to estimate your gradient
Tmax = 200       # Maximum length of your trajectory
gamma = 1


for step in range(num_steps):

  # Initialize batch storage
  batch_losses = torch.zeros(batch_size)
  batch_returns = np.zeros(batch_size)

  # Generate batch
  for i in range(batch_size):

    # Intialize environment
    state = env.reset()

    # Collect trajectory
    for t in range(Tmax):   
      ...

    # Compute the trajectory of discounted rewards
    # Ex: [0, 1, 1, 3] -> [5, 5, 4, 3] with gamma=1
    ...

    # Update the discounted return with baseline
    ...

    # Compute loss over one trajectory
    policy_loss = torch.zeros(1)
    ...

    # Store batch data
    batch_losses[i] = policy_loss
    batch_returns[i] = ...

  loss = batch_losses.mean()

  # Update the agent
  optimizer.zero_grad()  
  loss.backward()
  optimizer.step()

  print('Step {}/{} \t reward: {:.2f} +/- {}'.format(
        step, num_steps, np.mean(batch_returns), np.std(batch_returns)))


In [0]:
# This block displays your policy in a video
video_env = Monitor(env, "./gym-results", force=True, video_callable=lambda episode: True)

done = False
reward_episode = 0
obs = video_env.reset()
while not done:
    action = model.sample_action(state)
    next_state, reward, done, info = video_env.step(action)
    reward_episode += reward
    state = next_state

video_env.close()
show_video("./gym-results")

print(f'Reward: {reward_episode}')