<a href="https://colab.research.google.com/github/decoderkurt/HUF_RL_2022/blob/main/19/rl-baselines-zoo.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# RL Baselines3 Zoo: Training in Colab



Github Repo: [https://github.com/DLR-RM/rl-baselines3-zoo](https://github.com/DLR-RM/rl-baselines3-zoo)

Stable-Baselines3 Repo: [https://github.com/DLR-RM/rl-baselines3-zoo](https://github.com/DLR-RM/stable-baselines3)


# Install Dependencies



In [2]:
!apt-get install swig cmake ffmpeg freeglut3-dev xvfb

Reading package lists... Done
Building dependency tree       
Reading state information... Done
freeglut3-dev is already the newest version (2.8.1-3).
swig is already the newest version (3.0.12-1).
cmake is already the newest version (3.10.2-1ubuntu2.18.04.2).
ffmpeg is already the newest version (7:3.4.8-0ubuntu0.2).
xvfb is already the newest version (2:1.19.6-1ubuntu4.10).
0 upgraded, 0 newly installed, 0 to remove and 37 not upgraded.


## Clone RL Baselines3 Zoo Repo

In [4]:
!git clone --recursive https://github.com/DLR-RM/rl-baselines3-zoo

fatal: destination path 'rl-baselines3-zoo' already exists and is not an empty directory.


In [5]:
%cd /content/rl-baselines3-zoo/

/content/rl-baselines3-zoo


### Install pip dependencies

In [6]:
!pip install -r requirements.txt



## Train an RL Agent


The train agent can be found in the `logs/` folder.

Here we will train A2C on CartPole-v1 environment for 100 000 steps. 


To train it on Pong (Atari), you just have to pass `--env PongNoFrameskip-v4`

Note: You need to update `hyperparams/algo.yml` to support new environments. You can access it in the side panel of Google Colab. (see https://stackoverflow.com/questions/46986398/import-data-into-google-colaboratory)

In [7]:
!python train.py --algo a2c --env CartPole-v1 --n-timesteps 10

Seed: 3430206127
Default hyperparameters for environment (ones being tuned will be overridden):
OrderedDict([('ent_coef', 0.0),
             ('n_envs', 8),
             ('n_timesteps', 500000.0),
             ('policy', 'MlpPolicy')])
Using 8 environments
Overwriting n_timesteps with n=10
Creating test environment
Using cpu device
Log path: logs/a2c/CartPole-v1_1
Saving to logs/a2c/CartPole-v1_1


#### Evaluate trained agent


You can remove the `--folder logs/` to evaluate pretrained agent.

In [8]:
!python enjoy.py --algo a2c --env CartPole-v1 --no-render --n-timesteps 5000 --folder logs/

Loading latest experiment, id=1
Loading logs/a2c/CartPole-v1_1/CartPole-v1.zip
Episode Reward: 12.00
Episode Length 12
Episode Reward: 12.00
Episode Length 12
Episode Reward: 10.00
Episode Length 10
Episode Reward: 12.00
Episode Length 12
Episode Reward: 9.00
Episode Length 9
Episode Reward: 16.00
Episode Length 16
Episode Reward: 12.00
Episode Length 12
Episode Reward: 14.00
Episode Length 14
Episode Reward: 10.00
Episode Length 10
Episode Reward: 13.00
Episode Length 13
Episode Reward: 9.00
Episode Length 9
Episode Reward: 12.00
Episode Length 12
Episode Reward: 10.00
Episode Length 10
Episode Reward: 14.00
Episode Length 14
Episode Reward: 12.00
Episode Length 12
Episode Reward: 13.00
Episode Length 13
Episode Reward: 16.00
Episode Length 16
Episode Reward: 9.00
Episode Length 9
Episode Reward: 15.00
Episode Length 15
Episode Reward: 11.00
Episode Length 11
Episode Reward: 10.00
Episode Length 10
Episode Reward: 11.00
Episode Length 11
Episode Reward: 14.00
Episode Length 14
Episode

#### Tune Hyperparameters

We use [Optuna](https://optuna.org/) for optimizing the hyperparameters.

Tune the hyperparameters for PPO, using a tpe sampler and median pruner, 2 parallels jobs,
with a budget of 1000 trials and a maximum of 50000 steps

In [10]:
!python train.py --algo ppo --env MountainCar-v0 -n 50 -optimize --n-trials 1 --n-jobs 2 --sampler tpe --pruner median

Seed: 2632319663
Default hyperparameters for environment (ones being tuned will be overridden):
OrderedDict([('ent_coef', 0.0),
             ('gae_lambda', 0.98),
             ('gamma', 0.99),
             ('n_envs', 16),
             ('n_epochs', 4),
             ('n_steps', 16),
             ('n_timesteps', 1000000.0),
             ('normalize', True),
             ('policy', 'MlpPolicy')])
Using 16 environments
Overwriting n_timesteps with n=50
Normalization activated: {'gamma': 0.99}
Optimizing hyperparameters
Sampler: tpe - Pruner: median
[32m[I 2022-01-18 00:46:41,638][0m A new study created in memory with name: no-name-32e2ee8b-3225-45b6-82c8-6dcfcb061db1[0m

`n_jobs` argument has been deprecated in v2.7.0. This feature will be removed in v4.0.0. See https://github.com/optuna/optuna/releases/tag/v2.7.0.

Normalization activated: {'gamma': 0.99}
Normalization activated: {'gamma': 0.99, 'norm_reward': False}
[32m[I 2022-01-18 00:46:48,479][0m Trial 0 finished with value: -200

### Record  a Video

In [17]:
# Set up display; otherwise rendering will fail
import os
os.system("Xvfb :1 -screen 0 1024x768x24 &")
os.environ['DISPLAY'] = ':1'

In [18]:
!python -m utils.record_video --algo a2c --env CartPole-v1 --exp-id 0 -f logs/ -n 1000

Loading latest experiment, id=1
Saving video to /content/rl-baselines3-zoo/logs/a2c/CartPole-v1_1/videos/final-model-a2c-CartPole-v1-step-0-to-step-1000.mp4


### Display the video

In [19]:
import base64
from pathlib import Path

from IPython import display as ipythondisplay

def show_videos(video_path='', prefix=''):
  """
  Taken from https://github.com/eleurent/highway-env

  :param video_path: (str) Path to the folder containing videos
  :param prefix: (str) Filter the video, showing only the only starting with this prefix
  """
  html = []
  for mp4 in Path(video_path).glob("{}*.mp4".format(prefix)):
      video_b64 = base64.b64encode(mp4.read_bytes())
      html.append('''<video alt="{}" autoplay 
                    loop controls style="height: 400px;">
                    <source src="data:video/mp4;base64,{}" type="video/mp4" />
                </video>'''.format(mp4, video_b64.decode('ascii')))
  ipythondisplay.display(ipythondisplay.HTML(data="<br>".join(html)))

In [22]:
show_videos(video_path='logs/videos/', prefix='a2c')

### Continue Training

Here, we will continue training of the previous model

In [14]:
!python train.py --algo a2c --env CartPole-v1 --n-timesteps 50000 -i logs/a2c/CartPole-v1_1/CartPole-v1.zip

Traceback (most recent call last):
  File "train.py", line 15, in <module>



In [None]:
!python enjoy.py --algo a2c --env CartPole-v1 --no-render --n-timesteps 1000 --folder logs/