<a href="https://colab.research.google.com/github/abcardoso/ifes_ai/blob/main/IA_GYM_SB_CarRacing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

https://github.com/Stable-Baselines-Team/rl-colab-notebooks/blob/sb3/stable_baselines_getting_started.ipynb

# Stable Baselines3 Tutorial - Getting Started

Github repo: https://github.com/araffin/rl-tutorial-jnrr19

Stable-Baselines3: https://github.com/DLR-RM/stable-baselines3

Documentation: https://stable-baselines.readthedocs.io/en/master/

RL Baselines3 zoo: https://github.com/DLR-RM/rl-baselines3-zoo


[RL Baselines3 Zoo](https://github.com/DLR-RM/rl-baselines3-zoo) is a training framework for Reinforcement Learning (RL), using Stable Baselines3.

It provides scripts for training, evaluating agents, tuning hyperparameters, plotting results and recording videos.

## Introduction

Name Ana Cardoso - IFES 2024.2 - IA - Professor Sergio Nery


In this notebook, you will learn the basics for using stable baselines3 library: how to create a RL model, train it and evaluate it. Because all algorithms share the same interface, we will see how simple it is to switch from one algorithm to another.

We will also apply reinforcement learning to train agents to solve challanges in the environments:  
- Car Racing: https://gymnasium.farama.org/environments/box2d/car_racing/

- Mountain Car: https://gymnasium.farama.org/environments/classic_control/mountain_car/


## Install Dependencies and Stable Baselines3 Using Pip

List of full dependencies can be found in the [README](https://github.com/DLR-RM/stable-baselines3).


```
pip install stable-baselines3[extra]
```

In [None]:
# for autoformatting
# %load_ext jupyter_black

In [1]:
!apt-get update && }apt-get install ffmpeg freeglut3-dev xvfb  # For visualization
!pip install "stable-baselines3[extra]" #>=2.0.0a4"


0% [Working]            Get:1 https://cloud.r-project.org/bin/linux/ubuntu jammy-cran40/ InRelease [3,626 B]
Hit:2 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  InRelease
Get:3 http://security.ubuntu.com/ubuntu jammy-security InRelease [129 kB]
Ign:4 https://r2u.stat.illinois.edu/ubuntu jammy InRelease
Get:5 https://r2u.stat.illinois.edu/ubuntu jammy Release [5,713 B]
Hit:6 http://archive.ubuntu.com/ubuntu jammy InRelease
Get:7 https://r2u.stat.illinois.edu/ubuntu jammy Release.gpg [793 B]
Get:8 http://archive.ubuntu.com/ubuntu jammy-updates InRelease [128 kB]
Hit:9 https://ppa.launchpadcontent.net/deadsnakes/ppa/ubuntu jammy InRelease
Get:10 https://ppa.launchpadcontent.net/graphics-drivers/ppa/ubuntu jammy InRelease [24.3 kB]
Get:11 https://r2u.stat.illinois.edu/ubuntu jammy/main amd64 Packages [2,544 kB]
Hit:12 http://archive.ubuntu.com/ubuntu jammy-backports InRelease
Hit:13 https://ppa.launchpadcontent.net/ubuntugis/ppa/ubuntu jammy InRelease
Get:1

In [2]:
import stable_baselines3

print(f"{stable_baselines3.__version__=}")

stable_baselines3.__version__='2.3.2'


## Imports

Stable-Baselines works on environments that follow the [gym interface](https://stable-baselines.readthedocs.io/en/master/guide/custom_env.html).
You can find a list of available environment [here](https://gym.openai.com/envs/#classic_control).

It is also recommended to check the [source code](https://github.com/openai/gym) to learn more about the observation and action space of each env, as gym does not have a proper documentation.
Not all algorithms can work with all action spaces, you can find more in this [recap table](https://stable-baselines.readthedocs.io/en/master/guide/algos.html)

In [3]:
import gymnasium as gym
import numpy as np

print(f"{gym.__version__=}")

gym.__version__='0.29.1'


The first thing you need to import is the RL model, check the documentation to know what you can use on which problem

In [4]:
from stable_baselines3 import PPO

The next thing you need to import is the policy class that will be used to create the networks (for the policy/value functions).
This step is optional as you can directly use strings in the constructor:

```PPO('MlpPolicy', env)``` instead of ```PPO(MlpPolicy, env)```

Note that some algorithms like `SAC` have their own `MlpPolicy`, that's why using string for the policy is the recommened option.

In [5]:
from stable_baselines3.ppo import CnnPolicy

## Create the Gym env and instantiate the agent

Here we are using the [Proximal Policy Optimization](https://stable-baselines.readthedocs.io/en/master/modules/ppo2.html) algorithm, which is an Actor-Critic method: it uses a value function to improve the policy gradient descent (by reducing the variance).

It combines ideas from [A2C](https://stable-baselines.readthedocs.io/en/master/modules/a2c.html) (having multiple workers and using an entropy bonus for exploration) and [TRPO](https://stable-baselines.readthedocs.io/en/master/modules/trpo.html) (it uses a trust region to improve stability and avoid catastrophic drops in performance).

PPO is an on-policy algorithm, which means that the trajectories used to update the networks must be collected using the latest policy.
It is usually less sample efficient than off-policy alorithms like [DQN](https://stable-baselines.readthedocs.io/en/master/modules/dqn.html), [SAC](https://stable-baselines.readthedocs.io/en/master/modules/sac.html) or [TD3](https://stable-baselines.readthedocs.io/en/master/modules/td3.html), but is much faster regarding wall-clock time.

```python
env = gym.make("CarRacing-v2", render_mode="rgb_array")

model = PPO('CnnPolicy', env, verbose=1)
```

In fact, Stable-Baselines3 already provides you with that helper:

In [6]:
from stable_baselines3.common.evaluation import evaluate_policy

Create the Gym env (for learning) and instantiate the agent

In [None]:
#del model

In [7]:
# %% [code]
!apt-get update
!apt-get install -y build-essential
!apt-get install -y swig

!pip install gymnasium
!pip install gymnasium[box2d]
!pip install gymnasium[classic_control]

0% [Working]            Hit:1 https://cloud.r-project.org/bin/linux/ubuntu jammy-cran40/ InRelease
Hit:2 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  InRelease
Hit:3 http://security.ubuntu.com/ubuntu jammy-security InRelease
Hit:4 http://archive.ubuntu.com/ubuntu jammy InRelease
Ign:5 https://r2u.stat.illinois.edu/ubuntu jammy InRelease
Hit:6 https://r2u.stat.illinois.edu/ubuntu jammy Release
Hit:7 http://archive.ubuntu.com/ubuntu jammy-updates InRelease
Hit:8 http://archive.ubuntu.com/ubuntu jammy-backports InRelease
Hit:10 https://ppa.launchpadcontent.net/deadsnakes/ppa/ubuntu jammy InRelease
Hit:11 https://ppa.launchpadcontent.net/graphics-drivers/ppa/ubuntu jammy InRelease
Hit:12 https://ppa.launchpadcontent.net/ubuntugis/ppa/ubuntu jammy InRelease
Reading package lists... Done
W: Skipping acquire of configured file 'main/source/Sources' as repository 'https://r2u.stat.illinois.edu/ubuntu jammy InRelease' does not seem to provide it (sources.list e

In [11]:
env = gym.make("CarRacing-v2", render_mode="rgb_array")

model = PPO('CnnPolicy', env, verbose=1) ####
model.get_parameters()


Using cuda device
Wrapping the env with a `Monitor` wrapper
Wrapping the env in a DummyVecEnv.
Wrapping the env in a VecTransposeImage.


{'policy': OrderedDict([('log_std', tensor([0., 0., 0.], device='cuda:0')),
              ('features_extractor.cnn.0.weight',
               tensor([[[[-7.7313e-02,  9.9915e-02, -9.8600e-02,  ...,  1.2514e-02,
                           9.6267e-02,  2.0418e-01],
                         [-1.7059e-02,  1.5637e-01, -2.1182e-01,  ...,  5.1683e-02,
                          -5.3091e-02, -3.7875e-02],
                         [ 1.1935e-01, -5.8361e-02,  1.7212e-02,  ..., -7.0724e-02,
                           2.2439e-01, -2.9748e-02],
                         ...,
                         [ 1.4102e-02, -8.0711e-02, -3.3954e-02,  ...,  8.7566e-02,
                           7.7170e-02,  1.8175e-02],
                         [-1.6275e-01,  6.4538e-02,  1.8317e-02,  ...,  6.8005e-02,
                          -7.5151e-02,  1.0427e-01],
                         [-1.1869e-01, -8.2332e-02,  1.0064e-02,  ...,  3.2112e-02,
                          -2.0684e-01, -2.5451e-03]],
               
     

Let's evaluate the un-trained agent, this should be a random agent.

In [12]:
# Use a separate environement for evaluation
eval_env = gym.make("CarRacing-v2", render_mode="rgb_array")

# Random Agent, before training
mean_reward, std_reward = evaluate_policy(model, eval_env, n_eval_episodes=10)

print(f"mean_reward:{mean_reward:.2f} +/- {std_reward:.2f}")



mean_reward:-93.09 +/- 0.77


## Train the agent and evaluate it

In [15]:
# Train the agent for 100000 steps and then 50_000 (2 sessions)
model.learn(total_timesteps=50_000)     #### 1_000, 10_000, 50_000 - 30_000 takes 26min

  and should_run_async(code)


---------------------------------
| rollout/           |          |
|    ep_len_mean     | 1e+03    |
|    ep_rew_mean     | 248      |
| time/              |          |
|    fps             | 79       |
|    iterations      | 1        |
|    time_elapsed    | 25       |
|    total_timesteps | 2048     |
---------------------------------
-----------------------------------------
| rollout/                |             |
|    ep_len_mean          | 1e+03       |
|    ep_rew_mean          | 202         |
| time/                   |             |
|    fps                  | 75          |
|    iterations           | 2           |
|    time_elapsed         | 53          |
|    total_timesteps      | 4096        |
| train/                  |             |
|    approx_kl            | 0.042320095 |
|    clip_fraction        | 0.329       |
|    clip_range           | 0.2         |
|    entropy_loss         | -3.17       |
|    explained_variance   | 0.981       |
|    learning_rate        | 0.

<stable_baselines3.ppo.ppo.PPO at 0x7ea03ee07b50>

In [16]:
# Evaluate the trained agent
mean_reward, std_reward = evaluate_policy(model, eval_env, n_eval_episodes=40) #40ep takes 15min

print(f"mean_reward:{mean_reward:.2f} +/- {std_reward:.2f}")



mean_reward:662.48 +/- 298.83


Apparently the training went well, the mean reward increased a lot !

### Prepare video recording

In [17]:
# Set up fake display; otherwise rendering will fail
import os
os.system("Xvfb :1 -screen 0 1024x768x24 &")
os.environ['DISPLAY'] = ':1'

In [18]:
import base64
from pathlib import Path

from IPython import display as ipythondisplay


def show_videos(video_path="", prefix=""):
    """
    Taken from https://github.com/eleurent/highway-env

    :param video_path: (str) Path to the folder containing videos
    :param prefix: (str) Filter the video, showing only the only starting with this prefix
    """
    html = []
    for mp4 in Path(video_path).glob("{}*.mp4".format(prefix)):
        video_b64 = base64.b64encode(mp4.read_bytes())
        html.append(
            """<video alt="{}" autoplay
                    loop controls style="height: 400px;">
                    <source src="data:video/mp4;base64,{}" type="video/mp4" />
                </video>""".format(
                mp4, video_b64.decode("ascii")
            )
        )
    ipythondisplay.display(ipythondisplay.HTML(data="<br>".join(html)))

We will record a video using the [VecVideoRecorder](https://stable-baselines.readthedocs.io/en/master/guide/vec_envs.html#vecvideorecorder) wrapper, you will learn about those wrapper in the next notebook.

In [19]:
from stable_baselines3.common.vec_env import VecVideoRecorder, DummyVecEnv


def record_video1(env_id, model, video_length=1000, prefix="", video_folder="videos/"):
    """
    :param env_id: (str)
    :param model: (RL model)
    :param video_length: (int)
    :param prefix: (str)
    :param video_folder: (str)
    """
    eval_env = DummyVecEnv([lambda: gym.make("CarRacing-v2", render_mode="rgb_array")])
    # Start the video at step=0 and record <video_length> steps
    eval_env = VecVideoRecorder(
        eval_env,
        video_folder=video_folder,
        record_video_trigger=lambda step: step == 0,
        video_length=video_length,
        name_prefix=prefix,
    )

    obs = eval_env.reset()
    for _ in range(video_length):
        action, _ = model.predict(obs)
        obs, _, _, _ = eval_env.step(action)

    # Close the video recorder
    eval_env.close()

### Visualize trained agent



In [22]:
record_video1("CarRacing-v2", model, video_length=1000, prefix="ppo-carracing")

Saving video to /content/videos/ppo-carracing-step-0-to-step-1000.mp4
Moviepy - Building video /content/videos/ppo-carracing-step-0-to-step-1000.mp4.
Moviepy - Writing video /content/videos/ppo-carracing-step-0-to-step-1000.mp4





Moviepy - Done !
Moviepy - video ready /content/videos/ppo-carracing-step-0-to-step-1000.mp4


In [23]:
show_videos("videos", prefix="ppo-carracing")

## Conclusion

In this notebook we have seen:
- how to define and train a RL model using stable baselines3, it takes only one line of code ;)
