<a href="https://colab.research.google.com/github/ankitabuntolia/DRL/blob/main/02_StableBaselines3_Demo.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Deep Reinforcement Learning - Stable Baselines3 Demo

Github Repo: [https://github.com/DLR-RM/stable-baselines3](https://github.com/DLR-RM/stable-baselines3)


[RL Baselines3 Zoo](https://github.com/DLR-RM/rl-baselines3-zoo) is a collection of pre-trained Reinforcement Learning agents using Stable-Baselines3.

It also provides basic scripts for training, evaluating agents, tuning hyperparameters and recording videos.

Documentation is available online: [https://stable-baselines3.readthedocs.io/](https://stable-baselines3.readthedocs.io/)

Pybullet source code: https://github.com/bulletphysics/bullet3/tree/master/examples/pybullet/



## Install Dependencies and Stable-Baselines3 Using Pip

In [None]:
!apt update
!apt-get install -y xvfb x11-utils ffmpeg
!pip install gym pyvirtualdisplay==0.2.* PyOpenGL==3.1.* PyOpenGL-accelerate==3.1.*
!pip install stable-baselines3[extra] pybullet

## Import policy, RL agent, Wrappers

In [None]:
import os, shutil
import glob, io, base64

import pybullet_envs
from stable_baselines3 import PPO
from stable_baselines3.common.env_util import make_vec_env
from stable_baselines3.common.vec_env import VecNormalize, VecVideoRecorder
from stable_baselines3.common.evaluation import evaluate_policy

# Plotting and notebook imports
from IPython.display import HTML, clear_output
from IPython import display

# start virtual display
from pyvirtualdisplay import Display
pydisplay = Display(visible=0, size=(640, 480))
pydisplay.start()

## Define helper functions

In [None]:
def concatenate_videos(video_dir):
    """
    Merge all mp4 videos in video_dir.
    """
    outfile = os.path.join(video_dir, 'merged_video.mp4')
    cmd = "ffmpeg -i \"concat:"
    mp4list = glob.glob(os.path.join(video_dir, '*.mp4'))
    tmpfiles = []
    # build ffmpeg command and create temp files
    for f in mp4list:
        file = os.path.join(video_dir, "temp" + str(mp4list.index(f) + 1) + ".ts")
        os.system("ffmpeg -i " + f + " -c copy -bsf:v h264_mp4toannexb -f mpegts " + file)
        tmpfiles.append(file)
    for f in tmpfiles:
        cmd += f
        if tmpfiles.index(f) != len(tmpfiles)-1:
            cmd += "|"
        else:
            cmd += f"\" -c copy  -bsf:a aac_adtstoasc {outfile}"
    # execute ffmpeg command to combine videos
    os.system(cmd)
    # cleanup
    for f in tmpfiles + mp4list:
        if f != outfile:
            os.remove(f)
    # --
    return outfile

def show_video(video_dir):
    """
    Show video in the output of a code cell.
    """
    # merge all videos
    mp4 = concatenate_videos(video_dir)    
    if mp4:
        video = io.open(mp4, 'r+b').read()
        encoded = base64.b64encode(video)
        display.display(HTML(data='''<video alt="test" autoplay 
                    loop controls style="height: 400px;">
                    <source src="data:video/mp4;base64,{0}" type="video/mp4" />
                </video>'''.format(encoded.decode('ascii'))))
    else: 
        print("Could not find video")

## Define global variables

In [None]:
env_id = "HalfCheetahBulletEnv-v0"
n_envs = 1
video_length = 500
log_dir = "logs/"
video_folder = f"{log_dir}videos/"
model_path = os.path.join(log_dir, "ppo_halfcheetah")
stats_path = os.path.join(log_dir, "vec_normalize.pkl")

## Create and wrap the environments

Normalizing input features may be essential to successful training of an RL agent (by default, images are scaled but not other types of input), for instance when training on [PyBullet](https://github.com/bulletphysics/bullet3/) environments. For that, the `VecNormalize` exists, and will compute a running average and standard deviation of input features (it can do the same for rewards).

More information about `VecNormalize`:
- [Documentation](https://stable-baselines3.readthedocs.io/en/master/guide/vec_envs.html#stable_baselines3.common.vec_env.VecNormalize)
- [Discussion](https://github.com/hill-a/stable-baselines/issues/698)

---

To observe the agent behavior during environment rollouts one can simply record the frames and interactions by applying the `VecVideoRecorder` envrionment wrapper.

More information about `VecVideoRecorder`:
- [Documentation](https://stable-baselines3.readthedocs.io/en/master/guide/examples.html#record-a-video)

To learn more about vectorized environments follow this link:
- [Documentation](https://stable-baselines3.readthedocs.io/en/master/guide/vec_envs.html)

In [None]:
env = make_vec_env(env_id, n_envs)
env = VecNormalize(env, norm_obs=True, norm_reward=True, clip_obs=10.)
env = VecVideoRecorder(env, video_folder,
                       record_video_trigger=lambda x: x == 0, video_length=video_length,
                       name_prefix="random-agent-{}".format(env_id))

### Train the agent

For training Stable-Baselines3 offers a set of the most commonly used baseline methods in reinforcement learning:

* A2C ([Asynchronous Actor-Critic](https://arxiv.org/abs/1602.01783))
* DDPG ([Deep Deterministic Policy Gradient](https://arxiv.org/abs/1509.02971))
* DQN ([Deep Q-Networks](https://arxiv.org/abs/1312.5602))
* HER ([Hindsight Experience Replay](https://arxiv.org/abs/1707.01495))
* PPO ([Proximal Policy Optimization](https://arxiv.org/abs/1707.06347))
* SAC ([Soft Actor-Critic](https://arxiv.org/abs/1801.01290))
* TD3 ([Twin Delayed Deep Deterministic policy gradient](https://arxiv.org/abs/1802.09477))

For a more detailed description follow the documentation:
* [Documentation](https://stable-baselines3.readthedocs.io/en/master/guide/algos.html)

As a neural network architecture we are simply using a Multilayer Perceptron. Stable-Baselines3 currently offers two types:
* MlpPolicy
* CnnPolicy

In [None]:
model = PPO('MlpPolicy', env, verbose=True)
model.learn(total_timesteps=2000)

### Save the agent and the normalization

In [None]:
# Save model and the normalized statistics
model.save(model_path)
env.save(stats_path)

### Test model: load the saved agent and normalization

In [None]:
# Load the agent
model = PPO.load(model_path)

# Load the saved statistics
env = make_vec_env(env_id, n_envs=n_envs)
env = VecNormalize.load(stats_path, env)
#  do not update them at test time
env.training = False
# reward normalization is not needed at test time
env.norm_reward = False

In [None]:
mean_reward, std_reward = evaluate_policy(model, env)
print(f"Mean reward = {mean_reward:.2f} +/- {std_reward:.2f}")

## Show video

In [None]:
show_video(video_folder)