# Unit 1 Special Content: Optuna Guide

In this notebook, we shall see how to use Optuna to perform hyperparameter tuning of Unit 1's <a href="https://stable-baselines3.readthedocs.io/en/master/modules/ppo.html" target="_blank">`PPO`</a> model (created using Stable-Baselines3 for the `"LunarLander-v2"` Gym environment).

Optuna is an open-source, automatic hyperparameter optimization framework. You can read more about it <a href="https://tech.preferred.jp/en/blog/optuna-release/" target="_blank">here</a>.

**Prerequisite:** Before going through this notebook, you should have completed the <a href="https://colab.research.google.com/github/huggingface/deep-rl-class/blob/main/unit1/unit1.ipynb" target="_blank">Unit 1 hands-on</a>.

## Virtual Display Setup

We'll need to generate a replay video. To do so in Colab, we need to have a virtual display to be able to render the environment (and thus record the frames).

The following cell will install virtual display libraries.

In [1]:
!apt install python-opengl
!apt install ffmpeg
!apt install xvfb
!pip install pyvirtualdisplay

Reading package lists... Done
Building dependency tree       
Reading state information... Done
The following package was automatically installed and is no longer required:
  libnvidia-common-460
Use 'apt autoremove' to remove it.
Suggested packages:
  libgle3
The following NEW packages will be installed:
  python-opengl
0 upgraded, 1 newly installed, 0 to remove and 49 not upgraded.
Need to get 496 kB of archives.
After this operation, 5,416 kB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu bionic/universe amd64 python-opengl all 3.1.0+dfsg-1 [496 kB]
Fetched 496 kB in 0s (5,653 kB/s)
Selecting previously unselected package python-opengl.
(Reading database ... 155639 files and directories currently installed.)
Preparing to unpack .../python-opengl_3.1.0+dfsg-1_all.deb ...
Unpacking python-opengl (3.1.0+dfsg-1) ...
Setting up python-opengl (3.1.0+dfsg-1) ...
Reading package lists... Done
Building dependency tree       
Reading state information... Done
ff

Now, let's create & start a virtual display.

In [2]:
from pyvirtualdisplay import Display

virtual_display = Display(visible=0, size=(1400, 900))
virtual_display.start()

<pyvirtualdisplay.display.Display at 0x7f8ec17433d0>

## Dependencies, Imports & Gym Environments

Let's install all the other dependencies we'll need.

In [3]:
!pip install gym[box2d]
!pip install stable-baselines3[extra]
!pip install pyglet
!pip install ale-py==0.7.4 # To overcome an issue with Gym (https://github.com/DLR-RM/stable-baselines3/issues/875)
!pip install optuna
!pip install huggingface_sb3

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting box2d-py~=2.3.5
  Downloading box2d_py-2.3.8-cp37-cp37m-manylinux1_x86_64.whl (448 kB)
[K     |████████████████████████████████| 448 kB 19.1 MB/s 
Installing collected packages: box2d-py
Successfully installed box2d-py-2.3.8
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting stable-baselines3[extra]
  Downloading stable_baselines3-1.5.0-py3-none-any.whl (177 kB)
[K     |████████████████████████████████| 177 kB 27.8 MB/s 
Collecting gym==0.21
  Downloading gym-0.21.0.tar.gz (1.5 MB)
[K     |████████████████████████████████| 1.5 MB 60.3 MB/s 
Collecting autorom[accept-rom-license]~=0.4.2
  Downloading AutoROM-0.4.2-py3-none-any.whl (16 kB)
Collecting ale-py~=0.7.4
  Downloading ale_py-0.7.5-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.6 MB)
[K     |████████████████████████████████| 1.6 MB 60.0 MB/s 
Co

Next, let's perform all the necessary imports.

In [4]:
import gym

from stable_baselines3.common.env_util import make_vec_env
from stable_baselines3.common.monitor import Monitor
from stable_baselines3 import PPO
from stable_baselines3.common.evaluation import evaluate_policy
from stable_baselines3.common.vec_env import DummyVecEnv

import optuna
from optuna.samplers import TPESampler

from huggingface_hub import notebook_login
from huggingface_sb3 import package_to_hub

Finally, let's create our Gym environments. The training environment is a vectorized environment:

In [5]:
env = make_vec_env("LunarLander-v2", n_envs=16)
env

<stable_baselines3.common.vec_env.dummy_vec_env.DummyVecEnv at 0x7f8d7af53390>

And the evaluation environment is a separate environment:

In [6]:
eval_env = Monitor(gym.make("LunarLander-v2"))

We are now ready to dive into hyperparameter tuning!

## Hyperparameter Tuning

First, let's define a `run_training()` function that trains a single model (using a particular combination of hyperparameter values), and returns a score. 

The score tells us how good the particular combination of hyperparameters is. (In our case, the score is `mean_reward - std_reward`, which is being used in the <a href="https://huggingface.co/spaces/chrisjay/Deep-Reinforcement-Learning-Leaderboard" target="_blank">leaderboard</a>.) 

The function takes a very special argument - `params`, which is a dictionary. **The keys of this dictionary are the names of the hyperparameters we're tuning**, and **the values are sampled at each trial by Optuna's sampler** (from ranges that we'll specify soon).

For example, in a particular trial, `params` might look like this:

```
{'n_epochs': 5, 'gamma': 0.9926, 'total_timesteps': 559_621}
```

And in another trial, `params` might look like this:

```
{'n_epochs': 3, 'gamma': 0.9974, 'total_timesteps': 1_728_482}
```

In [7]:
def run_training(params, verbose=0, save_model=False):
    model = PPO(
        policy='MlpPolicy', 
        env=env, 
        n_steps=1024,
        batch_size=64, 
        n_epochs=params['n_epochs'], # We're tuning this.
        gamma=params['gamma'], # We're tuning this.
        gae_lambda=0.98, 
        ent_coef=0.01, 
        verbose=verbose
    )
    model.learn(total_timesteps=params['total_timesteps']) # We're tuning this.

    mean_reward, std_reward = evaluate_policy(model, eval_env, n_eval_episodes=50, deterministic=True)
    score = mean_reward - std_reward

    if save_model:
        model.save("PPO-LunarLander-v2")

    return model, score

Next, we define another function - `objective()`. This function has a single parameter `trial`, which is an object of type `optuna.trial.Trial`. Using this `trial` object, we specify the ranges for the different hyperparameters we want to explore:

- For `n_epochs`: We want to explore integer values between `3` and `5`.
- For `gamma`: We want to explore floating point values between `0.9900` and `0.9999` (drawn from a uniform distribution).
- For `total_timesteps`: We want to explore integer values between `500_000` and `2_000_000`.

**Note:** If you have more time available, then you can tune other hyperparameters too. Moreover, you can explore wider ranges for each hyperparameter.

The `trial.suggest_int()` and `trial.suggest_uniform()` methods are used by Optuna to suggest hyperparamter values in the ranges specified. The suggested combination of values are then used to train a model and return the score.

In [8]:
def objective(trial):
  params = {
      "n_epochs": trial.suggest_int("n_epochs", 3, 5), 
      "gamma": trial.suggest_uniform("gamma", 0.9900, 0.9999), 
      "total_timesteps": trial.suggest_int("total_timesteps", 500_000, 2_000_000)
  }
  model, score = run_training(params)
  return score

Finally, we use Optuna's `create_study()` function to create a study, passing in:

- `sampler=TPESampler()`: This specifies that we want to employ a Bayesian optimization algorithm called Tree-structured Parzen Estimator. Other options are `GridSampler()`, `RandomSampler()`, etc. (The full list can be found <a href="https://optuna.readthedocs.io/en/stable/reference/samplers.html" target="_blank">here</a>.)
- `study_name="PPO-LunarLander-v2"`: This is a name we give to the study (optional).
- `direction="maximize"`: This is to specify that our objective is to maximize (not mimimize) the score.

Once our study is created, we call the `optimize()` method on it, specifying that we want to conduct `10` trials.

**Note:** If you have more time available, then you can conduct more than `10` trials.

**Warning:** The below code cell will take quite a bit of time to run!

In [None]:
study = optuna.create_study(sampler=TPESampler(), study_name="PPO-LunarLander-v2", direction="maximize")
study.optimize(objective, n_trials=10)

[32m[I 2022-06-18 11:03:43,550][0m A new study created in memory with name: PPO-LunarLander-v2[0m


Now that all the `10` trials have concluded, let's print out the score and hyperparameters of the best trial.

In [None]:
print("Best trial score:", study.best_trial.values)
print("Best trial hyperparameters:", study.best_trial.params)

## Recreating & Saving The Best Model

Let's recreate the best model and save it.

In [None]:
model, score = run_training(study.best_trial.params, verbose=1, save_model=True)

## Pushing to Hugging Face Hub

To be able to share your model with the community, there are three more steps to follow:

1. (If not done already) create a Hugging Face account -> https://huggingface.co/join

2. Sign in and then, get your authentication token from the Hugging Face website.

- Create a new token (https://huggingface.co/settings/tokens) **with write role**.
- Copy the token.
- Run the cell below and paste the token.

In [None]:
notebook_login()
!git config --global credential.helper store

If you aren't using Google Colab or Jupyter Notebook, you need to use this command instead: `huggingface-cli login`

3. We're now ready to push our trained agent to the Hub using the `package_to_hub()` function.

Let's fill in the arguments of the `package_to_hub` function:

- `model`: our trained model

- `model_name`: the name of the trained model that we defined in `model.save()`

- `model_architecture`: the model architecture we used (in our case `"PPO"`)

- `env_id`: the name of the environment (in our case `"LunarLander-v2"`)

- `eval_env`: the evaluation environment

- `repo_id`: the name of the Hugging Face Hub repository that will be created/updated `(repo_id="{username}/{repo_name}")` (**Note:** A good `repo_id` is `"{username}/{model_architecture}-{env_id}"`.)

- `commit_message`: the commit message

In [None]:
model_name = "PPO-LunarLander-v2"
model_architecture = "PPO"
env_id = "LunarLander-v2"
eval_env = DummyVecEnv([lambda: gym.make(env_id)])
repo_id = "Sadhaklal/PPO-LunarLander-v2"
commit_message = "Upload best PPO LunarLander-v2 agent (tuned with Optuna)."

The following function call will evaluate the agent, record a replay, generate a model card, and push your agent to the Hub.

In [None]:
package_to_hub(
    model=model, 
    model_name=model_name, 
    model_architecture=model_architecture, 
    env_id=env_id, 
    eval_env=eval_env, 
    repo_id=repo_id, 
    commit_message=commit_message
)

That's it! You now know how to perform hyperparameter tuning of Stable-Baselines3 models using Optuna.

To get even better results, try tuning the other hyperparameters of your model.

## Final Tips

1. Read the <a href="https://optuna.readthedocs.io/en/stable/index.html" target="_blank">Optuna documentation</a> to get more familiar with the library and its features.
2. You may have noticed that hyperparameter tuning is a time consuming process. However, it can be sped up significantly using parallelization. Check out <a href="https://optuna.readthedocs.io/en/stable/tutorial/10_key_features/004_distributed.html" target="_blank">this guide</a> on how to do so.