<a href="https://colab.research.google.com/github/diego-minguzzi/reinforcement-learning/blob/master/optuna_hyperparams_optimization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Hyperparameter optimization for Deep RL using Optuma
Diego Minguzzi ([profile](https://www.linkedin.com/in/diego-minguzzi-2775b78/))

Online course [Deep Reinforcement Learning](https://huggingface.co/learn/deep-rl-course), by Hugging Face: exercise of the [Bonus Unit 2](https://huggingface.co/learn/deep-rl-course/unitbonus2/introduction).
<br>
Adapted from Antonin’s Raffin ICRA 2022 presentations:

*  [The tutorial on YouTube](https://youtu.be/ihP7E76KGOI)
*  [The related Colab Notebook](https://colab.research.google.com/github/araffin/tools-for-robotic-rl-icra2022/blob/main/notebooks/optuna_lab.ipynb)


See [Optuna](https://optuna.org/) and the [documentation](https://optuna.readthedocs.io/en/stable/index.html)

The Gym Lunar lander environment is used.<br>

## Lunar Lander environment

## Imports and package installation

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
!apt install swig cmake
!pip install gymnasium[box2d]
!pip install huggingface_sb3
!pip install stable-baselines3
!pip install sb3-contrib
!pip install optuna
!pip install optuna-dashboard
!pip install jupyterlab jupyterlab-optuna

In [3]:
import gymnasium as gym
import collections
import json
import math
import numpy as np
import os.path
import pickle
import torch
from typing import Any, Dict
import torch.nn as nn

In [4]:
from stable_baselines3 import PPO, A2C, SAC, TD3, DQN
from stable_baselines3.common.env_util import make_vec_env
from stable_baselines3.common.evaluation import evaluate_policy
from stable_baselines3.common.monitor import Monitor
from stable_baselines3.common.env_util import make_vec_env
from stable_baselines3.common.evaluation import evaluate_policy
from stable_baselines3.common.callbacks import BaseCallback, CallbackList

In [5]:
import optuna
from optuna.pruners import MedianPruner
from optuna.samplers import TPESampler
from optuna.visualization import plot_optimization_history, plot_param_importances, plot_intermediate_values

  and should_run_async(code)


## Set all parameters

In [6]:
import logging as log
LOG_FORMAT_STRING = "%(asctime)s [%(levelname)-7s] %(message)s"
LOG_LEVEL= log.DEBUG
LOG_ROOT_LOGGER = ''

N_TRIALS =          25  # Maximum number of trials
N_JOBS =            1   # Number of jobs to run in parallel
N_STARTUP_TRIALS=   3   # Stop random sampling after N_STARTUP_TRIALS
N_EVALUATIONS=      6
N_TIMESTEPS =       int(300000)  # Training budget
EVAL_FREQ =         int(N_TIMESTEPS / N_EVALUATIONS)
N_EVAL_ENVS =       4
N_EVAL_EPISODES =   8
TIMEOUT =           int(60 * 60 * 5) # Expressed into seconds.
NUM_STEPS_LR_UPDATE=5000 # After this many steps the learning rate is updated
LR_UPDATE_FACTOR =  0.95 # Learning rate update factor
POLICY=             'MlpPolicy'
IS_DETERMINISTIC=   True
IS_VERBOSE=         0
N_WARMUP_STEPS=     N_EVALUATIONS // 3  # Do not prune before the warmup steps.
SAVED_PARAMS_FILE=  'lunar_lander_params.json'
SAVED_SNAPSHOT_FILE='opt_lunar_lander_study.pkl'
SEED =              2024
ENV_ID =            'LunarLander-v2'
BEST_MODEL_FILE=    f'{ENV_ID}.model'
RESULT_OBJECTIVE=   375.0

HyperParams= collections.namedtuple('HyperParams', ['gamma',
                                                    'learning_rate',
                                                    'max_grad_norm',
                                                    'n_epochs',
                                                    'n_steps',
                                                    'ent_coef',
                                                    'batch_size'])
# Guess Hyperparameters, used in the trials after GUESS_HYPERPARAMS_AFTER_N_TRIAL
guess_hyperparams=[ HyperParams(gamma = 1.0 - 0.0016617315216418454,
                                learning_rate = math.pow(10., -2.682161433845602),
                                max_grad_norm = 0.3733377073391161,
                                n_epochs= 5,
                                n_steps = 2 ** 12,
                                ent_coef= 0.01,
                                batch_size=64),
                    HyperParams(gamma = 1.0 - 0.00010534109595607163,
                                learning_rate = math.pow(10., -2.7388359550396095),
                                max_grad_norm = 0.30386280137092286,
                                n_epochs= 5,
                                n_steps = 2 ** 12,
                                ent_coef= 0.01,
                                batch_size=64)
]

GUESS_HYPERPARAMS_AFTER_N_TRIAL=10

  and should_run_async(code)


In [7]:
log.basicConfig(level=LOG_LEVEL, format=LOG_FORMAT_STRING)
log.getLogger().setLevel(LOG_LEVEL)
roothandler= log.getLogger().handlers[0]
roothandler.setFormatter( log.Formatter(LOG_FORMAT_STRING) )

In [8]:
DEFAULT_HYPERPARAMS = {
    "policy": POLICY,
    "env": ENV_ID,
}

In [9]:
np.random.seed(SEED)

In [10]:
class DecreaseLearningRateCallback(BaseCallback):
    """ Custom callback to decrease the learning rate periodically during training. """

    def __init__(self, decay_interval, decay_factor):
        super(DecreaseLearningRateCallback, self).__init__()
        self.decay_interval = decay_interval  # Interval at which to decrease the learning rate
        self.decay_factor = decay_factor      # Factor by which to decrease the learning rate

    def _on_step(self) -> bool:
        if 0==(self.num_timesteps % self.decay_interval):
            # Retrieve the current learning rate
            current_lr = self.model.learning_rate

            # Decrease the learning rate by the decay factor
            new_lr = current_lr * self.decay_factor

            # Set the new learning rate
            self.model.learning_rate = new_lr
            if 0==(self.num_timesteps % (10*self.decay_interval)):
              log.debug(f'Num steps:{self.num_timesteps} current_lr:{current_lr} new_lr:{new_lr}')

        return True

### Define the sampling function



In [11]:
def sample_ppo_params(trial: optuna.Trial) -> Dict[str, Any]:
    """ Samples the model hyperparameters.
    :param trial: Optuna trial object
    :return: The sampled hyperparameters for the given trial.
    """
    indx_trial= trial.number
    if ( (indx_trial>=GUESS_HYPERPARAMS_AFTER_N_TRIAL)
      and (indx_trial - GUESS_HYPERPARAMS_AFTER_N_TRIAL) < len(guess_hyperparams) ):
      indx_guess= indx_trial - GUESS_HYPERPARAMS_AFTER_N_TRIAL
      log.info(f'indx_trial:{indx_guess} indx_guess:{indx_guess}: using guessed params.')
      gamma = guess_hyperparams[indx_guess].gamma
      learning_rate = guess_hyperparams[indx_guess].learning_rate
      max_grad_norm = guess_hyperparams[indx_guess].max_grad_norm
      n_epochs = guess_hyperparams[indx_guess].n_epochs
      n_steps = guess_hyperparams[indx_guess].n_steps
      ent_coef= guess_hyperparams[indx_guess].ent_coef
      batch_size= guess_hyperparams[indx_guess].batch_size
    else:
      gamma = 1.0 - trial.suggest_float("gamma", 0.0001, 0.2, log=True)
      learning_rate = math.pow(10., trial.suggest_float("learning_rate_exp", -4.0, -1.0))
      max_grad_norm = trial.suggest_float("max_grad_norm", 0.3, 10.0, log=True)
      n_epochs= trial.suggest_int("n_epochs", 2, 16)
      n_steps = 2 ** trial.suggest_int("exponent_n_steps", 6, 13)

    ent_coef= guess_hyperparams[-1].ent_coef
    batch_size= guess_hyperparams[-1].batch_size

    result= {
        "ent_coef": ent_coef,
        "n_steps": n_steps,
        "gamma": gamma,
        "learning_rate": learning_rate,
        "max_grad_norm": max_grad_norm,
        "n_epochs":n_epochs,
        "batch_size":batch_size
    }
    log.info(f'sample_ppo_params() {trial.number}\nParams:{result}')
    return result

### Integrate the performance evaluation in stable baselines.

In [12]:
from stable_baselines3.common.callbacks import EvalCallback

class TrialEvalCallback(EvalCallback):
    """
    Callback used for evaluating and reporting a trial.

    :param eval_env: Evaluation environement
    :param trial: Optuna trial object
    :param n_eval_episodes: Number of evaluation episodes
    :param eval_freq:   Evaluate the agent every ``eval_freq`` call of the callback.
    :param deterministic: Whether the evaluation should
        use a stochastic or deterministic policy.
    :param verbose:
    """
    def __init__(self,
                  eval_env: gym.Env,
                  trial: optuna.Trial,
                  n_eval_episodes: int = 5,
                  eval_freq: int = 10000,
                  deterministic: bool = IS_DETERMINISTIC,
                  verbose: int = IS_VERBOSE):

        super().__init__(
            eval_env=eval_env,
            n_eval_episodes=n_eval_episodes,
            eval_freq=eval_freq,
            deterministic=deterministic,
            verbose=verbose,
        )
        self.trial = trial
        self.eval_idx = 0
        self.is_pruned = False

    def _on_step(self) -> bool:
        if self.eval_freq > 0 and self.n_calls % self.eval_freq == 0:
            # Evaluate policy (done in the parent class)
            super()._on_step()
            self.eval_idx += 1
            # Send report to Optuna
            self.trial.report(self.last_mean_reward, self.eval_idx)
            # Prune trial if need
            if self.trial.should_prune():
                self.is_pruned = True
                return False
        return True

### Define the objective function: it evaluates a set of hyperparameters.

In [13]:
def objective(trial: optuna.Trial) -> float:
    """
    Objective function using by Optuna to evaluate
    one configuration (i.e., one set of hyperparameters).

    Given a trial object, it will sample hyperparameters,
    evaluate it and report the result (mean episodic reward after training)

    :param trial: Optuna trial object
    :return: Mean episodic reward after training
    """
    print(f'objective started trial:{trial.number}.')
    kwargs = DEFAULT_HYPERPARAMS.copy()

    # 1. Sample hyperparameters and update the keyword arguments
    sampled_params= sample_ppo_params( trial)
    kwargs.update( sampled_params)

    # Create the RL model
    model = PPO(**kwargs)

    # 2. Create envs used for evaluation using `make_vec_env`, `ENV_ID` and `N_EVAL_ENVS`
    eval_env= make_vec_env(ENV_ID, N_EVAL_ENVS)

    # 3. Create the `TrialEvalCallback` callback defined above that will periodically evaluate
    eval_callback = TrialEvalCallback( eval_env, trial, N_EVAL_EPISODES, EVAL_FREQ, deterministic=IS_DETERMINISTIC)
    lr_callback = DecreaseLearningRateCallback(decay_interval=NUM_STEPS_LR_UPDATE, decay_factor=LR_UPDATE_FACTOR)
    list_callbacks = CallbackList([eval_callback, lr_callback])

    nan_encountered = False
    try:
        # Train the model
        model.learn(N_TIMESTEPS, callback=list_callbacks)
    except AssertionError as e:
        # Sometimes, random hyperparams can generate NaN
        log.error('Exception in objective:{e}')
        nan_encountered = True
    finally:
        # Free memory
        model.env.close()
        eval_env.close()

    # Tell the optimizer that the trial failed
    if nan_encountered:
        return float("nan")

    if eval_callback.is_pruned:
        raise optuna.exceptions.TrialPruned()

    mean_reward, std_reward = evaluate_policy(model, eval_env, N_EVAL_EPISODES, deterministic=IS_DETERMINISTIC)
    result = mean_reward-std_reward
    log.info(f'objective: trial:{trial.number}\nMean reward:{mean_reward} std_reward:{std_reward} result:{result}')

    if (0==trial.number) or ( (trial.number>0) and (result > trial.study.best_value) ):
      model.save(BEST_MODEL_FILE)
      log.info(f'Saved best model to file:{BEST_MODEL_FILE}')

    eval_callback.mean_minus_std= result
    return result

### Run the Optimization loop

The optimization loop runs for a given number of trials, or until the TIMEOUT expires.<br>
Then, a snapshot file is saved.  If the snapshot exists, it is loaded before starting the optimization.

In [14]:
import torch as th

# Set pytorch num threads to 1 for faster training
th.set_num_threads(1)

# Select the sampler, can be random, TPESampler, CMAES, ...
sampler = TPESampler(n_startup_trials=N_STARTUP_TRIALS)

pruner = MedianPruner( n_startup_trials=N_STARTUP_TRIALS, n_warmup_steps=N_WARMUP_STEPS )

# Create the study and start the hyperparameter optimization
if not os.path.exists(SAVED_SNAPSHOT_FILE):
  log.info('Started the study from scratch.')
  study = optuna.create_study(sampler=sampler, pruner=pruner, direction="maximize")
else:
  log.info(f'Resuming the study from the file: {SAVED_SNAPSHOT_FILE}')
  snapshot_file = open(SAVED_SNAPSHOT_FILE, 'rb')
  with open(SAVED_SNAPSHOT_FILE, 'rb') as snapshot_file:
    study = pickle.load(snapshot_file)

try:
  study.optimize(objective, n_trials=N_TRIALS, n_jobs=N_JOBS, timeout=TIMEOUT)
except Exception as e:
  log.error(f'Exception during study: {e}')
except KeyboardInterrupt:
    pass

snapshot_file = open(SAVED_SNAPSHOT_FILE, 'wb')
pickle.dump(study, snapshot_file)
snapshot_file.close()

log.info(f'Saved the study to the file: {os.path.abspath(SAVED_SNAPSHOT_FILE)}')

2024-01-01 08:42:49,860 [INFO   ] Started the study from scratch.
[I 2024-01-01 08:42:49,862] A new study created in memory with name: no-name-016626b3-0f0b-47de-8ca5-9f23c1fa928f
2024-01-01 08:42:49,871 [INFO   ] sample_ppo_params() 0
Params:{'ent_coef': 0.01, 'n_steps': 1024, 'gamma': 0.9947205391014854, 'learning_rate': 0.0042755129544146675, 'max_grad_norm': 3.5955466505396703, 'n_epochs': 8, 'batch_size': 64}


objective started trial:0.


2024-01-01 08:45:18,801 [DEBUG  ] Num steps:50000 current_lr:0.002694639515789764 new_lr:0.002559907540000276
2024-01-01 08:47:54,833 [DEBUG  ] Num steps:100000 current_lr:0.00161338021603475 new_lr:0.0015327112052330124
2024-01-01 08:50:20,383 [DEBUG  ] Num steps:150000 current_lr:0.0009659903323764005 new_lr:0.0009176908157575804
2024-01-01 08:52:34,980 [DEBUG  ] Num steps:200000 current_lr:0.00057837409494091 new_lr:0.0005494553901938645
2024-01-01 08:55:06,590 [DEBUG  ] Num steps:250000 current_lr:0.00034629393533968793 new_lr:0.0003289792385727035
2024-01-01 08:57:32,397 [DEBUG  ] Num steps:300000 current_lr:0.00020733897092209777 new_lr:0.00019697202237599286
2024-01-01 08:57:35,022 [INFO   ] objective: trial:0
Mean reward:-94.604606875 std_reward:91.11242539314406 result:-185.71703226814407
2024-01-01 08:57:35,039 [INFO   ] Saved best model to file:LunarLander-v2.model
[I 2024-01-01 08:57:35,044] Trial 0 finished with value: -185.71703226814407 and parameters: {'gamma': 0.005279

objective started trial:1.


2024-01-01 08:59:19,161 [DEBUG  ] Num steps:50000 current_lr:0.00037118249628686947 new_lr:0.00035262337147252596
2024-01-01 09:01:45,950 [DEBUG  ] Num steps:100000 current_lr:0.000222240671725661 new_lr:0.00021112863813937794
2024-01-01 09:04:22,531 [DEBUG  ] Num steps:150000 current_lr:0.00013306369956330355 new_lr:0.00012641051458513837
2024-01-01 09:07:07,463 [DEBUG  ] Num steps:200000 current_lr:7.967015220026753e-05 new_lr:7.568664459025415e-05
2024-01-01 09:09:50,966 [DEBUG  ] Num steps:250000 current_lr:4.770146307704395e-05 new_lr:4.531638992319175e-05
2024-01-01 09:12:03,686 [DEBUG  ] Num steps:300000 current_lr:2.856062799994182e-05 new_lr:2.713259659994473e-05
2024-01-01 09:12:11,255 [INFO   ] objective: trial:1
Mean reward:132.43273125000002 std_reward:113.80482462081112 result:18.6279066291889
2024-01-01 09:12:11,266 [INFO   ] Saved best model to file:LunarLander-v2.model
[I 2024-01-01 09:12:11,267] Trial 1 finished with value: 18.6279066291889 and parameters: {'gamma': 0

objective started trial:2.


2024-01-01 09:14:31,400 [DEBUG  ] Num steps:50000 current_lr:9.524192518442121e-05 new_lr:9.047982892520015e-05
2024-01-01 09:17:30,046 [DEBUG  ] Num steps:100000 current_lr:5.7024858772091025e-05 new_lr:5.417361583348647e-05
2024-01-01 09:20:34,314 [DEBUG  ] Num steps:150000 current_lr:3.414288940170259e-05 new_lr:3.243574493161746e-05
2024-01-01 09:23:22,251 [DEBUG  ] Num steps:200000 current_lr:2.0442609097129886e-05 new_lr:1.942047864227339e-05
2024-01-01 09:25:58,078 [DEBUG  ] Num steps:250000 current_lr:1.2239745200862183e-05 new_lr:1.1627757940819074e-05
2024-01-01 09:28:38,157 [DEBUG  ] Num steps:300000 current_lr:7.3283875786218565e-06 new_lr:6.961968199690763e-06
2024-01-01 09:28:51,233 [INFO   ] objective: trial:2
Mean reward:154.44373775 std_reward:81.70893608548222 result:72.73480166451778
2024-01-01 09:28:51,246 [INFO   ] Saved best model to file:LunarLander-v2.model
[I 2024-01-01 09:28:51,249] Trial 2 finished with value: 72.73480166451778 and parameters: {'gamma': 0.001

objective started trial:3.


2024-01-01 09:31:11,515 [DEBUG  ] Num steps:50000 current_lr:0.00010888601291503123 new_lr:0.00010344171226927966
2024-01-01 09:34:08,869 [DEBUG  ] Num steps:100000 current_lr:6.519407809861636e-05 new_lr:6.193437419368554e-05
2024-01-01 09:37:19,366 [DEBUG  ] Num steps:150000 current_lr:3.903410277723337e-05 new_lr:3.70823976383717e-05
2024-01-01 09:40:24,352 [DEBUG  ] Num steps:200000 current_lr:2.3371159222757002e-05 new_lr:2.2202601261619152e-05
2024-01-01 09:43:29,237 [DEBUG  ] Num steps:250000 current_lr:1.3993176339486334e-05 new_lr:1.3293517522512017e-05
2024-01-01 09:46:18,223 [DEBUG  ] Num steps:300000 current_lr:8.378231571726947e-06 new_lr:7.959319993140598e-06
2024-01-01 09:46:38,721 [INFO   ] objective: trial:3
Mean reward:190.6948345 std_reward:89.18296582521033 result:101.51186867478968
2024-01-01 09:46:38,735 [INFO   ] Saved best model to file:LunarLander-v2.model
[I 2024-01-01 09:46:38,738] Trial 3 finished with value: 101.51186867478968 and parameters: {'gamma': 0.00

objective started trial:4.


2024-01-01 09:48:59,416 [DEBUG  ] Num steps:50000 current_lr:0.05024008508043113 new_lr:0.04772808082640957
2024-01-01 09:51:18,709 [DEBUG  ] Num steps:100000 current_lr:0.030080594768133066 new_lr:0.02857656502972641
[I 2024-01-01 09:51:18,710] Trial 4 pruned. 
2024-01-01 09:51:18,735 [INFO   ] sample_ppo_params() 5
Params:{'ent_coef': 0.01, 'n_steps': 64, 'gamma': 0.885997548922207, 'learning_rate': 0.003053843932869535, 'max_grad_norm': 9.84857440545547, 'n_epochs': 16, 'batch_size': 64}


objective started trial:5.


2024-01-01 09:54:19,114 [DEBUG  ] Num steps:50000 current_lr:0.0019246833360821033 new_lr:0.001828449169277998
2024-01-01 09:57:37,005 [DEBUG  ] Num steps:100000 current_lr:0.0011523790096489104 new_lr:0.0010947600591664649
2024-01-01 10:00:28,654 [DEBUG  ] Num steps:150000 current_lr:0.0006899718810797425 new_lr:0.0006554732870257553
[I 2024-01-01 10:00:28,655] Trial 5 pruned. 
2024-01-01 10:00:28,680 [INFO   ] sample_ppo_params() 6
Params:{'ent_coef': 0.01, 'n_steps': 8192, 'gamma': 0.9998797663819543, 'learning_rate': 0.0001228468921929742, 'max_grad_norm': 2.5433164699237056, 'n_epochs': 11, 'batch_size': 64}


objective started trial:6.


2024-01-01 10:02:37,216 [DEBUG  ] Num steps:50000 current_lr:7.742418129112467e-05 new_lr:7.355297222656843e-05
2024-01-01 10:04:56,117 [DEBUG  ] Num steps:100000 current_lr:4.635671732928532e-05 new_lr:4.4038881462821056e-05
[I 2024-01-01 10:04:56,118] Trial 6 pruned. 
2024-01-01 10:04:56,144 [INFO   ] sample_ppo_params() 7
Params:{'ent_coef': 0.01, 'n_steps': 128, 'gamma': 0.9834082151281215, 'learning_rate': 0.00178368113150837, 'max_grad_norm': 0.35929037256222, 'n_epochs': 13, 'batch_size': 64}


objective started trial:7.


2024-01-01 10:07:43,106 [DEBUG  ] Num steps:50000 current_lr:0.001124163980270073 new_lr:0.0010679557812565694
2024-01-01 10:10:36,211 [DEBUG  ] Num steps:100000 current_lr:0.0006730785007489366 new_lr:0.0006394245757114897
2024-01-01 10:13:27,877 [DEBUG  ] Num steps:150000 current_lr:0.0004029969614055751 new_lr:0.0003828471133352963
2024-01-01 10:16:33,161 [DEBUG  ] Num steps:200000 current_lr:0.000241289167194341 new_lr:0.00022922470883462393
2024-01-01 10:19:34,886 [DEBUG  ] Num steps:250000 current_lr:0.00014446873743731713 new_lr:0.00013724530056545128
2024-01-01 10:22:19,214 [DEBUG  ] Num steps:300000 current_lr:8.649876966885222e-05 new_lr:8.21738311854096e-05
2024-01-01 10:22:27,846 [INFO   ] objective: trial:7
Mean reward:11.504465375000002 std_reward:135.15946429080367 result:-123.65499891580367
[I 2024-01-01 10:22:27,849] Trial 7 finished with value: -123.65499891580367 and parameters: {'gamma': 0.01659178487187847, 'learning_rate_exp': -2.7486827818006527, 'max_grad_norm':

objective started trial:8.


2024-01-01 10:24:12,956 [DEBUG  ] Num steps:50000 current_lr:0.014914224436225103 new_lr:0.014168513214413847
2024-01-01 10:25:59,462 [DEBUG  ] Num steps:100000 current_lr:0.008929697090059651 new_lr:0.008483212235556668
[I 2024-01-01 10:25:59,464] Trial 8 pruned. 
2024-01-01 10:25:59,490 [INFO   ] sample_ppo_params() 9
Params:{'ent_coef': 0.01, 'n_steps': 4096, 'gamma': 0.9691463124036962, 'learning_rate': 0.00048030196933375707, 'max_grad_norm': 1.4462312789485925, 'n_epochs': 13, 'batch_size': 64}


objective started trial:9.


2024-01-01 10:28:24,255 [DEBUG  ] Num steps:50000 current_lr:0.0003027100326621677 new_lr:0.0002875745310290593
2024-01-01 10:31:39,522 [DEBUG  ] Num steps:100000 current_lr:0.00018124367843289592 new_lr:0.0001721814945112511
2024-01-01 10:34:57,618 [DEBUG  ] Num steps:150000 current_lr:0.00010851728528121704 new_lr:0.00010309142101715619
[I 2024-01-01 10:34:57,619] Trial 9 pruned. 
2024-01-01 10:34:57,623 [INFO   ] indx_trial:0 indx_guess:0: using guessed params.
2024-01-01 10:34:57,624 [INFO   ] sample_ppo_params() 10
Params:{'ent_coef': 0.01, 'n_steps': 4096, 'gamma': 0.9983382684783582, 'learning_rate': 0.002078923775963909, 'max_grad_norm': 0.3733377073391161, 'n_epochs': 5, 'batch_size': 64}


objective started trial:10.


2024-01-01 10:36:52,818 [DEBUG  ] Num steps:50000 current_lr:0.0013102404826637094 new_lr:0.001244728458530524
2024-01-01 10:39:22,393 [DEBUG  ] Num steps:100000 current_lr:0.0007844893762562853 new_lr:0.000745264907443471
2024-01-01 10:41:40,799 [DEBUG  ] Num steps:150000 current_lr:0.000469702768004713 new_lr:0.00044621762960447733
2024-01-01 10:43:48,589 [DEBUG  ] Num steps:200000 current_lr:0.0002812283976669362 new_lr:0.00026716697778358933
2024-01-01 10:45:44,066 [DEBUG  ] Num steps:250000 current_lr:0.00016838183004601494 new_lr:0.00015996273854371418
2024-01-01 10:47:39,980 [DEBUG  ] Num steps:300000 current_lr:0.00010081642154510783 new_lr:9.577560046785243e-05
2024-01-01 10:47:48,362 [INFO   ] objective: trial:10
Mean reward:261.104244375 std_reward:21.36077974698145 result:239.74346462801856
2024-01-01 10:47:48,378 [INFO   ] Saved best model to file:LunarLander-v2.model
[I 2024-01-01 10:47:48,383] Trial 10 finished with value: 239.74346462801856 and parameters: {}. Best is t

objective started trial:11.


2024-01-01 10:49:40,434 [DEBUG  ] Num steps:50000 current_lr:0.0011499434730113207 new_lr:0.0010924462993607547
2024-01-01 10:51:56,630 [DEBUG  ] Num steps:100000 current_lr:0.0006885136353279492 new_lr:0.0006540879535615517
2024-01-01 10:54:17,026 [DEBUG  ] Num steps:150000 current_lr:0.0004122385466401455 new_lr:0.00039162661930813825
2024-01-01 10:56:22,089 [DEBUG  ] Num steps:200000 current_lr:0.0002468224456513983 new_lr:0.00023448132336882836
2024-01-01 10:58:27,222 [DEBUG  ] Num steps:250000 current_lr:0.00014778171564464927 new_lr:0.0001403926298624168
2024-01-01 11:00:52,395 [DEBUG  ] Num steps:300000 current_lr:8.848237210047371e-05 new_lr:8.405825349545001e-05
2024-01-01 11:01:03,166 [INFO   ] objective: trial:11
Mean reward:236.09613437500002 std_reward:23.615464983534384 result:212.48066939146563
[I 2024-01-01 11:01:03,167] Trial 11 finished with value: 212.48066939146563 and parameters: {}. Best is trial 10 with value: 239.74346462801856.
2024-01-01 11:01:03,193 [INFO   ]

objective started trial:12.


2024-01-01 11:03:01,751 [DEBUG  ] Num steps:50000 current_lr:0.00771092500678526 new_lr:0.0073253787564459966
2024-01-01 11:05:07,764 [DEBUG  ] Num steps:100000 current_lr:0.00461681563725928 new_lr:0.004385974855396315
2024-01-01 11:07:25,028 [DEBUG  ] Num steps:150000 current_lr:0.0027642580636805052 new_lr:0.00262604516049648
2024-01-01 11:09:47,846 [DEBUG  ] Num steps:200000 current_lr:0.001655063412313073 new_lr:0.0015723102416974194
[I 2024-01-01 11:09:47,848] Trial 12 pruned. 
2024-01-01 11:09:47,873 [INFO   ] sample_ppo_params() 13
Params:{'ent_coef': 0.01, 'n_steps': 512, 'gamma': 0.9780240813846404, 'learning_rate': 0.07042914605671688, 'max_grad_norm': 0.30581906003766074, 'n_epochs': 5, 'batch_size': 64}


objective started trial:13.


2024-01-01 11:11:35,576 [DEBUG  ] Num steps:50000 current_lr:0.0443879277296541 new_lr:0.042168531343171396
2024-01-01 11:13:23,832 [DEBUG  ] Num steps:100000 current_lr:0.026576691987987455 new_lr:0.02524785738858808
[I 2024-01-01 11:13:23,838] Trial 13 pruned. 
2024-01-01 11:13:23,879 [INFO   ] sample_ppo_params() 14
Params:{'ent_coef': 0.01, 'n_steps': 64, 'gamma': 0.9950884133596364, 'learning_rate': 0.01333255801499181, 'max_grad_norm': 0.882705758055199, 'n_epochs': 5, 'batch_size': 64}


objective started trial:14.


2024-01-01 11:14:59,126 [DEBUG  ] Num steps:50000 current_lr:0.008402836819067694 new_lr:0.007982694978114309
2024-01-01 11:16:36,910 [DEBUG  ] Num steps:100000 current_lr:0.005031088797968144 new_lr:0.004779534358069736
[I 2024-01-01 11:16:36,911] Trial 14 pruned. 
2024-01-01 11:16:36,940 [INFO   ] sample_ppo_params() 15
Params:{'ent_coef': 0.01, 'n_steps': 2048, 'gamma': 0.9382530456038103, 'learning_rate': 0.001072883121540603, 'max_grad_norm': 2.633769169742853, 'n_epochs': 8, 'batch_size': 64}


objective started trial:15.


2024-01-01 11:19:14,065 [DEBUG  ] Num steps:50000 current_lr:0.000676183954054461 new_lr:0.0006423747563517379
2024-01-01 11:22:17,951 [DEBUG  ] Num steps:100000 current_lr:0.0004048563110126723 new_lr:0.00038461349546203865
[I 2024-01-01 11:22:17,954] Trial 15 pruned. 
2024-01-01 11:22:17,993 [INFO   ] sample_ppo_params() 16
Params:{'ent_coef': 0.01, 'n_steps': 256, 'gamma': 0.9992316651224186, 'learning_rate': 0.006802549124490447, 'max_grad_norm': 5.6745789912337345, 'n_epochs': 12, 'batch_size': 64}


objective started trial:16.


2024-01-01 11:24:43,927 [DEBUG  ] Num steps:50000 current_lr:0.004287302570332762 new_lr:0.004072937441816123
2024-01-01 11:27:05,661 [DEBUG  ] Num steps:100000 current_lr:0.002566966418549871 new_lr:0.0024386180976223772
2024-01-01 11:29:48,364 [DEBUG  ] Num steps:150000 current_lr:0.0015369376165702523 new_lr:0.0014600907357417395
2024-01-01 11:32:57,442 [DEBUG  ] Num steps:200000 current_lr:0.0009202213243456015 new_lr:0.0008742102581283214
2024-01-01 11:36:07,111 [DEBUG  ] Num steps:250000 current_lr:0.0005509704991605728 new_lr:0.0005234219742025442
[I 2024-01-01 11:36:07,114] Trial 16 pruned. 
2024-01-01 11:36:07,157 [INFO   ] sample_ppo_params() 17
Params:{'ent_coef': 0.01, 'n_steps': 512, 'gamma': 0.9887605802100042, 'learning_rate': 0.041062969203816975, 'max_grad_norm': 2.2138314268768258, 'n_epochs': 14, 'batch_size': 64}


objective started trial:17.


2024-01-01 11:38:13,255 [DEBUG  ] Num steps:50000 current_lr:0.02587991210224545 new_lr:0.024585916497133178
2024-01-01 11:40:20,228 [DEBUG  ] Num steps:100000 current_lr:0.015495259359856711 new_lr:0.014720496391863874
[I 2024-01-01 11:40:20,230] Trial 17 pruned. 
2024-01-01 11:40:20,256 [INFO   ] sample_ppo_params() 18
Params:{'ent_coef': 0.01, 'n_steps': 2048, 'gamma': 0.997517902659764, 'learning_rate': 0.00044934551152521056, 'max_grad_norm': 0.48990024871458937, 'n_epochs': 7, 'batch_size': 64}


objective started trial:18.


2024-01-01 11:42:32,831 [DEBUG  ] Num steps:50000 current_lr:0.00028319974340116656 new_lr:0.0002690397562311082
2024-01-01 11:45:03,579 [DEBUG  ] Num steps:100000 current_lr:0.00016956214755710868 new_lr:0.00016108404017925324
2024-01-01 11:47:21,169 [DEBUG  ] Num steps:150000 current_lr:0.0001015231212390296 new_lr:9.644696517707812e-05
2024-01-01 11:49:31,889 [DEBUG  ] Num steps:200000 current_lr:6.0785642872583416e-05 new_lr:5.774636072895424e-05
2024-01-01 11:51:40,394 [DEBUG  ] Num steps:250000 current_lr:3.639460976316776e-05 new_lr:3.457487927500937e-05
2024-01-01 11:53:45,629 [DEBUG  ] Num steps:300000 current_lr:2.179079725437428e-05 new_lr:2.0701257391655567e-05
2024-01-01 11:53:51,988 [INFO   ] objective: trial:18
Mean reward:179.254605 std_reward:105.82060098979024 result:73.43400401020976
[I 2024-01-01 11:53:51,989] Trial 18 finished with value: 73.43400401020976 and parameters: {'gamma': 0.0024820973402359973, 'learning_rate_exp': -3.3474195920829857, 'max_grad_norm': 0.

objective started trial:19.


2024-01-01 11:55:55,724 [DEBUG  ] Num steps:50000 current_lr:0.0009691436176313144 new_lr:0.0009206864367497487
2024-01-01 11:57:41,956 [DEBUG  ] Num steps:100000 current_lr:0.0005802620833029829 new_lr:0.0005512489791378337
2024-01-01 11:59:53,102 [DEBUG  ] Num steps:150000 current_lr:0.0003474243437129131 new_lr:0.0003300531265272674
2024-01-01 12:02:32,840 [DEBUG  ] Num steps:200000 current_lr:0.00020801578817157204 new_lr:0.00019761499876299343
[I 2024-01-01 12:02:32,842] Trial 19 pruned. 
2024-01-01 12:02:32,867 [INFO   ] sample_ppo_params() 20
Params:{'ent_coef': 0.01, 'n_steps': 4096, 'gamma': 0.9529737362916497, 'learning_rate': 0.00028373088354208026, 'max_grad_norm': 4.000835744169026, 'n_epochs': 10, 'batch_size': 64}


objective started trial:20.


2024-01-01 12:04:43,646 [DEBUG  ] Num steps:50000 current_lr:0.00017882122187303791 new_lr:0.000169880160779386
2024-01-01 12:07:44,062 [DEBUG  ] Num steps:100000 current_lr:0.0001070668710551297 new_lr:0.00010171352750237321
[I 2024-01-01 12:07:44,064] Trial 20 pruned. 
2024-01-01 12:07:44,090 [INFO   ] sample_ppo_params() 21
Params:{'ent_coef': 0.01, 'n_steps': 4096, 'gamma': 0.999609498841779, 'learning_rate': 0.0001742992017636454, 'max_grad_norm': 7.501070844118849, 'n_epochs': 15, 'batch_size': 64}


objective started trial:21.


2024-01-01 12:10:07,645 [DEBUG  ] Num steps:50000 current_lr:0.00010985196902700808 new_lr:0.00010435937057565767
2024-01-01 12:13:06,462 [DEBUG  ] Num steps:100000 current_lr:6.577243170453997e-05 new_lr:6.248381011931297e-05
2024-01-01 12:15:55,620 [DEBUG  ] Num steps:150000 current_lr:3.9380384445041556e-05 new_lr:3.7411365222789475e-05
2024-01-01 12:18:29,994 [DEBUG  ] Num steps:200000 current_lr:2.357849084865484e-05 new_lr:2.2399566306222097e-05
2024-01-01 12:21:06,844 [DEBUG  ] Num steps:250000 current_lr:1.411731344258372e-05 new_lr:1.3411447770454533e-05
2024-01-01 12:23:51,545 [DEBUG  ] Num steps:300000 current_lr:8.452557040881393e-06 new_lr:8.029929188837324e-06
2024-01-01 12:24:04,424 [INFO   ] objective: trial:21
Mean reward:254.1390525 std_reward:19.502405953956803 result:234.63664654604318
[I 2024-01-01 12:24:04,426] Trial 21 finished with value: 234.63664654604318 and parameters: {'gamma': 0.0003905011582210037, 'learning_rate_exp': -3.758704601819656, 'max_grad_norm':

objective started trial:22.


2024-01-01 12:26:44,261 [DEBUG  ] Num steps:50000 current_lr:0.0005320924350190845 new_lr:0.0005054878132681302
2024-01-01 12:29:33,222 [DEBUG  ] Num steps:100000 current_lr:0.00031858339593522254 new_lr:0.0003026542261384614
2024-01-01 12:32:15,155 [DEBUG  ] Num steps:150000 current_lr:0.00019074764737442363 new_lr:0.00018121026500570245
2024-01-01 12:34:59,309 [DEBUG  ] Num steps:200000 current_lr:0.00011420766255588396 new_lr:0.00010849727942808976
2024-01-01 12:37:41,715 [DEBUG  ] Num steps:250000 current_lr:6.838034631627956e-05 new_lr:6.496132900046557e-05
2024-01-01 12:40:20,667 [DEBUG  ] Num steps:300000 current_lr:4.0941839257469564e-05 new_lr:3.889474729459608e-05
2024-01-01 12:40:35,563 [INFO   ] objective: trial:22
Mean reward:215.149437875 std_reward:53.2343373591899 result:161.91510051581008
[I 2024-01-01 12:40:35,565] Trial 22 finished with value: 161.91510051581008 and parameters: {'gamma': 0.0016243900310192799, 'learning_rate_exp': -3.0735253631774864, 'max_grad_norm'

objective started trial:23.


2024-01-01 12:43:02,911 [DEBUG  ] Num steps:50000 current_lr:0.001993329476107443 new_lr:0.0018936630023020705
2024-01-01 12:45:39,348 [DEBUG  ] Num steps:100000 current_lr:0.0011934799894182109 new_lr:0.0011338059899473002
2024-01-01 12:48:08,812 [DEBUG  ] Num steps:150000 current_lr:0.0007145805559065118 new_lr:0.0006788515281111862
[I 2024-01-01 12:48:08,814] Trial 23 pruned. 
2024-01-01 12:48:08,840 [INFO   ] sample_ppo_params() 24
Params:{'ent_coef': 0.01, 'n_steps': 4096, 'gamma': 0.990154232513451, 'learning_rate': 0.0002501465694649141, 'max_grad_norm': 3.678972182557771, 'n_epochs': 9, 'batch_size': 64}


objective started trial:24.


2024-01-01 12:50:10,440 [DEBUG  ] Num steps:50000 current_lr:0.00015765472774989804 new_lr:0.00014977199136240313
2024-01-01 12:52:48,249 [DEBUG  ] Num steps:100000 current_lr:9.439370914943381e-05 new_lr:8.967402369196211e-05
[I 2024-01-01 12:52:48,251] Trial 24 pruned. 
2024-01-01 12:52:48,254 [INFO   ] Saved the study to the file: /content/opt_lunar_lander_study.pkl


In [15]:
# Write report
study.trials_dataframe().to_csv("study_results_ppo_lunar_lander.csv")

  and should_run_async(code)


In [16]:
fig1 = plot_optimization_history(study)
fig2 = plot_param_importances(study)
fig3 = plot_intermediate_values(study)

fig1.show()
fig2.show()
fig3.show()

  and should_run_async(code)


In [17]:
with open(SAVED_PARAMS_FILE, 'w') as fp:
    json.dump(study.best_params, fp)


`should_run_async` will not call `transform_cell` automatically in the future. Please pass the result to `transformed_cell` argument and any exception that happen during thetransform in `preprocessing_exc_tuple` in IPython 7.17 and above.



In [18]:
study.best_params

{}

In [19]:
if len(study.trials)>0:
    trial = study.best_trial
    print(f'Best trial: {trial.number} Value: {trial.value}')

    print("  Params: ")
    for key, value in study.best_params.items():
        print(f"    {key}: {value}")

Best trial: 10 Value: 239.74346462801856
  Params: 
