## Hyperparameter tuning

In [39]:
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning) 

#### Hyperparameters

Recall our discussing of RLlib config files in module 2:

In [1]:
from ray.rllib.algorithms.ppo import PPOConfig

We wrote code like this:

In [26]:
ppo_config = (
    PPOConfig()\
    .framework("torch")\
    .rollouts(create_env_on_local_worker=True)\
    .debugging(seed=0, log_level="ERROR")\
)

However, we are only setting a tiny fraction of the config options available in RLlib.

#### Embarrassment of configs

Here is the full list:

In [27]:
len(ppo_config.to_dict())

125

In [28]:
ppo_config.to_dict()

{'extra_python_environs_for_driver': {},
 'extra_python_environs_for_worker': {},
 'num_gpus': 0,
 'num_cpus_per_worker': 1,
 'num_gpus_per_worker': 0,
 '_fake_gpus': False,
 'custom_resources_per_worker': {},
 'placement_strategy': 'PACK',
 'eager_tracing': False,
 'eager_max_retraces': 20,
 'tf_session_args': {'intra_op_parallelism_threads': 2,
  'inter_op_parallelism_threads': 2,
  'gpu_options': {'allow_growth': True},
  'log_device_placement': False,
  'device_count': {'CPU': 1},
  'allow_soft_placement': True},
 'local_tf_session_args': {'intra_op_parallelism_threads': 8,
  'inter_op_parallelism_threads': 8},
 'env': None,
 'env_config': {},
 'observation_space': None,
 'action_space': None,
 'env_task_fn': None,
 'render_env': False,
 'clip_rewards': None,
 'normalize_actions': True,
 'clip_actions': False,
 'disable_env_checking': False,
 'num_workers': 2,
 'num_envs_per_worker': 1,
 'sample_collector': ray.rllib.evaluation.collectors.simple_list_collector.SimpleListCollector,


#### Key hyperparameters

We recommend focusing on the following key hyperparameters:

- `lr`
- `train_batch_size`
- `sgd_minibatch_size`
- `num_sgd_iter`
- `entropy_coeff`
- model architecture

#### Key hyperparameter interpretations

Let's look at the definitions of these key hyperparameters:

- `lr`: learning rate
- `train_batch_size`: number of iterations of data to be batched together
- `sgd_minibatch_size`: minibatch size for SGD
- `num_sgd_iter`: epochs of SGD per iteration of PPO
- `entropy_coeff`: measure the amount of exploration during training
- model architecture: of the policy network

Notes:

If you are not familiar with deep learning, most of these hyperparameters will not make sense. That is fine, you can skip.

#### Key hyperparameter defaults

Let's look at the defaults of these key hyperparameters:

In [29]:
ppo_config.lr

5e-05

In [30]:
ppo_config.train_batch_size

4000

In [31]:
ppo_config.sgd_minibatch_size

128

In [32]:
ppo_config.num_sgd_iter

30

In [33]:
ppo_config.entropy_coeff

0.0

#### Key hyperparameter defaults


In [34]:
ppo_config.model

{'_use_default_native_models': False,
 '_disable_preprocessor_api': False,
 '_disable_action_flattening': False,
 'fcnet_hiddens': [256, 256],
 'fcnet_activation': 'tanh',
 'conv_filters': None,
 'conv_activation': 'relu',
 'post_fcnet_hiddens': [],
 'post_fcnet_activation': 'relu',
 'free_log_std': False,
 'no_final_linear': False,
 'vf_share_layers': False,
 'use_lstm': False,
 'max_seq_len': 20,
 'lstm_cell_size': 256,
 'lstm_use_prev_action': False,
 'lstm_use_prev_reward': False,
 '_time_major': False,
 'use_attention': False,
 'attention_num_transformer_units': 1,
 'attention_dim': 64,
 'attention_num_heads': 1,
 'attention_head_dim': 32,
 'attention_memory_inference': 50,
 'attention_memory_training': 50,
 'attention_position_wise_mlp_dim': 32,
 'attention_init_gru_gate_bias': 2.0,
 'attention_use_n_prev_actions': 0,
 'attention_use_n_prev_rewards': 0,
 'framestack': True,
 'dim': 84,
 'grayscale': False,
 'zero_mean': True,
 'custom_model': None,
 'custom_model_config': {},
 'c

#### How do we tune?

- Do we resort to tuning by hand?
- No!!!

#### Introducing Ray tune

![](img/rllib_and_tune.png)

In [35]:
from ray import tune

#### Tune usage

- Tune is its own sub-package of Ray, like RLlib
- It is sophisticated and has its own entire documentation [here](https://docs.ray.io/en/latest/tune/index.html)
- For our purposes, we will focus on this syntax:

Instead of

In [36]:
ppo_config = ppo_config.training(
    lr=1e-4
)

we do

In [37]:
ppo_config = ppo_config.training(
    lr=tune.grid_search([1e-4, 5e-5])
)

We're setting up `tune` to automatically sweep these values!

#### Running the sweep

In [76]:
ppo_config = ppo_config.environment(env="FrozenLake-v1")

analysis = tune.run(
    "PPO",
    config            = ppo_config.to_dict(),
    stop              = {"training_iteration" : 5},
    checkpoint_freq   = 1,
    verbose           = 0,
    metric="episode_reward_mean",
    mode="max",
)

[2m[36m(PPO pid=6445)[0m   if LooseVersion(torch.__version__) >= LooseVersion("1.8.0"):
[2m[36m(PPO pid=6445)[0m 2022-07-30 19:16:27,149	INFO ppo.py:378 -- In multi-agent mode, policies will be optimized sequentially by the multi-GPU optimizer. Consider setting simple_optimizer=True if this doesn't work for you.
[2m[36m(PPO pid=6445)[0m 2022-07-30 19:16:27,149	INFO algorithm.py:332 -- Current log_level is ERROR. For more information, set 'log_level': 'INFO' / 'DEBUG' or use the -v and -vv flags.
[2m[36m(PPO pid=6444)[0m   if LooseVersion(torch.__version__) >= LooseVersion("1.8.0"):
[2m[36m(PPO pid=6444)[0m 2022-07-30 19:16:27,149	INFO ppo.py:378 -- In multi-agent mode, policies will be optimized sequentially by the multi-GPU optimizer. Consider setting simple_optimizer=True if this doesn't work for you.
[2m[36m(PPO pid=6444)[0m 2022-07-30 19:16:27,149	INFO algorithm.py:332 -- Current log_level is ERROR. For more information, set 'log_level': 'INFO' / 'DEBUG' or use th

In [77]:
analysis.best_config["lr"]

0.0001

In [78]:
analysis.get_best_logdir("episode_reward_mean", mode="max")

'/Users/mike/ray_results/PPO/PPO_FrozenLake-v1_c6bf1_00000_0_lr=0.0001_2022-07-30_19-16-24'

In [79]:
analysis.best_logdir

'/Users/mike/ray_results/PPO/PPO_FrozenLake-v1_c6bf1_00000_0_lr=0.0001_2022-07-30_19-16-24'

In [80]:
analysis.get_best_checkpoint("episode_reward_mean", mode="max")

KeyError: 'episode_reward_mean'

In [84]:
trials = analysis.trials
trials

[PPO_FrozenLake-v1_c6bf1_00000, PPO_FrozenLake-v1_c6bf1_00001]

In [83]:
best_checkpoint = analysis.get_best_checkpoint(trial=trials[0], metric="episode_reward_mean", mode="max")


RuntimeError: Cannot create checkpoint from URI as it is not supported: None

In [72]:
analysis.best_checkpoint

RuntimeError: Cannot create checkpoint from URI as it is not supported: None

Gotcha? Tune checkpoints vs. RLlib checkpoints?

Notes:

- `config`: the config file
- `stop: the stopping condition. Other possibilities are the total number of timesteps or reaching a certain reward value.
- `checkpoint_at_end:
- `verbose
- `metric`:
- `mode`:

In [5]:
tune.run("PPO", config = tune_config, stop = {"training_iteration": 10}, checkpoint_at_end=True, verbose=2, local_dir=".")

[2m[36m(PPOTrainer pid=63452)[0m 2022-07-14 09:51:34,080	INFO ppo.py:249 -- In multi-agent mode, policies will be optimized sequentially by the multi-GPU optimizer. Consider setting simple_optimizer=True if this doesn't work for you.
[2m[36m(PPOTrainer pid=63452)[0m 2022-07-14 09:51:34,090	INFO trainer.py:779 -- Current log_level is WARN. For more information, set 'log_level': 'INFO' / 'DEBUG' or use the -v and -vv flags.


Trial PPO_BasicRecommender_36008_00000 reported episode_reward_max=26.229951468536488,episode_reward_min=24.437207722145967,episode_reward_mean=25.367144631882343,episode_len_mean=100.0,episode_media={},episodes_this_iter=40,policy_reward_min={},policy_reward_max={},policy_reward_mean={},custom_metrics={},sampler_perf={'mean_raw_obs_processing_ms': 0.03270105622161454, 'mean_inference_ms': 0.25230732516966003, 'mean_action_processing_ms': 0.015700894078869987, 'mean_env_wait_ms': 0.014201394919453113, 'mean_env_render_ms': 0.0},off_policy_estimator={},num_healthy_workers=2,timesteps_this_iter=4000,agent_timesteps_total=4000,timers={'sample_time_ms': 3011.171, 'sample_throughput': 1328.387, 'load_time_ms': 6.652, 'load_throughput': 601333.907, 'learn_time_ms': 1587.199, 'learn_throughput': 2520.163, 'update_time_ms': 2.732},info={'learner': {'default_policy': {'learner_stats': {'allreduce_latency': 0.0, 'cur_kl_coeff': 0.20000000000000004, 'cur_lr': 0.0010000000000000005, 'total_loss': 

Trial PPO_BasicRecommender_36008_00000 reported episode_reward_max=26.52690953227332,episode_reward_min=24.093393864945554,episode_reward_mean=25.429153526172882,episode_len_mean=100.0,episode_media={},episodes_this_iter=40,policy_reward_min={},policy_reward_max={},policy_reward_mean={},custom_metrics={},sampler_perf={'mean_raw_obs_processing_ms': 0.036330457471485056, 'mean_inference_ms': 0.2579944025246475, 'mean_action_processing_ms': 0.015710399918824974, 'mean_env_wait_ms': 0.01514268198918096, 'mean_env_render_ms': 0.0},off_policy_estimator={},num_healthy_workers=2,timesteps_this_iter=4000,agent_timesteps_total=16000,timers={'sample_time_ms': 2459.617, 'sample_throughput': 1626.269, 'load_time_ms': 1.859, 'load_throughput': 2151752.725, 'learn_time_ms': 1446.732, 'learn_throughput': 2764.852, 'update_time_ms': 3.082},info={'learner': {'default_policy': {'learner_stats': {'allreduce_latency': 0.0, 'cur_kl_coeff': 0.20000000000000004, 'cur_lr': 0.0010000000000000005, 'total_loss': 

Trial PPO_BasicRecommender_36008_00000 reported episode_reward_max=26.771151646915843,episode_reward_min=24.011689111219024,episode_reward_mean=25.575060131062806,episode_len_mean=100.0,episode_media={},episodes_this_iter=40,policy_reward_min={},policy_reward_max={},policy_reward_mean={},custom_metrics={},sampler_perf={'mean_raw_obs_processing_ms': 0.0378616620354775, 'mean_inference_ms': 0.263690299528818, 'mean_action_processing_ms': 0.016213090765989046, 'mean_env_wait_ms': 0.015341564616789033, 'mean_env_render_ms': 0.0},off_policy_estimator={},num_healthy_workers=2,timesteps_this_iter=4000,agent_timesteps_total=28000,timers={'sample_time_ms': 2288.985, 'sample_throughput': 1747.5, 'load_time_ms': 1.285, 'load_throughput': 3112573.534, 'learn_time_ms': 1380.818, 'learn_throughput': 2896.834, 'update_time_ms': 2.364},info={'learner': {'default_policy': {'learner_stats': {'allreduce_latency': 0.0, 'cur_kl_coeff': 0.10000000000000002, 'cur_lr': 0.0010000000000000005, 'total_loss': 20.

Trial PPO_BasicRecommender_36008_00000 reported episode_reward_max=26.6548669958362,episode_reward_min=24.056177303270747,episode_reward_mean=25.540941366961775,episode_len_mean=100.0,episode_media={},episodes_this_iter=40,policy_reward_min={},policy_reward_max={},policy_reward_mean={},custom_metrics={},sampler_perf={'mean_raw_obs_processing_ms': 0.036447590817010475, 'mean_inference_ms': 0.25664676157687394, 'mean_action_processing_ms': 0.015811126708607494, 'mean_env_wait_ms': 0.014775917683812385, 'mean_env_render_ms': 0.0},off_policy_estimator={},num_healthy_workers=2,timesteps_this_iter=4000,agent_timesteps_total=40000,timers={'sample_time_ms': 2217.21, 'sample_throughput': 1804.069, 'load_time_ms': 1.0, 'load_throughput': 4000862.307, 'learn_time_ms': 1385.19, 'learn_throughput': 2887.692, 'update_time_ms': 2.919},info={'learner': {'default_policy': {'learner_stats': {'allreduce_latency': 0.0, 'cur_kl_coeff': 0.10000000000000002, 'cur_lr': 0.0010000000000000005, 'total_loss': 20.

Trial name,status,loc,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_BasicRecommender_36008_00000,TERMINATED,127.0.0.1:63452,10,23.1234,40000,25.5409,26.6549,24.0562,100


2022-07-14 09:51:58,607	INFO tune.py:639 -- Total run time: 27.73 seconds (26.82 seconds for the tuning loop).


<ray.tune.analysis.experiment_analysis.ExperimentAnalysis at 0x126f5e7f0>