## Hyperparameter tuning

In [39]:
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning) 

#### Hyperparameters

Recall our discussing of RLlib config files in module 2:

In [1]:
from ray.rllib.algorithms.ppo import PPOConfig

We wrote code like this:

In [26]:
ppo_config = (
    PPOConfig()\
    .framework("torch")\
    .rollouts(create_env_on_local_worker=True)\
    .debugging(seed=0, log_level="ERROR")\
)

However, we are only setting a tiny fraction of the config options available in RLlib.

#### Embarrassment of configs

Here is the full list:

In [27]:
len(ppo_config.to_dict())

125

In [28]:
ppo_config.to_dict()

{'extra_python_environs_for_driver': {},
 'extra_python_environs_for_worker': {},
 'num_gpus': 0,
 'num_cpus_per_worker': 1,
 'num_gpus_per_worker': 0,
 '_fake_gpus': False,
 'custom_resources_per_worker': {},
 'placement_strategy': 'PACK',
 'eager_tracing': False,
 'eager_max_retraces': 20,
 'tf_session_args': {'intra_op_parallelism_threads': 2,
  'inter_op_parallelism_threads': 2,
  'gpu_options': {'allow_growth': True},
  'log_device_placement': False,
  'device_count': {'CPU': 1},
  'allow_soft_placement': True},
 'local_tf_session_args': {'intra_op_parallelism_threads': 8,
  'inter_op_parallelism_threads': 8},
 'env': None,
 'env_config': {},
 'observation_space': None,
 'action_space': None,
 'env_task_fn': None,
 'render_env': False,
 'clip_rewards': None,
 'normalize_actions': True,
 'clip_actions': False,
 'disable_env_checking': False,
 'num_workers': 2,
 'num_envs_per_worker': 1,
 'sample_collector': ray.rllib.evaluation.collectors.simple_list_collector.SimpleListCollector,


#### Key hyperparameters

We recommend focusing on the following key hyperparameters:

- `lr`
- `train_batch_size`
- `sgd_minibatch_size`
- `num_sgd_iter`
- `entropy_coeff`
- model architecture

#### Key hyperparameter interpretations

Let's look at the definitions of these key hyperparameters:

- `lr`: learning rate
- `train_batch_size`: number of iterations of data to be batched together
- `sgd_minibatch_size`: minibatch size for SGD
- `num_sgd_iter`: epochs of SGD per iteration of PPO
- `entropy_coeff`: measure the amount of exploration during training
- model architecture: of the policy network

Notes:

If you are not familiar with deep learning, most of these hyperparameters will not make sense. That is fine, you can skip.

#### Key hyperparameter defaults

Let's look at the defaults of these key hyperparameters:

In [29]:
ppo_config.lr

5e-05

In [30]:
ppo_config.train_batch_size

4000

In [31]:
ppo_config.sgd_minibatch_size

128

In [32]:
ppo_config.num_sgd_iter

30

In [33]:
ppo_config.entropy_coeff

0.0

#### Key hyperparameter defaults


In [34]:
ppo_config.model

{'_use_default_native_models': False,
 '_disable_preprocessor_api': False,
 '_disable_action_flattening': False,
 'fcnet_hiddens': [256, 256],
 'fcnet_activation': 'tanh',
 'conv_filters': None,
 'conv_activation': 'relu',
 'post_fcnet_hiddens': [],
 'post_fcnet_activation': 'relu',
 'free_log_std': False,
 'no_final_linear': False,
 'vf_share_layers': False,
 'use_lstm': False,
 'max_seq_len': 20,
 'lstm_cell_size': 256,
 'lstm_use_prev_action': False,
 'lstm_use_prev_reward': False,
 '_time_major': False,
 'use_attention': False,
 'attention_num_transformer_units': 1,
 'attention_dim': 64,
 'attention_num_heads': 1,
 'attention_head_dim': 32,
 'attention_memory_inference': 50,
 'attention_memory_training': 50,
 'attention_position_wise_mlp_dim': 32,
 'attention_init_gru_gate_bias': 2.0,
 'attention_use_n_prev_actions': 0,
 'attention_use_n_prev_rewards': 0,
 'framestack': True,
 'dim': 84,
 'grayscale': False,
 'zero_mean': True,
 'custom_model': None,
 'custom_model_config': {},
 'c

#### How do we tune?

- Do we resort to tuning by hand?
- No!!!

#### Introducing Ray tune

![](img/rllib_and_tune.png)

In [35]:
from ray import tune

#### Tune usage

- Tune is its own sub-package of Ray, like RLlib
- It is sophisticated and has its own entire documentation [here](https://docs.ray.io/en/latest/tune/index.html)
- For our purposes, we will focus on this syntax:

Instead of

In [36]:
ppo_config = ppo_config.training(
    lr=1e-4
)

we do

In [37]:
ppo_config = ppo_config.training(
    lr=tune.grid_search([1e-4, 5e-5])
)

We're setting up `tune` to automatically sweep these values!

#### Running the sweep

In [76]:
ppo_config = ppo_config.environment(env="FrozenLake-v1")

analysis = tune.run(
    "PPO",
    config            = ppo_config.to_dict(),
    stop              = {"training_iteration" : 5},
    checkpoint_freq   = 1,
    verbose           = 0,
    metric="episode_reward_mean",
    mode="max",
)

[2m[36m(PPO pid=6445)[0m   if LooseVersion(torch.__version__) >= LooseVersion("1.8.0"):
[2m[36m(PPO pid=6445)[0m 2022-07-30 19:16:27,149	INFO ppo.py:378 -- In multi-agent mode, policies will be optimized sequentially by the multi-GPU optimizer. Consider setting simple_optimizer=True if this doesn't work for you.
[2m[36m(PPO pid=6445)[0m 2022-07-30 19:16:27,149	INFO algorithm.py:332 -- Current log_level is ERROR. For more information, set 'log_level': 'INFO' / 'DEBUG' or use the -v and -vv flags.
[2m[36m(PPO pid=6444)[0m   if LooseVersion(torch.__version__) >= LooseVersion("1.8.0"):
[2m[36m(PPO pid=6444)[0m 2022-07-30 19:16:27,149	INFO ppo.py:378 -- In multi-agent mode, policies will be optimized sequentially by the multi-GPU optimizer. Consider setting simple_optimizer=True if this doesn't work for you.
[2m[36m(PPO pid=6444)[0m 2022-07-30 19:16:27,149	INFO algorithm.py:332 -- Current log_level is ERROR. For more information, set 'log_level': 'INFO' / 'DEBUG' or use th

In [77]:
analysis.best_config["lr"]

0.0001

In [78]:
analysis.get_best_logdir("episode_reward_mean", mode="max")

'/Users/mike/ray_results/PPO/PPO_FrozenLake-v1_c6bf1_00000_0_lr=0.0001_2022-07-30_19-16-24'

In [79]:
analysis.best_logdir

'/Users/mike/ray_results/PPO/PPO_FrozenLake-v1_c6bf1_00000_0_lr=0.0001_2022-07-30_19-16-24'

In [80]:
analysis.get_best_checkpoint("episode_reward_mean", mode="max")

KeyError: 'episode_reward_mean'

In [84]:
trials = analysis.trials
trials

[PPO_FrozenLake-v1_c6bf1_00000, PPO_FrozenLake-v1_c6bf1_00001]

In [83]:
best_checkpoint = analysis.get_best_checkpoint(trial=trials[0], metric="episode_reward_mean", mode="max")


RuntimeError: Cannot create checkpoint from URI as it is not supported: None

In [72]:
analysis.best_checkpoint

RuntimeError: Cannot create checkpoint from URI as it is not supported: None

Gotcha? Tune checkpoints vs. RLlib checkpoints?

Notes:

- `config`: the config file
- `stop: the stopping condition. Other possibilities are the total number of timesteps or reaching a certain reward value.
- `checkpoint_at_end:
- `verbose`: 
- `metric`:
- `mode`:

#### Larger grid searches

multiple hypers

#### Other types of searches

#### Let's apply what we learned!

## Key hyperparameters
<!-- multiple choice -->

Which of the following RLlib PPO hyperparameters directly controls the exploration/exploitation tradeoff during training?

- [ ] lr
- [ ] train_batch_size
- [ ] num_sgd_iter
- [x] entropy_coeff

## Grid search
<!-- multiple choice -->

Given the code below, how many agents are trained by the Ray tune experiment?

```python
ppo_config = (
    PPOConfig()\
    .framework("torch")\
    .rollouts(create_env_on_local_worker=True)\
    .debugging(seed=0, log_level="ERROR")\
    .training(model={"fcnet_hiddens" : [64, 64]}, 
              lr=tune.grid_search([1e-2, 1e-3, 1e-4]), 
              train_batch_size=tune.grid_search([400, 4000, 40_000]))\
    .environment(env_config=env_config, env=BasicRecommenderWithHistory)
)
```

- [ ] 1 | With grid search, an agent is trained for every combination of hyperparameters, in this case lr and train_batch_size.
- [ ] 3 | With grid search, an agent is trained for every combination of hyperparameters, in this case lr and train_batch_size.
- [ ] 6 | With grid search, an agent is trained for every combination of hyperparameters, in this case lr and train_batch_size.
- [x] 9

## Ray tune
<!-- multiple choice -->

True of false: Ray tune is a library specifically for tuning RLlib algorithms.

- [ ] True | Ray tune can tune models beyond RLlib as well!
- [x] False

## Tuning a model
<!-- coding exercise -->

In Module 4 we claimed that `lr=1e-3` works better than `lr=1e-4` for the recommender environment we created. Complete the code below so that it uses Ray tune to select the best learning rate from the following candidates: `[1e-2, 1e-3, 1e-4]`. Then, answer the multiple choice question.

In [85]:
# EXERCISE

from envs import BasicRecommenderWithHistory
from ray.rllib.algorithms.ppo import PPOConfig
from ray import tune

env_config = {
    "num_candidates" : 2,
    "alpha"          : 0.5,
    "seed"           : 42
}

ppo_config = (
    PPOConfig()\
    .framework("torch")\
    .rollouts(create_env_on_local_worker=True)\
    .debugging(seed=0, log_level="ERROR")\
    .training(model={"fcnet_hiddens" : [64, 64]}, 
              lr=____)\
    .environment(env_config=env_config, env=BasicRecommenderWithHistory)
)

analysis = tune.____(
    "PPO",
    config            = ppo_config.to_dict(),
    ____              = {"training_iteration" : 10},
    checkpoint_freq   = 1,
    verbose           = 0
)

____.results_df[["lr", "episode_reward_mean"]]

In [None]:
# SOLUTION

from envs import BasicRecommenderWithHistory
from ray.rllib.algorithms.ppo import PPOConfig
from ray import tune

env_config = {
    "num_candidates" : 2,
    "alpha"          : 0.5,
    "seed"           : 42
}

ppo_config = (
    PPOConfig()\
    .framework("torch")\
    .rollouts(create_env_on_local_worker=True)\
    .debugging(seed=0, log_level="ERROR")\
    .training(model={"fcnet_hiddens" : [64, 64]}, 
              lr=tune.grid_search([1e-2, 1e-3, 1e-4]))\
    .environment(env_config=env_config, env=BasicRecommenderWithHistory)
)

analysis = tune.run(
    "PPO",
    config            = ppo_config.to_dict(),
    stop              = {"training_iteration" : 10},
    checkpoint_freq   = 1,
    verbose           = 0
)

analysis.results_df[["lr", "episode_reward_mean"]]

[2m[36m(PPO pid=12987)[0m   if LooseVersion(torch.__version__) >= LooseVersion("1.8.0"):
[2m[36m(PPO pid=12987)[0m 2022-07-31 08:03:27,492	INFO ppo.py:378 -- In multi-agent mode, policies will be optimized sequentially by the multi-GPU optimizer. Consider setting simple_optimizer=True if this doesn't work for you.
[2m[36m(PPO pid=12987)[0m 2022-07-31 08:03:27,492	INFO algorithm.py:332 -- Current log_level is ERROR. For more information, set 'log_level': 'INFO' / 'DEBUG' or use the -v and -vv flags.
[2m[36m(PPO pid=12986)[0m   if LooseVersion(torch.__version__) >= LooseVersion("1.8.0"):
[2m[36m(PPO pid=12986)[0m 2022-07-31 08:03:27,492	INFO ppo.py:378 -- In multi-agent mode, policies will be optimized sequentially by the multi-GPU optimizer. Consider setting simple_optimizer=True if this doesn't work for you.
[2m[36m(PPO pid=12986)[0m 2022-07-31 08:03:27,492	INFO algorithm.py:332 -- Current log_level is ERROR. For more information, set 'log_level': 'INFO' / 'DEBUG' or 

In [None]:
analysis.results_df[["lr", "episode_reward_mean"]]

In [None]:
analysis.results_df

#### Is a learning rate of 0.001 actually better than the default of 0.0001

- [x] Yes
- [ ] No