# Utilizing PPO RL agent as trainable and optimizing the hyperparameters in the Cartpole environment

First we need to define the search space for the PPO hyperparameters we want to tune/optimize. So, we can check the [PPO config training](https://docs.ray.io/en/latest/rllib/rllib-algorithms.html#ray.rllib.algorithms.ppo.ppo.PPOConfig) in the Ray RLlib page to verify the specific PPO hyperparameters, and also the [general RL algorithm hyperparameters](https://docs.ray.io/en/latest/rllib/package_ref/doc/ray.rllib.algorithms.algorithm_config.AlgorithmConfig.training.html#ray-rllib-algorithms-algorithm-config-algorithmconfig-training).

For this notebook, we will try to optimize the learning rate `lr` and the discount factor `gamma` hyperparameters. We can check the common values assumed for these variables using the SB3 Zoo [here](https://github.com/DLR-RM/rl-baselines3-zoo/blob/master/rl_zoo3/hyperparams_opt.py). These values were defined based on community experience using the software and algorithms.

In [1]:
from pathlib import Path
from ray import air, tune
from ray.rllib.algorithms.ppo import PPOConfig
from ray.rllib.algorithms.algorithm import Algorithm

In [2]:
search_space = {
    "lr": tune.loguniform(1e-5, 1),
    "gamma": tune.choice(
        [
            0.5,
            0.6,
            0.7,
            0.8,
            0.9,
            0.95,
            0.98,
            0.99,
            0.995,
            0.999,
            0.9999,
        ]
    ),
}

You can check the `loguniform` and the `choice` search space [here](https://docs.ray.io/en/latest/tune/api/search_space.html).

In Ray Tune, there is a Search Space algorithm responsible for selecting the hyperparameters to be used in a trial. In this example, we are using a Random Search algorithm that randomly selects samples from the hyperparameters in the search space.

In [3]:
search_algo = tune.search.basic_variant.BasicVariantGenerator()  # Random search

In Ray Tune, there is also a Trial Scheduler algorithm responsible for early terminate bad trials, pause trials, clone trials, and alter hyperparameters of a running trial. In this example, we are considering a simple scheduler that only allocates the trials in the FIFO order without pausing, terminating or altering hyperparameters of a running trial.

In [4]:
scheduler_algo = tune.schedulers.FIFOScheduler()  # FIFO trial scheduler

Once the search and scheduler algorithms are defined, we can define our Tune configuration:

In [5]:
number_trials = 2
tune_config = tune.TuneConfig(
    metric="env_runners/episode_reward_mean",  # That's the metric we want to maximize/minimize
    mode="max",  # Here we indicate we want to maximize the metric env_runners/episode_reward_mean
    scheduler=scheduler_algo,
    search_alg=search_algo,
    num_samples=number_trials,  # Number of trials to run
)

Now, let's initialize our Tuner and train the PPO RL agent.

In [6]:
config = PPOConfig().environment("CartPole-v1")
stop = {
    "training_iteration": 10,
}
checkpoint_frequency = 0
store_results_path = str(Path("./ray_results/").resolve()) + "/nb_2/"
agent_name = "ppo_cartpole"

tuner = tune.Tuner(
    "PPO",
    param_space={
        **config.to_dict(),
        **search_space,
    },  # Here we mix the Algo config with the search space
    tune_config=tune_config,
    run_config=air.RunConfig(
        storage_path=store_results_path,
        name=agent_name,
        stop=stop,
        verbose=2,
        checkpoint_config=air.CheckpointConfig(
            checkpoint_frequency=checkpoint_frequency,
            checkpoint_at_end=True,
        ),
    ),
)
results = tuner.fit()
print(results)

2024-11-30 02:50:09,520	INFO worker.py:1783 -- Started a local Ray instance.
2024-11-30 02:50:10,043	INFO tune.py:253 -- Initializing Ray automatically. For cluster usage or custom Ray initialization, call `ray.init(...)` before `Tuner(...)`.
2024-11-30 02:50:10,045	INFO tune.py:616 -- [output] This uses the legacy output and progress reporter, as Jupyter notebooks are not supported by the new engine, yet. For more information, please see https://github.com/ray-project/ray/issues/36949
  gym.logger.warn(f"Box bound precision lowered by casting to {self.dtype}")
  logger.warn(
  logger.warn(f"{pre} is not within the observation space.")


0,1
Current time:,2024-11-30 02:51:52
Running for:,00:01:42.19
Memory:,5.5/23.9 GiB

Trial name,status,loc,gamma,lr,iter,total time (s),ts,num_healthy_workers,num_in_flight_async_ sample_reqs,num_remote_worker_re starts
PPO_CartPole-v1_cfa8c_00000,TERMINATED,200.239.93.233:515764,0.995,1.39853e-05,10,93.8896,40000,2,0,0
PPO_CartPole-v1_cfa8c_00001,TERMINATED,200.239.93.233:515765,0.5,0.0029496,10,94.2231,40000,2,0,0


[36m(PPO pid=515765)[0m Install gputil for GPU system monitoring.


Trial name,agent_timesteps_total,counters,custom_metrics,env_runners,episode_media,info,num_agent_steps_sampled,num_agent_steps_sampled_lifetime,num_agent_steps_trained,num_env_steps_sampled,num_env_steps_sampled_lifetime,num_env_steps_sampled_this_iter,num_env_steps_sampled_throughput_per_sec,num_env_steps_trained,num_env_steps_trained_this_iter,num_env_steps_trained_throughput_per_sec,num_healthy_workers,num_in_flight_async_sample_reqs,num_remote_worker_restarts,num_steps_trained_this_iter,perf,timers
PPO_CartPole-v1_cfa8c_00000,40000,"{'num_env_steps_sampled': 40000, 'num_env_steps_trained': 40000, 'num_agent_steps_sampled': 40000, 'num_agent_steps_trained': 40000}",{},"{'episode_reward_max': 500.0, 'episode_reward_min': 30.0, 'episode_reward_mean': np.float64(278.86), 'episode_len_mean': np.float64(278.86), 'episode_media': {}, 'episodes_timesteps_total': 27886, 'policy_reward_min': {'default_policy': np.float64(30.0)}, 'policy_reward_max': {'default_policy': np.float64(500.0)}, 'policy_reward_mean': {'default_policy': np.float64(278.86)}, 'custom_metrics': {}, 'hist_stats': {'episode_reward': [79.0, 59.0, 30.0, 99.0, 226.0, 171.0, 131.0, 256.0, 184.0, 77.0, 152.0, 180.0, 94.0, 136.0, 54.0, 227.0, 232.0, 167.0, 85.0, 153.0, 195.0, 162.0, 167.0, 120.0, 172.0, 295.0, 154.0, 203.0, 324.0, 91.0, 250.0, 57.0, 64.0, 103.0, 286.0, 275.0, 233.0, 166.0, 139.0, 267.0, 340.0, 178.0, 237.0, 99.0, 357.0, 206.0, 66.0, 383.0, 389.0, 288.0, 261.0, 388.0, 322.0, 202.0, 332.0, 310.0, 188.0, 348.0, 221.0, 86.0, 99.0, 291.0, 302.0, 500.0, 369.0, 492.0, 367.0, 249.0, 363.0, 279.0, 203.0, 324.0, 122.0, 500.0, 500.0, 324.0, 296.0, 460.0, 500.0, 500.0, 359.0, 418.0, 348.0, 294.0, 500.0, 468.0, 500.0, 412.0, 500.0, 500.0, 500.0, 500.0, 500.0, 427.0, 500.0, 500.0, 500.0, 500.0, 404.0, 500.0], 'episode_lengths': [79, 59, 30, 99, 226, 171, 131, 256, 184, 77, 152, 180, 94, 136, 54, 227, 232, 167, 85, 153, 195, 162, 167, 120, 172, 295, 154, 203, 324, 91, 250, 57, 64, 103, 286, 275, 233, 166, 139, 267, 340, 178, 237, 99, 357, 206, 66, 383, 389, 288, 261, 388, 322, 202, 332, 310, 188, 348, 221, 86, 99, 291, 302, 500, 369, 492, 367, 249, 363, 279, 203, 324, 122, 500, 500, 324, 296, 460, 500, 500, 359, 418, 348, 294, 500, 468, 500, 412, 500, 500, 500, 500, 500, 427, 500, 500, 500, 500, 404, 500], 'policy_default_policy_reward': [79.0, 59.0, 30.0, 99.0, 226.0, 171.0, 131.0, 256.0, 184.0, 77.0, 152.0, 180.0, 94.0, 136.0, 54.0, 227.0, 232.0, 167.0, 85.0, 153.0, 195.0, 162.0, 167.0, 120.0, 172.0, 295.0, 154.0, 203.0, 324.0, 91.0, 250.0, 57.0, 64.0, 103.0, 286.0, 275.0, 233.0, 166.0, 139.0, 267.0, 340.0, 178.0, 237.0, 99.0, 357.0, 206.0, 66.0, 383.0, 389.0, 288.0, 261.0, 388.0, 322.0, 202.0, 332.0, 310.0, 188.0, 348.0, 221.0, 86.0, 99.0, 291.0, 302.0, 500.0, 369.0, 492.0, 367.0, 249.0, 363.0, 279.0, 203.0, 324.0, 122.0, 500.0, 500.0, 324.0, 296.0, 460.0, 500.0, 500.0, 359.0, 418.0, 348.0, 294.0, 500.0, 468.0, 500.0, 412.0, 500.0, 500.0, 500.0, 500.0, 500.0, 427.0, 500.0, 500.0, 500.0, 500.0, 404.0, 500.0]}, 'sampler_perf': {'mean_raw_obs_processing_ms': np.float64(0.19439578188943432), 'mean_inference_ms': np.float64(0.6707228199427331), 'mean_action_processing_ms': np.float64(0.07895767599599109), 'mean_env_wait_ms': np.float64(0.04029074512553687), 'mean_env_render_ms': np.float64(0.0)}, 'num_faulty_episodes': 0, 'connector_metrics': {'ObsPreprocessorConnector_ms': np.float64(0.0043277740478515625), 'StateBufferConnector_ms': np.float64(0.003129243850708008), 'ViewRequirementAgentConnector_ms': np.float64(0.08161735534667969)}, 'num_episodes': 8, 'episode_return_max': 500.0, 'episode_return_min': 30.0, 'episode_return_mean': np.float64(278.86), 'episodes_this_iter': 8}",{},"{'learner': {'default_policy': {'learner_stats': {'allreduce_latency': np.float64(0.0), 'grad_gnorm': np.float32(0.36715484), 'cur_kl_coeff': np.float64(0.01875), 'cur_lr': np.float64(1.3985265446433982e-05), 'total_loss': np.float64(9.929167386536957), 'policy_loss': np.float64(-0.01733068605904938), 'vf_loss': np.float64(9.94645794078868), 'vf_explained_var': np.float64(-0.06159804110885948), 'kl': np.float64(0.0021413145434349998), 'entropy': np.float64(0.5399863363594137), 'entropy_coeff': np.float64(0.0)}, 'model': {}, 'custom_metrics': {}, 'num_agent_steps_trained': np.float64(128.0), 'num_grad_updates_lifetime': np.float64(8835.5), 'diff_num_grad_updates_vs_sampler_policy': np.float64(464.5)}}, 'num_env_steps_sampled': 40000, 'num_env_steps_trained': 40000, 'num_agent_steps_sampled': 40000, 'num_agent_steps_trained': 40000}",40000,40000,40000,40000,40000,4000,427.598,40000,4000,427.598,2,0,0,4000,"{'cpu_util_percent': np.float64(15.638461538461542), 'ram_util_percent': np.float64(27.600000000000005)}","{'training_iteration_time_ms': 9385.326, 'restore_workers_time_ms': 0.017, 'training_step_time_ms': 9385.279, 'sample_time_ms': 2011.235, 'load_time_ms': 0.323, 'load_throughput': 12378055.187, 'learn_time_ms': 7369.253, 'learn_throughput': 542.796, 'synch_weights_time_ms': 3.983}"
PPO_CartPole-v1_cfa8c_00001,40000,"{'num_env_steps_sampled': 40000, 'num_env_steps_trained': 40000, 'num_agent_steps_sampled': 40000, 'num_agent_steps_trained': 40000}",{},"{'episode_reward_max': 284.0, 'episode_reward_min': 24.0, 'episode_reward_mean': np.float64(83.7), 'episode_len_mean': np.float64(83.7), 'episode_media': {}, 'episodes_timesteps_total': 8370, 'policy_reward_min': {'default_policy': np.float64(24.0)}, 'policy_reward_max': {'default_policy': np.float64(284.0)}, 'policy_reward_mean': {'default_policy': np.float64(83.7)}, 'custom_metrics': {}, 'hist_stats': {'episode_reward': [118.0, 58.0, 100.0, 53.0, 68.0, 171.0, 103.0, 99.0, 108.0, 92.0, 284.0, 91.0, 87.0, 58.0, 66.0, 77.0, 81.0, 49.0, 62.0, 93.0, 65.0, 37.0, 71.0, 104.0, 62.0, 90.0, 68.0, 86.0, 62.0, 149.0, 24.0, 57.0, 47.0, 87.0, 93.0, 47.0, 47.0, 54.0, 82.0, 59.0, 90.0, 66.0, 53.0, 92.0, 76.0, 58.0, 119.0, 85.0, 53.0, 100.0, 93.0, 60.0, 78.0, 122.0, 88.0, 55.0, 73.0, 26.0, 79.0, 60.0, 75.0, 94.0, 107.0, 36.0, 110.0, 96.0, 116.0, 124.0, 69.0, 100.0, 57.0, 56.0, 125.0, 82.0, 126.0, 146.0, 75.0, 85.0, 118.0, 77.0, 77.0, 69.0, 83.0, 103.0, 90.0, 80.0, 78.0, 88.0, 69.0, 136.0, 91.0, 65.0, 41.0, 148.0, 79.0, 77.0, 87.0, 72.0, 68.0, 60.0], 'episode_lengths': [118, 58, 100, 53, 68, 171, 103, 99, 108, 92, 284, 91, 87, 58, 66, 77, 81, 49, 62, 93, 65, 37, 71, 104, 62, 90, 68, 86, 62, 149, 24, 57, 47, 87, 93, 47, 47, 54, 82, 59, 90, 66, 53, 92, 76, 58, 119, 85, 53, 100, 93, 60, 78, 122, 88, 55, 73, 26, 79, 60, 75, 94, 107, 36, 110, 96, 116, 124, 69, 100, 57, 56, 125, 82, 126, 146, 75, 85, 118, 77, 77, 69, 83, 103, 90, 80, 78, 88, 69, 136, 91, 65, 41, 148, 79, 77, 87, 72, 68, 60], 'policy_default_policy_reward': [118.0, 58.0, 100.0, 53.0, 68.0, 171.0, 103.0, 99.0, 108.0, 92.0, 284.0, 91.0, 87.0, 58.0, 66.0, 77.0, 81.0, 49.0, 62.0, 93.0, 65.0, 37.0, 71.0, 104.0, 62.0, 90.0, 68.0, 86.0, 62.0, 149.0, 24.0, 57.0, 47.0, 87.0, 93.0, 47.0, 47.0, 54.0, 82.0, 59.0, 90.0, 66.0, 53.0, 92.0, 76.0, 58.0, 119.0, 85.0, 53.0, 100.0, 93.0, 60.0, 78.0, 122.0, 88.0, 55.0, 73.0, 26.0, 79.0, 60.0, 75.0, 94.0, 107.0, 36.0, 110.0, 96.0, 116.0, 124.0, 69.0, 100.0, 57.0, 56.0, 125.0, 82.0, 126.0, 146.0, 75.0, 85.0, 118.0, 77.0, 77.0, 69.0, 83.0, 103.0, 90.0, 80.0, 78.0, 88.0, 69.0, 136.0, 91.0, 65.0, 41.0, 148.0, 79.0, 77.0, 87.0, 72.0, 68.0, 60.0]}, 'sampler_perf': {'mean_raw_obs_processing_ms': np.float64(0.1992350386598566), 'mean_inference_ms': np.float64(0.6674608949034192), 'mean_action_processing_ms': np.float64(0.07855310032530356), 'mean_env_wait_ms': np.float64(0.04027246995109764), 'mean_env_render_ms': np.float64(0.0)}, 'num_faulty_episodes': 0, 'connector_metrics': {'ObsPreprocessorConnector_ms': np.float64(0.004766941070556641), 'StateBufferConnector_ms': np.float64(0.003278970718383789), 'ViewRequirementAgentConnector_ms': np.float64(0.0824594497680664)}, 'num_episodes': 46, 'episode_return_max': 284.0, 'episode_return_min': 24.0, 'episode_return_mean': np.float64(83.7), 'episodes_this_iter': 46}",{},"{'learner': {'default_policy': {'learner_stats': {'allreduce_latency': np.float64(0.0), 'grad_gnorm': np.float32(0.36532947), 'cur_kl_coeff': np.float64(0.6750000000000002), 'cur_lr': np.float64(0.0029495964113414592), 'total_loss': np.float64(-0.01114009203619614), 'policy_loss': np.float64(-0.01957856527259273), 'vf_loss': np.float64(0.0027076304654568254), 'vf_explained_var': np.float64(0.846721371335368), 'kl': np.float64(0.00849013701074782), 'entropy': np.float64(0.5356226322151), 'entropy_coeff': np.float64(0.0)}, 'model': {}, 'custom_metrics': {}, 'num_agent_steps_trained': np.float64(128.0), 'num_grad_updates_lifetime': np.float64(8835.5), 'diff_num_grad_updates_vs_sampler_policy': np.float64(464.5)}}, 'num_env_steps_sampled': 40000, 'num_env_steps_trained': 40000, 'num_agent_steps_sampled': 40000, 'num_agent_steps_trained': 40000}",40000,40000,40000,40000,40000,4000,421.976,40000,4000,421.976,2,0,0,4000,"{'cpu_util_percent': np.float64(15.471428571428573), 'ram_util_percent': np.float64(27.264285714285716)}","{'training_iteration_time_ms': 9418.291, 'restore_workers_time_ms': 0.017, 'training_step_time_ms': 9418.243, 'sample_time_ms': 2024.944, 'load_time_ms': 0.342, 'load_throughput': 11695514.814, 'learn_time_ms': 7388.48, 'learn_throughput': 541.383, 'synch_weights_time_ms': 3.987}"


[36m(PPO pid=515764)[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/home/lasse/ray_minicourse/lesson_2/ray_results/nb_2/ppo_cartpole/PPO_CartPole-v1_cfa8c_00000_0_gamma=0.9950,lr=0.0000_2024-11-30_02-50-10/checkpoint_000000)
[36m(PPO pid=515764)[0m Install gputil for GPU system monitoring.
2024-11-30 02:51:52,274	INFO tune.py:1009 -- Wrote the latest version of all result files and experiment state to '/home/lasse/ray_minicourse/lesson_2/ray_results/nb_2/ppo_cartpole' in 0.0166s.
2024-11-30 02:51:52,763	INFO tune.py:1041 -- Total run time: 102.72 seconds (102.18 seconds for the tuning loop).


ResultGrid<[
  Result(
    metrics={'custom_metrics': {}, 'episode_media': {}, 'info': {'learner': {'default_policy': {'learner_stats': {'allreduce_latency': np.float64(0.0), 'grad_gnorm': np.float32(0.36715484), 'cur_kl_coeff': np.float64(0.01875), 'cur_lr': np.float64(1.3985265446433982e-05), 'total_loss': np.float64(9.929167386536957), 'policy_loss': np.float64(-0.01733068605904938), 'vf_loss': np.float64(9.94645794078868), 'vf_explained_var': np.float64(-0.06159804110885948), 'kl': np.float64(0.0021413145434349998), 'entropy': np.float64(0.5399863363594137), 'entropy_coeff': np.float64(0.0)}, 'model': {}, 'custom_metrics': {}, 'num_agent_steps_trained': np.float64(128.0), 'num_grad_updates_lifetime': np.float64(8835.5), 'diff_num_grad_updates_vs_sampler_policy': np.float64(464.5)}}, 'num_env_steps_sampled': 40000, 'num_env_steps_trained': 40000, 'num_agent_steps_sampled': 40000, 'num_agent_steps_trained': 40000}, 'env_runners': {'episode_reward_max': 500.0, 'episode_reward_min': 30

In the logs above, you can check that there are two different trials utilizing different learning rate `lr` and discount factor `gamma` values. Confirming that they sampled different configurations based on the defined search space.

Finally, you can check the different training reward performances using tensorboard and looking for the metric `episode_reward_mean`. You can observe two different curves (one for each trial) showing the reward performance over the training.

In [8]:
%load_ext tensorboard
%tensorboard --logdir ray_results/nb_2

The tensorboard extension is already loaded. To reload it, use:
  %reload_ext tensorboard


Reusing TensorBoard on port 6008 (pid 516186), started 0:00:11 ago. (Use '!kill 516186' to kill it.)