# RL Exercise 2 - Proximal Policy Optimization

**GOAL:** The goal of this exercise is to demonstrate how to use the proximal policy optimization (PPO) algorithm.

To understand how to use **RLlib**, see the documentation at http://rllib.io.

PPO is described in detail in https://arxiv.org/abs/1707.06347. It is a variant of Trust Region Policy Optimization (TRPO) described in https://arxiv.org/abs/1502.05477

PPO works in two phases. In one phase, a large number of rollouts are performed (in parallel). The rollouts are then aggregated on the driver and a surrogate optimization objective is defined based on those rollouts. We then use SGD to find the policy that maximizes that objective with a penalty term for diverging too much from the current policy.

![ppo](https://raw.githubusercontent.com/ucbrise/risecamp/risecamp2018/ray/tutorial/rllib_exercises/ppo.png)

**NOTE:** The SGD optimization step is best performed in a data-parallel manner over multiple GPUs. This is exposed through the `num_gpus` field of the `config` dictionary (for this to work, you must be using a machine that has GPUs).

In [3]:
import os
!pip install ray
!pip install lz4

Collecting lz4
[?25l  Downloading https://files.pythonhosted.org/packages/83/fe/66da85ed881031de7cf7de9dd38cc98aec8859824c7bcd3e8a88d255f36d/lz4-2.1.6-cp36-cp36m-manylinux1_x86_64.whl (359kB)
[K     |████████████████████████████████| 368kB 1.7MB/s 
[?25hInstalling collected packages: lz4
Successfully installed lz4-2.1.6


In [0]:
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import gym
import ray
from ray.rllib.agents.ppo import PPOAgent, DEFAULT_CONFIG
from ray.tune.logger import pretty_print

Start up Ray. This must be done before we instantiate any RL agents.

In [5]:
ray.init(ignore_reinit_error=True)

2019-05-16 14:06:11,513	INFO node.py:469 -- Process STDOUT and STDERR is being redirected to /tmp/ray/session_2019-05-16_14-06-11_118/logs.
2019-05-16 14:06:11,641	INFO services.py:407 -- Waiting for redis server at 127.0.0.1:35248 to respond...
2019-05-16 14:06:11,780	INFO services.py:407 -- Waiting for redis server at 127.0.0.1:64990 to respond...
2019-05-16 14:06:11,784	INFO services.py:804 -- Starting Redis shard with 2.58 GB max memory.
2019-05-16 14:06:11,843	INFO node.py:483 -- Process STDOUT and STDERR is being redirected to /tmp/ray/session_2019-05-16_14-06-11_118/logs.
2019-05-16 14:06:11,846	INFO services.py:1427 -- Starting the Plasma object store with 3.87 GB memory using /dev/shm.


{'node_ip_address': '172.28.0.2',
 'object_store_address': '/tmp/ray/session_2019-05-16_14-06-11_118/sockets/plasma_store',
 'raylet_socket_name': '/tmp/ray/session_2019-05-16_14-06-11_118/sockets/raylet',
 'redis_address': '172.28.0.2:35248',
 'webui_url': None}

Instantiate a PPOAgent object. We pass in a config object that specifies how the network and training procedure should be configured. Some of the parameters are the following.

- `num_workers` is the number of actors that the agent will create. This determines the degree of parallelism that will be used.
- `num_sgd_iter` is the number of epochs of SGD (passes through the data) that will be used to optimize the PPO surrogate objective at each iteration of PPO.
- `sgd_minibatch_size` is the SGD batch size that will be used to optimize the PPO surrogate objective.
- `model` contains a dictionary of parameters describing the neural net used to parameterize the policy. The `fcnet_hiddens` parameter is a list of the sizes of the hidden layers.

In [6]:
config = DEFAULT_CONFIG.copy()
config['num_workers'] = 1
config['num_sgd_iter'] = 30
config['sgd_minibatch_size'] = 128
config['model']['fcnet_hiddens'] = [100, 100]
config['num_cpus_per_worker'] = 0  # This avoids running out of resources in the notebook environment when this cell is re-executed

agent = PPOAgent(config, 'CartPole-v0')

2019-05-16 14:07:37,489	INFO policy_evaluator.py:311 -- Creating policy evaluation worker 0 on CPU (please ignore any CUDA init errors)


Instructions for updating:
Colocations handled automatically by placer.
Instructions for updating:
Use tf.random.categorical instead.
Instructions for updating:
Use tf.cast instead.


  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "


Instructions for updating:
Deprecated in favor of operator or tf.math.divide.


2019-05-16 14:07:39,005	INFO policy_evaluator.py:728 -- Built policy map: {'default_policy': <ray.rllib.agents.ppo.ppo_policy_graph.PPOPolicyGraph object at 0x7f32f54bf160>}
2019-05-16 14:07:39,006	INFO policy_evaluator.py:729 -- Built preprocessor map: {'default_policy': <ray.rllib.models.preprocessors.NoPreprocessor object at 0x7f32f54b1e10>}
2019-05-16 14:07:39,009	INFO policy_evaluator.py:343 -- Built filter map: {'default_policy': <ray.rllib.utils.filter.NoFilter object at 0x7f32f549dcf8>}
2019-05-16 14:07:39,040	INFO multi_gpu_optimizer.py:78 -- LocalMultiGPUOptimizer devices ['/cpu:0']


Train the policy on the `CartPole-v0` environment for 2 steps. The CartPole problem is described at https://gym.openai.com/envs/CartPole-v0.

**EXERCISE:** Inspect how well the policy is doing by looking for the lines that say something like

```
total reward is  22.3215974777
trajectory length mean is  21.3215974777
```

This indicates how much reward the policy is receiving and how many time steps of the environment the policy ran. The maximum possible reward for this problem is 200. The reward and trajectory length are very close because the agent receives a reward of one for every time step that it survives (however, that is specific to this environment).

In [7]:
for i in range(2):
    result = agent.train()
    print(pretty_print(result))

[2m[36m(pid=215)[0m 2019-05-16 14:08:03,172	INFO policy_evaluator.py:437 -- Generating sample batch of size 200
[2m[36m(pid=215)[0m 2019-05-16 14:08:03,173	INFO sampler.py:308 -- Raw obs from env: { 0: { 'agent0': np.ndarray((4,), dtype=float64, min=-0.048, max=0.049, mean=-0.01)}}
[2m[36m(pid=215)[0m 2019-05-16 14:08:03,173	INFO sampler.py:309 -- Info return from env: {0: {'agent0': None}}
[2m[36m(pid=215)[0m 2019-05-16 14:08:03,173	INFO sampler.py:407 -- Preprocessed obs: np.ndarray((4,), dtype=float64, min=-0.048, max=0.049, mean=-0.01)
[2m[36m(pid=215)[0m 2019-05-16 14:08:03,174	INFO sampler.py:411 -- Filtered obs: np.ndarray((4,), dtype=float64, min=-0.048, max=0.049, mean=-0.01)
[2m[36m(pid=215)[0m 2019-05-16 14:08:03,174	INFO sampler.py:525 -- Inputs to compute_actions():
[2m[36m(pid=215)[0m 
[2m[36m(pid=215)[0m { 'default_policy': [ { 'data': { 'agent_id': 'agent0',
[2m[36m(pid=215)[0m                                   'env_id': 0,
[2m[36m(pid=215)

2019-05-16 14:08:08,104	INFO multi_gpu_impl.py:144 -- Training on concatenated sample batches:

{ 'inputs': [ np.ndarray((4000, 4), dtype=float32, min=-2.56, max=2.555, mean=0.0),
              np.ndarray((4000,), dtype=float32, min=0.996, max=57.441, mean=12.418),
              np.ndarray((4000,), dtype=float32, min=-1.244, max=4.904, mean=-0.0),
              np.ndarray((4000,), dtype=int64, min=0.0, max=1.0, mean=0.485),
              np.ndarray((4000, 2), dtype=float32, min=-0.012, max=0.009, mean=0.0),
              np.ndarray((4000,), dtype=float32, min=-0.008, max=0.006, mean=-0.0),
              np.ndarray((4000,), dtype=int64, min=0.0, max=1.0, mean=0.465),
              np.ndarray((4000,), dtype=float32, min=0.0, max=1.0, mean=0.954)],
  'placeholders': [ <tf.Tensor 'default_policy/obs:0' shape=(?, 4) dtype=float32>,
                    <tf.Tensor 'default_policy/value_targets:0' shape=(?,) dtype=float32>,
                    <tf.Tensor 'default_policy/advantages:0' shape=(?,

custom_metrics: {}
date: 2019-05-16_14-08-11
done: false
episode_len_mean: 21.775956284153004
episode_reward_max: 85.0
episode_reward_mean: 21.775956284153004
episode_reward_min: 9.0
episodes_this_iter: 183
episodes_total: 183
experiment_id: fd89ac60471b404880810e4490722e65
hostname: f5120e65a079
info:
  grad_time_ms: 3304.501
  learner:
    default_policy:
      cur_kl_coeff: 0.19999995827674866
      cur_lr: 4.999999873689376e-05
      entropy: 0.6646150946617126
      kl: 0.02912997081875801
      policy_loss: -0.039056196808815
      total_loss: 173.78334045410156
      vf_explained_var: 0.022880150005221367
      vf_loss: 173.81658935546875
  load_time_ms: 124.932
  num_steps_sampled: 4000
  num_steps_trained: 3968
  sample_time_ms: 4947.051
  update_time_ms: 827.049
iterations_since_restore: 1
node_ip: 172.28.0.2
num_healthy_workers: 1
num_metric_batches_dropped: 0
off_policy_estimator: {}
pid: 118
policy_reward_mean: {}
sampler_perf:
  mean_env_wait_ms: 0.08069184743532742
  mea

**EXERCISE:** The current network and training configuration are too large and heavy-duty for a simple problem like CartPole. Modify the configuration to use a smaller network and to speed up the optimization of the surrogate objective (fewer SGD iterations and a larger batch size should help).

In [8]:
config = DEFAULT_CONFIG.copy()
config['num_workers'] = 3
config['num_sgd_iter'] = 10
config['sgd_minibatch_size'] = 256
config['model']['fcnet_hiddens'] = [20, 20]
config['num_cpus_per_worker'] = 0

agent = PPOAgent(config, 'CartPole-v0')

2019-05-16 14:09:38,367	INFO policy_evaluator.py:311 -- Creating policy evaluation worker 0 on CPU (please ignore any CUDA init errors)
  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "
2019-05-16 14:09:39,396	INFO policy_evaluator.py:728 -- Built policy map: {'default_policy': <ray.rllib.agents.ppo.ppo_policy_graph.PPOPolicyGraph object at 0x7f32014c9ac8>}
2019-05-16 14:09:39,398	INFO policy_evaluator.py:729 -- Built preprocessor map: {'default_policy': <ray.rllib.models.preprocessors.NoPreprocessor object at 0x7f32014c9588>}
2019-05-16 14:09:39,399	INFO policy_evaluator.py:343 -- Built filter map: {'default_policy': <ray.rllib.utils.filter.NoFilter object at 0x7f32014c9898>}
2019-05-16 14:09:39,468	INFO multi_gpu_optimizer.py:78 -- LocalMultiGPUOptimizer devices ['/cpu:0']


[2m[36m(pid=213)[0m 2019-05-16 14:09:39,447	INFO policy_evaluator.py:311 -- Creating policy evaluation worker 1 on CPU (please ignore any CUDA init errors)
[2m[36m(pid=213)[0m 2019-05-16 14:09:39.461944: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2300000000 Hz
[2m[36m(pid=213)[0m 2019-05-16 14:09:39.462244: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x6e97340 executing computations on platform Host. Devices:
[2m[36m(pid=213)[0m 2019-05-16 14:09:39.462279: I tensorflow/compiler/xla/service/service.cc:158]   StreamExecutor device (0): <undefined>, <undefined>
[2m[36m(pid=213)[0m Instructions for updating:
[2m[36m(pid=213)[0m Colocations handled automatically by placer.
[2m[36m(pid=213)[0m Instructions for updating:
[2m[36m(pid=213)[0m Use tf.random.categorical instead.




[2m[36m(pid=213)[0m Instructions for updating:
[2m[36m(pid=213)[0m Use tf.cast instead.
[2m[36m(pid=213)[0m   "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "
[2m[36m(pid=213)[0m Instructions for updating:
[2m[36m(pid=213)[0m Deprecated in favor of operator or tf.math.divide.


**EXERCISE:** Train the agent and try to get a reward of 200. If it's training too slowly you may need to modify the config above to use fewer hidden units, a larger `sgd_minibatch_size`, a smaller `num_sgd_iter`, or a larger `num_workers`.

This should take around 20 or 30 training iterations.

In [10]:
for i in range(20):
    result = agent.train()
    print(pretty_print(result))

custom_metrics: {}
date: 2019-05-16_14-10-37
done: false
episode_len_mean: 26.22222222222222
episode_reward_max: 92.0
episode_reward_mean: 26.22222222222222
episode_reward_min: 10.0
episodes_this_iter: 153
episodes_total: 504
experiment_id: 7c4787f1bf084d65857dc9dca012a585
hostname: f5120e65a079
info:
  grad_time_ms: 580.712
  learner:
    default_policy:
      cur_kl_coeff: 0.05000000447034836
      cur_lr: 4.999999873689376e-05
      entropy: 0.6797142624855042
      kl: 0.0025639538653194904
      policy_loss: -0.007877949625253677
      total_loss: 358.54803466796875
      vf_explained_var: -5.960067210253328e-05
      vf_loss: 358.5557861328125
  load_time_ms: 22.756
  num_steps_sampled: 12000
  num_steps_trained: 11520
  sample_time_ms: 2678.368
  update_time_ms: 305.095
iterations_since_restore: 3
node_ip: 172.28.0.2
num_healthy_workers: 3
num_metric_batches_dropped: 0
off_policy_estimator: {}
pid: 118
policy_reward_mean: {}
sampler_perf:
  mean_env_wait_ms: 0.09372710773863878


Checkpoint the current model. The call to `agent.save()` returns the path to the checkpointed model and can be used later to restore the model.

In [11]:
checkpoint_path = agent.save()
print(checkpoint_path)

/root/ray_results/PPO_CartPole-v0_2019-05-16_14-09-38m1nxh4tk/checkpoint_22/checkpoint-22


Now let's use the trained policy to make predictions.

**NOTE:** Here we are loading the trained policy in the same process, but in practice, this would often be done in a different process (probably on a different machine).

In [12]:
trained_config = config.copy()

test_agent = PPOAgent(trained_config, 'CartPole-v0')
test_agent.restore(checkpoint_path)

2019-05-16 14:12:39,729	INFO policy_evaluator.py:311 -- Creating policy evaluation worker 0 on CPU (please ignore any CUDA init errors)
  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "
2019-05-16 14:12:41,167	INFO policy_evaluator.py:728 -- Built policy map: {'default_policy': <ray.rllib.agents.ppo.ppo_policy_graph.PPOPolicyGraph object at 0x7f31fd57f4a8>}
2019-05-16 14:12:41,169	INFO policy_evaluator.py:729 -- Built preprocessor map: {'default_policy': <ray.rllib.models.preprocessors.NoPreprocessor object at 0x7f31fd57fb00>}
2019-05-16 14:12:41,176	INFO policy_evaluator.py:343 -- Built filter map: {'default_policy': <ray.rllib.utils.filter.NoFilter object at 0x7f31fd57fe10>}
2019-05-16 14:12:41,251	INFO multi_gpu_optimizer.py:78 -- LocalMultiGPUOptimizer devices ['/cpu:0']


[2m[36m(pid=468)[0m 2019-05-16 14:12:41,231	INFO policy_evaluator.py:311 -- Creating policy evaluation worker 2 on CPU (please ignore any CUDA init errors)
[2m[36m(pid=470)[0m 2019-05-16 14:12:41,218	INFO policy_evaluator.py:311 -- Creating policy evaluation worker 1 on CPU (please ignore any CUDA init errors)
[2m[36m(pid=470)[0m 2019-05-16 14:12:41.226316: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2300000000 Hz
[2m[36m(pid=470)[0m 2019-05-16 14:12:41.226559: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x6d7d340 executing computations on platform Host. Devices:
[2m[36m(pid=470)[0m 2019-05-16 14:12:41.226599: I tensorflow/compiler/xla/service/service.cc:158]   StreamExecutor device (0): <undefined>, <undefined>
[2m[36m(pid=468)[0m 2019-05-16 14:12:41.248412: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2300000000 Hz
[2m[36m(pid=468)[0m 2019-05-16 14:12:41.248659: I tensorflow/compiler/xla/s

Now use the trained policy to act in an environment. The key line is the call to `test_agent.compute_action(state)` which uses the trained policy to choose an action.

**EXERCISE:** Verify that the reward received roughly matches up with the reward printed in the training logs.

In [17]:
env = gym.make('CartPole-v0')
state = env.reset()
done = False
cumulative_reward = 0

while not done:
    action = test_agent.compute_action(state)
    state, reward, done, _ = env.step(action)
    cumulative_reward += reward

print(cumulative_reward)

200.0


## Visualize results with TensorBoard

**EXERCISE**: Finally, you can visualize your training results using TensorBoard. To do this, open a new terminal in Jupyter lab using the "+" button, and run:
    
`$ tensorboard --logdir=~/ray_results --host=0.0.0.0`

And open your browser to the address printed (or change the current URL to go to port 6006). Check the "episode_reward_mean" learning curve of the PPO agent. Toggle the horizontal axis between both the "STEPS" and "RELATIVE" view to compare efficiency in number of timesteps vs real time time.

Note that TensorBoard will not work in Binder.