# RL Exercise 3 - Proximal Policy Optimization

**GOAL:** The goal of this exercise is to demonstrate how to use the proximal policy optimization (PPO) algorithm.

PPO is described in detail in https://arxiv.org/abs/1707.06347. It is a variant of Trust Region Policy Optimization (TRPO) described in https://arxiv.org/abs/1502.05477

PPO works in two phases. In one phase, a large number of rollouts are performed (in parallel). The rollouts are then aggregated on the driver and a surrogate optimization objective is defined based on those rollouts. We then use SGD to find the policy that maximizes that objective with a penalty term for diverging too much from the current policy.

**NOTE:** The SGD optimization step is best performed in a data-parallel manner over multiple GPUs. This is exposed through the `devices` field of the `config` dictionary (for this to work, you must be using a machine that has GPUs).

In [1]:
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import gym
import ray
from ray.rllib.ppo import PPOAgent, DEFAULT_CONFIG

  from ._conv import register_converters as _register_converters


Instructions for updating:
Use the retry module or similar alternatives.


Start up Ray. This must be done before we instantiate any RL agents. We pass in `num_workers=0` because the training agent's constructor will create a number of actors.

In [3]:
ray.init(num_workers=0)

Waiting for redis server at 127.0.0.1:56402 to respond...
Waiting for redis server at 127.0.0.1:52600 to respond...
Starting local scheduler with the following resources: {'CPU': 8, 'GPU': 0}.

View the web UI at http://localhost:8895/notebooks/ray_ui65673.ipynb?token=a04e05bf3ae4bb9dbae3a4b37ef0e91ff356d3cc284f32cf



AssertionError: Perhaps you called ray.init twice by accident?

Instantiate a PPOAgent object. We pass in a config object that specifies how the network and training procedure should be configured. Some of the parameters are the following.

- `num_agents` is the number of actors that the agent will create. This determines the degree of parallelism that will be used.
- `num_sgd_iter` is the number of epochs of SGD (passes through the data) that will be used to optimize the PPO surrogate objective at each iteration of PPO.
- `sgd_batchsize` is the SGD batch size that will be used to optimize the PPO surrogate objective.
- `model` contains a dictionary of parameters describing the neural net used to parameterize the policy. The `fcnet_hiddens` parameter is a list of the sizes of the hidden layers.

In [4]:
config = DEFAULT_CONFIG.copy()
config['num_workers'] = 3
config['num_sgd_iter'] = 30
config['sgd_batchsize'] = 128
config['model']['fcnet_hiddens'] = [100, 100]

agent = PPOAgent(config, 'CartPole-v0')

[33mWARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.[0m
Observation shape is (4,)
Not using any observation preprocessor.
[33mWARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.[0m
Constructing fcnet [100, 100] <function tanh at 0x11cf7f378>
Constructing fcnet [100, 100] <function tanh at 0x11cf7f378>
Instructions for updating:
keep_dims is deprecated, use keepdims instead
Instructions for updating:
keep_dims is deprecated, use keepdims instead
Constructing fcnet [100, 100] <function tanh at 0x11cf7f378>
Constructing fcnet [100, 100] <function tanh at 0x11cf7f378>
Constructing fcnet [100, 100] <function tanh at 0x11cf7f378>
Constructing fcnet [100, 100] <function tanh at 0x11cf7f378>
Constructing fcnet [100, 100] <function tanh at 0x11cf7f378>
Constructing fcnet [100, 100] <function tanh at 0x11cf7f378>
Constructing fcnet [100, 100] <function tanh at 0x11cf7f378>
Constructing fcnet [100

Train the policy on the `CartPole-v0` environment for 2 steps. The CartPole problem is described at https://gym.openai.com/envs/CartPole-v0.

**EXERCISE:** Inspect how well the policy is doing by looking for the lines that say something like

```
total reward is  22.3215974777
trajectory length mean is  21.3215974777
```

This indicates how much reward the policy is receiving and how many time steps of the environment the policy ran. The maximum possible reward for this problem is 200. The reward and trajectory length are very close because the agent receives a reward of one for every time step that it survives (however, that is specific to this environment).

In [7]:
for i in range(2):
    result = agent.train()
    print(result)

===> iteration 4
Computing policy (iterations=30, stepsize=5e-05):
           iter     total loss    policy loss        vf loss             kl        entropy
              0    3.45683e+03    8.80492e-04    3.45683e+03    3.59236e-05    5.71446e-01
              1    3.43720e+03    6.25133e-04    3.43720e+03    1.09336e-04    5.69045e-01
              2    3.41699e+03    4.28345e-04    3.41699e+03    2.71533e-04    5.66235e-01
              3    3.39670e+03    3.34345e-04    3.39670e+03    2.91015e-04    5.65247e-01
              4    3.37699e+03    8.83215e-05    3.37699e+03    4.21387e-04    5.63902e-01
              5    3.35746e+03   -7.62925e-05    3.35746e+03    5.40983e-04    5.63376e-01
              6    3.33838e+03   -1.59915e-04    3.33838e+03    8.15238e-04    5.61826e-01
              7    3.31956e+03   -3.14552e-04    3.31956e+03    7.72596e-04    5.60851e-01
              8    3.30098e+03   -4.14943e-04    3.30098e+03    9.78424e-04    5.59300e-01
              9    3.28

  if np.issubdtype(value, float):
  if np.issubdtype(value, int):


Computing policy (iterations=30, stepsize=5e-05):
           iter     total loss    policy loss        vf loss             kl        entropy
              0    2.92967e+03    3.22774e-03    2.92966e+03    1.41356e-05    5.62545e-01
              1    2.91633e+03    2.45404e-03    2.91633e+03    1.54543e-04    5.62640e-01
              2    2.90225e+03    1.84078e-03    2.90225e+03    6.59957e-04    5.62127e-01
              3    2.88752e+03    1.16049e-03    2.88752e+03    1.60652e-03    5.60807e-01
              4    2.87282e+03    6.16450e-04    2.87282e+03    2.68919e-03    5.59769e-01
              5    2.85828e+03    4.26363e-05    2.85828e+03    3.63385e-03    5.59570e-01
              6    2.84379e+03   -4.11539e-04    2.84379e+03    4.58249e-03    5.58834e-01
              7    2.82943e+03   -6.85314e-04    2.82943e+03    5.45573e-03    5.57052e-01
              8    2.81516e+03   -7.94172e-04    2.81516e+03    5.61973e-03    5.55575e-01
              9    2.80097e+03   -8.3989

**EXERCISE:** The current network and training configuration are too large and heavy-duty for a simple problem like CartPole. Modify the configuration to use a smaller network and to speed up the optimization of the surrogate objective (fewer SGD iterations and a larger batch size should help).

In [8]:
config = DEFAULT_CONFIG.copy()
config['num_workers'] = 3
config['num_sgd_iter'] = 10
config['sgd_batchsize'] = 256
config['model']['fcnet_hiddens'] = [50, 50]

agent = PPOAgent(config, 'CartPole-v0')

[33mWARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.[0m
Observation shape is (4,)
Not using any observation preprocessor.
[33mWARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.[0m
Constructing fcnet [50, 50] <function tanh at 0x11cf7f378>
Constructing fcnet [50, 50] <function tanh at 0x11cf7f378>
Constructing fcnet [50, 50] <function tanh at 0x11cf7f378>
Constructing fcnet [50, 50] <function tanh at 0x11cf7f378>
Constructing fcnet [50, 50] <function tanh at 0x11cf7f378>
Constructing fcnet [50, 50] <function tanh at 0x11cf7f378>
Constructing fcnet [50, 50] <function tanh at 0x11cf7f378>
Constructing fcnet [50, 50] <function tanh at 0x11cf7f378>
Constructing fcnet [50, 50] <function tanh at 0x11cf7f378>
Constructing fcnet [50, 50] <function tanh at 0x11cf7f378>


**EXERCISE:** Train the agent and try to get a reward of 200. If it's training too slowly you may need to modify the config above to use fewer hidden units, a larger `sgd_batchsize`, a smaller `num_sgd_iter`, or a larger `num_workers`.

This should take around 20 or 30 training iterations.

In [12]:
for _ in range(20):
    result = agent.train()
    print(result)

===> iteration 6
Computing policy (iterations=10, stepsize=5e-05):
           iter     total loss    policy loss        vf loss             kl        entropy
              0    3.72534e+03    2.65221e-03    3.72534e+03    3.72436e-05    6.15637e-01
              1    3.72228e+03    1.04352e-03    3.72228e+03    4.07155e-04    6.14076e-01
              2    3.71903e+03   -9.03456e-04    3.71903e+03    1.24829e-03    6.12161e-01
              3    3.71569e+03   -2.54648e-03    3.71570e+03    2.56542e-03    6.09897e-01
              4    3.71229e+03   -4.29543e-03    3.71230e+03    4.23522e-03    6.07125e-01
              5    3.70879e+03   -5.87979e-03    3.70880e+03    6.29925e-03    6.04031e-01
              6    3.70521e+03   -7.30158e-03    3.70522e+03    8.76495e-03    6.00705e-01
              7    3.70149e+03   -8.04212e-03    3.70150e+03    1.05392e-02    5.98145e-01
              8    3.69768e+03   -8.18685e-03    3.69768e+03    1.09233e-02    5.96920e-01
              9    3.69

  if np.issubdtype(value, float):
  if np.issubdtype(value, int):


Computing policy (iterations=10, stepsize=5e-05):
           iter     total loss    policy loss        vf loss             kl        entropy
              0    5.95067e+03    7.02002e-05    5.95067e+03    3.74175e-06    6.05889e-01
              1    5.94350e+03   -1.08898e-05    5.94350e+03    2.05089e-05    6.06253e-01
              2    5.93589e+03   -9.57919e-05    5.93589e+03    5.62319e-05    6.07081e-01
              3    5.92806e+03   -2.35799e-04    5.92806e+03    1.11483e-04    6.07741e-01
              4    5.92010e+03   -3.19806e-04    5.92010e+03    1.67116e-04    6.08016e-01
              5    5.91203e+03   -4.06945e-04    5.91203e+03    2.44078e-04    6.08678e-01
              6    5.90378e+03   -5.22747e-04    5.90378e+03    3.27264e-04    6.09033e-01
              7    5.89534e+03   -6.06124e-04    5.89534e+03    4.00025e-04    6.09100e-01
              8    5.88676e+03   -7.01083e-04    5.88676e+03    5.16007e-04    6.09645e-01
              9    5.87804e+03   -7.8719

              6    6.00477e+03   -6.06764e-04    6.00477e+03    2.90020e-04    5.76008e-01
              7    5.98879e+03   -6.50456e-04    5.98879e+03    4.66421e-04    5.75966e-01
              8    5.97369e+03   -8.17682e-04    5.97369e+03    6.22073e-04    5.75581e-01
              9    5.95897e+03   -9.35263e-04    5.95897e+03    7.97761e-04    5.74760e-01
TrainingResult(timesteps_total=81380, done=None, info={'kl_divergence': 0.00079776125, 'kl_coefficient': 0.003125, 'rollouts_time': 3.1342482566833496, 'shuffle_time': 0.0015649795532226562, 'load_time': 0.001199960708618164, 'sgd_time': 0.32426881790161133, 'sample_throughput': 13963.723151986424}, episode_reward_mean=192.14634146341464, episode_len_mean=192.14634146341464, episodes_total=None, mean_accuracy=None, mean_validation_accuracy=None, mean_loss=None, neg_mean_loss=None, experiment_id='e3efea1e166e44919724746e8ca51e8a', training_iteration=11, timesteps_this_iter=7878, time_this_iter_s=3.470571994781494, time_total_s=43

Computing policy (iterations=10, stepsize=5e-05):
           iter     total loss    policy loss        vf loss             kl        entropy
              0    5.64856e+03    1.29332e-04    5.64856e+03    1.96159e-06    5.50513e-01
              1    5.63501e+03   -1.42558e-04    5.63501e+03    4.57310e-05    5.50049e-01
              2    5.62147e+03   -6.10969e-04    5.62147e+03    1.50330e-04    5.48999e-01
              3    5.60824e+03   -8.90095e-04    5.60825e+03    4.84986e-04    5.48259e-01
              4    5.59522e+03   -1.47454e-03    5.59522e+03    7.58791e-04    5.47077e-01
              5    5.58244e+03   -1.90128e-03    5.58244e+03    1.17708e-03    5.45979e-01
              6    5.56989e+03   -2.31896e-03    5.56989e+03    1.65945e-03    5.44985e-01
              7    5.55752e+03   -2.66542e-03    5.55752e+03    2.41435e-03    5.44046e-01
              8    5.54535e+03   -3.11905e-03    5.54535e+03    3.24475e-03    5.42890e-01
              9    5.53337e+03   -3.3955

              7    5.44629e+03    6.78124e-04    5.44629e+03    4.45216e-04    5.46537e-01
              8    5.43798e+03    5.61219e-04    5.43798e+03    5.98634e-04    5.45495e-01
              9    5.42974e+03    4.70412e-04    5.42974e+03    7.50912e-04    5.44430e-01
TrainingResult(timesteps_total=132591, done=None, info={'kl_divergence': 0.00075091206, 'kl_coefficient': 2.44140625e-05, 'rollouts_time': 2.7159507274627686, 'shuffle_time': 0.0010960102081298828, 'load_time': 0.00086212158203125, 'sgd_time': 0.2708714008331299, 'sample_throughput': 14767.155143352313}, episode_reward_mean=200.0, episode_len_mean=200.0, episodes_total=None, mean_accuracy=None, mean_validation_accuracy=None, mean_loss=None, neg_mean_loss=None, experiment_id='e3efea1e166e44919724746e8ca51e8a', training_iteration=18, timesteps_this_iter=7000, time_this_iter_s=2.995709180831909, time_total_s=66.94217348098755, pid=8668, date='2018-03-28_20-49-12', timestamp=1522295352, hostname='DILBAG-M-X2Y6', config={'

Computing policy (iterations=10, stepsize=5e-05):
           iter     total loss    policy loss        vf loss             kl        entropy
              0    5.17134e+03   -2.74704e-04    5.17134e+03    2.76425e-06    5.32934e-01
              1    5.16472e+03   -3.73292e-04    5.16472e+03    1.32440e-05    5.32601e-01
              2    5.15815e+03   -4.25899e-04    5.15815e+03    3.93870e-05    5.32371e-01
              3    5.15160e+03   -5.37482e-04    5.15160e+03    7.35631e-05    5.31598e-01
              4    5.14507e+03   -6.53202e-04    5.14507e+03    1.11934e-04    5.31666e-01
              5    5.13859e+03   -7.18052e-04    5.13859e+03    1.40243e-04    5.30932e-01
              6    5.13211e+03   -8.22091e-04    5.13211e+03    1.92913e-04    5.30676e-01
              7    5.12564e+03   -8.75232e-04    5.12564e+03    2.88889e-04    5.30042e-01
              8    5.11919e+03   -1.00700e-03    5.11919e+03    3.83385e-04    5.29917e-01
              9    5.11279e+03   -1.1012

              7    4.92075e+03    3.20098e-03    4.92075e+03    6.01013e-04    5.28943e-01
              8    4.91498e+03    3.16703e-03    4.91498e+03    6.94038e-04    5.28366e-01
              9    4.90927e+03    3.07371e-03    4.90927e+03    8.10775e-04    5.27563e-01
TrainingResult(timesteps_total=182735, done=None, info={'kl_divergence': 0.0008107755, 'kl_coefficient': 1.9073486328125e-07, 'rollouts_time': 2.73003888130188, 'shuffle_time': 0.0009980201721191406, 'load_time': 0.0009150505065917969, 'sgd_time': 0.27130842208862305, 'sample_throughput': 14806.027653236082}, episode_reward_mean=194.91666666666666, episode_len_mean=194.91666666666666, episodes_total=None, mean_accuracy=None, mean_validation_accuracy=None, mean_loss=None, neg_mean_loss=None, experiment_id='e3efea1e166e44919724746e8ca51e8a', training_iteration=25, timesteps_this_iter=7017, time_this_iter_s=3.0101749897003174, time_total_s=88.94900345802307, pid=8668, date='2018-03-28_20-49-34', timestamp=1522295374, hos

Checkpoint the current model. The call to `agent.save()` returns the path to the checkpointed model and can be used later to restore the model.

In [23]:
checkpoint_path = agent.save()

INFO:tensorflow:/Users/alokbeniwal/ray_results/2018-03-28_20-47-3088_z_7cy/checkpoint-26 is not in all_model_checkpoint_paths. Manually adding it.


Now let's use the trained policy to make predictions.

**NOTE:** Here we are loading the trained policy in the same process, but in practice, this would often be done in a different process (probably on a different machine).

In [24]:
trained_config = config.copy()

test_agent = PPOAgent(trained_config, 'CartPole-v0')
test_agent.restore(checkpoint_path)

[33mWARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.[0m
Observation shape is (4,)
Not using any observation preprocessor.
[33mWARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.[0m
Constructing fcnet [50, 50] <function tanh at 0x11cf7f378>
Constructing fcnet [50, 50] <function tanh at 0x11cf7f378>
Constructing fcnet [50, 50] <function tanh at 0x11cf7f378>
Constructing fcnet [50, 50] <function tanh at 0x11cf7f378>
Constructing fcnet [50, 50] <function tanh at 0x11cf7f378>
Constructing fcnet [50, 50] <function tanh at 0x11cf7f378>
Constructing fcnet [50, 50] <function tanh at 0x11cf7f378>
Constructing fcnet [50, 50] <function tanh at 0x11cf7f378>
Constructing fcnet [50, 50] <function tanh at 0x11cf7f378>
Constructing fcnet [50, 50] <function tanh at 0x11cf7f378>
INFO:tensorflow:Restoring parameters from /Users/alokbeniwal/ray_results/2018-03-28_20-47-3088_z_7cy/checkpoint-26


Now use the trained policy to act in an environment. The key line is the call to `test_agent.compute_action(state)` which uses the trained policy to choose an action.

**EXERCISE:** Verify that the reward received roughly matches up with the reward printed in the training logs.

In [26]:
env = gym.make('CartPole-v0')
state = env.reset()
done = False
cumulative_reward = 0

while not done:
    action = test_agent.compute_action(state)
    state, reward, done, _ = env.step(action)
    cumulative_reward += reward

print(cumulative_reward)

[33mWARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.[0m
200.0
