# RLlib Sample Application: CartPole-v0

We were briefly introduced to the _CartPole_ example and the OpenAI gym `CartPole-v0` environment ([gym.openai.com/envs/CartPole-v0/](https://gym.openai.com/envs/CartPole-v0/)) in the [introduction](01-Introduction-to-Reinforcement-Learning.ipynb). This lesson uses [RLlib](https://ray.readthedocs.io/en/latest/rllib.html) to train a policy for _CartPole_.

Even though this is a relatively simple and quick example to run, its results can be understood quite visually.

For more background about this problem, see:

* ["Neuronlike Adaptive Elements That Can Solve Difficult Learning Control Problem"](https://ieeexplore.ieee.org/document/6313077), AG Barto, RS Sutton, and CW Anderson, *IEEE Transactions on Systems, Man, and Cybernetics* (1983)
* ["Cartpole - Introduction to Reinforcement Learning (DQN - Deep Q-Learning)"](https://towardsdatascience.com/cartpole-introduction-to-reinforcement-learning-ed0eb5b58288), [Greg Surma](https://twitter.com/GSurma)

First, import Ray and the PPO module in RLlib, then start Ray.

In [1]:
import ray
import ray.rllib.agents.ppo as ppo

In [2]:
ray.init(ignore_reinit_error=True)

2020-05-01 11:54:47,624	INFO resource_spec.py:212 -- Starting Ray with 4.79 GiB memory available for workers and up to 2.42 GiB for objects. You can adjust these settings with ray.init(memory=<bytes>, object_store_memory=<bytes>).
2020-05-01 11:54:47,952	INFO services.py:1148 -- View the Ray dashboard at [1m[32mlocalhost:8265[39m[22m


{'node_ip_address': '192.168.1.149',
 'redis_address': '192.168.1.149:19062',
 'object_store_address': '/tmp/ray/session_2020-05-01_11-54-47_613558_27290/sockets/plasma_store',
 'raylet_socket_name': '/tmp/ray/session_2020-05-01_11-54-47_613558_27290/sockets/raylet',
 'webui_url': 'localhost:8265',
 'session_dir': '/tmp/ray/session_2020-05-01_11-54-47_613558_27290'}

The Ray Dashboard is useful for monitoring Ray:

In [3]:
print(f'Dashboard URL: http://{ray.get_webui_url()}')

Dashboard URL: http://localhost:8265


Next we'll train an RLlib policy with the [`CartPole-v0` environment](https://gym.openai.com/envs/CartPole-v0/).

By default, training runs for `10` iterations. Increase the `N_ITER` setting if you want to train longer and see the resulting rewards improve.
Also note that *checkpoints* get saved after each iteration into the `/tmp/ppo/cart` directory.

> **Note:** If you prefer to use a different directory root than `/tmp`, change it in the next cell **and** in the `rllib rollout` command below.

In [9]:
SELECT_ENV = "CartPole-v0"
N_ITER = 10

config = ppo.DEFAULT_CONFIG.copy()
config["log_level"] = "WARN"

reward_history = []

agent = ppo.PPOTrainer(config, env=select_env)

for _ in range(N_ITER):
    result = agent.train()
    print(result)

    max_reward = result["episode_reward_max"]
    reward_history.append(max_reward)

    file_name = agent.save('/tmp/ppo/cart')
    print(f'\ncheckpoint saved to {file_name}')

2020-05-01 12:01:15,835	INFO trainer.py:428 -- Tip: set 'eager': true or the --eager flag to enable TensorFlow eager execution
2020-05-01 12:01:15,858	INFO trainer.py:585 -- Current log_level is WARN. For more information, set 'log_level': 'INFO' / 'DEBUG' or use the -v and -vv flags.
2020-05-01 12:01:17,762	INFO trainable.py:217 -- Getting current IP.


{'episode_reward_max': 79.0, 'episode_reward_min': 8.0, 'episode_reward_mean': 21.690217391304348, 'episode_len_mean': 21.690217391304348, 'episodes_this_iter': 184, 'policy_reward_min': {}, 'policy_reward_max': {}, 'policy_reward_mean': {}, 'custom_metrics': {}, 'hist_stats': {'episode_reward': [11.0, 15.0, 32.0, 17.0, 36.0, 75.0, 34.0, 22.0, 47.0, 27.0, 21.0, 21.0, 14.0, 23.0, 21.0, 18.0, 20.0, 18.0, 23.0, 14.0, 15.0, 19.0, 20.0, 19.0, 10.0, 28.0, 25.0, 19.0, 25.0, 36.0, 17.0, 12.0, 18.0, 21.0, 10.0, 24.0, 27.0, 19.0, 31.0, 22.0, 10.0, 11.0, 19.0, 35.0, 26.0, 13.0, 24.0, 33.0, 13.0, 10.0, 16.0, 11.0, 31.0, 31.0, 21.0, 14.0, 24.0, 14.0, 47.0, 18.0, 14.0, 39.0, 41.0, 11.0, 25.0, 23.0, 39.0, 30.0, 19.0, 14.0, 16.0, 11.0, 25.0, 12.0, 11.0, 30.0, 14.0, 25.0, 9.0, 62.0, 26.0, 28.0, 23.0, 19.0, 11.0, 14.0, 14.0, 12.0, 14.0, 18.0, 19.0, 16.0, 25.0, 50.0, 28.0, 19.0, 13.0, 15.0, 33.0, 29.0, 15.0, 20.0, 23.0, 16.0, 49.0, 17.0, 28.0, 18.0, 43.0, 20.0, 26.0, 9.0, 38.0, 23.0, 24.0, 15.0, 11.0, 17

In [10]:
print(reward_history)

[79.0, 96.0, 200.0, 200.0, 200.0, 200.0, 200.0, 200.0, 200.0, 200.0]


Do the episode rewards increase after multiple iterations?
That shows how the policy is improving.

Also, print out the policy and model to see the results of training in detail…

In [11]:
import pprint

policy = agent.get_policy()
model = policy.model

pprint.pprint(model.variables())
pprint.pprint(model.value_function())

print(model.base_model.summary())

[<tf.Variable 'default_policy/fc_1/kernel:0' shape=(4, 256) dtype=float32>,
 <tf.Variable 'default_policy/fc_1/bias:0' shape=(256,) dtype=float32>,
 <tf.Variable 'default_policy/fc_value_1/kernel:0' shape=(4, 256) dtype=float32>,
 <tf.Variable 'default_policy/fc_value_1/bias:0' shape=(256,) dtype=float32>,
 <tf.Variable 'default_policy/fc_2/kernel:0' shape=(256, 256) dtype=float32>,
 <tf.Variable 'default_policy/fc_2/bias:0' shape=(256,) dtype=float32>,
 <tf.Variable 'default_policy/fc_value_2/kernel:0' shape=(256, 256) dtype=float32>,
 <tf.Variable 'default_policy/fc_value_2/bias:0' shape=(256,) dtype=float32>,
 <tf.Variable 'default_policy/fc_out/kernel:0' shape=(256, 2) dtype=float32>,
 <tf.Variable 'default_policy/fc_out/bias:0' shape=(2,) dtype=float32>,
 <tf.Variable 'default_policy/value_out/kernel:0' shape=(256, 1) dtype=float32>,
 <tf.Variable 'default_policy/value_out/bias:0' shape=(1,) dtype=float32>]
<tf.Tensor 'Reshape:0' shape=(?,) dtype=float32>
Model: "model"
__________

Next we'll use the [`rollout` script](https://ray.readthedocs.io/en/latest/rllib-training.html#evaluating-trained-policies), using the `rllib rollout` command line, to evaluate the trained policy.

This visualizes the "cartpole" agent operating within the simulation: moving the cart left or right to avoid having the pole fall over.

In [15]:
! rllib rollout \
    /tmp/ppo/cart/checkpoint_2/checkpoint-2 \
    --config "{\"env\": \"CartPole-v0\"}" --run PPO \
    --steps 2000

2020-05-01 12:08:49,808	INFO resource_spec.py:212 -- Starting Ray with 4.3 GiB memory available for workers and up to 2.16 GiB for objects. You can adjust these settings with ray.init(memory=<bytes>, object_store_memory=<bytes>).
2020-05-01 12:08:50,119	INFO services.py:1148 -- View the Ray dashboard at [1m[32mlocalhost:8266[39m[22m
2020-05-01 12:08:50,771	INFO trainer.py:428 -- Tip: set 'eager': true or the --eager flag to enable TensorFlow eager execution
2020-05-01 12:08:50,799	INFO trainer.py:585 -- Current log_level is WARN. For more information, set 'log_level': 'INFO' / 'DEBUG' or use the -v and -vv flags.
2020-05-01 12:08:53,763	INFO trainable.py:217 -- Getting current IP.
2020-05-01 12:08:53,828	INFO trainable.py:217 -- Getting current IP.
2020-05-01 12:08:53,829	INFO trainable.py:423 -- Restored on 192.168.1.149 from checkpoint: /tmp/ppo/cart/checkpoint_2/checkpoint-2
2020-05-01 12:08:53,829	INFO trainable.py:430 -- Current state after restoring: {'_iteration': 2, '_times

The rollout uses the second saved checkpoint, evaluated through `2000` steps.
Modify the path to view other checkpoints.

Finally, launch [TensorBoard](https://ray.readthedocs.io/en/latest/rllib-training.html#getting-started) then follow the instructions (for example, click the URL link or copy and paste it into a browser) to visualize key metrics from training with RLlib…

TODO: explain what to do with Tensorboard

In [None]:
!tensorboard --logdir=~/ray_results/

Serving TensorBoard on localhost; to expose to the network, use a proxy or pass --bind_all
TensorBoard 2.1.1 at http://localhost:6007/ (Press CTRL+C to quit)
W0501 12:15:52.709307 123145370316800 core_plugin.py:200] Unable to get first event timestamp for run PPO_CartPole-v0_2020-03-26_15-11-39yyf7qw4j: No event timestamp could be found
W0501 12:15:52.711589 123145370316800 core_plugin.py:200] Unable to get first event timestamp for run PPO_CartPole-v0_2020-03-26_15-21-05wb9qng9p: No event timestamp could be found
W0501 12:15:52.713537 123145370316800 core_plugin.py:200] Unable to get first event timestamp for run PPO_CartPole-v0_2020-03-26_15-43-39ai8xvs0i: No event timestamp could be found
W0501 12:15:52.724436 123145370316800 core_plugin.py:200] Unable to get first event timestamp for run PPO_CartPole-v0_2020-05-01_10-29-417oh9pqog: No event timestamp could be found
W0501 12:15:52.741851 123145370316800 core_plugin.py:200] Unable to get first event timestamp for run PPO_CartPole-v0_

## Exercise 1

TODO

Next, go through any of the following lessons:

* [04a: Application: Mountain Car](04a-Application-Mountain-Car.ipynb) -- Based on the `MountainCar-v0` environment from OpenAI Gym.
* [04b: Application: Taxi](04b-Application-Taxi.ipynb) -- Based on the `Taxi-v3` environment from OpenAI Gym.
* [04c: Application: Frozen Lake](04c-Application-Frozen-Lake.ipynb) -- Based on the `FrozenLake-v0` environment from OpenAI Gym.