# RLlib Sample Application: CartPole-v0

© 2019-2020, Anyscale. All Rights Reserved

![Anyscale Academy](../images/AnyscaleAcademy_Logo_clearbanner_141x100.png)

We were briefly introduced to the _CartPole_ example and the OpenAI gym `CartPole-v0` environment ([gym.openai.com/envs/CartPole-v0/](https://gym.openai.com/envs/CartPole-v0/)) in the [introduction](01-Introduction-to-Reinforcement-Learning.ipynb). This lesson uses [RLlib](https://ray.readthedocs.io/en/latest/rllib.html) to train a policy for _CartPole_.

Recall that the `gym` Python module provides MDP interfaces to a variety of simulators, like the simple simulator for the physics of balancing a pole on a cart that is used by the CartPole environment. The _CartPole_ problem is described at https://gym.openai.com/envs/CartPole-v0. Here is an image from that website:

![Cart Pole](../images/Cart-Pole.png)

Even though this is a relatively simple and quick example to run, its results can be understood quite visually.

For more background about this problem, see:

* ["Neuronlike Adaptive Elements That Can Solve Difficult Learning Control Problem"](https://ieeexplore.ieee.org/document/6313077), AG Barto, RS Sutton, and CW Anderson, *IEEE Transactions on Systems, Man, and Cybernetics* (1983)
* ["Cartpole - Introduction to Reinforcement Learning (DQN - Deep Q-Learning)"](https://towardsdatascience.com/cartpole-introduction-to-reinforcement-learning-ed0eb5b58288), [Greg Surma](https://twitter.com/GSurma)

First, import Ray and the PPO module in RLlib, then start Ray.

In [1]:
import ray
import ray.rllib.agents.ppo as ppo

In [2]:
import pandas as pd
import json, os, shutil

Model *checkpoints* will get saved after each iteration into directories under `/tmp/ppo/cart`.

> **Note:** If you prefer to use a different directory root than `/tmp`, change it in the next cell _and_ in the `rllib rollout` below.

In [3]:
checkpoint_root = '/tmp/ppo/cart'

Clean up output of previous lessons (optional):

In [4]:
# Where checkpoints are written:
shutil.rmtree(checkpoint_root, ignore_errors=True, onerror=None)

# Where some data will be written and used by Tensorboard below:
ray_results = f'{os.getenv("HOME")}/ray_results/'
shutil.rmtree(ray_results, ignore_errors=True, onerror=None)

In [5]:
ray.init(ignore_reinit_error=True)

2020-05-11 13:20:42,967	INFO resource_spec.py:212 -- Starting Ray with 3.91 GiB memory available for workers and up to 1.97 GiB for objects. You can adjust these settings with ray.init(memory=<bytes>, object_store_memory=<bytes>).
2020-05-11 13:20:43,306	INFO services.py:1170 -- View the Ray dashboard at [1m[32mlocalhost:8265[39m[22m


{'node_ip_address': '192.168.1.149',
 'raylet_ip_address': '192.168.1.149',
 'redis_address': '192.168.1.149:45208',
 'object_store_address': '/tmp/ray/session_2020-05-11_13-20-42_955441_17141/sockets/plasma_store',
 'raylet_socket_name': '/tmp/ray/session_2020-05-11_13-20-42_955441_17141/sockets/raylet',
 'webui_url': 'localhost:8265',
 'session_dir': '/tmp/ray/session_2020-05-11_13-20-42_955441_17141'}

The Ray Dashboard is useful for monitoring Ray:

In [6]:
print(f'Dashboard URL: http://{ray.get_webui_url()}')

Dashboard URL: http://localhost:8265


Next we'll train an RLlib policy with the [`CartPole-v0` environment](https://gym.openai.com/envs/CartPole-v0/).

By default, training runs for `10` iterations. Increase the `N_ITER` setting if you want to train longer and see the resulting rewards improve. However, if the max score of `200` is achieved early, you can use a smaller number of iterations.

> **Note:** If you change the values shown for `config['model']['fcnet_hiddens']`, make the same change in the `rllib rollout` command below!

In [9]:
SELECT_ENV = "CartPole-v0"
N_ITER = 10

config = ppo.DEFAULT_CONFIG.copy()
config["log_level"] = "WARN"
# Other settings we might use:
config['num_workers'] = 1
config['num_sgd_iter'] = 10
config['sgd_minibatch_size'] = 250
config['model']['fcnet_hiddens'] = [100, 50]
config['num_cpus_per_worker'] = 0  # This avoids running out of resources in the notebook environment when this cell is re-executed

In [10]:
agent = ppo.PPOTrainer(config, env=SELECT_ENV)

results = []
episode_data = []
episode_json = []
for n in range(N_ITER):
    result = agent.train()
    results.append(result)
    episode = {'n': n, 
               'episode_reward_mean': result['episode_reward_mean'], 
               'episode_reward_max': result['episode_reward_max'],  
               'episode_len_mean': result['episode_len_mean']}
    episode_data.append(episode)
    episode_json.append(json.dumps(episode))
    file_name = agent.save(checkpoint_root)
    print(f'Max reward: {episode["episode_reward_max"]}. Checkpoint saved to {file_name}')

2020-05-11 13:40:23,207	INFO trainer.py:421 -- Tip: set 'eager': true or the --eager flag to enable TensorFlow eager execution
2020-05-11 13:40:23,394	INFO trainer.py:580 -- Current log_level is WARN. For more information, set 'log_level': 'INFO' / 'DEBUG' or use the -v and -vv flags.
2020-05-11 13:40:26,195	INFO trainable.py:217 -- Getting current IP.


Max reward: 69.0. Checkpoint saved to /tmp/ppo/cart/checkpoint_1/checkpoint-1
Max reward: 70.0. Checkpoint saved to /tmp/ppo/cart/checkpoint_2/checkpoint-2
Max reward: 96.0. Checkpoint saved to /tmp/ppo/cart/checkpoint_3/checkpoint-3
Max reward: 122.0. Checkpoint saved to /tmp/ppo/cart/checkpoint_4/checkpoint-4
Max reward: 161.0. Checkpoint saved to /tmp/ppo/cart/checkpoint_5/checkpoint-5
Max reward: 165.0. Checkpoint saved to /tmp/ppo/cart/checkpoint_6/checkpoint-6
Max reward: 200.0. Checkpoint saved to /tmp/ppo/cart/checkpoint_7/checkpoint-7
Max reward: 200.0. Checkpoint saved to /tmp/ppo/cart/checkpoint_8/checkpoint-8
Max reward: 200.0. Checkpoint saved to /tmp/ppo/cart/checkpoint_9/checkpoint-9
Max reward: 200.0. Checkpoint saved to /tmp/ppo/cart/checkpoint_10/checkpoint-10


The episode rewards should increase after multiple iterations. Try tweaking the config parameters. Smaller values for the `num_sgd_iter`, `sgd_minibatch_size`, or the `model`'s `fcnet_hiddens` will train faster, but take longer to improve the policy.

In [11]:
df = pd.DataFrame(data=episode_data)
df

Unnamed: 0,n,episode_reward_mean,episode_reward_max,episode_len_mean
0,0,22.581921,69.0,22.581921
1,1,23.22093,70.0,23.22093
2,2,29.272059,96.0,29.272059
3,3,40.49,122.0,40.49
4,4,45.88,161.0,45.88
5,5,51.87,165.0,51.87
6,6,61.55,200.0,61.55
7,7,68.09,200.0,68.09
8,8,82.37,200.0,82.37
9,9,95.12,200.0,95.12


In [12]:
from bokeh.plotting import figure, output_file, show
from bokeh.models import ColumnDataSource
from bokeh.models.tools import HoverTool
import bokeh.io
# The next two lines prevent Bokeh from opening the graph in a new window.
bokeh.io.reset_output()
bokeh.io.output_notebook()

In [13]:
source = ColumnDataSource(df)

plot = figure(title='Episode reward and length means/maxes')
plot.grid.grid_line_alpha=0.2
plot.xaxis.axis_label = 'n'
plot.yaxis.axis_label = 'value'

plot.line(x='n', y='episode_reward_mean', source=source, color='blue', legend_label='Episode reward mean', name='Episode reward mean')
plot.circle(x='n', y='episode_reward_mean', source=source, color='blue', size=8)
plot.line(x='n', y='episode_reward_max', source=source, color='green', legend_label='Episode reward max', name='Episode reward max')
plot.circle(x='n', y='episode_reward_max', source=source, color='green', size=8)
plot.legend.location = "top_left"

hover = HoverTool()
hover.tooltips = [
    ("n", "$x"),
    ("mean", "$y")]
plot.add_tools(hover)

show(plot)

Also, print out the policy and model to see the results of training in detail…

In [14]:
import pprint

policy = agent.get_policy()
model = policy.model

pprint.pprint(model.variables())
pprint.pprint(model.value_function())

print(model.base_model.summary())

[<tf.Variable 'default_policy/fc_1/kernel:0' shape=(4, 100) dtype=float32>,
 <tf.Variable 'default_policy/fc_1/bias:0' shape=(100,) dtype=float32>,
 <tf.Variable 'default_policy/fc_value_1/kernel:0' shape=(4, 100) dtype=float32>,
 <tf.Variable 'default_policy/fc_value_1/bias:0' shape=(100,) dtype=float32>,
 <tf.Variable 'default_policy/fc_2/kernel:0' shape=(100, 50) dtype=float32>,
 <tf.Variable 'default_policy/fc_2/bias:0' shape=(50,) dtype=float32>,
 <tf.Variable 'default_policy/fc_value_2/kernel:0' shape=(100, 50) dtype=float32>,
 <tf.Variable 'default_policy/fc_value_2/bias:0' shape=(50,) dtype=float32>,
 <tf.Variable 'default_policy/fc_out/kernel:0' shape=(50, 2) dtype=float32>,
 <tf.Variable 'default_policy/fc_out/bias:0' shape=(2,) dtype=float32>,
 <tf.Variable 'default_policy/value_out/kernel:0' shape=(50, 1) dtype=float32>,
 <tf.Variable 'default_policy/value_out/bias:0' shape=(1,) dtype=float32>]
<tf.Tensor 'Reshape:0' shape=(?,) dtype=float32>
Model: "model"
________________

Next we'll use the [`rollout` script](https://ray.readthedocs.io/en/latest/rllib-training.html#evaluating-trained-policies), using the `rllib rollout` command line, to evaluate the trained policy.

This visualizes the "cartpole" agent operating within the simulation: moving the cart left or right to avoid having the pole fall over.

We'll use the seventh saved checkpoint, `checkpoint_7` for the rollout, evaluated through `2000` steps.
Modify the path to view other checkpoints. Note that you have to change the number in _twice_. You will have to change the checkpoint choice if you changed `N_ITER` above to a smaller number.

> **Notes:** 
>
> 1. If you defined `checkpoint_root` above to be different than `/tmp/ppo/cart`, then change it here, too. Note that bugs in variable substitution in Jupyter notebooks, we can't use variables in the next cell, unfortunately.
> 2. If you changed the model parameters, specifically the `fcnet_hiddens` array in the `config` object above, make the same change here.

In [45]:
!rllib rollout /tmp/ppo/cart/checkpoint_7/checkpoint-7 \
    --config "{\"env\": \"CartPole-v0\", \"model\": {\"fcnet_hiddens\": [100, 50]}}" \
    --run PPO \
    --steps 2000

2020-05-11 13:55:06,261	INFO resource_spec.py:212 -- Starting Ray with 3.86 GiB memory available for workers and up to 1.94 GiB for objects. You can adjust these settings with ray.init(memory=<bytes>, object_store_memory=<bytes>).
2020-05-11 13:55:06,662	INFO services.py:1170 -- View the Ray dashboard at [1m[32mlocalhost:8266[39m[22m
2020-05-11 13:55:07,225	INFO trainer.py:421 -- Tip: set 'eager': true or the --eager flag to enable TensorFlow eager execution
2020-05-11 13:55:07,254	INFO trainer.py:580 -- Current log_level is WARN. For more information, set 'log_level': 'INFO' / 'DEBUG' or use the -v and -vv flags.
2020-05-11 13:55:10,188	INFO trainable.py:217 -- Getting current IP.
2020-05-11 13:55:10,289	INFO trainable.py:217 -- Getting current IP.
2020-05-11 13:55:10,289	INFO trainable.py:423 -- Restored on 192.168.1.149 from checkpoint: /tmp/ppo/cart/checkpoint_7/checkpoint-7
2020-05-11 13:55:10,289	INFO trainable.py:430 -- Current state after restoring: {'_iteration': 7, '_time

The rollout uses the checkpoint evaluated through `2000` steps.

Finally, launch [TensorBoard](https://ray.readthedocs.io/en/latest/rllib-training.html#getting-started) then follow the instructions (for example, click the URL link or copy and paste it into a browser) to visualize key metrics from training with RLlib.


In [None]:
!tensorboard --logdir={ray_results}

Next, go through any of the following lessons:

* [04a: Application: Mountain Car](04a-Application-Mountain-Car.ipynb) -- Based on the `MountainCar-v0` environment from OpenAI Gym.
* [04b: Application: Taxi](04b-Application-Taxi.ipynb) -- Based on the `Taxi-v3` environment from OpenAI Gym.
* [04c: Application: Frozen Lake](04c-Application-Frozen-Lake.ipynb) -- Based on the `FrozenLake-v0` environment from OpenAI Gym.