# Ray RLlib - Explore RLlib - Sample Application: BipedalWalker-v3 (Optional)


© 2019-2020, Anyscale. All Rights Reserved

![Anyscale Academy](../../images/AnyscaleAcademy_Logo_clearbanner_141x100.png)

This example uses a harder problem, the _Bipedal Walker_, a two-legged "robot" in two dimensions (see [here](https://gym.openai.com/envs/BipedalWalker-v2/) and [here](https://github.com/openai/gym/wiki/BipedalWalker-v2); we'll actually use version 3, not 2). 
![Bipedal Walker](../../images/rllib/Bipedal-Walker.png)

([source](https://gym.openai.com/envs/BipedalWalker-v2/))

Reward is given for moving forward, a total of 300+ points up to the far end. If the robot falls, it gets -100. Applying motor torque costs a small amount of points, so a more optimal agent that minimizes torque application will get a better score. The state consists of the hull angle speed, angular velocity, horizontal speed, vertical speed, position of joints, joints angular speed, legs contact with ground, and 10 LIDAR rangefinder measurements. There are no coordinates in the state vector.

This notebook is offered as a "homework" exercise, as more computation is required to make it work. However, we provide a checkpoint from previous training episodes, which will accelerate your efforts somewhat. Even working through this notebook as is, you'll see good results. However, consider iterating on the neural network structure and run more training iterations. How well can you train the walker?

First, import Ray and the PPO module in RLlib, then start Ray.

In [1]:
import ray
import ray.rllib.agents.ppo as ppo

In [2]:
import pandas as pd
import json, os, shutil, sys

In [3]:
sys.path.append('../..') # so we can import from "util"
from util.line_plots import plot_line, plot_line_with_min_max, plot_line_with_stddev

Model *checkpoints* will get saved after each iteration into directories under `tmp/ppo/cart`, i.e., relative to this directory. 
The default directories for checkpoints are `$HOME/ray_results/<algo_env>/.../checkpoint_N`.

> **Note:** If you prefer to use a different directory root, change it in the next cell _and_ in the `rllib rollout` command below.

In [4]:
checkpoint_root = 'tmp/ppo/bipedal-walker'

Clean up output of previous lessons (optional):

In [5]:
# Where checkpoints are written:
#shutil.rmtree(checkpoint_root, ignore_errors=True, onerror=None)

# Where some data will be written and used by Tensorboard below:
ray_results = f'{os.getenv("HOME")}/ray_results/'
#shutil.rmtree(ray_results, ignore_errors=True, onerror=None)

Make sure Ray is running:

In [6]:
!../../tools/start-ray.sh  --check --verbose

INFO: Ray is already running.


In [7]:
ray.init(address='auto', ignore_reinit_error=True)

2020-06-14 07:07:03,247	INFO resource_spec.py:212 -- Starting Ray with 3.52 GiB memory available for workers and up to 1.78 GiB for objects. You can adjust these settings with ray.init(memory=<bytes>, object_store_memory=<bytes>).
2020-06-14 07:07:03,615	INFO services.py:1170 -- View the Ray dashboard at [1m[32mlocalhost:8266[39m[22m


{'node_ip_address': '192.168.1.149',
 'raylet_ip_address': '192.168.1.149',
 'redis_address': '192.168.1.149:61931',
 'object_store_address': '/tmp/ray/session_2020-06-14_07-07-03_235068_29259/sockets/plasma_store',
 'raylet_socket_name': '/tmp/ray/session_2020-06-14_07-07-03_235068_29259/sockets/raylet',
 'webui_url': 'localhost:8266',
 'session_dir': '/tmp/ray/session_2020-06-14_07-07-03_235068_29259'}

The Ray Dashboard is useful for monitoring Ray:

In [8]:
print(f'Dashboard URL: http://{ray.get_webui_url()}')

Dashboard URL: http://localhost:8266


Next we'll train a policy for the [Bipedal Walker](https://gym.openai.com/envs/BipedalWalker-v2/) environment.

> **Note:** If you change the values shown for `config['model']['fcnet_hiddens']`, make the same change in the `rllib rollout` command below!

In [9]:
SELECT_ENV = "BipedalWalker-v3"                 # Specifies the OpenAI Gym environment
N_ITER = 50                                     # Number of training runs.

config = ppo.DEFAULT_CONFIG.copy()              # PPO's default configuration. See the next code cell.
config["log_level"] = "WARN"                    # Suppress too many messages, but try "INFO" to see what can be printed.
# Other settings we might adjust:
config['num_workers'] = 4                       # Use > 1 for using more CPU cores, including over a cluster
config['num_sgd_iter'] = 50                     # Number of SGD (stochastic gradient descent) iterations per training minibatch.
                                                # I.e., for each minibatch of data, do this many passes over it to train. 
config['sgd_minibatch_size'] = 250              # The amount of data records per minibatch
config['model']['fcnet_hiddens'] = [512, 512]   # Larger network than we used for CartPole.
config['num_cpus_per_worker'] = 0               # This avoids running out of resources in the notebook environment when this cell is re-executed

Recall you can see what configuration settings are defined for PPO. Note in particular the parameters for the deep learning `model`. As you try to make the performance better and better, what else might you modify here?

In [10]:
ppo.DEFAULT_CONFIG

{'num_workers': 2,
 'num_envs_per_worker': 1,
 'rollout_fragment_length': 200,
 'sample_batch_size': -1,
 'batch_mode': 'truncate_episodes',
 'num_gpus': 0,
 'train_batch_size': 4000,
 'model': {'conv_filters': None,
  'conv_activation': 'relu',
  'fcnet_activation': 'tanh',
  'fcnet_hiddens': [512, 512],
  'free_log_std': False,
  'no_final_linear': False,
  'vf_share_layers': True,
  'use_lstm': False,
  'max_seq_len': 20,
  'lstm_cell_size': 256,
  'lstm_use_prev_action_reward': False,
  'state_shape': None,
  'framestack': True,
  'dim': 84,
  'grayscale': False,
  'zero_mean': True,
  'custom_model': None,
  'custom_action_dist': None,
  'custom_options': {},
  'custom_preprocessor': None},
 'optimizer': {},
 'gamma': 0.99,
 'horizon': None,
 'soft_horizon': False,
 'no_done_at_end': False,
 'env_config': {},
 'env': None,
 'normalize_actions': False,
 'clip_rewards': None,
 'clip_actions': True,
 'preprocessor_pref': 'deepmind',
 'lr': 5e-05,
 'monitor': False,
 'log_level': 'WAR

In [11]:
agent = ppo.PPOTrainer(config, env=SELECT_ENV)

2020-06-14 07:07:04,233	INFO trainer.py:421 -- Tip: set 'eager': true or the --eager flag to enable TensorFlow eager execution
2020-06-14 07:07:04,266	INFO trainer.py:580 -- Current log_level is WARN. For more information, set 'log_level': 'INFO' / 'DEBUG' or use the -v and -vv flags.
2020-06-14 07:07:09,460	INFO trainable.py:217 -- Getting current IP.


Restore from a previously-captured checkpoint:

In [13]:
agent.restore('bipedal-walker-checkpoint/checkpoint-50')

2020-06-14 07:07:32,563	INFO trainable.py:217 -- Getting current IP.
2020-06-14 07:07:32,564	INFO trainable.py:423 -- Restored on 192.168.1.149 from checkpoint: bipedal-walker-checkpoint/checkpoint-50
2020-06-14 07:07:32,565	INFO trainable.py:430 -- Current state after restoring: {'_iteration': 50, '_timesteps_total': 200000, '_time_total': 458.983549118042, '_episodes_total': 180}


In [14]:
results = []
episode_data = []
episode_json = []
for n in range(N_ITER):
    result = agent.train()
    results.append(result)
    episode = {'n': n, 
               'episode_reward_min': result['episode_reward_min'], 
               'episode_reward_mean': result['episode_reward_mean'], 
               'episode_reward_max': result['episode_reward_max'],  
               'episode_len_mean': result['episode_len_mean']}
    episode_data.append(episode)
    episode_json.append(json.dumps(episode))
    file_name = agent.save(checkpoint_root)
    print(f'{n:3d}: Min/Mean/Max reward: {result["episode_reward_min"]:8.4f}/{result["episode_reward_mean"]:8.4f}/{result["episode_reward_max"]:8.4f}. Checkpoint saved to {file_name}')

  0: Min/Mean/Max reward: -111.0537/-111.0537/-111.0537. Checkpoint saved to tmp/ppo/bipedal-walker/checkpoint_51/checkpoint-51
  1: Min/Mean/Max reward: -111.0537/ 55.3963/105.9468. Checkpoint saved to tmp/ppo/bipedal-walker/checkpoint_52/checkpoint-52
  2: Min/Mean/Max reward: -111.0537/ 55.3963/105.9468. Checkpoint saved to tmp/ppo/bipedal-walker/checkpoint_53/checkpoint-53
  3: Min/Mean/Max reward: -111.0537/ 74.3217/105.9468. Checkpoint saved to tmp/ppo/bipedal-walker/checkpoint_54/checkpoint-54
  4: Min/Mean/Max reward: -111.0537/ 90.6076/141.5110. Checkpoint saved to tmp/ppo/bipedal-walker/checkpoint_55/checkpoint-55
  5: Min/Mean/Max reward: -118.5845/ 60.5697/141.5110. Checkpoint saved to tmp/ppo/bipedal-walker/checkpoint_56/checkpoint-56
  6: Min/Mean/Max reward: -118.5845/ 52.8854/141.5110. Checkpoint saved to tmp/ppo/bipedal-walker/checkpoint_57/checkpoint-57
  7: Min/Mean/Max reward: -118.5845/ 60.6714/141.5110. Checkpoint saved to tmp/ppo/bipedal-walker/checkpoint_58/chec

The episode rewards should increase after multiple iterations. Try tweaking the config parameters. Smaller values for the `num_sgd_iter`, `sgd_minibatch_size`, or the `model`'s `fcnet_hiddens` will train faster, but take longer to improve the policy.

In [15]:
df = pd.DataFrame(data=episode_data)
df

Unnamed: 0,n,episode_reward_min,episode_reward_mean,episode_reward_max,episode_len_mean
0,0,-111.053682,-111.053682,-111.053682,157.0
1,1,-111.053682,55.396269,105.946782,1311.4
2,2,-111.053682,55.396269,105.946782,1311.4
3,3,-111.053682,74.321686,105.946782,1439.666667
4,4,-111.053682,90.60761,141.510993,1489.0
5,5,-118.584451,60.569715,141.510993,1340.875
6,6,-118.584451,52.885396,141.510993,1293.789474
7,7,-118.584451,60.671436,141.510993,1328.869565
8,8,-118.584451,60.671436,141.510993,1328.869565
9,9,-118.584451,72.813825,154.206144,1369.037037


In [16]:
import bokeh.io
# The next two lines prevent Bokeh from opening the graph in a new window.
bokeh.io.reset_output()
bokeh.io.output_notebook()

Here are the results training from training runs 51-100:

In [17]:
plot_line_with_min_max(df, x_col='n', y_col='episode_reward_mean', min_col='episode_reward_min', max_col='episode_reward_max',
                       title='Bipel Walker Episode Rewards', x_axis_label = 'n', y_axis_label='reward')

([image](../images/rllib/Bipedal-Walker-Rewards-100.png) after 100 training runs)

Compare with this image after 50 runs:

![image](../images/rllib/Bipedal-Walker-Rewards-50.png)

Also, print out the policy and model to see the results of training in detail…

In [18]:
import pprint

policy = agent.get_policy()
model = policy.model

pprint.pprint(model.variables())
pprint.pprint(model.value_function())

print(model.base_model.summary())

[<tf.Variable 'default_policy/fc_1/kernel:0' shape=(24, 512) dtype=float32>,
 <tf.Variable 'default_policy/fc_1/bias:0' shape=(512,) dtype=float32>,
 <tf.Variable 'default_policy/fc_value_1/kernel:0' shape=(24, 512) dtype=float32>,
 <tf.Variable 'default_policy/fc_value_1/bias:0' shape=(512,) dtype=float32>,
 <tf.Variable 'default_policy/fc_2/kernel:0' shape=(512, 512) dtype=float32>,
 <tf.Variable 'default_policy/fc_2/bias:0' shape=(512,) dtype=float32>,
 <tf.Variable 'default_policy/fc_value_2/kernel:0' shape=(512, 512) dtype=float32>,
 <tf.Variable 'default_policy/fc_value_2/bias:0' shape=(512,) dtype=float32>,
 <tf.Variable 'default_policy/fc_out/kernel:0' shape=(512, 8) dtype=float32>,
 <tf.Variable 'default_policy/fc_out/bias:0' shape=(8,) dtype=float32>,
 <tf.Variable 'default_policy/value_out/kernel:0' shape=(512, 1) dtype=float32>,
 <tf.Variable 'default_policy/value_out/bias:0' shape=(1,) dtype=float32>]
<tf.Tensor 'Reshape:0' shape=(?,) dtype=float32>
Model: "model"
________

## Rollout

Next we'll use the [RLlib rollout CLI](https://ray.readthedocs.io/en/latest/rllib-training.html#evaluating-trained-policies), to evaluate the trained policy.

We'll use the last saved checkpoint you created for the rollout, `checkpoint_100` (or a different number you might have, see the output from the training above), evaluated through `2000` steps.

> **Notes:** 
>
> 1. If you changed `checkpoint_root` value above, then change it here, too. Note that bugs in variable substitution in Jupyter notebooks, we can't use variables in the next cell, unfortunately.
> 2. If you changed the model parameters, specifically the `fcnet_hiddens` array in the `config` object above, make the same change here.

You may need to make one more modification, depending on how you are running this tutorial:

1. Running on your laptop? - Remove the line `--no-render`. 
2. Running on the Anyscale Service? The popup windows that would normally be created by the rollout can't be viewed in this case. Hence, the `--no-render` flag suppresses them. The code cell afterwards provides a sample video. You can try adding `--video-dir tmp/ppo/cart`, which will generate MP4 videos, then download them to view them. Or copy the `Video` cell below and use it to view the movies.

In [20]:
!RAY_ADDRESS=auto rllib rollout tmp/ppo/bipedal-walker/checkpoint_100/checkpoint-100 \
    --config "{\"env\": \"BipedalWalker-v3\", \"model\": {\"fcnet_hiddens\": [512, 512]}}" \
    --run PPO \
    --no-render \
    --steps 2000

2020-06-14 07:22:19,934	INFO trainer.py:421 -- Tip: set 'eager': true or the --eager flag to enable TensorFlow eager execution
2020-06-14 07:22:19,952	INFO trainer.py:580 -- Current log_level is WARN. For more information, set 'log_level': 'INFO' / 'DEBUG' or use the -v and -vv flags.
2020-06-14 07:22:23,050	INFO trainable.py:217 -- Getting current IP.
2020-06-14 07:22:23,128	INFO trainable.py:217 -- Getting current IP.
2020-06-14 07:22:23,128	INFO trainable.py:423 -- Restored on 192.168.1.149 from checkpoint: tmp/ppo/bipedal-walker/checkpoint_100/checkpoint-100
2020-06-14 07:22:23,128	INFO trainable.py:430 -- Current state after restoring: {'_iteration': 100, '_timesteps_total': 400000, '_time_total': 938.3951942920685, '_episodes_total': 326}
Episode #0: reward: 204.43550759620186
Episode #1: reward: 48.09873094424978


Here is a sample episode video after training 100 times.

> **Note:** This video was created by running the previous `rllib rollout` command with the additional argument `--video-dir tmp/ppo/bipedal-walker` (then the video was copied to the location below). It creates one video per episode.

In [5]:
from IPython.display import Video

sample_video='../../images/rllib/Bipedal-Walker-Example-100.mp4'
Video(sample_video, embed=True)

Finally, use [TensorBoard](https://ray.readthedocs.io/en/latest/rllib-training.html#getting-started) to visualize the results.

In [4]:
import os
os.curdir

'.'

## Exercise ("Homework")

In addition to _Cart Pole_, _Bipedal Walker_, and _Mountain Car_, there are other so-called ["classic control"](https://gym.openai.com/envs/#classic_control) examples you can try. Make a copy of this notebook and edit as required.