# Ray RLlib - Explore RLlib - Sample Application: BipedalWalker-v3 (Optional)


© 2019-2022, Anyscale. All Rights Reserved

![Anyscale Academy](../../images/AnyscaleAcademyLogo.png)

This example uses a harder problem, the _Bipedal Walker_, a two-legged "robot" in two dimensions (see [here](https://gym.openai.com/envs/BipedalWalker-v2/) and [here](https://github.com/openai/gym/wiki/BipedalWalker-v2); we'll actually use version 3, not 2). 

![Bipedal Walker](../../images/rllib/Bipedal-Walker.png)

([source](https://gym.openai.com/envs/BipedalWalker-v2/))

Reward is given for moving forward, a total of 300+ points up to the far end. If the robot falls, it gets -100. Applying motor torque costs a small amount of points, so a more optimal agent that minimizes torque application will get a better score. The state consists of the hull angle speed, angular velocity, horizontal speed, vertical speed, position of joints, joints angular speed, legs contact with ground, and 10 LIDAR rangefinder measurements. There are no coordinates in the state vector.

This notebook requires more computation than the other lessons to achieve a well-trained policy. However, to make it faster, we provide a checkpoint from previous training episodes, which will accelerate your efforts somewhat. Even starting with the provided checkpoint, you'll see good results. However, consider iterating on the neural network structure and run more training iterations. How well can you train the walker?

Then, import Ray and the PPO module in RLlib, then start Ray. 

> **NOTE:** There are lots of warnings from TF code for transitioning between V1 and V2. Please ignore them.

In [None]:
import ray
import ray.rllib.agents.ppo as ppo

In [None]:
import pandas as pd
import json
import os
import shutil
import sys

Model *checkpoints* will get saved after each iteration into directories under `tmp/ppo/bipedal-walker`, i.e., relative to this directory. 
The default directories for checkpoints are `$HOME/ray_results/<algo_env>/.../checkpoint_N`.

> **Note:** If you prefer to use a different directory root, change it in the next cell _and_ in the `rllib rollout` command below.

In [None]:
checkpoint_root = "tmp/ppo/bipedal-walker"

Clean up output of previous lessons (optional):

In [None]:
# Where checkpoints are written:
shutil.rmtree(checkpoint_root, ignore_errors=True, onerror=None)

# Where some data will be written and used by Tensorboard below:
ray_results = f'{os.getenv("HOME")}/ray_results/'
shutil.rmtree(ray_results, ignore_errors=True, onerror=None)

Start Ray:

In [None]:
info = ray.init(ignore_reinit_error=True)

The Ray Dashboard is useful for monitoring Ray:

In [None]:
print("Dashboard URL: http://{}".format(info["webui_url"]))

Next we'll train a policy for the [Bipedal Walker](https://gym.openai.com/envs/BipedalWalker-v2/) environment.

> **Note:** If you change the values shown for `config['model']['fcnet_hiddens']`, make the same change in the `rllib rollout` command below!

In [None]:
SELECT_ENV = "BipedalWalker-v3"                 # Specifies the OpenAI Gym environment
N_ITER = 20                                     # Number of training runs. We'll only do 20 because this is compute intensive.
                                                # If you have a powerful machine or cluster or more time, try a bigger number like 50 or 100!

config = ppo.DEFAULT_CONFIG.copy()              # PPO's default configuration. See the next code cell.
config["log_level"] = "WARN"                    # Suppress too many messages, but try "INFO" to see what can be printed.
config["framework"] = "tf"                      # TensorFlow

# Other settings we might adjust:
config["num_workers"] = 4                       # Use > 1 for using more CPU cores, including over a cluster
config["num_sgd_iter"] = 50                     # Number of SGD (stochastic gradient descent) iterations per training minibatch.
                                                # I.e., for each minibatch of data, do this many passes over it to train. 
config["sgd_minibatch_size"] = 250              # The amount of data records per minibatch
config["model"]["fcnet_hiddens"] = [512, 512]   # Larger network than we used for CartPole.
config["num_cpus_per_worker"] = 0               # This avoids running out of resources in the notebook environment when this cell is re-executed

Recall you can see what configuration settings are defined for PPO. Note in particular the parameters for the deep learning `model`. As you try to make the performance better and better, what else might you modify here?

In [None]:
ppo.DEFAULT_CONFIG

> **Note:** You can safely ignore the following warnings if you see them:
> ```
> WARNING:tensorflow:From .../python3.X/site-packages/tensorflow_core/python/compat/v2_compat.py:88: disable_resource_variables 
> (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version...
> ```
> Also, there may be warnings about `box bound precision...`.

In [None]:
agent = ppo.PPOTrainer(config, env=SELECT_ENV)

Restore from a previously-captured checkpoint, after training for 100 iterations.

> **WARNING:** If you change the configuration parameters above and you get an exception on the next line, it probably means the checkpoint is incompatible with the change. Just skip loading the checkpoint, but consider training for 100-200 iterations instead of 20.

In [None]:
agent.restore("bipedal-walker-checkpoint/checkpoint-100")

Train for an additional `N_ITER` iterations. 

> **Note:** Depending on the machine or cluster you are running on, this can take a long time. If you are on a powerful laptop or running in a cluster, or you don't mind waiting, try using a larger value for `N_ITER`.

In [None]:
results = []
episode_data = []
episode_json = []

for n in range(N_ITER):
    result = agent.train()
    results.append(result)
    
    episode = {'n': n, 
               'episode_reward_min': result['episode_reward_min'], 
               'episode_reward_mean': result['episode_reward_mean'], 
               'episode_reward_max': result['episode_reward_max'],  
               'episode_len_mean': result['episode_len_mean']}
    
    episode_data.append(episode)
    episode_json.append(json.dumps(episode))
    file_name = agent.save(checkpoint_root)
    
    print(f'{n:3d}: Min/Mean/Max reward: {result["episode_reward_min"]:8.4f}/{result["episode_reward_mean"]:8.4f}/{result["episode_reward_max"]:8.4f}. Checkpoint saved to {file_name}')

The episode rewards should increase after multiple iterations. Try tweaking the config parameters. Smaller values for the `num_sgd_iter`, `sgd_minibatch_size`, or the `model`'s `fcnet_hiddens` will train faster, but take longer to improve the policy.

In [None]:
df = pd.DataFrame(data=episode_data)
df

Here are the results training starting from the iteration-100 checkpoint and training for an additional `N_ITER` iterations:

In [None]:
df.plot(x="n", y=["episode_reward_mean", "episode_reward_min", "episode_reward_max"], secondary_y=True)

Compare with these images after 50 and 100 iterations. Note the sign of the `reward` in all graphs!

After 100 iterations, starting from a checkpoint at 50 (so 50 _new_ iterations):

![image](../../images/rllib/Bipedal-Walker-Rewards-100.png)

After the first 50 iterations:

![image](../../images/rllib/Bipedal-Walker-Rewards-50.png)

By 100 iterations, the reward has mostly leveled off.

Let's print out the policy and model to see the results of training in detail…

In [None]:
import pprint

policy = agent.get_policy()
model = policy.model

pprint.pprint(model.variables())
pprint.pprint(model.value_function())

print(model.base_model.summary())

## Rollout

Next we'll use the [RLlib rollout CLI](https://ray.readthedocs.io/en/latest/rllib-training.html#evaluating-trained-policies), to evaluate the trained policy.

We'll use the last saved checkpoint you created for the rollout, `checkpoint_120` (or a different number if you changed the number of steps, etc. See the output from the training above), evaluated through `2000` steps.

> **Notes:** 
>
> 1. If you changed `checkpoint_root` value above, then change it here, too. Note that bugs in variable substitution in Jupyter notebooks, we can't use variables in the next cell, unfortunately.
> 2. If you changed the model parameters, specifically the `fcnet_hiddens` array in the `config` object above, make the same change here.

You may need to make one more modification, depending on how you are running this tutorial:

1. Running on your laptop? - Remove the line `--no-render`. 
2. Running on the Anyscale Service? The popup windows that would normally be created by the rollout can't be viewed in this case. Hence, the `--no-render` flag suppresses them. The code cell afterwards provides a sample video. You can try adding `--video-dir tmp/ppo/cart`, which will generate MP4 videos, then download them to view them. Or copy the `Video` cell below and use it to view the movies.

In [None]:
!rllib rollout tmp/ppo/bipedal-walker/checkpoint_120/checkpoint-120 \
    --config "{\"env\": \"BipedalWalker-v3\", \"model\": {\"fcnet_hiddens\": [512, 512]}}" \
    --run PPO \
    --no-render \
    --steps 2000

Here is a sample episode video after training 100 times.

> **Note:** This video was created by running the previous `rllib rollout` command with the additional argument `--video-dir tmp/ppo/bipedal-walker` (then the video was copied to the location below). It creates one video per episode.

In [None]:
from IPython.display import Video

sample_video ="../../images/rllib/Bipedal-Walker-Example-100.mp4"
Video(sample_video, embed=True)

Finally, use [TensorBoard](https://ray.readthedocs.io/en/latest/rllib-training.html#getting-started) to visualize the results.

In [None]:
ray.shutdown()

## Exercise 1 ("Homework")

Try a long training run while you do other work. Increase `N_ITER` above to some large number. When it finishes, change the `rllib rollout` command to use the last checkpoint. How well does it run? 

Redo the experiment a few times. You might increase `N_ITER`. For each run, load the last checkpoint that was saved in the previous run. How well can you train the walker?

## Exercise 2 ("Homework")

In addition to _Cart Pole_, _Bipedal Walker_, and _Mountain Car_ (see the `extras` folder), there are other so-called ["classic control"](https://gym.openai.com/envs/#classic_control) examples you can try. Make a copy of this notebook and edit as required.