# Ray RLlib - Extra Application Example - MountainCar-v0

© 2019-2021, Anyscale. All Rights Reserved

![Anyscale Academy](../../../images/AnyscaleAcademyLogo.png)

This example uses [RLlib](https://ray.readthedocs.io/en/latest/rllib.html) to train a policy with the `MountainCar-v0` environment, ([gym.openai.com/envs/MountainCar-v0/](gym.openai.com/envs/MountainCar-v0/)). The idea is that a cart starts at an arbitrar point on a hill. Without any "pushes", it will rock back and forth between the two sides of the valley below, never rising above the starting point. However, there are three actions, accelerate to the left (by some unit), accelerate to the right, or apply no acceleration. Timing accelerations in the appropriate directions at the appropriate steps is the key to getting to the top of the hill.

The primary idea demonstrated in this lesson is how to start from a previous checkpoint. A checkpoint is provided in the `mountain-car-checkpoint` directory, captured after 200 training episodes. Still, the with the provided checkpoint and addition training of 50 episodes, the cart is unable to reach the top.

Hence, you should consider this lesson a big exercise to try when you aren't pressed for time (like in a class setting). Modifications you can try are discussed below.

> **Note:** This rollout can only show the rollout visualization popup windows when running on a local laptop.

Like `CartPole`, _MountainCar_ is one of OpenAI Gym's ["classic control"](https://gym.openai.com/envs/#classic_control) examples.

For more background about this problem, see:

* ["Efficient memory-based learning for robot control"](https://www.cl.cam.ac.uk/techreports/UCAM-CL-TR-209.pdf), [Andrew William Moore](https://www.cl.cam.ac.uk/~awm22/), University of Cambridge (1990)
* ["Solving Mountain Car with Q-Learning"](https://medium.com/@ts1829/solving-mountain-car-with-q-learning-b77bf71b1de2), [Tim Sullivan](https://twitter.com/ts_1829)

Import Ray and the PPO support, then start Ray…

In [None]:
import pandas as pd
import json
import os
import shutil
import sys
import ray
import ray.rllib.agents.ppo as ppo

In [None]:
info = ray.init(ignore_reinit_error=True)

The Ray Dashboard is useful for monitoring Ray:

In [None]:
print("Dashboard URL: http://{}".format(info["webui_url"]))

Next we'll train an RLlib policy with the `MountainCar-v0` environment.

By default, training runs for `20` iterations. Increase the `N_ITER` setting _considerably_ if you want to train long enough to see good results. Consider saving high checkpoints and using them in the `agent.restore()` cell below. Note the directory, which is different from the directory used to save the *checkpoints* after each iteration, `tmp/ppo/mountain-car`.

For `MountainCar`, the environment has these parameters and behaviors (from this [source code](https://github.com/openai/gym/blob/master/gym/envs/classic_control/mountain_car.py)):

```
Observation
    Type: Box(2)
    Num    Observation               Min            Max
    0      Car Position              -1.2           0.6
    1      Car Velocity              -0.07          0.07
Actions:
    Type: Discrete(3)
    Num    Action
    0      Accelerate to the Left
    1      Don't accelerate
    2      Accelerate to the Right
    Note: This does not affect the amount of velocity affected by the
    gravitational pull acting on the car.
Reward:
     Reward of 0 is awarded if the agent reached the flag (position = 0.5)
     on top of the mountain.
     Reward of -1 is awarded if the position of the agent is less than 0.5.
Starting State:
     The position of the car is assigned a uniform random value in
     [-0.6 , -0.4].
     The starting velocity of the car is always assigned to 0.
 Episode Termination:
     The car position is more than 0.5
     Episode length is greater than 200
```

Clean up previous stuff:

In [None]:
checkpoint_root = "tmp/ppo/mountain-car"
shutil.rmtree(checkpoint_root, ignore_errors=True, onerror=None)   # clean up old runs

Here is the default configuration for PPO applied to this environment. There are no configuration parameters that are passed to _MountainCar_ itself:

In [None]:
ppo.DEFAULT_CONFIG

The next cell copies the default configuration and makes a few modifications, like a larger training batch size. Other changes you might consider are the following:

* Tweak the `model` parameters for the neural net.
* Try other `train_batch_size` values (default: `4000`).
* SGD parameters: `num_sgd_iter` and `sgd_minibatch_size`.

To speed up training:

* Increase the `num_workers` to fully utilize your available machine or cluster, 
* Use GPUs if you have them available.

In [None]:
SELECT_ENV = "MountainCar-v0"
N_ITER = 20

In [None]:
config = ppo.DEFAULT_CONFIG.copy()
config["log_level"] = "WARN"            # the default, at this time
config["num_workers"] = 4               # default = 2
config["train_batch_size"] = 10000      # default = 4000
config["sgd_minibatch_size"] = 256      # default = 128
config["evaluation_num_episodes"] = 50  # default = 10

agent = ppo.PPOTrainer(config, env=SELECT_ENV)

In [None]:
agent.restore("mountain-car-checkpoint/checkpoint-20")

In [None]:
results = []
episode_data = []
episode_json = []

for n in range(N_ITER):
    result = agent.train()
    results.append(result)
    
    episode = {'n': n, 
               'episode_reward_min': result['episode_reward_min'], 
               'episode_reward_mean': result['episode_reward_mean'], 
               'episode_reward_max': result['episode_reward_max'],  
               'episode_len_mean': result['episode_len_mean']
              }
    
    episode_data.append(episode)
    episode_json.append(json.dumps(episode))
    file_name = agent.save(checkpoint_root)
    
    print(f'{n+1:3d}: Min/Mean/Max reward: {result["episode_reward_min"]:8.4f}/{result["episode_reward_mean"]:8.4f}/{result["episode_reward_max"]:8.4f}, len mean: {result["episode_len_mean"]:8.4f}. Checkpoint saved to {file_name}')

Training gives up on an episode after 200 steps. The reward is `-1*N` when the cart doesn't reach the top of the hill. The reward is zero if it does reach the top. Hence, there are no incremental rewards here; it's success or failure.

Let's print out the policy and model to see the results of training in detail…

In [None]:
import pprint

policy = agent.get_policy()
model = policy.model

pprint.pprint(model.variables())
pprint.pprint(model.value_function())

print(model.base_model.summary())

In [None]:
ray.shutdown()

## Rollout

Next we'll use the [`rollout` script](https://ray.readthedocs.io/en/latest/rllib-training.html#evaluating-trained-policies) to evaluate the trained policy.

This visualizes the "car" agent operating within the simulation: rocking back and forth to gain momentum to overcome the mountain, using the last checkpoint. Edit the number in the checkpoint path if necessary! Also change the configuration to match the changes above.

> **Note:** This rollout can only show the visualization popup windows when running on a local laptop.

In [None]:
!rllib rollout \
    tmp/ppo/mountain-car/checkpoint_200/checkpoint-200 \
    --config '{"env": "MountainCar-v0", "num_workers":4, "train_batch_size":10000, "sgd_minibatch_size":256, "evaluation_num_episodes":50}' --run PPO \
    --steps 2000

The rollout uses the second saved checkpoint, evaluated through `2000` steps.
Modify the path to view other checkpoints.

## Exercise ("Homework")

In addition to _Mountain Car_ and _Cart Pole_, there are other so-called ["classic control"](https://gym.openai.com/envs/#classic_control) examples you can try.