# Ray RLlib - Extra Application Example - Taxi-v3

© 2019-2021, Anyscale. All Rights Reserved

![Anyscale Academy](../../../images/AnyscaleAcademyLogo.png)

This example uses [RLlib](https://ray.readthedocs.io/en/latest/rllib.html) to train a policy with the `Taxi-v3` environment ([gym.openai.com/envs/Taxi-v3/](https://gym.openai.com/envs/Taxi-v3/)). The goal is to pick up passengers as fast as possible, negotiating the available paths. This is one of OpenAI Gym's ["toy text"](https://gym.openai.com/envs/#toy_text) problems.

For more background about this problem, see:

* ["Hierarchical Reinforcement Learning with the MAXQ Value Function Decomposition"](https://arxiv.org/abs/cs/9905014), [Thomas G. Dietteric](https://twitter.com/tdietterich)
* ["Reinforcement Learning: let’s teach a taxi-cab how to drive"](https://towardsdatascience.com/reinforcement-learning-lets-teach-a-taxi-cab-how-to-drive-4fd1a0d00529), [Valentina Alto](https://twitter.com/AltoValentina)

In [None]:
import pandas as pd
import json
import os
import shutil
import sys
import ray
import ray.rllib.agents.ppo as ppo

In [None]:
info = ray.init(ignore_reinit_error=True)

In [None]:
print("Dashboard URL: http://{}".format(info["webui_url"]))

Set up the checkpoint location:

In [None]:
checkpoint_root = "tmp/ppo/taxi"
shutil.rmtree(checkpoint_root, ignore_errors=True, onerror=None)   # clean up old runs

Next we'll train an RLlib policy with the `Taxi-v3` environment.

By default, training runs for `10` iterations. Increase the `N_ITER` setting if you want to see the resulting rewards improve.
Also note that *checkpoints* get saved after each iteration into the `/tmp/ppo/taxi` directory.

> **Note:** If you prefer to use a different directory root than `/tmp`, change it in the next cell **and** in the `rllib rollout` command below.

In [None]:
SELECT_ENV = "Taxi-v3"
N_ITER = 10

config = ppo.DEFAULT_CONFIG.copy()
config["log_level"] = "WARN"

agent = ppo.PPOTrainer(config, env=SELECT_ENV)

In [None]:
results = []
episode_data = []
episode_json = []

for n in range(N_ITER):
    result = agent.train()
    results.append(result)
    
    episode = {'n': n, 
               'episode_reward_min': result['episode_reward_min'], 
               'episode_reward_mean': result['episode_reward_mean'], 
               'episode_reward_max': result['episode_reward_max'],  
               'episode_len_mean': result['episode_len_mean']
              }
    
    episode_data.append(episode)
    episode_json.append(json.dumps(episode))
    file_name = agent.save(checkpoint_root)
    
    print(f'{n+1:3d}: Min/Mean/Max reward: {result["episode_reward_min"]:8.4f}/{result["episode_reward_mean"]:8.4f}/{result["episode_reward_max"]:8.4f}, len mean: {result["episode_len_mean"]:8.4f}. Checkpoint saved to {file_name}')

Do the episode rewards increase after multiple iterations?

Also, print out the policy and model to see the results of training in detail…

In [None]:
import pprint

policy = agent.get_policy()
model = policy.model

pprint.pprint(model.variables())
pprint.pprint(model.value_function())

print(model.base_model.summary())

## Rollout

Next we'll use the [`rollout` script](https://ray.readthedocs.io/en/latest/rllib-training.html#evaluating-trained-policies) to evaluate the trained policy.

The output from the following command visualizes the "taxi" agent operating within its simulation: picking up a passenger, driving, turning, dropping off a passenger ("put-down"), and so on. 

A 2-D map of the *observation space* is visualized as text, which needs some decoding instructions:

  * `R` -- R(ed) location in the Northwest corner
  * `G` -- G(reen) location in the Northeast corner
  * `Y` -- Y(ellow) location in the Southwest corner
  * `B` -- B(lue) location in the Southeast corner
  * `:` -- cells where the taxi can drive
  * `|` -- obstructions ("walls") which the taxi must avoid
  * blue letter represents the current passenger’s location for pick-up
  * purple letter represents the drop-off location
  * yellow rectangle is the current location of our taxi/agent

That allows for a total of 500 states, and these known states are numbered between 0 and 499.

The *action space* for the taxi/agent is defined as:

  * move the taxi one square North
  * move the taxi one square South
  * move the taxi one square East
  * move the taxi one square West
  * pick-up the passenger
  * put-down the passenger

The *rewards* are structured as −1 for each action plus:

 * +20 points when the taxi performs a correct drop-off for the passenger
 * -10 points when the taxi attempts illegal pick-up/drop-off actions

Admittedly it'd be better if these state visualizations showed the *reward* along with observations.

In [None]:
!rllib rollout \
    tmp/ppo/taxi/checkpoint_10/checkpoint-10 \
    --config "{\"env\": \"Taxi-v3\"}" \
    --run PPO \
    --steps 2000

In [None]:
ray.shutdown()  # "Undo ray.init()".

## Exercise ("Homework")

In addition to _Taxi_, there are other so-called ["toy text"](https://gym.openai.com/envs/#toy_text) problems you can try.