The code below, from the RLlib docs, loads the OpenAI gym taxi environment.

There are 6 valid actions: 0, 1, 2, 3, 4, 5. Try changing the actions and seeing what happens to the taxi (yellow) in the grid environment.

In the `answer_here` variable, store a sequence of steps that successfully picks up the passenger `G` and drops them at `Y`.

In [5]:
import gym

In [12]:
# Import the RL algorithm (Trainer) we would like to use.
from ray.rllib.agents.ppo import PPOTrainer

# Configure the algorithm.
config = {
    # Environment (RLlib understands openAI gym registered strings).
    "env": "Taxi-v3",
    "log_level" : "ERROR",
    # Only for evaluation runs, render the env.
    "evaluation_config": {
        "render_env": True,
    }

}

# Create our RLlib Trainer.
trainer = PPOTrainer(config=config)

# Run it for n training iterations. A training iteration includes
# parallel sample collection by the environment workers as well as
# loss calculation on the collected batch and a model update.
for _ in range(3):
    print(trainer.train())

# Evaluate the trained Trainer (and render each timestep to the shell's
# output).
trainer.evaluate()




[2m[1m[36m(scheduler +2m46s)[0m Tip: use `ray status` to view detailed cluster status. To disable these messages, set RAY_SCHEDULER_EVENTS=0.


Required resources for this actor or task: {CPU: 1.000000}
Available resources on this node: {0.000000/8.000000 CPU, 3.187335 GiB/3.187335 GiB memory, 1.593668 GiB/1.593668 GiB object_store_memory, 1.000000/1.000000 node:127.0.0.1}
 In total there are 0 pending tasks and 1 pending actors on this node.


{'episode_reward_max': -659.0, 'episode_reward_min': -875.0, 'episode_reward_mean': -802.1, 'episode_len_mean': 200.0, 'episode_media': {}, 'episodes_this_iter': 20, 'policy_reward_min': {}, 'policy_reward_max': {}, 'policy_reward_mean': {}, 'custom_metrics': {}, 'hist_stats': {'episode_reward': [-875.0, -731.0, -758.0, -848.0, -758.0, -848.0, -821.0, -740.0, -821.0, -830.0, -857.0, -776.0, -812.0, -776.0, -857.0, -875.0, -767.0, -776.0, -659.0, -857.0], 'episode_lengths': [200, 200, 200, 200, 200, 200, 200, 200, 200, 200, 200, 200, 200, 200, 200, 200, 200, 200, 200, 200]}, 'sampler_perf': {'mean_raw_obs_processing_ms': 0.057644334094396885, 'mean_inference_ms': 0.35419343770116274, 'mean_action_processing_ms': 0.02253395149196642, 'mean_env_wait_ms': 0.02928825094365048, 'mean_env_render_ms': 0.0}, 'off_policy_estimator': {}, 'num_healthy_workers': 2, 'timesteps_total': 4000, 'timesteps_this_iter': 4000, 'agent_timesteps_total': 4000, 'timers': {'sample_time_ms': 35559.153, 'sample_th

ValueError: Cannot evaluate w/o an evaluation worker set in the Trainer or w/o an env on the local worker!
Try one of the following:
1) Set `evaluation_interval` >= 0 to force creating a separate evaluation worker set.
2) Set `create_env_on_driver=True` to force the local (non-eval) worker to have an environment to evaluate on.

In [2]:
policy = trainer.get_policy()

In [6]:
env = gym.make("Taxi-v3")

In [7]:
steps = [1,4,0,0,3,3,3,3,0,0,5]

env.seed(42)
env.reset()
env.render()

for action in steps:
    
    env.step(action)
    env.render()

+---------+
|R: | : :[34;1mG[0m|
| : | : :[43m [0m|
| : : : : |
| | : | : |
|[35mY[0m| : |B: |
+---------+

+---------+
|R: | : :[34;1m[43mG[0m[0m|
| : | : : |
| : : : : |
| | : | : |
|[35mY[0m| : |B: |
+---------+
  (North)
+---------+
|R: | : :[42mG[0m|
| : | : : |
| : : : : |
| | : | : |
|[35mY[0m| : |B: |
+---------+
  (Pickup)
+---------+
|R: | : :G|
| : | : :[42m_[0m|
| : : : : |
| | : | : |
|[35mY[0m| : |B: |
+---------+
  (South)
+---------+
|R: | : :G|
| : | : : |
| : : : :[42m_[0m|
| | : | : |
|[35mY[0m| : |B: |
+---------+
  (South)
+---------+
|R: | : :G|
| : | : : |
| : : :[42m_[0m: |
| | : | : |
|[35mY[0m| : |B: |
+---------+
  (West)
+---------+
|R: | : :G|
| : | : : |
| : :[42m_[0m: : |
| | : | : |
|[35mY[0m| : |B: |
+---------+
  (West)
+---------+
|R: | : :G|
| : | : : |
| :[42m_[0m: : : |
| | : | : |
|[35mY[0m| : |B: |
+---------+
  (West)
+---------+
|R: | : :G|
| : | : : |
|[42m_[0m: : : : |
| | : | : |
|[35mY[0m| : |B: |
+--