# Demo - Taxi

In this demo we will look at the [Gymnasium](https://gymnasium.farama.org/) library, which is a continuation of the now abandoned [Gym](https://github.com/openai/gym) library that was originally created by OpenAI.

The library contains a lot of different reinforcement learning environments that you can use to evaluate your algorithms. The environments range from simple environments with only a few states, to complete video games ([Atari](https://gymnasium.farama.org/environments/atari/)) and 3D simulated robots ([MuJoCo](https://gymnasium.farama.org/environments/mujoco/)). All these environments use the same basic interface, which makes them very useful to compare and evaluate RL algorithms, making it the de-facto benchmark in the RL industry.

In this demo we will use the `Taxi` environment and we will first explore how it works.

Let's create the environment first by simply calling `gymnasium.make`.

In [None]:
import gymnasium

env = gymnasium.make('Taxi-v3', render_mode='rgb_array')

print(f'Observation: {env.observation_space}, {env.observation_space.dtype}')
print(f'Action:      {env.action_space}, {env.action_space.dtype}')

You can now see that the environment will return observations (a.k.a. state) with a shape `Discrete(500)`. This means that it is a single discrete value (i.e. integer) between 0 and 500. In other environments it can also return `Box()` spaces which will are multidimensional. The action space is also `Discrete`, meaning the environment accepts only a single integer value between 0 and 6 for the action.

So this is a very simple environment that gives a single number and accepts a single number.

Furthermore, the environment supports rendering the current state so you can see what is going on. In this case it is a nice comic image representation of the world.

In [None]:
from utils import create_frame, update_frame
env.reset()
frame = create_frame(env)

The world consists of a 5x5 grid with 4 pickup and dropoff positions (R,G,Y,B). There is a Taxi driving around in this world. The passenger is waiting at one of the four colored square and its destination (a hotel) is located on another colored square. There are also some walls to make the route a bit more challenging.

The goal of the agent is to pickup the passenger and drop it off at its desired destination as fast as possible. The agent has 6 actions (in this order): moving south (0), north (1), east (2), west (3) and pickup (4) and dropoff (5) of passenger.

You can interact with the environment in the following way.

First the environment must be reset, so it is initialized and at its start state. This can be achieved with the `env.reset()` function which will return the current state (observation) and extra information that we will not use in this exercise. In Python you can use `_` to discard data. The start state is randomly chosen, so each time you execute the next cell the output should be different.

In [None]:
obs, _ = env.reset()
obs

Then the agent can perform an action by calling the `step(a)` function, which only required the action to be taken, which is in this case simply an integer value. It returns the next observation (state) after the action is performed, the resulting reward, a flag indicating if the episode is finished (done), a flag if the episode is truncated (not used now), and some extra information. We will discard the truncated flag and the extra information because they are not used in this exercise.

In [None]:
obs,reward,done,_,_ = env.step(5)
update_frame(frame)
reward, done

The state the environment returns is a single number, which represent the entire state of the world. There are 5 × 5 = 25 possible states where the taxi can be, there are 5 states where the passenger can be (4 positions and on-board the taxi), and there are 4 destination states. So 25 × 5 × 4 = 500 different states.

We don't know anything else about this environment. We don't know to which state an action will bring us, what the reward will be, or when an episode is finished. We only have this interface.