# Taxi-v3

Gym provides different game environments which we can plug into our code and train an agent with. The library takes care of the API for providing all the information that our agent requires, like possible actions, score, and current state. We just need to focus on the algorithm part of our agent.

We'll be using the Gym environment called Taxi-v3, which all of the details explained in previous Notebook were pulled from. The objectives, rewards, and actions are all the same.

## 1. Install Gym

We need to install Gym first. Executing the following should work.

In [1]:
pip install gym

Collecting gym
  Downloading gym-0.18.0.tar.gz (1.6 MB)
[K     |████████████████████████████████| 1.6 MB 4.0 MB/s eta 0:00:01
[?25hCollecting Pillow<=7.2.0
  Downloading Pillow-7.2.0-cp38-cp38-manylinux1_x86_64.whl (2.2 MB)
[K     |████████████████████████████████| 2.2 MB 10.9 MB/s eta 0:00:01
[?25hCollecting cloudpickle<1.7.0,>=1.2.0
  Downloading cloudpickle-1.6.0-py3-none-any.whl (23 kB)
Collecting pyglet<=1.5.0,>=1.4.0
  Downloading pyglet-1.5.0-py2.py3-none-any.whl (1.0 MB)
[K     |████████████████████████████████| 1.0 MB 10.9 MB/s eta 0:00:01
Building wheels for collected packages: gym
  Building wheel for gym (setup.py) ... [?25ldone
[?25h  Created wheel for gym: filename=gym-0.18.0-py3-none-any.whl size=1656445 sha256=2bbec43e8efc2575754d6423d67749e3b8bacc38d162e7d1c9f2aec8ebbdd614
  Stored in directory: /home/yori/.cache/pip/wheels/d8/e7/68/a3f0f1b5831c9321d7523f6fd4e0d3f83f2705a1cbd5daaa79
Successfully built gym
Installing collected packages: Pillow, cloudpickle, pygle

## 2. Gym's interface

Once installed, we can load the game environment and render what it looks like. We can also print the action and state space.

In [2]:
import gym

env = gym.make("Taxi-v3").env # load the game environment

env.render() # visualize the environment

print("Action Space: {}".format(env.action_space))
print("State Space: {}\n".format(env.observation_space))

print("Current state: %d" % env.s)

+---------+
|[35mR[0m: | : :[34;1mG[0m|
| : |[43m [0m: : |
| : : : : |
| | : | : |
|Y| : |B: |
+---------+

Action Space: Discrete(6)
State Space: Discrete(500)

Current state: 144


Reminder of our problem:

- The filled square represents the taxi, which is yellow without a passenger and green with a passenger.
- The pipe ("|") represents a wall which the taxi cannot cross.
- R, G, Y, B are the possible pickup and dropoff locations. The blue letter represents the current passenger pick-up location, and the purple letter is the current destination. In the illustration, these colors are reversed.

- As verified by the prints, we have an Action Space of size 6 (south, north, east, west, pickup and dropoff) and a State Space of size 500 (the taxi's location x the passenger's location x the destination location).
- The current state is randomly chosen (a state between 0 and 499).

We can reset the environment to a new, random state.

In [3]:
env.reset() # reset the environment to a new, random state
env.render()
print("Current state: %d" % env.s)

+---------+
|R: | : :G|
| : | : : |
| : :[43m [0m: : |
| | : | : |
|[34;1mY[0m| : |[35mB[0m: |
+---------+

Current state: 251


## 3. Back to our illustration

We can actually take our illustration, encode its state, and give it to the environment to render in Gym.

<img src="./resources/taxi.png" style="height: 350px"/>

Recall that we have the taxi at row 3, column 1, our passenger is at location 2 (=Y), and our destination is location 0 (=R). Using the Taxi-v2 state encoding method, we can do the following:

In [4]:
state = env.encode(3, 1, 2, 0) # (taxi row, taxi column, passenger location, destination location)
print("State:", state)

env.s = state
env.render()

State: 328
+---------+
|[35mR[0m: | : :G|
| : | : : |
| : : : : |
| |[43m [0m: | : |
|[34;1mY[0m| : |B: |
+---------+



## 4. Initial State - Exercise

Generate the state where the taxi is in the lower right corner. The passenger is in the taxi and the destination is B. What is the color of the taxi right now?

In [5]:
state = env.encode(4, 0, 4, 3) # (taxi row, taxi column, passenger location, destination location)
print("State:", state)

env.s = state
env.render()

State: 419
+---------+
|R: | : :G|
| : | : : |
| : : : : |
| | : | : |
|[42mY[0m| : |[35mB[0m: |
+---------+



## 5. Step-method

The agent performs an action by using the step-method:

```python
state, reward, done, info = env.step(action)
```

Remember: There are 6 actions (0=south, 1=north, 2=east, 3=west, 4=pickup and 5=dropoff).

The step-method returns:

- state: the new state
- reward: if your action was beneficial or not
- done: indicates if we have successfully picked up and dropped off a passenger, also called one episode
- info: additional info such as performance and latency for debugging purposes

Let's go back to the illustrations state and try the different actions.

<img src="./resources/taxi.png" style="height: 250px"/>

In [6]:
state = env.encode(3, 1, 2, 0) # (taxi row, taxi column, passenger index, destination index)
print("Initial state:", state)
env.s = state
env.render()

state, reward, done, info = env.step(3) # go west

print("New state: %d, reward: %d" % (state, reward)) # a -1 penalty for every wall hit and the taxi won't move anywhere

Initial state: 328
+---------+
|[35mR[0m: | : :G|
| : | : : |
| : : : : |
| |[43m [0m: | : |
|[34;1mY[0m| : |B: |
+---------+

New state: 328, reward: -1


In [7]:
state, reward, done, info = env.step(1) # go north

print("New state: %d, reward: %d" % (state, reward)) # a slight negative reward for not making it to the destination

env.render() # taxi moves up

New state: 228, reward: -1
+---------+
|[35mR[0m: | : :G|
| : | : : |
| :[43m [0m: : : |
| | : | : |
|[34;1mY[0m| : |B: |
+---------+
  (North)


## 5. Step-method - Exercise

Generate the state again where the taxi is in the lower right corner. The passenger is in the taxi and the destination is B.

In [None]:
state = env.encode(3, 1, 2, 0) # (taxi row, taxi column, passenger index, destination index)
print("Initial state:", state)
env.s = state
env.render()

state, reward, done, info = env.step(3) # go west

print("New state: %d, reward: %d" % (state, reward)) # a -1 penalty for every wall hit and the taxi won't move anywhere


Now carry out the two steps to drive the taxi to the destination location and drop off the passenger. Render at each step the environment, print the new state number, the reward and done.

## 6. The Reward Table

When the Taxi environment is created, there is an initial Reward table that's also created, called *P*. We can think of it like a matrix that has the number of states as rows and the number of actions as columns, i.e. a states × actions matrix.

Since every state is in this matrix, we can see the default reward values assigned for state 479.

In [8]:
env.s = 479
env.render()

env.P[479]

+---------+
|R: | : :G|
| : | : : |
| : : : : |
| | : | : |
|Y| : |[35m[42mB[0m[0m: |
+---------+
  (North)


{0: [(1.0, 479, -1, False)],
 1: [(1.0, 379, -1, False)],
 2: [(1.0, 499, -1, False)],
 3: [(1.0, 479, -1, False)],
 4: [(1.0, 479, -10, False)],
 5: [(1.0, 475, 20, True)]}

This dictionary has the structure

``
{action: [(probability, nextstate, reward, done)]}.
``

A few things to note:

- The 0-5 corresponds to the actions (south, north, east, west, pickup, dropoff) the taxi can perform at our current state.
- In this environment, the probability of each action is always 1.0.
- The nextstate is the state we would be in if we take the action at this index of the dictionary.
- All the movement actions have a -1 reward, the pickup action has a -10 reward, the dropoff actions has +20 reward in this particular state.
- done is used to tell us when we have successfully dropped off a passenger at the right location (action 5 in this state). Each successful dropoff is the end of an episode

## 7. Solving the environment without Reinforcement Learning

Let's see what will happen if we try to brute-force our way to solving the problem without RL.

We'll create an infinite loop which runs until one passenger reaches one destination (one episode), or in other words, when the received reward is 20. The `env.action_space.sample()` method automatically selects one random action from the set of all possible actions.

Let's see what happens.

In [9]:
env.s = 328  # set environment to illustration's state

epochs = 0
penalties, reward = 0, 0

frames = [] # for animation

done = False

while not done:
    action = env.action_space.sample()
    state, reward, done, info = env.step(action)

    if reward == -10:
        penalties += 1
    
    # Put each rendered frame into dict for animation
    frames.append({
        'frame': env.render(mode='ansi'),
        'state': state,
        'action': action,
        'reward': reward
        }
    )

    epochs += 1
    
    
print("Timesteps taken: {}".format(epochs))
print("Penalties incurred: {}".format(penalties))

Timesteps taken: 4199
Penalties incurred: 1344


Let us now show the different steps our taxi carried out, picking up the passenger and delivering him to the destination.

In [10]:
from IPython.display import clear_output
from time import sleep

def print_frames(frames):
    for i, frame in enumerate(frames):
        clear_output(wait=True)
        print(frame['frame'])
        print(f"Timestep: {i + 1}")
        print(f"State: {frame['state']}")
        print(f"Action: {frame['action']}")
        print(f"Reward: {frame['reward']}")
        sleep(.02)
        
print_frames(frames)

+---------+
|[35m[34;1m[43mR[0m[0m[0m: | : :G|
| : | : : |
| : : : : |
| | : | : |
|Y| : |B: |
+---------+
  (Dropoff)

Timestep: 4199
State: 0
Action: 5
Reward: 20


Not good. Our agent takes thousands (?) of timesteps and makes lots of wrong drop offs to deliver just one passenger to the right destination.

This is because we aren't learning from past experience. We can run this over and over, and it will never optimize. The agent has no memory of which action was best for each state, which is exactly what Reinforcement Learning will do for us.

## 8. Q-learning

We are going to use a simple RL algorithm called Q-learning which will give our agent some memory. Essentially, Q-learning lets the agent use the environment's rewards to learn, over time, the best action to take in a given state.

Therefore the Q-learning algorithm uses a Q-Table. The Q-table is a matrix where we have a row for every state (500) and a column for every action (6). The matrix is first initialized to 0, and then values are updated during training.

<img src="./resources/qtable.png" style="height: 650px"/>

The optimal action at every state is the action with the highest Q-value. So for state 328 the highest value is -1.971 (=North). For state 499 the highest value is 29 (=West). These actions indeed seem to be the best options.

In [11]:
env.s = 328
env.render()
print("North is best?\n")

env.s = 499
env.render()
print("West is best?")

+---------+
|[35mR[0m: | : :G|
| : | : : |
| : : : : |
| |[43m [0m: | : |
|[34;1mY[0m| : |B: |
+---------+
  (Dropoff)
North is best?

+---------+
|R: | : :G|
| : | : : |
| : : : : |
| | : | : |
|Y| : |[35mB[0m:[42m_[0m|
+---------+
  (Dropoff)
West is best?


## 9. Training the Agent

First, we'll initialize the Q-table to a 500×6 matrix of zeros.

In [12]:
import numpy as np
q_table = np.zeros([env.observation_space.n, env.action_space.n])

We can now create the training algorithm that will update this Q-table as the agent explores the environment over 100 000 of episodes (of course you don't need to write this algorithm yourself).

In [13]:
%%time
"""Training the agent"""

import random
from IPython.display import clear_output

# Hyperparameters
alpha = 0.1
gamma = 0.6
epsilon = 0.1

# For plotting metrics
all_epochs = []
all_penalties = []

for i in range(1, 100001):
    state = env.reset()

    epochs, penalties, reward, = 0, 0, 0
    done = False
    
    while not done:
        if random.uniform(0, 1) < epsilon:
            action = env.action_space.sample() # Explore action space
        else:
            action = np.argmax(q_table[state]) # Exploit learned values

        next_state, reward, done, info = env.step(action) 
        
        old_value = q_table[state, action]
        next_max = np.max(q_table[next_state])
        
        new_value = (1 - alpha) * old_value + alpha * (reward + gamma * next_max)
        q_table[state, action] = new_value

        if reward == -10:
            penalties += 1

        state = next_state
        epochs += 1
        
    if i % 100 == 0:
        clear_output(wait=True)
        print(f"Episode: {i}")

print("Training finished.\n")

Episode: 100000
Training finished.

CPU times: user 36.2 s, sys: 7.41 s, total: 43.6 s
Wall time: 35.8 s


Now that the Q-table has been established over 100 000 episodes, let's see what the Q-values are for our two states. Are North and West indeed the two most preferable moves for the two states?

In [14]:
print(q_table[328])
print(q_table[499])

[ -2.4043024   -2.27325184  -2.41131299  -2.36087579 -10.60118881
 -11.06768132]
[ 1.85606096  1.36966912  3.77152136 11.         -2.59489135 -2.72256117]


## 10. Evaluating the agent

Let's evaluate the performance of our agent. We don't need to explore actions any further, so now the next action is always selected using the best Q-value.

In [15]:
"""Evaluate agent's performance after Q-learning"""

total_epochs, total_penalties = 0, 0
episodes = 100

for _ in range(episodes):
    state = env.reset()
    epochs, penalties, reward = 0, 0, 0
    
    done = False
    
    while not done:
        action = np.argmax(q_table[state]) # take the action with the highest q-value
        state, reward, done, info = env.step(action)

        if reward == -10:
            penalties += 1

        epochs += 1

    total_penalties += penalties
    total_epochs += epochs

print(f"Results after {episodes} episodes:")
print(f"Average timesteps per episode: {total_epochs / episodes}")
print(f"Average penalties per episode: {total_penalties / episodes}")

Results after 100 episodes:
Average timesteps per episode: 12.96
Average penalties per episode: 0.0


We can see from the evaluation that the agent's performance improved significantly and it incurred no penalties, which means it performed the correct pickup/dropoff actions with 100 different passengers.

## 11. Solving the environment with Reinforcement Learning - Exercise

Now that our agent is trained, use the q-table to solve the taxi problem with 328 as the initial state (by analogy with 7. Solving the environment without RL). Put every frame in a dictionary for later animation. Print the timesteps taken and penalties incurred. Pretty impressive! No?

In [39]:
env.s = 328  # set environment to illustration's state

epochs = 0
penalties, reward = 0, 0

frames = [] # for animation

done = False

while not done:
    action = env.action_space.sample()
    state, reward, done, info = env.step(action)

    if reward == -10:
        penalties += 1
    
    # Put each rendered frame into dict for animation
    frames.append({
        'frame': env.render(mode='ansi'),
        'state': state,
        'action': action,
        'reward': reward
        }
    )

    epochs += 1
    
    
print("Timesteps taken: {}".format(epochs))
print("Penalties incurred: {}".format(penalties))

Timesteps taken: 3448
Penalties incurred: 1123


In [40]:
from IPython.display import clear_output
from time import sleep

def print_frames(frames):
    for i, frame in enumerate(frames):
        clear_output(wait=True)
        print(frame['frame'])
        print(f"Timestep: {i + 1}")
        print(f"State: {frame['state']}")
        print(f"Action: {frame['action']}")
        print(f"Reward: {frame['reward']}")
        sleep(.02)
        
print_frames(frames)

+---------+
|[35m[34;1m[43mR[0m[0m[0m: | : :G|
| : | : : |
| : : : : |
| | : | : |
|Y| : |B: |
+---------+
  (Dropoff)

Timestep: 3448
State: 0
Action: 5
Reward: 20


Show the different steps our taxi carried out, picking up the passenger and delivering him to the destination. Wait 0.5 seconds between two frames.

## 12. Solving the environment with RL - Exercise

Solve the problem for these two initial states.

<img src="./resources/taxi1.png" style="height: 300px"/>
<img src="./resources/taxi2.png" style="height: 300px"/>

In [None]:
# solve the problem with the first/second initial state



In [None]:
# print the solution

