# The Taxi Problem

This Notebook is mostly based on the one [from here](https://casey-barr.github.io/open-ai-taxi-problem/), so refer to that link for more information! However I will, as usual, give you here as much detail as possible.


### Problem Description:
The Taxi Problem
    from "Hierarchical Reinforcement Learning with the MAXQ Value Function Decomposition"
    by Tom Dietterich
    
**Description:**
  
There are four designated locations in the grid world indicated by R(ed), G(reen), Y(ellow), and B(lue). When the episode starts, the taxi starts off at a random square and the passenger is at a random location (in the renders, this is the passenger location highglighted in **bold**). The taxi drives to the passenger's location, picks up the passenger, drives to the passenger's destination (another one of the four specified locations, **coloured**), and then drops off the passenger. Once the passenger is dropped off, the episode ends.


**Observations:** 

There are 500 discrete states since there are 25 taxi positions, 5 possible locations of the passenger (including the case when the passenger is in the taxi), and 4 destination locations. 
    
**Passenger locations:**
    - 0: R(ed)
    - 1: G(reen)
    - 2: Y(ellow)
    - 3: B(lue)
    - 4: in taxi
    
**Destinations:**
    - 0: R(ed)
    - 1: G(reen)
    - 2: Y(ellow)
    - 3: B(lue)
    
**Actions** (_There are 6 discrete deterministic actions_):
    - 0: move south
    - 1: move north
    - 2: move east
    - 3: move west
    - 4: pickup passenger
    - 5: drop off passenger
    
**Rewards:**
There is a default per-step reward of -1, except for delivering the passenger, which is +20, or executing "pickup" and "drop-off" actions illegally, which is -10 (Counted as "penalty" in the code below).
    
**Rendering:**
    - blue: passenger
    - magenta: destination
    - yellow: empty taxi
    - green: full taxi
    - other letters (R, G, Y and B): locations for passengers and destinations
    state space is represented by:
        (taxi_row, taxi_col, passenger_location, destination)

In [None]:
import random
from time import sleep

import numpy as np
from IPython.display import clear_output


import gym

In [None]:
env = gym.make("Taxi-v3").env

In [None]:
# reset environment to a new, random state
env.reset() 

# Renders the environment at the current status:
# * Taxi location (yellow square), 
# * Passenger pick-up location as a bold and coloured letter, and 
# * Passenger drop-off location, as a coloured letter (not bold)
env.render()

print("Action Space {}".format(env.action_space))
print("State Space {}".format(env.observation_space))

### Solve enviroment without reinforcement

At every state, we just take a sample action... (not really great, is it?)

In [None]:
env.s = 328  # set environment to illustration's state

epochs = 0
penalties, reward = 0, 0

frames = [] # for animation

done = False

while not done:
    action = env.action_space.sample()
    state, reward, done, info = env.step(action)

    if reward == -10:
        penalties += 1
    
    # Put each rendered frame into dict for animation
    frames.append({
        'frame': env.render(mode='ansi'),
        'state': state,
        'action': action,
        'reward': reward
        }
    )

    epochs += 1
    
    
print("Timesteps taken: {}".format(epochs))
print("Penalties incurred: {}".format(penalties))

In [None]:
# A utility function to print the animations of all the actions taken by the taxi agent
def print_frames(frames, sleep_time=0.1):
    for i, frame in enumerate(frames):
        sleep(sleep_time)
        clear_output(wait=True)
        print(frame['frame'])
        print(f"Timestep: {i + 1} / {len(frames)}")
        print(f"State: {frame['state']}")
        print(f"Action: {frame['action']}")
        print(f"Reward: {frame['reward']}")

In [None]:
ANIMATION_LENGTH = 30 # seconds
        
print_frames(frames, sleep_time=ANIMATION_LENGTH/len(frames))

### Using Q-learning to train via reinforcement

We will learn this algorithm later on in this course, but for now just bear in mind that it's based on a Markov Decision Process! (as well as Dynamic Programming and some Monte-Carlo ideas).

You can see this Q-learning implementation as the last stage of the course. After we have seen all of the above in detail, you can come back to this Notebook and you will better understand the code below.

In [None]:
%%time
"""Training the agent"""

# Hyperparameters of the Q-learning algorithm
alpha = 0.1
gamma = 0.6
epsilon = 0.1

# For plotting metrics
all_epochs = []
all_penalties = []

q_table = np.zeros([env.observation_space.n, env.action_space.n])

for i in range(1, 100001):
    state = env.reset()

    epochs, penalties, reward, = 0, 0, 0
    done = False
    
    while not done:
        if random.uniform(0, 1) < epsilon:
            action = env.action_space.sample() # Explore action space
        else:
            action = np.argmax(q_table[state]) # Exploit learned values

        next_state, reward, done, info = env.step(action) 
        
        old_value = q_table[state, action]
        next_max = np.max(q_table[next_state])
        
        new_value = (1 - alpha) * old_value + alpha * (reward + gamma * next_max)
        q_table[state, action] = new_value

        if reward == -10:
            penalties += 1

        state = next_state
        epochs += 1
        
    if i % 100 == 0:
        clear_output(wait=True)
        print(f"Episode: {i}")

print("Training finished.\n")

In [None]:
"""Evaluate agent's performance after Q-learning"""
# Doing 100 runs of the Taxi problem, and averaging the results...

total_epochs, total_penalties = 0, 0
episodes = 100

for _ in range(episodes):
    state = env.reset()
    epochs, penalties, reward = 0, 0, 0
    
    done = False
    
    while not done:
        action = np.argmax(q_table[state])
        state, reward, done, info = env.step(action)

        if reward == -10:
            penalties += 1

        epochs += 1

    total_penalties += penalties
    total_epochs += epochs

print(f"Results after {episodes} episodes:")
print(f"Average timesteps per episode: {total_epochs / episodes}")
print(f"Average penalties per episode: {total_penalties / episodes}")

### Testing our MDP-based Reinforced Learning results!

Every time you run the cell below, a new random start point (state) is created, and then solved using the result of the Q-learning algorithm above: Now the actions our agent take are quite well informed! (**the policy is optimised!**) 

In [None]:
state = env.reset()

frames = []
done=False

while not done:
    action = np.argmax(q_table[state])
    state, reward, done, info = env.step(action)

    frames.append({
        'frame': env.render(mode='ansi'),
        'state': state,
        'action': action,
        'reward': reward
        }
    )
    
ANIMATION_LENGTH = 10 # seconds

print_frames(frames, sleep_time=ANIMATION_LENGTH/len(frames))