# **Sprint Challenge - Reinforcement Learning Taxi**

# Setup

First, let's make sure this notebook works well in both python 2 and 3, import a few common modules, ensure MatplotLib plots figures inline and prepare a function to save the figures:

In [28]:
# To support both python 2 and python 3
from __future__ import division, print_function, unicode_literals

# Common imports
import numpy as np
import os
import sys
import random



Note: there may be minor differences between the output of this notebook and the examples shown in the book. You can safely ignore these differences. They are mainly due to the fact that most of the environments provided by OpenAI gym have some randomness.

# OpenAI gym Taxi

In this notebook we will be using [OpenAI gym](https://gym.openai.com/), a great toolkit for developing and comparing Reinforcement Learning algorithms. It provides many environments for your learning *agents* to interact with. Let's start by importing `gym`:

In [3]:
import gym

Next we will load the Taxi environment, version 2.

In [4]:
env = gym.make('Taxi-v2')

Let's initialize the environment by calling is `reset()` method. This returns state:

In [5]:
state = env.reset()

In [6]:
env.render()

+---------+
|[34;1mR[0m: | : :G|
| : : : : |
| :[43m [0m: : : |
| | : | : |
|[35mY[0m| : |B: |
+---------+



In [21]:
total_reward = 0 
done = False
while not done:
    state, reward, done, info = env.step(env.action_space.sample())
    total_reward += reward
    env.render()
    
print('Total reward for single round:', total_reward)

+---------+
|R: | : :[35m[42mG[0m[0m|
| : : : : |
| : : : : |
| | : | : |
|Y| : |B: |
+---------+
  (East)
+---------+
|R: | :[42m_[0m:[35mG[0m|
| : : : : |
| : : : : |
| | : | : |
|Y| : |B: |
+---------+
  (West)
+---------+
|R: |[42m_[0m: :[35mG[0m|
| : : : : |
| : : : : |
| | : | : |
|Y| : |B: |
+---------+
  (West)
+---------+
|R: |[42m_[0m: :[35mG[0m|
| : : : : |
| : : : : |
| | : | : |
|Y| : |B: |
+---------+
  (Dropoff)
+---------+
|R: |[42m_[0m: :[35mG[0m|
| : : : : |
| : : : : |
| | : | : |
|Y| : |B: |
+---------+
  (West)
+---------+
|R: |[42m_[0m: :[35mG[0m|
| : : : : |
| : : : : |
| | : | : |
|Y| : |B: |
+---------+
  (North)
+---------+
|R: | :[42m_[0m:[35mG[0m|
| : : : : |
| : : : : |
| | : | : |
|Y| : |B: |
+---------+
  (East)
+---------+
|R: | :[42m_[0m:[35mG[0m|
| : : : : |
| : : : : |
| | : | : |
|Y| : |B: |
+---------+
  (Dropoff)
+---------+
|R: | : :[35mG[0m|
| : : :[42m_[0m: |
| : : : : |
| | : | : |
|Y| : |B: |
+---------+
  (

Create game with multiple rounds or episodes and show us the average score over time.

In [22]:
episodes = 1000
rewards = []
max_steps = 99

for episode in range(episodes):
    state = env.reset()
    total_rewards = 0
    
    for step in range(max_steps):
        action = env.action_space.sample()
        state, reward, done, info = env.step(env.action_space.sample())
        total_rewards += reward
        if done:
            break
    rewards.append(total_rewards)
    
print('Average score over time: ', sum(rewards) / episodes)

Average score over time:  -386.276


## Goal 1 - Beat Random

In order to beat random we will use a Q-table.
First we initialize it.

In [25]:
action_size = env.action_space.n
print('Action size', action_size)

state_size = env.observation_space.n
print('State size', state_size)

qtable = np.zeros((state_size, action_size))
print(qtable)

Action size 6
State size 500
[[0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0.]
 ...
 [0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0.]]


Create hyperparameters.

In [26]:
total_episodes = 50000
total_test_episodes = 100
max_steps = 99

learning_rate = 0.7
gamma = 0.618

epsilon = 1.0
max_epsilon = 1.0
min_epsilon = 0.01
decay_rate = 0.01


Define Q-learning algorithm

In [29]:
# 2 For life or until learning is stopped
for episode in range(total_episodes):
    # Reset the environment
    state = env.reset()
    step = 0
    done = False
    
    for step in range(max_steps):
        # 3. Choose an action a in the current world state (s)
        ## First we randomize a number
        exp_exp_tradeoff = random.uniform(0,1)
        
        ## If this number > greater than epsilon --> exploitation (taking the biggest Q value for this state)
        if exp_exp_tradeoff > epsilon:
            action = np.argmax(qtable[state,:])
        
        # Else doing a random choice --> exploration
        else:
            action = env.action_space.sample()
        
        # Take the action (a) and observe the outcome state(s') and reward (r)
        new_state, reward, done, info = env.step(action)

        # Update Q(s,a):= Q(s,a) + lr [R(s,a) + gamma * max Q(s',a') - Q(s,a)]
        qtable[state, action] = qtable[state, action] + learning_rate * (reward + gamma * 
                                    np.max(qtable[new_state, :]) - qtable[state, action])
                
        # Our new state is state
        state = new_state
        
        # If done : finish episode
        if done == True: 
            break
        
    episode += 1
    
    # Reduce epsilon (because we need less and less exploration)
    epsilon = min_epsilon + (max_epsilon - min_epsilon)*np.exp(-decay_rate*episode)


Lets play Taxi with our Q-table!

In [33]:
env.reset()
rewards = []

for episode in range(total_test_episodes):
    state = env.reset()
    step = 0
    done = False
    total_rewards = 0
#     print("****************************************************")
#     print("EPISODE ", episode)

    for step in range(max_steps):
        env.render()
        # Take the action (index) that have the maximum expected future reward given that state
        action = np.argmax(qtable[state,:])
        
        new_state, reward, done, info = env.step(action)
        
        total_rewards += reward
        
        if done:
            rewards.append(total_rewards)
#             print ("Score", total_rewards)
            break
        state = new_state
env.close()
print ("Average Score using Q-table over time: " +  str(sum(rewards)/total_test_episodes))

+---------+
|R: | :[43m [0m:G|
| : : : : |
| : : : : |
| | : | : |
|[34;1mY[0m| : |[35mB[0m: |
+---------+

+---------+
|R: |[43m [0m: :G|
| : : : : |
| : : : : |
| | : | : |
|[34;1mY[0m| : |[35mB[0m: |
+---------+
  (West)
+---------+
|R: | : :G|
| : :[43m [0m: : |
| : : : : |
| | : | : |
|[34;1mY[0m| : |[35mB[0m: |
+---------+
  (South)
+---------+
|R: | : :G|
| :[43m [0m: : : |
| : : : : |
| | : | : |
|[34;1mY[0m| : |[35mB[0m: |
+---------+
  (West)
+---------+
|R: | : :G|
| : : : : |
| :[43m [0m: : : |
| | : | : |
|[34;1mY[0m| : |[35mB[0m: |
+---------+
  (South)
+---------+
|R: | : :G|
| : : : : |
|[43m [0m: : : : |
| | : | : |
|[34;1mY[0m| : |[35mB[0m: |
+---------+
  (West)
+---------+
|R: | : :G|
| : : : : |
| : : : : |
|[43m [0m| : | : |
|[34;1mY[0m| : |[35mB[0m: |
+---------+
  (South)
+---------+
|R: | : :G|
| : : : : |
| : : : : |
| | : | : |
|[34;1m[43mY[0m[0m| : |[35mB[0m: |
+---------+
  (South)
+---------+
|R: | : :G|
| : 

+---------+
|[35mR[0m: |[43m [0m: :G|
| : : : : |
| : : : : |
| | : | : |
|[34;1mY[0m| : |B: |
+---------+

+---------+
|[35mR[0m: | : :G|
| : :[43m [0m: : |
| : : : : |
| | : | : |
|[34;1mY[0m| : |B: |
+---------+
  (South)
+---------+
|[35mR[0m: | : :G|
| : : : : |
| : :[43m [0m: : |
| | : | : |
|[34;1mY[0m| : |B: |
+---------+
  (South)
+---------+
|[35mR[0m: | : :G|
| : : : : |
| :[43m [0m: : : |
| | : | : |
|[34;1mY[0m| : |B: |
+---------+
  (West)
+---------+
|[35mR[0m: | : :G|
| : : : : |
|[43m [0m: : : : |
| | : | : |
|[34;1mY[0m| : |B: |
+---------+
  (West)
+---------+
|[35mR[0m: | : :G|
| : : : : |
| : : : : |
|[43m [0m| : | : |
|[34;1mY[0m| : |B: |
+---------+
  (South)
+---------+
|[35mR[0m: | : :G|
| : : : : |
| : : : : |
| | : | : |
|[34;1m[43mY[0m[0m| : |B: |
+---------+
  (South)
+---------+
|[35mR[0m: | : :G|
| : : : : |
| : : : : |
| | : | : |
|[42mY[0m| : |B: |
+---------+
  (Pickup)
+---------+
|[35mR[0m: | : :G|
| : 

## Goal 2 - Beat Basic Q-Learning

Let's try and tune the hyperperameters


In [46]:
total_episodes = 50000
total_test_episodes = 100
max_steps = 99

# learning_rate = 0.7
# gamma = 0.618

# epsilon = 1.0
# max_epsilon = 1.0
# min_epsilon = 0.01
# decay_rate = 0.01
# 8.4

# learning_rate = 0.5
# gamma = 0.5

# epsilon = 1.0
# max_epsilon = 1.0
# min_epsilon = 0.01
# decay_rate = 0.01
# 6

# learning_rate = 0.8
# gamma = 0.8

# epsilon = 1.0
# max_epsilon = 1.0
# min_epsilon = 0.01
# decay_rate = 0.01
# 8.1

# learning_rate = 0.7
# gamma = 0.5

# epsilon = 1.0
# max_epsilon = 1.0
# min_epsilon = 0.01
# decay_rate = 0.02
# 8.65


# learning_rate = 0.7
# gamma = 0.5

# epsilon = 1.0
# max_epsilon = 1.0
# min_epsilon = 0.01
# decay_rate = 0.03
# 8.5

# learning_rate = 0.7
# gamma = 0.5

epsilon = 1.0
max_epsilon = 1.0
min_epsilon = 0.02
decay_rate = 0.02
#  These returned the highest score of avg 8.71

# learning_rate = 0.7
# gamma = 0.5

# epsilon = 1.0
# max_epsilon = 1.0
# min_epsilon = 0.03
# decay_rate = 0.02
# 8.66

# learning_rate = 0.75
# gamma = 0.5

# epsilon = 1.0
# max_epsilon = 1.0
# min_epsilon = 0.02
# decay_rate = 0.02
# 8.26

# learning_rate = 0.7
# gamma = 0.4

# epsilon = 1.0
# max_epsilon = 1.0
# min_epsilon = 0.02
# decay_rate = 0.02
# 7.1




# 2 For life or until learning is stopped
for episode in range(total_episodes):
    # Reset the environment
    state = env.reset()
    step = 0
    done = False
    
    for step in range(max_steps):
        # 3. Choose an action a in the current world state (s)
        ## First we randomize a number
        exp_exp_tradeoff = random.uniform(0,1)
        
        ## If this number > greater than epsilon --> exploitation (taking the biggest Q value for this state)
        if exp_exp_tradeoff > epsilon:
            action = np.argmax(qtable[state,:])
        
        # Else doing a random choice --> exploration
        else:
            action = env.action_space.sample()
        
        # Take the action (a) and observe the outcome state(s') and reward (r)
        new_state, reward, done, info = env.step(action)

        # Update Q(s,a):= Q(s,a) + lr [R(s,a) + gamma * max Q(s',a') - Q(s,a)]
        qtable[state, action] = qtable[state, action] + learning_rate * (reward + gamma * 
                                    np.max(qtable[new_state, :]) - qtable[state, action])
                
        # Our new state is state
        state = new_state
        
        # If done : finish episode
        if done == True: 
            break
        
    episode += 1
    
    # Reduce epsilon (because we need less and less exploration)
    epsilon = min_epsilon + (max_epsilon - min_epsilon)*np.exp(-decay_rate*episode)
    
    
env.reset()
rewards = []

for episode in range(total_test_episodes):
    state = env.reset()
    step = 0
    done = False
    total_rewards = 0
#     print("****************************************************")
#     print("EPISODE ", episode)

    for step in range(max_steps):
        env.render()
        # Take the action (index) that have the maximum expected future reward given that state
        action = np.argmax(qtable[state,:])
        
        new_state, reward, done, info = env.step(action)
        
        total_rewards += reward
        
        if done:
            rewards.append(total_rewards)
#             print ("Score", total_rewards)
            break
        state = new_state
env.close()
print ("Average Score using Q-table over time: " +  str(sum(rewards)/total_test_episodes))


+---------+
|[34;1mR[0m: | : :[35mG[0m|
|[43m [0m: : : : |
| : : : : |
| | : | : |
|Y| : |B: |
+---------+

+---------+
|[34;1m[43mR[0m[0m: | : :[35mG[0m|
| : : : : |
| : : : : |
| | : | : |
|Y| : |B: |
+---------+
  (North)
+---------+
|[42mR[0m: | : :[35mG[0m|
| : : : : |
| : : : : |
| | : | : |
|Y| : |B: |
+---------+
  (Pickup)
+---------+
|R: | : :[35mG[0m|
|[42m_[0m: : : : |
| : : : : |
| | : | : |
|Y| : |B: |
+---------+
  (South)
+---------+
|R: | : :[35mG[0m|
| :[42m_[0m: : : |
| : : : : |
| | : | : |
|Y| : |B: |
+---------+
  (East)
+---------+
|R: | : :[35mG[0m|
| : :[42m_[0m: : |
| : : : : |
| | : | : |
|Y| : |B: |
+---------+
  (East)
+---------+
|R: |[42m_[0m: :[35mG[0m|
| : : : : |
| : : : : |
| | : | : |
|Y| : |B: |
+---------+
  (North)
+---------+
|R: | :[42m_[0m:[35mG[0m|
| : : : : |
| : : : : |
| | : | : |
|Y| : |B: |
+---------+
  (East)
+---------+
|R: | : :[35m[42mG[0m[0m|
| : : : : |
| : : : : |
| | : | : |
|Y| : |B: |
+--