# Reinforcement Learning Sprint Challenge - play Taxi

For the sprint challenge, we will apply the techniques we have learned to play
[Taxi](https://gym.openai.com/envs/Taxi-v2/), an environment in the OpenAI Gym.
In this task the agent controls a taxi that can navigate between four locations.
The goal is to pick up a passenger from one location and drop them off to
another. You get 20 points for each successful drop off, but lose 1 point for
each step you take, and additionally there is a 10 point penalty for illegal
pick-up/drop-off actions.

You can create the environment and watch a random agent play with this code:

```python
import gym

env = gym.make('Taxi-v2')
state = env.reset()
env.render()

total_reward = 0
done = False
while not done:
    state, reward, done, info = env.step(env.action_space.sample())
    total_reward += reward
    env.render()

print('Total reward:', total_reward)
```

You'll see that a random agent doesn't do very well - in a trial run the score
reached -713 before the environment terminated.


In [37]:
import gym

env = gym.make('Taxi-v2')
state = env.reset()
env.render()
print('^^ Initial State ^^')

counterrr = 0
total_reward = 0
done = False
while not done:
    print('\n\n\\/\\/\\/\\/\\/\\/\n')
    
    state, reward, done, info = env.step(env.action_space.sample())
    
    total_reward += reward
    counterrr += 1
    
    env.render()
    print(state, reward, done, info)
    
    print('\n<<^^^^^^^^^^>>\n\n')

print('Total reward: ', total_reward)
print('Steps: ', counterrr)



+---------+
|[34;1mR[0m:[43m [0m| : :G|
| : : : : |
| : : : : |
| | : | : |
|[35mY[0m| : |B: |
+---------+

^^ Initial State ^^


\/\/\/\/\/\/

+---------+
|[34;1mR[0m:[43m [0m| : :G|
| : : : : |
| : : : : |
| | : | : |
|[35mY[0m| : |B: |
+---------+
  (East)
22 -1 False {'prob': 1.0}

<<^^^^^^^^^^>>




\/\/\/\/\/\/

+---------+
|[34;1m[43mR[0m[0m: | : :G|
| : : : : |
| : : : : |
| | : | : |
|[35mY[0m| : |B: |
+---------+
  (West)
2 -1 False {'prob': 1.0}

<<^^^^^^^^^^>>




\/\/\/\/\/\/

+---------+
|[34;1m[43mR[0m[0m: | : :G|
| : : : : |
| : : : : |
| | : | : |
|[35mY[0m| : |B: |
+---------+
  (North)
2 -1 False {'prob': 1.0}

<<^^^^^^^^^^>>




\/\/\/\/\/\/

+---------+
|[34;1m[43mR[0m[0m: | : :G|
| : : : : |
| : : : : |
| | : | : |
|[35mY[0m| : |B: |
+---------+
  (Dropoff)
2 -10 False {'prob': 1.0}

<<^^^^^^^^^^>>




\/\/\/\/\/\/

+---------+
|[42mR[0m: | : :G|
| : : : : |
| : : : : |
| | : | : |
|[35mY[0m| : |B: |
+---------+
  (Pickup)
18 -1 F

In [40]:
### Random reward == -713
print('Total reward: ', total_reward)
print('Steps: ', counterrr)
print('Average Reward: ', (total_reward/counterrr))

Total reward:  -713
Steps:  200
Average Reward:  -3.565


## Instructions

Make a Python notebook where you work on the below goals. You can use whatever
environment you wish to develop, but for turning in you should add the file to
the `ML-Reinforcement-Learning` repository in the `sprintchallenge/` directory.
Add, commit, push, and it will appear in your already open pull request.

The goals involve trying to beat a score in Taxi - be sure to measure the score
of your approach after it is trained, and not during the training. This snippet
measures performance (run a simulation repeatedly and average total rewards):

```python
episodes = 1000
rewards = []
max_steps = 99

for episode in range(episodes):
    state = env.reset()  # Assuming you already have env created as above
    total_rewards = 0
    
    for step in range(max_steps):
        action = env.action_space.sample() # TODO your policy here!
        state, reward, done, info = env.step(env.action_space.sample())
        total_rewards += reward
        if done:
            break
    rewards.append(total_rewards)        

print('Average score over time:', sum(rewards) / episodes)
```



In [39]:
print('Observation Space: ', env.observation_space)
print('Action Space:      ', env.action_space)

Observation Space:  Discrete(500)
Action Space:       Discrete(6)


In [27]:
import numpy as np
import random
from collections import defaultdict

In [29]:
### TheHYPErPARAMETERS ###

total_episodes = 1000
learning_rate = 0.9
max_steps = 99
gamma = 0.99


# Exploration Parameters

epsilon = 1.0         # Exploration Rate
max_epsilon = 1.0     # Initial exploration probability
min_epsilon = 0.01    # Minimum exploration probability
decay_rate = 0.01     # Exponential decay rate for exploration

In [None]:
%%time

rewards = []

# Learn from the episodes
for episode in range(total_episodes):
    # Reset for a blank slate
    state = env.reset()
    done = False
    total_rewards = 0
    env.render()
    print('^^ initial state ^^')
print('Average score over time:', sum(rewards) / episodes)

for step in range(max_steps):
        taxi_location = state, reward, done, info = env.step(env.action_space.sample())

        print('\n\n\\/\\/\\/\\/\\/\\/\n')
    
        state, reward, done, info = env.step(env.action_space.sample())
    
        total_reward += reward
        counterrr += 1
    
        env.render()
        print(state, reward, done, info)
    
        print('\n<<^^^^^^^^^^>>\n\n')

print('Total reward: ', total_reward)
print('Steps: ', counterrr)

In [32]:
# this is a dictionary of actions 
qtable_dict = defaultdict(lambda: np.zeros(6)) 

In [42]:
%%time 

env.render()
print('^^ initial state ^^')

episodes = 10000
rewards = []
max_steps = 99
   
for episode in range(episodes):
    state = env.reset()  # Assuming you already have env created as above
    total_rewards = 0

    for step in range(max_steps):
        if random.uniform(0, 1) < epsilon:
            # We explore at this level of epsilon
            action = env.action_space.sample()
        else: 
            # Exploit based on best available rewards
            action = np.argmax(qtable_dict[state])
            
        #action = env.action_space.sample() # TODO your policy here!
        new_state, reward, done, info = env.step(env.action_space.sample())
        
        # update qtable
        qtable_dict[state][action] += learning_rate * (reward + gamma * (np.max(qtable_dict[new_state])) - qtable_dict[state][action])
        
        total_rewards += reward
        state = new_state
        if done:
            break
        # Explore less as we learn    
        epsilon = (min_epsilon + (max_epsilon - min_epsilon) * np.exp(-decay_rate * episode))
    rewards.append(total_rewards)   
    
print('\nAverage score over time:', sum(rewards) / episodes)
print('\nSum of Rewards: ',sum(rewards))

+---------+
|[35m[42mR[0m[0m: | : :G|
| : : : : |
| : : : : |
| | : | : |
|Y| : |B: |
+---------+
  (Dropoff)
^^ initial state ^^

Average score over time: -385.2214

Sum of Rewards:  -3852214
CPU times: user 1min 5s, sys: 514 ms, total: 1min 6s
Wall time: 1min 14s


In [78]:
qtable_dict = defaultdict(lambda: np.zeros(6)) 

In [125]:
### TheHYPErPARAMETERS ###

episodes = 100000
learning_rate = 0.0009
max_steps = 9
gamma = 0.99


# Exploration Parameters

epsilon = 0.7        # Exploration Rate
max_epsilon = 1.0     # Initial exploration probability
min_epsilon = 0.01    # Minimum exploration probability
decay_rate = 0.01     # Exponential decay rate for exploration

In [126]:
%%time 

env.render()
print('^^ initial state ^^')

#episodes = 10000
rewards = []
counterrr = 0

   
for episode in range(episodes):
    state = env.reset()  # Assuming you already have env created as above
    total_rewards = 0

    for step in range(max_steps):
        if random.uniform(0, 1) < epsilon:
            # We explore at this level of epsilon
            action = env.action_space.sample()
        else: 
            # Exploit based on best available rewards
            action = np.argmax(qtable_dict[state])
            
        #action = env.action_space.sample() # TODO your policy here!
        new_state, reward, done, info = env.step(env.action_space.sample())
        
        # update qtable
        qtable_dict[state][action] += learning_rate * (reward + gamma * (np.max(qtable_dict[new_state])) - qtable_dict[state][action])
        
        total_rewards += reward
        counterrr += 1
        state = new_state
        if reward == 20:
        #if done:
            break
        # Explore less as we learn    
        epsilon = (min_epsilon + (max_epsilon - min_epsilon) * np.exp(-decay_rate * episode))
    rewards.append(total_rewards)   
    
print('\nAverage score over time:', sum(rewards) / episodes)
print('Counter: ', counterrr)
print('Epsilon: ', epsilon)
print('\nSum of Rewards: ',sum(rewards))

+---------+
|R: | : :[35mG[0m|
| : : : : |
|[43m [0m: : : : |
| | : | : |
|[34;1mY[0m| : |B: |
+---------+
  (West)
^^ initial state ^^

Average score over time: -35.48007
Counter:  900000
Epsilon:  0.01

Sum of Rewards:  -3548007
CPU times: user 1min 3s, sys: 652 ms, total: 1min 3s
Wall time: 1min 6s


In [77]:
print(qtable_dict)

defaultdict(<function <lambda> at 0x1a14d39598>, {288: array([-176.14658751, -176.17186024, -176.11452486, -176.16499972,
       -176.15388366, -176.18308107]), 388: array([-177.85993325, -177.85927862, -177.84744681, -177.82455216,
       -177.85828156, -177.88030339]), 268: array([-175.19966918, -175.28745641, -175.22532529, -175.23073531,
       -175.24805077, -175.22522602]), 368: array([-177.21715876, -177.16124848, -177.20725834, -177.25781783,
       -177.20253607, -177.20381167]), 188: array([-176.73225411, -175.85224441, -175.77015821, -175.79199091,
       -175.80284759, -175.79078985]), 88: array([-175.02341577, -175.07971885, -175.09535717, -175.09971254,
       -175.06069134, -175.11278166]), 68: array([-175.32786561, -175.37822877, -175.38575845, -175.38939129,
       -175.32062635, -175.33471621]), 168: array([-174.39671835, -174.39786686, -174.32735421, -174.36384952,
       -174.35681707, -174.35353691]), 468: array([-178.91203955, -178.87464075, -178.93034645, -178.92

## Goal 1 - Beat Random

As an initial goal, come up with an agent/policy that does better than random.
And more specifically, try to at least have a positive score (>0) average.

This game is discrete, and so you can use the Q-learning approach and build a
matrix of states by actions populated with expected rewards. This approach
should work well and it is suggested you start with it.


## Goal 2 - Beat Basic Q-learning

Once you've got an initial Q-learning approach working, you should try to
improve it via hyperparameter optimization. A score (average performance across
many games) generated without optimizing hyperparameters that you should try to
beat: `8.467`

You should be able to do better without having to use different techniques (i.e.
just with hyperparameter optimization).

## Goal 3 - Beat Optimized Q-learning (stretch)

Now the sky's the limit - or rather, the best possible performance in Taxi. With
the default environment, optimized Q-learning has achieved an average score of
`9.423`. See if you can get in that range, or possibly even beat it, by
employing alternative techniques.

What is an alternative technique? It's anything that maps from environment state
to action - Q-learning achieves this by populating a Q-table, but any model that
can take the environment state and possible action as input and give predicted
reward as output can serve the same purpose. And the Gym environment object
gives us a simulator perfect for generating arbitrary amounts of training data
to train such a model.

The true optimal performance in Taxi is probably not much more than the
optimized Q-learning score - the score certainly has to be less than 20 as the
taxi must always take at least some steps to achieve the task.

If you get this far, feel free to share the best score you get, and see how your
classmates are doing. Good luck!