Reinforcement Learning
===

taxi problem where we have a self driving taxi that can pick up passengers and go to the drop off at the quickest time

In [8]:
import gym
import random

random.seed(0)

streets = gym.make("Taxi-v3").env
streets.render()

+---------+
|R: | : :[35mG[0m|
| : | : : |
| : : : : |
| | :[43m [0m| : |
|[34;1mY[0m| : |B: |
+---------+



Let's breakdown what we're seeing here:
* R, G, B, Y are pickup/dropoff locations
* BLUE letter indicates where we need to pick someone up from
* MAGENTA letter inidcates where the passenger wants to go to
* solid lines represent walls, so taxi cannot cross
* filled rectangle represents the taxi itself. Yellow if empty, green if with passenger

There is a total of 500 states = 25 (locations in the 5x5 grid) x 4 (locations R, G, Y B) x 5 (where the passenger is. can be inside the taxi)

For each state, there are six actions:
* Move S, E, W, N
* pick up passenger
* drop off passenger

Q-learning will take place when we have rewards and penalties at each state:
* successful drop-off: 20 pts
* every step taken while driving passenger: -1 pt
* pick up or drop off illegally: -10 pts

Make moving across wall impossible

Given initial state, with taxi location (2,3), the passenger at pickup lcation 2 (Y) and destination at location 0 (R):

In [11]:
initial_state = streets.encode(2, 3, 2, 0)
streets.s = initial_state
streets.render()

+---------+
|[35mR[0m: | : :G|
| : | : : |
| : : :[43m [0m: |
| | : | : |
|[34;1mY[0m| : |B: |
+---------+



Let's examine the reward table for the initial state:

In [12]:
streets.P[initial_state]

{0: [(1.0, 368, -1, False)],
 1: [(1.0, 168, -1, False)],
 2: [(1.0, 288, -1, False)],
 3: [(1.0, 248, -1, False)],
 4: [(1.0, 268, -10, False)],
 5: [(1.0, 268, -10, False)]}

Interpretation: each vector is an action. second component is the resulting state, 3rd component is the cumulative points, 4th component is if the drop off is successful

Now do Q-Learning. train model with 10% chance of each step of making a random exploratory step

In [14]:
import numpy as np
q_table = np.zeros([streets.observation_space.n,streets.action_space.n]) #2D array of every state and action
learning_rate = .1
discount_factor = .6
exploration = .1
epoch = 10000
for taxi_run in range(epoch):
    state = streets.reset()
    done = False
    while not done:
        random_value = random.uniform(0,1)
        if random_value < exploration:
            action = streets.action_space.sample() #explore a random action
        else:
            action = np.argmax(q_table[state]) #use action with highest q_value
        next_state, reward, done, info = streets.step(action)
        
        prev_q = q_table[state, action]
        next_max_q = np.max(q_table[next_state])
        new_q = (1 - learning_rate) * prev_q + learning_rate * (reward + discount_factor * next_max_q)
        q_table[state, action] = new_q
        
        state = next_state

In [25]:
q_table[initial_state]

array([-2.40585473, -2.40490312, -2.3984878 , -2.3639511 , -8.38468125,
       -7.27981943])

the lowest q-value here corresponds to the action "go west"

In [19]:
from IPython.display import clear_output
from time import sleep

for tripnum in range(1,11):
    state = streets.reset()
    
    done = False
    while not done:
        action = np.argmax(q_table[state])
        next_state, reward, done, info = streets.step(action)
        clear_output(wait=True)
        print("Trip number " + str(tripnum))
        print(streets.render(mode = 'ansi'))
        sleep(.5)
        state = next_state
    sleep(2)

Trip number 10
+---------+
|R: | : :[35m[34;1m[43mG[0m[0m[0m|
| : | : : |
| : : : : |
| | : | : |
|Y| : |B: |
+---------+
  (Dropoff)

