# Reinforcement Learning using Q-Learning

Notes: https://github.com/daviskregers/notes/blob/master/data-science/06-more-data-mining-and-machine-learning-techniques/04-reinforcement-learning.md

---

We want to build a self-driving taxy that can pick up passengers at one of a set of fixed locations, drop them off another and get there in the quickest amount of time while avoiding obstacles.

The AI Gym lets us create this environment quickly.

In [3]:
!pip install gym

Collecting gym
  Downloading gym-0.18.0.tar.gz (1.6 MB)
[K     |████████████████████████████████| 1.6 MB 1.4 MB/s eta 0:00:01
Collecting pyglet<=1.5.0,>=1.4.0
  Downloading pyglet-1.5.0-py2.py3-none-any.whl (1.0 MB)
[K     |████████████████████████████████| 1.0 MB 3.6 MB/s eta 0:00:01
[?25hCollecting Pillow<=7.2.0
  Downloading Pillow-7.2.0-cp38-cp38-manylinux1_x86_64.whl (2.2 MB)
[K     |████████████████████████████████| 2.2 MB 5.3 MB/s eta 0:00:01
Building wheels for collected packages: gym
  Building wheel for gym (setup.py) ... [?25ldone
[?25h  Created wheel for gym: filename=gym-0.18.0-py3-none-any.whl size=1656446 sha256=644fc3bd1b144bfcc91eb2f9dc7399f9ea938c0a9053ef2f4bd8d448d3763e94
  Stored in directory: /home/davis/.cache/pip/wheels/d8/e7/68/a3f0f1b5831c9321d7523f6fd4e0d3f83f2705a1cbd5daaa79
Successfully built gym
Installing collected packages: pyglet, Pillow, gym
  Attempting uninstall: Pillow
    Found existing installation: Pillow 8.0.1
    Uninstalling Pillow-8.0.1:

In [4]:
import gym
import random

random.seed(1234)

streets = gym.make("Taxi-v3").env #New versions keep getting released; if -v3 doesn't work, try -v2 or -v4
streets.render()

+---------+
|R: | : :[35mG[0m|
| : | :[43m [0m: |
| : : : : |
| | : | : |
|Y| : |[34;1mB[0m: |
+---------+



R, G, B and Y are pickup or dropoff locations.
- The blue letter indicates where we need to pick someone up from
- the mageneta letter indicates where that passanger wants to go to
- The solid lines represent walls that taxi cannot cross
- The filled rectangle represents the taxi itself. It's yellow when empty, green when carrying a passanger.

The world which we've called the streets is a 5x5 grid. The state of this world at any time can be defined by:

- Where the taxi is (one of 25 locations)
- What the current destination is (4 possibilities)
- Where the passanger is (5 possibilities: at one of the destinations or inside the taxi)

So there are a total of 25 x 4 x 5 = 500 possible states that describe our world.

For each state, there are six possible actions:
    
- Move Souh, East, North or West
- Pick up a passanger
- Drop off a passanger

Q-learning will take place using the following rewards and penalties at each state:
    
    - A successfull drop-off yields +20 points
    - Every time step taken while driging a passanger yields a -1 point penalty
    - Picking up or dropping off at an illegal location yields a -10 point penalty
    
    Moving across a wall just isn't allowed at all.
    
    Let's define an initial state, with the taxi at location (2,3), the passanger at pickup location 2, and the destination at location 0.


In [5]:
initial_state = streets.encode(2, 3, 2, 0)
streets.s = initial_state
streets.render()

+---------+
|[35mR[0m: | : :G|
| : | : : |
| : : :[43m [0m: |
| | : | : |
|[34;1mY[0m| : |B: |
+---------+



Let's examine the reward table for this initial state.

In [7]:
streets.P[initial_state]

{0: [(1.0, 368, -1, False)],
 1: [(1.0, 168, -1, False)],
 2: [(1.0, 288, -1, False)],
 3: [(1.0, 248, -1, False)],
 4: [(1.0, 268, -10, False)],
 5: [(1.0, 268, -10, False)]}

So these rows - each is a potential action at this state. The 4 values in each row are the probability assigned to that action, the next state that results from that action, the reward for tthat action, and whether that action indicates a successfull dropoff took place.

So, lets do it. First we need to train our model. At a high level, we'll train over 10,000 simulated taxi runs. For each run, we'll step trough time with a 10% chance at each step of making a random, exploratory step instead of using the learned Q values to guide our actions.

In [9]:
import numpy as np

q_table = np.zeros([streets.observation_space.n, streets.action_space.n])

learning_rate = 0.1
discount_factor = 0.6
exploration = 0.1
epochs = 10000

for taxi_run in range(epochs):
    state = streets.reset()
    done = False
    
    while not done:
        random_value = random.uniform(0, 1)
        if (random_value < exploration):
            action = streets.action_space.sample() # Explore a random action
        else:
            action = np.argmax(q_table[state]) # Use the action with the highest q-value
            
        next_state, reward, done, info = streets.step(action)
        
        prev_q = q_table[state, action]
        next_max_q = np.max(q_table[next_state])
        new_q = (1 - learning_rate) * prev_q + learning_rate * (reward + discount_factor * next_max_q)
        q_table[state, action] = new_q
        
        state = next_state
        
        

So now we have a table of Q-values that can be quickly used to determine the optimal next step for any given state.

In [10]:
q_table[initial_state]

array([-2.4145798 , -2.4146969 , -2.40612615, -2.3639511 , -7.31385793,
       -8.40116511])

The lowest q-value here corresponds to the action "go west" which makes sense -that's the most direct route toward our destination from that point.

In [11]:
from IPython.display import clear_output
from time import sleep

for tripnum in range(1, 11):
    state = streets.reset()
   
    done = False
    trip_length = 0
    
    while not done and trip_length < 25:
        action = np.argmax(q_table[state])
        next_state, reward, done, info = streets.step(action)
        clear_output(wait=True)
        print("Trip number " + str(tripnum) + " Step " + str(trip_length))
        print(streets.render(mode='ansi'))
        sleep(.5)
        state = next_state
        trip_length += 1
        
    sleep(2)

Trip number 10 Step 9
+---------+
|R: | : :G|
| : | : : |
| : : : : |
| | : | : |
|Y| : |[35m[34;1m[43mB[0m[0m[0m: |
+---------+
  (Dropoff)

