## Project Overview:
The Smartcab's job is to pick up the passenger at one location and drop them off in another. Here are a few things that we'd love our Smartcab to take care of:
<ul>
<li>Drop off the passenger to the right location.</li>
<li>Save passenger's time by taking minimum time possible to drop off</li>
<li>Take care of passenger's safety and traffic rules</li>
</ul>

### Components that need to be considered for designing our agent:
<ul>
    <li>Reward:
        <ul>
            <li>Our agent is reward-motivated and is going to learn how to control the cab by trial experiences in the environment</li>
            <li>The agent should receive a high positive reward for a successful dropoff because this behavior is highly desired</li>
            <li>The agent should be penalized if it tries to drop off a passenger in wrong locations
        The agent should get a slight negative reward for not making it to the destination after every time-step. "Slight" negative because we would prefer our agent to reach late instead of making wrong moves trying to reach to the destination as fast as possible</li>
        </ul>
    </li>
    <li>State Space:
        <ul>
            <li>Agents may encounter a state, and then take action according to the state it's in.</li>
            <li>State should contain useful information the agent needs to make the right action.</li>
            <li>PEAS for our car agent:
                <ul>
                    <li>Performance: try to get high positive reward</li>
                    <li>Environment:
                        <ul>
                            <li>5x5 grid, give 25 possible position 
                            <li>There are 4 position R,G,B,Y that is the pick up or drop off position
                            <li>assume max passenger in the car (3 passenger) which mean there are 5*5*5*4=500 possible position
                        </ul>
                    </li>
                    <li>Actuator: Wheel</li>
                    <li>Sensor: Camera or some computer vision</li>
                </ul>
            </li>
        </ul>
    </li>
    <li>Action Space:
        <ul>
            <li>set of all the actions that our agent can take in a given state.</li>
            <li>Can be move north, south, west, east, pick up and drop off</li>
        </ul>
    </li>
</ul>

### Framework that use for this project is : OpenAI Gym
<ul>
    <li>provides different game environments which we can plug into our code and test an agent.</li>
    <li>providing all the information that our agent would require, like possible actions, score, and current state.</li>
</ul>

In [9]:
pip install gym

Collecting gym
  Using cached gym-0.17.1.tar.gz (1.6 MB)
Collecting pyglet<=1.5.0,>=1.4.0
  Using cached pyglet-1.5.0-py2.py3-none-any.whl (1.0 MB)
Collecting cloudpickle<1.4.0,>=1.2.0
  Using cached cloudpickle-1.3.0-py2.py3-none-any.whl (26 kB)
Collecting future
  Using cached future-0.18.2.tar.gz (829 kB)
Building wheels for collected packages: gym, future
  Building wheel for gym (setup.py): started
  Building wheel for gym (setup.py): finished with status 'done'
  Created wheel for gym: filename=gym-0.17.1-py3-none-any.whl size=1648717 sha256=10acb8a293b3f5b1aae3ea96710a69f53ac37a75c76d8abdc9e3fbc1654e2b56
  Stored in directory: c:\users\administrator\appdata\local\pip\cache\wheels\12\7a\2a\2e85bca5dd2c3b319675a5db8a48837b7cfe0603240442b771
  Building wheel for future (setup.py): started
  Building wheel for future (setup.py): finished with status 'done'
  Created wheel for future: filename=future-0.18.2-py3-none-any.whl size=491062 sha256=3e343eb96aa82cd312716a9abdb3ccc96902926eb

### Generate the Environment

In [12]:
import gym

env = gym.make("Taxi-v3").env
state = env.encode(2, 2, 3, 1) # (taxi row, taxi column, passenger index, destination index)
print("State:", state)

env.s = state
env.render()

State: 253
+---------+
|R: | : :[35mG[0m|
| : | : : |
| : :[43m [0m: : |
| | : | : |
|Y| : |[34;1mB[0m: |
+---------+



From our Taxi-V2 environment:
<ul>
    <li>The filled square represents the taxi, which is yellow without a passenger and green with a passenger.</li>
    <li>The pipe ("|") represents a wall which the taxi cannot cross.</li>
    <li>R, G, Y, B are the possible pickup and destination locations. The blue letter represents the current passenger pick-up location, and the purple letter is the current destination.</li>
    <li>To identify a state uniquely by assigning a unique number to every possible state like:
        <ul>
            <li>0 = south</li>
            <li>1 = north</li>
            <li>2 = east</li>
            <li>3 = west</li>
            <li>4 = pickup</li>
            <li>5 = dropoff</li>
        </ul>
    </li>
</ul>

In [5]:
print("Action Space {}".format(env.action_space))
print("State Space {}".format(env.observation_space))

Action Space Discrete(6)
State Space Discrete(500)


Reinforcement Learning will learn a mapping of states to the optimal action to perform in that state by exploration, i.e. the agent explores the environment and takes actions based off rewards defined in the environment.

### Our Reward Table:

In [22]:
env.P[253]

{0: [(1.0, 353, -1, False)],
 1: [(1.0, 153, -1, False)],
 2: [(1.0, 273, -1, False)],
 3: [(1.0, 233, -1, False)],
 4: [(1.0, 253, -10, False)],
 5: [(1.0, 253, -10, False)]}

<ul>
    <li>number of states as rows and number of actions as columns.</li>
    <li>The above Dictionary: <code>{action: [(probability, nextstate, reward, done)]}</code>
        <ul>
            <li><code>actions</code> (south, north, east, west, pickup, dropoff) the taxi can perform at our current state</li>
            <li><code>probability</code> is always 1.0.</li>
            <li>All the movement actions have a -1 <code>reward</code> and the pickup/dropoff actions have -10 reward in this particular state. We would see a reward of 20 at the dropoff action when it drop off on the right destination</li>
            <li><code>done</code> is used to tell us when we have successfully dropped off a passenger in the right location.</li>
        </ul>
    </li>
</ul>

### Without Reinforcement Learning:
<ul>
    <li>Navigate taxi based on initial reward table</li>
    <li>Create an infinite loop which runs until one passenger reaches one destination</li>
</ul>

In [23]:
"""Without Q-Learning"""
env.s = 253  # set environment to illustration's state

epochs = 0
penalties, reward = 0, 0

frames = [] # for animation

done = False

while not done:
    action = env.action_space.sample()
    state, reward, done, info = env.step(action)

    if reward == -10:
        penalties += 1
    
    # Put each rendered frame into dict for animation
    frames.append({
        'frame': env.render(mode='ansi'),
        'state': state,
        'action': action,
        'reward': reward
        }
    )

    epochs += 1
    
    
print("Timesteps taken: {}".format(epochs))
print("Penalties incurred: {}".format(penalties))

Timesteps taken: 2177
Penalties incurred: 705


In [24]:
from IPython.display import clear_output
from time import sleep

def print_frames(frames):
    for i, frame in enumerate(frames):
        clear_output(wait=True)
        print(frame['frame'])
        print(f"Timestep: {i + 1}")
        print(f"State: {frame['state']}")
        print(f"Action: {frame['action']}")
        print(f"Reward: {frame['reward']}")
        sleep(.1)
        
print_frames(frames)

+---------+
|R: | : :[35m[34;1m[43mG[0m[0m[0m|
| : | : : |
| : : : : |
| | : | : |
|Y| : |B: |
+---------+
  (Dropoff)

Timestep: 2177
State: 85
Action: 5
Reward: 20


Not good. Our agent takes thousand timesteps and makes lots of wrong drop offs to deliver just one passenger to the right destination.

This is because we aren't learning from past experience. We can run this over and over, and it will never optimize. The agent has no memory of which action was best for each state, which is exactly what Reinforcement Learning will do for us.

### Q-Learning
Q-learning lets the agent use the environment's rewards to learn, over time, the best action to take in a given state.
It does thing by looking receiving a reward for taking an action in the current state, then updating a Q-value to remember if that action was beneficial.
Q-values map to a <code>(state, action)</code> combination and it is the representative of the "quality" of an action taken from that state. Better Q-values imply better chances of getting greater rewards.<br>
Q-values are updated using the equation:
<code>Q(state,action)←(1−α)Q(state,action)+α(reward+γmaxaQ(next state,all actions))</code>
<ul>
    <li><code>α</code>(alpha) is the learning rate. It is the extent to which our Q-values are being updated in every iteration.</li>
    <li><code>γ</code>(gamma) is the discount factor. It determines how much importance we want to give to future rewards.</li>
</ul>
From the the above equation: We are assigning (←), or updating, the Q-value of the agent's current state and action by first taking a weight (1−α) of the old Q-value, then adding the learned value. The learned value is a combination of the reward for taking the current action in the current state, and the discounted maximum reward from the next state we will be in once we take the current action.

#### Q-Table
The Q-table is a matrix where we have a row for every state (500) and a column for every action (6). It's first initialized to 0, and then values are updated after training.

#### Q-Learning Process:
Breaking it down into steps, we get
<ul>
    <li>Initialize the Q-table by all zeros.</li>
    <li>Start exploring actions: For each state, select any one among all possible actions for the current state (S).</li>
    <li>Travel to the next state (S') as a result of that action (a).</li>
    <li>For all possible actions from the state (S') select the one with the highest Q-value.</li>
    <li>Update Q-table values using the equation.</li>
    <li>Set the next state as the current state.</li>
    <li>If goal state is reached, then end and repeat the process.</li>

### Training the Agent
Initialize the Q-table to a 500×6 matrix of zeros

In [19]:
import numpy as np
q_table = np.zeros([env.observation_space.n, env.action_space.n])

Start Training Algorithm

In [20]:
"""Training the agent"""

import random
from IPython.display import clear_output

# Hyperparameters
alpha = 0.1
gamma = 0.6
epsilon = 0.1

# For plotting metrics
all_epochs = []
all_penalties = []

for i in range(1, 100001):
    state = env.reset()

    epochs, penalties, reward, = 0, 0, 0
    done = False
    
    while not done:
        if random.uniform(0, 1) < epsilon:
            action = env.action_space.sample() # Explore action space
        else:
            action = np.argmax(q_table[state]) # Exploit learned values

        next_state, reward, done, info = env.step(action) 
        
        old_value = q_table[state, action]
        next_max = np.max(q_table[next_state])
        
        new_value = (1 - alpha) * old_value + alpha * (reward + gamma * next_max)
        q_table[state, action] = new_value

        if reward == -10:
            penalties += 1

        state = next_state
        epochs += 1
        
    if i % 100 == 0:
        clear_output(wait=True)
        print(f"Episode: {i}")

print("Training finished.\n")

Episode: 100000
Training finished.



In the first part of <code>while not done</code>, we decide whether to pick a random action or to exploit the already computed Q-values. This is done simply by using the <code>epsilon</code> value and comparing it to the <code>random.uniform(0, 1)</code> function, which returns an arbitrary number between 0 and 1.

We execute the chosen action in the environment to obtain the <code>next_state</code> and the <code>reward</code> from performing the action. After that, we calculate the maximum Q-value for the actions corresponding to the <code>next_state</code>, and with that, we can easily update our Q-value to the <code>new_q_value</code>:

Q-table has been established over 100,000 episodes

In [25]:
"""Q-table has been established over 100,000 episodes"""
q_table[253]

array([ -2.41836112,  -2.41834853,  -2.27325184,  -2.41836473,
       -11.36355604, -11.36242973])

The max Q-value is "east" (-2.27) and based from the above figure it is the best action to be taken from the current state. Seem like our agent is learnt from the Q-Learning

### Evaluating Agent

In [26]:
"""Evaluate agent's performance after Q-learning"""

total_epochs, total_penalties = 0, 0
episodes = 100

for _ in range(episodes):
    state = env.reset()
    epochs, penalties, reward = 0, 0, 0
    
    done = False
    
    while not done:
        action = np.argmax(q_table[state])
        state, reward, done, info = env.step(action)

        if reward == -10:
            penalties += 1

        epochs += 1

    total_penalties += penalties
    total_epochs += epochs

print(f"Results after {episodes} episodes:")
print(f"Average timesteps per episode: {total_epochs / episodes}")
print(f"Average penalties per episode: {total_penalties / episodes}")

Results after 100 episodes:
Average timesteps per episode: 13.14
Average penalties per episode: 0.0


We can see from the evaluation, the agent's performance improved significantly and it incurred no penalties, which means it performed the correct pickup/dropoff actions with 100 different passengers.