<a href="https://colab.research.google.com/github/aliyah-smith/reinforcement_learning_tutorial/blob/master/Q_learning_example.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This tutorial will show how to implement a Q-learning algorithm for two simple toy problems. Comment out the necessary code below to run one of the games.


> The frozen lake environment is a 4x4 grid, so there is a discrete state space 
as well as a discrete actions space i.e. the agent can move up, down, left, or right to one of the 16 squares within the grid. The goal of the game is to move from the start block to the end block without falling into one of the dangerous holes. There is a +1 reward for reaching the goal and 0 reward otherwise.


> The taxi environment is a 5x5 grid. There are four marked locations in the grab and the taxi learns to pick up a passenger at one of the locations and drop them off at another location in the shortest amount of time. There is a +20 reward for a successful drop off, -10 penalty for illegal pick up and drop off actions, and -1 penalty at every timestep.



This tutorial is modified from Arthur Juliani, "Simple Reinforcement Learning with Tensor Flow", 2016.

In [None]:
import gym 
import numpy as np
import random


# Load OpenAI Frozen Lake environment
#env = gym.make('FrozenLake-v0')
env = gym.make('Taxi-v3')



**Initialize Q-table**
> The Q table keeps track of the Q-value (long-term expected reward stemming from a state s and action a) for each state-action pair. In this case, the Q table will be 16 x 4 for the 16 states and 4 actions.




In [None]:
Q = np.zeros([env.observation_space.n,env.action_space.n])

**Set the parameters**

*   learning rate: value between 0 and 1 which controls how fast learning occurs (0 - nothing is learned, closer to 1 - learning occurs quickly)
*   discount factor (gamma): value between 0 and 1 which controls whether immediate or future rewards are prioritized (0 - imemdiate rewards are important, closer to 1 - future rewards are important)
*   epsilon: paramter for epsilon-greedy policy - epsilon is the probability that a random action will be taken







In [None]:

# Frozen Lake Parameters
#learning_rate = 0.6
#gamma = 0.95

# Taxi Parameters
learning_rate = 0.1 
gamma = 0.6 
epsilon = 0.1

# Set Number of episodes
episodes = 2000

# Initialize lists that will store rewards and steps per episode
rewards = []




**Action Policy**
Set up the Q-learning algorithm with one of two action policies to leverage exploration and exploitation

1.   Action that maximizes Q + noise - The parameters are optimized for this method for the Frozen Lake game
2.   Epsilon-greedy policy - The parameters are optimized for this method for the Taxi game

The epsilon-greedy policy: With probability epsilon, the agent performs a random action. With probability 1-epsilon, the agent choose the action the maximizes Q.





In [None]:
# For loop for each episode
for i in range(episodes):
    # Reset the environment and perform first observation
    s = env.reset()
    rewards_all = 0
    d = False
    j = 0

    # Q-learning Algorithm 
    while j < 99:
        j+=1
        # Perform action with noise (which encourages exploration) - Frozen Lake
        #a = np.argmax(Q[s,:]+ np.random.randn(1,env.action_space.n)*(1./(i+1)))
        # OR Perform action - epsilon-greedy policy - Taxi
        if random.uniform(0,1) < epsilon:
            a = env.action_space.sample()
        else:
            a = np.argmax(Q[s,:])
        
        # Get new state and reward from environment
        s_new,r,d,_ = env.step(a)

        # Update Q-table
        Q[s,a] = Q[s,a] + learning_rate*(r+gamma*np.max(Q[s_new,:]) - Q[s,a])
        rewards_all += r
        s = s_new
        if d == True:
            break
        rewards.append(rewards_all)

Print out the final Q-table

In [None]:
print("Score over time: ", sum(rewards)/episodes)
print("Final Q-table Values")
print(Q)


Score over time:  -3661.3525
Final Q-table Values
[[ 0.          0.          0.          0.          0.          0.        ]
 [-2.29593199 -2.29870647 -2.2876985  -2.30060569 -2.27913591 -6.82796348]
 [-1.73443868 -1.75961522 -1.80314659 -1.73494657 -0.7671085  -6.13238327]
 ...
 [-1.19403701 -0.8711333  -1.18986086 -1.23611754 -1.96       -1.96      ]
 [-1.98750763 -2.00167605 -1.99042537 -1.991924   -3.68759462 -5.04540497]
 [-0.196      -0.2878     -0.196       3.8686428  -1.         -1.        ]]


Evaluate the agent's performance after Q-learning

In [None]:
j = 0
#epsilon = 0.05
s = env.reset()
rewards_all = 0
d = False
j = 0
while j < 99:
    env.render()
    print("Episode: ",j)
    j+=1
    # Perform action with noise - Frozen Lake
    #a = np.argmax(Q[s,:]+ np.random.randn(1,env.action_space.n)*(1./(i+1)))
    # OR Perform action - epsilon-greedy policy - Taxi
    if random.uniform(0,1) < epsilon:
        a = env.action_space.sample()
    else:
        a = np.argmax(Q[s,:])
        
    # Get new state and reward from environment
    s_new,r,d,_ = env.step(a)

    # Update Q-table
    Q[s,a] = Q[s,a] + learning_rate*(r+gamma*np.max(Q[s_new,:]) - Q[s,a])
    rewards_all += r
    s = s_new
    if d == True:
            break
    rewards.append(rewards_all)

+---------+
|R: | : :[34;1mG[0m|
| : | : : |
| : : : : |
| | : |[43m [0m: |
|[35mY[0m| : |B: |
+---------+

Episode:  0
+---------+
|R: | : :[34;1mG[0m|
| : | : : |
| : : : : |
| | : |[43m [0m: |
|[35mY[0m| : |B: |
+---------+
  (West)
Episode:  1
+---------+
|R: | : :[34;1mG[0m|
| : | : : |
| : : : : |
| | : | : |
|[35mY[0m| : |[43mB[0m: |
+---------+
  (South)
Episode:  2
+---------+
|R: | : :[34;1mG[0m|
| : | : : |
| : : : : |
| | : | : |
|[35mY[0m| : |B:[43m [0m|
+---------+
  (East)
Episode:  3
+---------+
|R: | : :[34;1mG[0m|
| : | : : |
| : : : : |
| | : | : |
|[35mY[0m| : |B:[43m [0m|
+---------+
  (South)
Episode:  4
+---------+
|R: | : :[34;1mG[0m|
| : | : : |
| : : : : |
| | : | : |
|[35mY[0m| : |B:[43m [0m|
+---------+
  (East)
Episode:  5
+---------+
|R: | : :[34;1mG[0m|
| : | : : |
| : : : : |
| | : | :[43m [0m|
|[35mY[0m| : |B: |
+---------+
  (North)
Episode:  6
+---------+
|R: | : :[34;1mG[0m|
| : | : : |
| : : : :[43m [0m|

The agent has learned how to successfully reach the final destination.