# Abstract
This notebook uses Reinforcement learning and Open AI to play game Frozen Lake. Here an agent will play Frozen Lake.
Goal of this game is to reach our goal without falling into a hole. So agent start from Starting State (S) and walks over the Frozen Tiles (F) escaping the Holes (H) to reach the Goal state (G). We do exploration and exploitation trade-off to solve the problem of Reinforcement learning. We use Q-learning for this as Q-learning is a value based Reinforcement learning algorithm.

In [2]:
import numpy as np
import gym
import random

In [3]:
#Creating Frozen Lake environment using Open AI gym
env = gym.make("FrozenLake-v0")

In [4]:
# Creating and initalizing Q table
action_size = env.action_space.n
print("Action size", action_size)
state_size = env.observation_space.n
print("State size", state_size)

Action size 4
State size 16


In [5]:
qtable = np.zeros((state_size, action_size))
print(qtable)

[[0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]]


In [6]:
# Setting up hyperparameters
total_episodes = 15000        # Total episodes
learning_rate = 0.8           # Learning rate
max_steps = 99                # Max steps per episode
gamma = 0.95                  # Discounting rate

# Exploration parameters
epsilon = 1.0                 # Exploration rate
max_epsilon = 1.0             # Exploration probability at start
min_epsilon = 0.01            # Minimum exploration probability 
decay_rate = 0.005             # Exponential decay rate for exploration prob

Running RL on Training Data

In [7]:
#Policy with maxQ(s', a')

rewards = []

for episode in range(total_episodes):
    # Reseting the environment
    state = env.reset()
    step = 0
    done = False
    total_rewards = 0
    
    for step in range(max_steps):
        exp_exp_tradeoff = random.uniform(0, 1)
        
        ## Exploitation (taking the biggest Q value for this state)
        if exp_exp_tradeoff > epsilon:
            action = np.argmax(qtable[state,:])

        # Exploration
        else:
            action = env.action_space.sample()

        # Take the action (a) and observe the outcome state(s') and reward (r)
        new_state, reward, done, info = env.step(action)

        # Update Q(s,a):= Q(s,a) + lr [R(s,a) + gamma * max Q(s',a') - Q(s,a)]
        # qtable[new_state,:] : all the actions we can take from new state
        qtable[state, action] = qtable[state, action] + learning_rate * (reward + gamma * np.max(qtable[new_state, :]) - qtable[state, action])
        
        total_rewards += reward
        
        # Our new state is state
        state = new_state
        
        # If done (if we're dead) : finish episode
        if done == True: 
            break
        
    # Reduce epsilon (because we need less and less exploration)
    epsilon = min_epsilon + (max_epsilon - min_epsilon)*np.exp(-decay_rate*episode) 
    rewards.append(total_rewards)

print ("Score over time: " +  str(sum(rewards)/total_episodes))
print(qtable)

Score over time: 0.482
[[1.71197542e-01 2.45175588e-02 6.07797955e-02 2.78679865e-02]
 [7.28160770e-04 1.98629638e-04 1.20906018e-02 1.52703096e-02]
 [2.79590879e-03 7.45299671e-03 1.32275040e-01 9.88112260e-03]
 [4.88487075e-03 2.51783451e-03 1.83184675e-03 1.13642586e-02]
 [2.03112621e-01 4.11257475e-03 5.55047555e-02 3.98240792e-03]
 [0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00]
 [3.24333031e-01 7.35268576e-06 4.22380754e-05 3.27672527e-05]
 [0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00]
 [9.06929271e-03 5.40340226e-02 5.79250972e-03 3.62936749e-01]
 [1.27897253e-01 5.00698746e-01 1.04849127e-02 1.79446000e-02]
 [4.82625347e-01 7.71293395e-04 4.90190887e-05 4.61339719e-03]
 [0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00]
 [0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00]
 [1.72165200e-02 8.97850517e-02 7.76263629e-01 8.54193277e-02]
 [2.47001452e-01 9.81117860e-01 2.39442387e-01 1.91151784e-01]
 [0.00000000e+00 0.00000000e+00 

In [8]:
env.reset()

for episode in range(5):
    state = env.reset()
    step = 0
    done = False
    print("****************************************************")
    print("EPISODE ", episode)

    for step in range(max_steps):
        
        # Take the action (index) that have the maximum expected future reward given that state
        action = np.argmax(qtable[state,:])
        
        new_state, reward, done, info = env.step(action)
        
        if done:
            # Here, we decide to only print the last state (to see if our agent is on the goal or fall into an hole)
            env.render()
            if new_state == 15:
                print("We reached our Goal 🎯")
            else:
                print("We fell into a hole 💀")
            
            # We print the number of step it took.
            print("Number of steps", step)
            break
        state = new_state
env.close()

****************************************************
EPISODE  0
  (Down)
SFFF
FHFH
FFFH
HFF[41mG[0m
We reached our Goal 🎯
Number of steps 32
****************************************************
EPISODE  1
  (Down)
SFFF
FHFH
FFFH
HFF[41mG[0m
We reached our Goal 🎯
Number of steps 22
****************************************************
EPISODE  2
  (Down)
SFFF
FHFH
FFFH
HFF[41mG[0m
We reached our Goal 🎯
Number of steps 33
****************************************************
EPISODE  3
  (Down)
SFFF
FHFH
FFFH
HFF[41mG[0m
We reached our Goal 🎯
Number of steps 9
****************************************************
EPISODE  4
  (Left)
SFFF
F[41mH[0mFH
FFFH
HFFG
We fell into a hole 💀
Number of steps 49


Running RL on Test Data

In [9]:
#Setting hyperparameter for test
total_episodes = 5000
total_test_episodes = 100
max_steps = 99 
alpha= 0.7                # Learning rate 
gamma = 0.8               # Discounting rate 
epsilon = 1.0             # Exploration rate
decay_rate = 0.01        # Exponential decay rate

In [10]:
#
env.reset()
    
total_rewards_1=[]
test_episodes_1=[]

for episode in range(1,total_test_episodes):
    state = env.reset()
    step = 0
    done = False
    rewards_current_episode = 0
    for step in range(max_steps):
        action = np.argmax(qtable[state,:])
        new_state, reward, done, info = env.step(action)
        rewards_current_episode += reward
        if done:
            total_rewards_1.append(rewards_current_episode)
            test_episodes_1.append(episode)
            break
        state = new_state
env.close()

rewards_per_hundred_episodes=np.split(np.array(total_rewards_1),total_test_episodes/100)

print ("Score over time: " +  str(sum(total_rewards_1)/total_test_episodes))
print('Average steps taken per episode: ', step/total_test_episodes)
print(qtable)

Score over time: 0.62
Average steps taken per episode:  0.61
[[1.71197542e-01 2.45175588e-02 6.07797955e-02 2.78679865e-02]
 [7.28160770e-04 1.98629638e-04 1.20906018e-02 1.52703096e-02]
 [2.79590879e-03 7.45299671e-03 1.32275040e-01 9.88112260e-03]
 [4.88487075e-03 2.51783451e-03 1.83184675e-03 1.13642586e-02]
 [2.03112621e-01 4.11257475e-03 5.55047555e-02 3.98240792e-03]
 [0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00]
 [3.24333031e-01 7.35268576e-06 4.22380754e-05 3.27672527e-05]
 [0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00]
 [9.06929271e-03 5.40340226e-02 5.79250972e-03 3.62936749e-01]
 [1.27897253e-01 5.00698746e-01 1.04849127e-02 1.79446000e-02]
 [4.82625347e-01 7.71293395e-04 4.90190887e-05 4.61339719e-03]
 [0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00]
 [0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00]
 [1.72165200e-02 8.97850517e-02 7.76263629e-01 8.54193277e-02]
 [2.47001452e-01 9.81117860e-01 2.39442387e-01 1.91151784

In [11]:
env.reset()

for episode in range(total_test_episodes):
    state = env.reset()
    step = 0
    done = False
    print("****************************************************")
    print("EPISODE ", episode)

    for step in range(max_steps):
        
        # Take the action (index) that have the maximum expected future reward given that state
        action = np.argmax(qtable[state,:])
        
        new_state, reward, done, info = env.step(action)
        
        if done:
            # Here, we decide to only print the last state (to see if our agent is on the goal or fall into an hole)
            env.render()
            if new_state == 15:
                print("We reached our Goal 🎯")
            else:
                print("We fell into a hole 💀")
            
            # We print the number of step it took.
            print("Number of steps", step)
            break
        state = new_state
env.close()

****************************************************
EPISODE  0
  (Left)
SFFF
F[41mH[0mFH
FFFH
HFFG
We fell into a hole 💀
Number of steps 19
****************************************************
EPISODE  1
  (Down)
SFFF
FHFH
FFFH
HFF[41mG[0m
We reached our Goal 🎯
Number of steps 40
****************************************************
EPISODE  2
  (Down)
SFFF
FHFH
FFFH
HFF[41mG[0m
We reached our Goal 🎯
Number of steps 37
****************************************************
EPISODE  3
  (Down)
SFFF
FHFH
FFFH
HFF[41mG[0m
We reached our Goal 🎯
Number of steps 19
****************************************************
EPISODE  4
  (Down)
SFFF
FHFH
FFFH
HFF[41mG[0m
We reached our Goal 🎯
Number of steps 19
****************************************************
EPISODE  5
  (Down)
SFFF
FHFH
FFFH
HFF[41mG[0m
We reached our Goal 🎯
Number of steps 20
****************************************************
EPISODE  6
  (Down)
SFFF
FHFH
FFFH
HFF[41mG[0m
We reached our Goal 🎯
Number of steps 42

  (Down)
SFFF
FHFH
FFFH
HFF[41mG[0m
We reached our Goal 🎯
Number of steps 56
****************************************************
EPISODE  73
  (Down)
SFFF
FHFH
FFFH
HFF[41mG[0m
We reached our Goal 🎯
Number of steps 58
****************************************************
EPISODE  74
  (Down)
SFFF
FHFH
FFFH
HFF[41mG[0m
We reached our Goal 🎯
Number of steps 30
****************************************************
EPISODE  75
  (Down)
SFFF
FHFH
FFFH
HFF[41mG[0m
We reached our Goal 🎯
Number of steps 20
****************************************************
EPISODE  76
  (Down)
SFFF
FHFH
FFFH
HFF[41mG[0m
We reached our Goal 🎯
Number of steps 19
****************************************************
EPISODE  77
  (Left)
SFFF
F[41mH[0mFH
FFFH
HFFG
We fell into a hole 💀
Number of steps 36
****************************************************
EPISODE  78
  (Down)
SFFF
FHFH
FFFH
HFF[41mG[0m
We reached our Goal 🎯
Number of steps 77
****************************************************
EPIS

In [12]:
# changing the value of alpha and gamma, keeping the epsilon and decay_rate same
total_episodes = 5000
total_test_episodes = 100
max_steps = 99 
alpha= 0.65 
gamma = 0.90
epsilon = 1.0 
decay_rate = 0.01 

In [13]:
#
env.reset()
    
total_rewards=[]
test_episodes=[]

for episode in range(1,total_test_episodes):
    state = env.reset()
    step = 0
    done = False
    rewards_current_episode = 0
    for step in range(max_steps):
        action = np.argmax(qtable[state,:])
        new_state, reward, done, info = env.step(action)
        rewards_current_episode += reward
        if done:
            total_rewards.append(rewards_current_episode)
            test_episodes.append(episode)
            break
        state = new_state
env.close()

rewards_per_hundred_episodes=np.split(np.array(total_rewards),total_test_episodes/100)

print ("Score over time: " +  str(sum(total_rewards)/total_test_episodes))
print('Average steps taken per episode: ', step/total_test_episodes)
print(qtable)

Score over time: 0.74
Average steps taken per episode:  0.11
[[1.71197542e-01 2.45175588e-02 6.07797955e-02 2.78679865e-02]
 [7.28160770e-04 1.98629638e-04 1.20906018e-02 1.52703096e-02]
 [2.79590879e-03 7.45299671e-03 1.32275040e-01 9.88112260e-03]
 [4.88487075e-03 2.51783451e-03 1.83184675e-03 1.13642586e-02]
 [2.03112621e-01 4.11257475e-03 5.55047555e-02 3.98240792e-03]
 [0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00]
 [3.24333031e-01 7.35268576e-06 4.22380754e-05 3.27672527e-05]
 [0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00]
 [9.06929271e-03 5.40340226e-02 5.79250972e-03 3.62936749e-01]
 [1.27897253e-01 5.00698746e-01 1.04849127e-02 1.79446000e-02]
 [4.82625347e-01 7.71293395e-04 4.90190887e-05 4.61339719e-03]
 [0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00]
 [0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00]
 [1.72165200e-02 8.97850517e-02 7.76263629e-01 8.54193277e-02]
 [2.47001452e-01 9.81117860e-01 2.39442387e-01 1.91151784

In [14]:
env.reset()

for episode in range(total_test_episodes):
    state = env.reset()
    step = 0
    done = False
    print("****************************************************")
    print("EPISODE ", episode)

    for step in range(max_steps):
        
        # Take the action (index) that have the maximum expected future reward given that state
        action = np.argmax(qtable[state,:])
        
        new_state, reward, done, info = env.step(action)
        
        if done:
            # Here, we decide to only print the last state (to see if our agent is on the goal or fall into an hole)
            env.render()
            if new_state == 15:
                print("We reached our Goal 🎯")
            else:
                print("We fell into a hole 💀")
            
            # We print the number of step it took.
            print("Number of steps", step)
            break
        state = new_state
env.close()

****************************************************
EPISODE  0
  (Down)
SFFF
FHFH
FFFH
HFF[41mG[0m
We reached our Goal 🎯
Number of steps 35
****************************************************
EPISODE  1
  (Left)
SFFF
F[41mH[0mFH
FFFH
HFFG
We fell into a hole 💀
Number of steps 97
****************************************************
EPISODE  2
  (Left)
SFFF
F[41mH[0mFH
FFFH
HFFG
We fell into a hole 💀
Number of steps 62
****************************************************
EPISODE  3
  (Down)
SFFF
FHFH
FFFH
HFF[41mG[0m
We reached our Goal 🎯
Number of steps 22
****************************************************
EPISODE  4
  (Down)
SFFF
FHFH
FFFH
HFF[41mG[0m
We reached our Goal 🎯
Number of steps 68
****************************************************
EPISODE  5
  (Down)
SFFF
FHFH
FFFH
HFF[41mG[0m
We reached our Goal 🎯
Number of steps 20
****************************************************
EPISODE  6
  (Down)
SFFF
FHFH
FFFH
HFF[41mG[0m
We reached our Goal 🎯
Number of steps 16

  (Down)
SFFF
FHFH
FFFH
HFF[41mG[0m
We reached our Goal 🎯
Number of steps 15
****************************************************
EPISODE  83
  (Down)
SFFF
FHFH
FFFH
HFF[41mG[0m
We reached our Goal 🎯
Number of steps 47
****************************************************
EPISODE  84
  (Left)
SFFF
F[41mH[0mFH
FFFH
HFFG
We fell into a hole 💀
Number of steps 78
****************************************************
EPISODE  85
  (Left)
SFFF
F[41mH[0mFH
FFFH
HFFG
We fell into a hole 💀
Number of steps 24
****************************************************
EPISODE  86
****************************************************
EPISODE  87
  (Down)
SFFF
FHFH
FFFH
HFF[41mG[0m
We reached our Goal 🎯
Number of steps 32
****************************************************
EPISODE  88
  (Down)
SFFF
FHFH
FFFH
HFF[41mG[0m
We reached our Goal 🎯
Number of steps 18
****************************************************
EPISODE  89
  (Down)
SFFF
FHFH
FFFH
HFF[41mG[0m
We reached our Goal 🎯
Number of s

In [15]:
# changing the value of epsilon and decay_rate, keeping the alpha and gamma same
total_episodes = 5000
total_test_episodes = 100
max_steps = 99 
alpha= 0.65 
gamma = 0.9 
epsilon = 0.9 
decay_rate = 0.02

In [16]:
#
env.reset()
    
total_rewards_2=[]
test_episodes_2=[]

for episode in range(1,total_test_episodes):
    state = env.reset()
    step = 0
    done = False
    rewards_current_episode = 0
    for step in range(max_steps):
        action = np.argmax(qtable[state,:])
        new_state, reward, done, info = env.step(action)
        rewards_current_episode += reward
        if done:
            total_rewards_2.append(rewards_current_episode)
            test_episodes_2.append(episode)
            break
        state = new_state
env.close()

rewards_per_hundred_episodes=np.split(np.array(total_rewards_2),total_test_episodes/100)

print ("Score over time: " +  str(sum(total_rewards_2)/total_test_episodes))
print('Average steps taken per episode: ', step/total_test_episodes)
print(qtable)

Score over time: 0.73
Average steps taken per episode:  0.85
[[1.71197542e-01 2.45175588e-02 6.07797955e-02 2.78679865e-02]
 [7.28160770e-04 1.98629638e-04 1.20906018e-02 1.52703096e-02]
 [2.79590879e-03 7.45299671e-03 1.32275040e-01 9.88112260e-03]
 [4.88487075e-03 2.51783451e-03 1.83184675e-03 1.13642586e-02]
 [2.03112621e-01 4.11257475e-03 5.55047555e-02 3.98240792e-03]
 [0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00]
 [3.24333031e-01 7.35268576e-06 4.22380754e-05 3.27672527e-05]
 [0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00]
 [9.06929271e-03 5.40340226e-02 5.79250972e-03 3.62936749e-01]
 [1.27897253e-01 5.00698746e-01 1.04849127e-02 1.79446000e-02]
 [4.82625347e-01 7.71293395e-04 4.90190887e-05 4.61339719e-03]
 [0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00]
 [0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00]
 [1.72165200e-02 8.97850517e-02 7.76263629e-01 8.54193277e-02]
 [2.47001452e-01 9.81117860e-01 2.39442387e-01 1.91151784

In [17]:
env.reset()

for episode in range(total_test_episodes):
    state = env.reset()
    step = 0
    done = False
    print("****************************************************")
    print("EPISODE ", episode)

    for step in range(max_steps):
        
        # Take the action (index) that have the maximum expected future reward given that state
        action = np.argmax(qtable[state,:])
        
        new_state, reward, done, info = env.step(action)
        
        if done:
            # Here, we decide to only print the last state (to see if our agent is on the goal or fall into an hole)
            env.render()
            if new_state == 15:
                print("We reached our Goal 🎯")
            else:
                print("We fell into a hole 💀")
            
            # We print the number of step it took.
            print("Number of steps", step)
            break
        state = new_state
env.close()

****************************************************
EPISODE  0
  (Down)
SFFF
FHFH
FFFH
HFF[41mG[0m
We reached our Goal 🎯
Number of steps 71
****************************************************
EPISODE  1
  (Left)
SFFF
F[41mH[0mFH
FFFH
HFFG
We fell into a hole 💀
Number of steps 43
****************************************************
EPISODE  2
  (Down)
SFFF
FHFH
FFFH
HFF[41mG[0m
We reached our Goal 🎯
Number of steps 16
****************************************************
EPISODE  3
  (Down)
SFFF
FHFH
FFFH
HFF[41mG[0m
We reached our Goal 🎯
Number of steps 9
****************************************************
EPISODE  4
  (Left)
SFFF
F[41mH[0mFH
FFFH
HFFG
We fell into a hole 💀
Number of steps 31
****************************************************
EPISODE  5
  (Down)
SFFF
FHFH
FFFH
HFF[41mG[0m
We reached our Goal 🎯
Number of steps 38
****************************************************
EPISODE  6
  (Down)
SFFF
FHFH
FFFH
HFF[41mG[0m
We reached our Goal 🎯
Number of steps 23


  (Down)
SFFF
FHFH
FFFH
HFF[41mG[0m
We reached our Goal 🎯
Number of steps 12
****************************************************
EPISODE  75
  (Left)
SFFF
F[41mH[0mFH
FFFH
HFFG
We fell into a hole 💀
Number of steps 14
****************************************************
EPISODE  76
  (Down)
SFFF
FHFH
FFFH
HFF[41mG[0m
We reached our Goal 🎯
Number of steps 74
****************************************************
EPISODE  77
  (Down)
SFFF
FHFH
FFFH
HFF[41mG[0m
We reached our Goal 🎯
Number of steps 89
****************************************************
EPISODE  78
  (Left)
SFFF
F[41mH[0mFH
FFFH
HFFG
We fell into a hole 💀
Number of steps 69
****************************************************
EPISODE  79
  (Down)
SFFF
FHFH
FFFH
HFF[41mG[0m
We reached our Goal 🎯
Number of steps 18
****************************************************
EPISODE  80
  (Down)
SFFF
FHFH
FFFH
HFF[41mG[0m
We reached our Goal 🎯
Number of steps 51
****************************************************
EPIS

Policy Change

In [18]:
#Setting hyperparameter for test
total_episodes = 5000
total_test_episodes = 100
max_steps = 99 
alpha= 0.7                # Learning rate 
gamma = 0.8               # Discounting rate 
epsilon = 1.0             # Exploration rate
decay_rate = 0.01        # Exponential decay rate

In [19]:
# Policy with minQ(s', a')
rewards_1 = []

for episode in range(total_episodes):
    state = env.reset()
    step = 0
    done = False
    total_rewards_1 = 0
    
    for step in range(max_steps):
        exp_exp_tradeoff = random.uniform(0, 1)
        
        
        if exp_exp_tradeoff > epsilon:
            action = np.argmax(qtable[state,:])

        
        else:
            action = env.action_space.sample()

        
        new_state, reward, done, info = env.step(action)

        # Update Q(s,a):= Q(s,a) + lr [R(s,a) + gamma * min Q(s',a') - Q(s,a)]
        # qtable[new_state,:] : all the actions we can take from new state
        qtable[state, action] = qtable[state, action] + learning_rate * (reward + gamma * np.min(qtable[new_state, :]) - qtable[state, action])
        
        total_rewards_1 += reward
        
        
        state = new_state
        
        
        if done == True: 
            break
        
    
    epsilon = min_epsilon + (max_epsilon - min_epsilon)*np.exp(-decay_rate*episode) 
    rewards_1.append(total_rewards_1)

print ("Score over time: " +  str(sum(rewards_1)/total_episodes))
print(qtable)

Score over time: 0.0474
[[4.79689747e-45 1.87635444e-42 6.31141461e-45 5.49688514e-45]
 [4.02567735e-44 7.36455488e-38 1.65149572e-43 1.45149113e-43]
 [1.09382202e-31 1.39971545e-36 4.26557797e-28 9.22044617e-38]
 [2.36044448e-39 9.80347750e-43 9.23665567e-43 9.71600283e-43]
 [2.69149409e-41 4.45891371e-40 2.18474099e-41 1.19681003e-41]
 [0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00]
 [1.53891215e-27 3.78205922e-19 6.41386371e-27 5.33887412e-28]
 [0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00]
 [5.37567326e-35 2.10031516e-34 1.18756421e-34 3.98812624e-22]
 [6.56448537e-22 2.27197704e-20 6.45197076e-22 5.01883296e-22]
 [3.20016401e-18 1.00823031e-14 6.64483101e-19 4.92455584e-19]
 [0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00]
 [0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00]
 [1.21601000e-19 5.00123388e-14 3.11174197e-20 2.56155908e-20]
 [3.89971447e-13 2.66257176e-04 3.16399431e-12 5.29811174e-12]
 [0.00000000e+00 0.00000000e+00

In [20]:
env.reset()

for episode in range(total_test_episodes):
    state = env.reset()
    step = 0
    done = False
    print("****************************************************")
    print("EPISODE ", episode)

    for step in range(max_steps):
        
        # Take the action (index) that have the maximum expected future reward given that state
        action = np.argmax(qtable[state,:])
        
        new_state, reward, done, info = env.step(action)
        
        if done:
            # Here, we decide to only print the last state (to see if our agent is on the goal or fall into an hole)
            env.render()
            if new_state == 15:
                print("We reached our Goal 🎯")
            else:
                print("We fell into a hole 💀")
            
            # We print the number of step it took.
            print("Number of steps", step)
            break
        state = new_state
env.close()

****************************************************
EPISODE  0
  (Down)
SFFF
F[41mH[0mFH
FFFH
HFFG
We fell into a hole 💀
Number of steps 3
****************************************************
EPISODE  1
  (Down)
SFFF
FHF[41mH[0m
FFFH
HFFG
We fell into a hole 💀
Number of steps 4
****************************************************
EPISODE  2
  (Down)
SFFF
F[41mH[0mFH
FFFH
HFFG
We fell into a hole 💀
Number of steps 1
****************************************************
EPISODE  3
  (Down)
SFFF
F[41mH[0mFH
FFFH
HFFG
We fell into a hole 💀
Number of steps 2
****************************************************
EPISODE  4
  (Down)
SFFF
FHFH
FFF[41mH[0m
HFFG
We fell into a hole 💀
Number of steps 7
****************************************************
EPISODE  5
  (Down)
SFFF
F[41mH[0mFH
FFFH
HFFG
We fell into a hole 💀
Number of steps 2
****************************************************
EPISODE  6
  (Down)
SFFF
F[41mH[0mFH
FFFH
HFFG
We fell into a hole 💀
Number of steps 3
******

In [21]:
#Setting hyperparameter for test
total_episodes = 5000
total_test_episodes = 100
max_steps = 99 
alpha= 0.7                # Learning rate 
gamma = 0.8               # Discounting rate 
epsilon = 1.0             # Exploration rate
decay_rate = 0.01        # Exponential decay rate

In [22]:
# Policy with avgQ(s', a')
rewards_1 = []

for episode in range(total_episodes):
    state = env.reset()
    step = 0
    done = False
    total_rewards_1 = 0
    
    for step in range(max_steps):
        exp_exp_tradeoff = random.uniform(0, 1)
        
        
        if exp_exp_tradeoff > epsilon:
            action = np.argmax(qtable[state,:])

        
        else:
            action = env.action_space.sample()

        
        new_state, reward, done, info = env.step(action)

        # Update Q(s,a):= Q(s,a) + lr [R(s,a) + gamma * min Q(s',a') - Q(s,a)]
        # qtable[new_state,:] : all the actions we can take from new state
        qtable[state, action] = qtable[state, action] + learning_rate * (reward + gamma * np.all(qtable[new_state, :]) - qtable[state, action])
        
        total_rewards_1 += reward
        
        
        state = new_state
        
        
        if done == True: 
            break
        
    
    epsilon = min_epsilon + (max_epsilon - min_epsilon)*np.exp(-decay_rate*episode) 
    rewards_1.append(total_rewards_1)

print ("Score over time: " +  str(sum(rewards_1)/total_episodes))
print(qtable)

Score over time: 0.6066
[[8.00000000e-01 8.00000000e-01 8.00000000e-01 8.00000000e-01]
 [7.94879607e-01 6.65806835e-01 7.73367401e-01 8.00000000e-01]
 [8.00000000e-01 8.00000000e-01 8.00000000e-01 7.99999999e-01]
 [7.73171200e-01 7.98975659e-01 6.66625638e-01 8.00000000e-01]
 [8.00000000e-01 7.94880000e-01 7.93651131e-01 1.59754171e-01]
 [0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00]
 [5.33305754e-03 4.10260855e-05 2.56508035e-02 1.96620583e-06]
 [0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00]
 [7.98975987e-01 7.73162929e-01 6.45375659e-01 8.00000000e-01]
 [1.59989842e-01 8.00000000e-01 7.99957336e-01 7.94675197e-01]
 [8.00000000e-01 1.54634171e-01 7.99957399e-01 7.74145655e-01]
 [0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00]
 [0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00]
 [6.70975591e-01 1.58933073e-01 8.00000000e-01 7.94837320e-01]
 [8.00000000e-01 9.92256410e-01 8.00000000e-01 8.00000003e-01]
 [0.00000000e+00 0.00000000e+00

In [23]:
env.reset()

for episode in range(total_test_episodes):
    state = env.reset()
    step = 0
    done = False
    print("****************************************************")
    print("EPISODE ", episode)

    for step in range(max_steps):
        
        # Take the action (index) that have the maximum expected future reward given that state
        action = np.argmax(qtable[state,:])
        
        new_state, reward, done, info = env.step(action)
        
        if done:
            # Here, we decide to only print the last state (to see if our agent is on the goal or fall into an hole)
            env.render()
            if new_state == 15:
                print("We reached our Goal 🎯")
            else:
                print("We fell into a hole 💀")
            
            # We print the number of step it took.
            print("Number of steps", step)
            break
        state = new_state
env.close()

****************************************************
EPISODE  0
  (Down)
SFFF
FHFH
FFFH
HFF[41mG[0m
We reached our Goal 🎯
Number of steps 25
****************************************************
EPISODE  1
  (Right)
SFFF
FHF[41mH[0m
FFFH
HFFG
We fell into a hole 💀
Number of steps 33
****************************************************
EPISODE  2
  (Down)
SFFF
FHFH
FFFH
HFF[41mG[0m
We reached our Goal 🎯
Number of steps 66
****************************************************
EPISODE  3
  (Right)
SFFF
FHF[41mH[0m
FFFH
HFFG
We fell into a hole 💀
Number of steps 21
****************************************************
EPISODE  4
  (Right)
SFFF
FHF[41mH[0m
FFFH
HFFG
We fell into a hole 💀
Number of steps 44
****************************************************
EPISODE  5
  (Down)
SFFF
FHFH
FFFH
HFF[41mG[0m
We reached our Goal 🎯
Number of steps 76
****************************************************
EPISODE  6
  (Down)
SFFF
FHFH
FFFH
HFF[41mG[0m
We reached our Goal 🎯
Number of steps

  (Right)
SFFF
FHF[41mH[0m
FFFH
HFFG
We fell into a hole 💀
Number of steps 45
****************************************************
EPISODE  63
  (Down)
SFFF
FHFH
FFFH
HFF[41mG[0m
We reached our Goal 🎯
Number of steps 14
****************************************************
EPISODE  64
  (Down)
SFFF
FHFH
FFFH
HFF[41mG[0m
We reached our Goal 🎯
Number of steps 39
****************************************************
EPISODE  65
****************************************************
EPISODE  66
  (Right)
SFFF
FHF[41mH[0m
FFFH
HFFG
We fell into a hole 💀
Number of steps 60
****************************************************
EPISODE  67
  (Down)
SFFF
FHFH
FFFH
HFF[41mG[0m
We reached our Goal 🎯
Number of steps 16
****************************************************
EPISODE  68
  (Down)
SFFF
FHFH
FFFH
HFF[41mG[0m
We reached our Goal 🎯
Number of steps 35
****************************************************
EPISODE  69
  (Down)
SFFF
FHFH
FFFH
HFF[41mG[0m
We reached our Goal 🎯
Number of

* Establish a baseline performance. How well did your RL Q-learning do on your problem?
-->Establish a baseline performace using training data. The RL Q-learning had a score of 0.47
* What are the states, the actions and the size of the Q-table?
--> States are Starting State (S), Frozen Tiles (F), Holes (H) and the Goal state (G). Actions is the move agent will take to reach the goal state without falling into the hole state. Size of Q-table is 16*4.
* What are the rewards? Why did you choose them?
--> If the agent reaches the goal state then the reward is 1 else the reward is 0. Agent reaching the goal state or hole state will be the end of the episode.
* How did you choose alpha and gamma in the following equation?  
--> Initally will consider the default values give for alpha and gamma and then eventually decrease the alpha value and increase the gamma from the default value to see the changes in performance
* Try a policy other than maxQ(s', a'). How did it change the baseline performance? 
--> Tried a policy other than maxQ(s',a') which was minQ(s',a'). There was a significant drop in performance.
* What is the average number of steps taken per episode?
--> Average Number of Steps is 0.24
* Does Q-learning use value-based or policy-based iteration?
--> Q-learning is a value based iteration. And we improve Q-table to choose the best path.
* What is meant by expected lifetime value in the Bellman equation?
--> Here the expected lifetime value will be till end of the episodes. Ideally it can go from zero to infinity.


# Conclusion
We learn to use Open Gym library to create the environment to train a reinforcement learning agent. Used Frozen lake environment with simple rules that allowed explore the fundamental concepts of reinforcement learning.

# Contribution
The entire notebook was created from scratch with just about 40% code taken from original source. All the evaluations and algorithms are done on my own.

# Citations
1.Application of Reinforcement Learning https://towardsdatascience.com/applications-of-reinforcement-learning-in-real-world-1a94955bcd12

2.Q Learning and Bellman Equations https://www.freecodecamp.org/news/diving-deeper-into-reinforcement-learning-with-q-learning-c18d0db58efe/

3.Open Gym and Frozen lake https://reinforcementlearning4.fun/2019/06/16/gym-tutorial-frozen-lake/

# License
Copyright 2020 Sohail Budhwani

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.