# Frozen lake v0 using Q learning

The goal of this game is to go from the starting state (S) to the goal state (G) by walking only on frozen tiles (F) and avoid holes (H).However, the ice is slippery, so you won't always move in the direction you intend (stochastic environment)

In [1]:
import numpy as np
import gym
import random

## Step 1: Create the environment

1. Here we'll create the FrozenLake environment.
2. OpenAI Gym is a library composed of many environments that we can use to train our agents.
3. In our case we choose to use Frozen Lake.



In [6]:
env = gym.make("FrozenLake-v0")
env.render()


[41mS[0mFFF
FHFH
FFFH
HFFG



## Step 2: Create the Q-table and initialize it 


   1. Now, we'll create our Q-table, to know how much rows (states) and columns (actions) we need, we need to calculate the action_size and the state_size
   2. OpenAI Gym provides us a way to do that: env.action_space.n and env.observation_space.n



In [3]:
action_size = env.action_space.n
state_size = env.observation_space.n

In [4]:
q_table = np.zeros((state_size,action_size))
q_table

array([[0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.]])

## Step 3: Hyperparameters

In [27]:
total_episodes = 10000
total_test_episodes = 5
max_steps = 99

learning_rate = 0.8 
# gamma = discounting rate
gamma = 0.95

'''   Exploration parameters
eplsilon =  exploration rate
max_epsilon = starting epsilon value
min_epsilon = minimum exploaration probablility
decay_rate = exponential decay factor
'''
epsilon = 1.0

max_epsilon = 1.0
min_epsilon = 0.005
decay_rate = 0.005

In [28]:
rewards = []

for episode in range(total_episodes):
    #Reset the environment
    state = env.reset()
    step = 0
    done = False
    total_rewards = 0
    
    for step in range(max_steps):
        # Choose action (a) in current world state (s)
        # first we randomize a number
        
        exp_exp_tradeoff = random.uniform(0,1)
        
        # if this is greater that epsilon --> exploitation mode (choose largest Q value)
        if exp_exp_tradeoff > epsilon:
            action = np.argmax(q_table[state, :])
        
        # otherwise do a radnom choice
        else:
            action = env.action_space.sample()
            
        # take action (a) and observe new state & reward 
        new_state, reward, done, info = env.step(action)
        
        # update Q table based on the bellman equation
        q_table[state, action] = q_table[state, action] + learning_rate * (reward + gamma * 
                                                    np.max(q_table[new_state,:]) - q_table[state,action])
        
        state = new_state
        total_rewards += reward
        
        if done == True:
            break
        
    episode+=1
        
    #calculate new epsilon
    
    epsilon = min_epsilon + (max_epsilon -min_epsilon)*np.exp(-decay_rate*episode)
    rewards.append(total_rewards)

print ("Score over time: " +  str(sum(rewards)/total_episodes))
print(q_table)

Score over time: 0.5209
[[1.74923322e-01 4.15917778e-02 3.21399249e-02 4.15509268e-02]
 [5.41052296e-03 2.17630960e-02 3.49208317e-03 6.34712236e-02]
 [1.71590974e-02 1.71671252e-02 1.23352138e-02 6.89743346e-02]
 [1.10634067e-04 1.99511801e-04 7.05226566e-06 5.08009091e-02]
 [3.31767651e-01 4.64519251e-02 6.74254496e-04 3.32705961e-03]
 [0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00]
 [3.01813160e-01 1.74290825e-04 5.17452937e-03 7.78710733e-07]
 [0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00]
 [1.30793373e-02 5.72394990e-02 5.19795727e-02 3.35322326e-01]
 [1.41067808e-02 6.04978863e-01 5.32693489e-02 1.47231020e-03]
 [5.11345930e-01 5.64180640e-03 3.15707679e-04 2.07388865e-04]
 [0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00]
 [0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00]
 [1.07123649e-02 2.85577277e-02 8.46730461e-01 1.16989153e-01]
 [2.91109730e-01 8.29353108e-01 2.06517509e-01 3.15238983e-01]
 [0.00000000e+00 0.00000000e+00

In [29]:
env.reset()

for episode in range(total_test_episodes):
    state = env.reset()
    step = 0
    done = False
    print("****************************************************")
    print("EPISODE ", episode)

    for step in range(max_steps):
        
        # Take the action (index) that have the maximum expected future reward given that state
        action = np.argmax(q_table[state,:])
        
        new_state, reward, done, info = env.step(action)
        
        if done:
            # Here, we decide to only print the last state (to see if our agent is on the goal or fall into an hole)
            env.render()
            
            # We print the number of step it took.
            print("Number of steps", step)
            break
        state = new_state
env.close()



****************************************************
EPISODE  0
  (Left)
SFFF
F[41mH[0mFH
FFFH
HFFG
Number of steps 39
****************************************************
EPISODE  1
  (Down)
SFFF
FHFH
FFFH
HFF[41mG[0m
Number of steps 22
****************************************************
EPISODE  2
  (Down)
SFFF
FHFH
FFFH
HFF[41mG[0m
Number of steps 65
****************************************************
EPISODE  3
  (Down)
SFFF
FHFH
FFFH
HFF[41mG[0m
Number of steps 22
****************************************************
EPISODE  4
