# Reinforcement learning: Q-learning with value iteration

### Frozen Lake Game

winter is here. You and your friends where tossing around a frisbee at the park when you made a wild throw that left the frisbee out in the middle of the lake. The water is mostly frozen, but there are a few holes where the ice has melted. If you step into one of those holes, you'll fall into the freezing water. At this time, there's an international frisbee shortage, so it's absolutely imperative that you navigate across the lake and retrieve the disc. However, the ice is splippery, so you won't always move in the diretion you intended. Teh surface is described by using a grid like the following:
    
    SFFF
    FHFH
    FFFH
    HFFG
    
   Where S stands for start where the agent begins, F for frozen surface (safe), H for hole, and G for goal of getting frisbee. 
    


### Setting library

In [1]:
# import libraries
import numpy as np
import gym
import random
import time

from IPython.display import clear_output

### Step 1: Create the environment

In [2]:
# can query information about the environment, sample states and actions, 
# retreive rewards, and have agent navigate the envorinment
env = gym.make('FrozenLake-v0')

### Step 2: Create the Q-table and initialize it

In [3]:
# size of the action space 
action_space_size = env.action_space.n
# size of the state space in the environment
state_space_size = env.observation_space.n

In [4]:
# build Q-table and fill it with zeros
q_table = np.zeros((state_space_size, action_space_size))

In [5]:
print(q_table)

[[0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]]


### Step 3: Initializing Q-learning hyper-parameters

In [6]:
# create and initialize all the parameters needed to implement the Q-learning algorithm.
num_episodes = 10000 # number of plays during training
max_steps_per_episode = 100 # max number of steps agent allowed to take in a single episode

learning_rate = 0.1
discount_rate = 0.99

# related to the exploration-exploitation trade-off using epsilon-greedy policy
exploration_rate = 1
max_exploration_rate = 1
min_exploration_rate = 0.01   # max and min are the bounds to how large or small our exploration rate can be
exploration_decay_rate = 0.001

# Note: exploration rate was represented by epsilon 

Note: The lower the exploration decay rate, the longer the agent will be able to explore. With 0.01 as the decay rate, the agent was only able to explore for a relatively short amount of time until it went into full exploitation mode without having a chance to fully explore and learn about the environment. Decreasing the decay rate to 0.001 allowed the agent to explore for longer and learn more.

If the exploration decay rate is too large, the 2nd term in the “exploration-rate update” is ≈ 0 (because the exponential term is ≈ 0). The impact is that subsequent epsilon-greedy searches get stuck in an “exploitation” mode, since the exploration rate converges to "min_exploration_rate" (little to no exploration occurs). 

These behavior would come much more clear if the game was deterministic (no slipping on ice), since the slippery situation adds a randomness which contributes to hide the phenomenon.

### Step 4: The Q learning algorithm

Source:
    - https://www.kaggle.com/sandovaledwin/q-learning-algorithm-for-solving-frozenlake-game/code
    - http://deeplizard.com/learn/video/HGeI30uATws

In [7]:
# create list to hold all of the rewards we’ll get from each episode, to see how our game score changes over time.
rewards_all_episodes = []


#### Q-learning algorithm

This first for-loop contains everything that happens within a single episode. This second nested loop contains everything that happens for a single time-step.


##### Update Q-value formula:

$$q(s,a)\;=\;(1 - \alpha) + \alpha(R_{t+1} + \gamma*max_{a'}\;q(s',a'))$$

Used to update the Q-table.

In [8]:
# Step 4: Q-learing algorithm 

# everything that happends within a single episode
for episode in range(num_episodes):
    # reset the state of the environment back to the starting state
    state = env.reset() 
    
    # initialize new episode params
    done = False # keeps track of whether or not the episode is finished so initialize to false for the starting
    
    # keep track of the rewards within the current episode 
    rewards_current_episode = 0 # set to zero since we start out with no rewards at the beginning of each episode
    
    # everything that happends within a single time-step within each episode
    for step in range(max_steps_per_episode):
        
        # Exploration-exploitation trade-off
        exploration_rate_threshold = random.uniform(0, 1) # epsilon: probability of exploring or exploiting the environment on this timne step
        
        #if exploration_rate_threshold > exploration_rate and ~np.all(q_table[state,:]==0):
        if exploration_rate_threshold > exploration_rate:
            # Exploitation: Take new action with Greedy Policy, only if the q values for the state are NOT all 0
            action = np.argmax(q_table[state,:])
        else:
            # Explore: Take new action
            action = env.action_space.sample()
            #print('Exploration')
        
        # Returns a tuple containing the new state. And 'info' diagnostic information regarding our environment
        new_state, reward, done, info = env.step(action)
        
        # Update Q-table for Q(s,a) is a weighted sum of our old value and the “learned value.”
        q_table[state, action] = q_table[state, action] * (1 - learning_rate) \
        + learning_rate * (reward + discount_rate * np.max(q_table[new_state, :]))
        
        # Transition to the next state
        # Set new state
        state = new_state
        
        # Add new reward 
        rewards_current_episode += reward
        
        # Check to see if our last action ended the episode (did our agent step in a hole or reach the goal?)
        if done == True:
            break
            
    # Exploration rate decay
    exploration_rate = min_exploration_rate + \
    (max_exploration_rate - min_exploration_rate) * np.exp(-exploration_decay_rate * episode)
     
    # Add current episode reward to total rewards list
    rewards_all_episodes.append(rewards_current_episode)                                 

#### After all episodes complete

we now just calculate the average reward per thousand episodes from our list that contains the rewards for all episodes so that we can print it out and see how the rewards changed over time.

In [9]:
# Calculate and print the average reward per thousand episodes
rewards_per_thousand_episodes = np.split(np.array(rewards_all_episodes), num_episodes/1000)
count = 1000

In [10]:
print("******** Average reward per thousand episodes ********\n")
for reward in rewards_per_thousand_episodes:
    print(count, "", str(sum(reward/1000)))
    count += 1000

******** Average reward per thousand episodes ********

1000  0.04200000000000003
2000  0.22700000000000017
3000  0.3880000000000003
4000  0.5270000000000004
5000  0.6160000000000004
6000  0.6430000000000005
7000  0.6970000000000005
8000  0.6910000000000005
9000  0.6970000000000005
10000  0.7000000000000005


In [11]:
# Print updated Q-table to see how that has transitioned from its initial state of all zeros.
print("\n\n******** Q-table ********\n")
print(q_table)



******** Q-table ********

[[0.55305969 0.50936222 0.51268992 0.51292919]
 [0.36488135 0.33360739 0.26946678 0.50493763]
 [0.44790108 0.44540804 0.44219952 0.47713909]
 [0.22650871 0.21555082 0.29339466 0.4616523 ]
 [0.57257193 0.34338435 0.42032957 0.42608411]
 [0.         0.         0.         0.        ]
 [0.41637697 0.19327921 0.18684407 0.13209898]
 [0.         0.         0.         0.        ]
 [0.36310989 0.45718635 0.29683875 0.60511954]
 [0.40807201 0.7130411  0.43457842 0.40485577]
 [0.68275394 0.4480717  0.42565895 0.33215237]
 [0.         0.         0.         0.        ]
 [0.         0.         0.         0.        ]
 [0.40610197 0.30692862 0.80346477 0.39405881]
 [0.76452398 0.89945059 0.80661165 0.79968618]
 [0.         0.         0.         0.        ]]


The agent played 10,000 episodes, within each time step within an episode the agent received a reward of one if it reached the frisbee and otherwise it received a reward of zero. If the agen did indeed reach the frisbee then the episode finished at that time step. This means that for each episode the total reward received by the agent for the entire episode is either one or zero. For the first thousand episodes we can interpret the first score in the printout as meaning that 5% of the time the agent recieved a reward of one and won the episode. And by the last thousand episodes from a total of 10,000 the agent was winning 70% of the time and from the grid enviroment we can see that it is more likely for the agent to fall in a hole or perhaps reach the  the max number of time steps than it is to reach the frisbee. But reach the frisbee 70% of the time by the end of the training is not too bad especially since the agent had not explicit instructions to reach the frisbee in any case. It learned through reinforcement that this is the correct direction/action to do.

We observe from the print out that over time during training we can see the average rewards per thousand episodes did indeed progress over time. When the algorithm first started training for the first thousand episodes only averaged a rewards of 0.005 but the time it got to its last thousand episodes the reward drastically improved to 0.694. 

In [12]:
# Step 5: Use our Q-table to play FrozenLake !

# List of rewards
rewards = []

# Watch our agent play Frozen Lake by playing the best action 
# from each state according to the Q-table
for episode in range(3): # watch the agent play three episodes
    # Reset the environment
    state = env.reset()
    
    # initialize new episode params
    step = 0
    done = False
    print('****** EPISODE ', episode+1, '*******\n\n\n')
    time.sleep(1) # allow print out to be read before disappearing
    
    for step in range(max_steps_per_episode):
        # clear output from jupyter notebook cell
        clear_output(wait=True) 
        
        # Show current state of environment on screen
        env.render() # render current state of the environment to display where the agent is in the grid (visually see the game grid)
        time.sleep(0.3) # sleep 300 milliseconds, will allow to see the current state of the environment before moving on to the next time step
        
        # Choose action with highest Q-value for current state
        # Take the action (index) that have the maximum expected future reward given that state.
        # Set the action to the action with the highest Q-value in the Q-table for the current sate.
        action = np.argmax(q_table[state,:])
        
        # Take the new action (a) and observe the outcome state(s') and reward (r)
        new_state, reward, done, info = env.step(action) # take action
        
        # If done (if we're dead) : finish episode
        if done:
            # Here, we decide to only print the last state (to see if our agent is on the goal or fall into an hole)
            clear_output(wait=True)
            env.render()
            if reward == 1: # Agent reached the goal and won episode
                print('***** You reached the goal! *****')
                time.sleep(3)
            else: # Agent stepped in a hole and lost episode
                print('**** You fell through a hole! *****')
                time.sleep(3)
                
            # clear output from jupyter cell
            clear_output(wait=True)
            
            # We print the number of step it took.
            print("Number of steps", step)
            break
            
        # Set our new state
        state = new_state

# After all three episodes are done, we then close the environment
env.close()

Number of steps 99
