## Frozen Lake

Winter is here. You and your friends were tossing around a frisbee at the park when you made a wild throw that left the frisbee out in the middle of the lake. The water is mostly frozen, but there are a few holes where the ice has melted. If you step into one of those holes, you'll fall into the freezing water. At this time, there's an international frisbee shortage, so it's absolutely imperative that you navigate across the lake and retrieve the disc. However, the ice is slippery, so you won't always move in the direction you intend. The surface is described using a grid like the following:

                                                       SFFF
                                                       FHFH
                                                       FFFH
                                                       HFFG
                                                       
This grid is our environment where S is the agent’s starting point, and it’s safe. F represents the frozen surface and is also safe. H represents a hole, and if our agent steps in a hole in the middle of a frozen lake, well, that’s not good. Finally, G represents the goal, which is the space on the grid where the prized frisbee is located.

The agent can navigate left, right, up, and down, and the episode ends when the agent reaches the goal or falls in a hole. It receives a reward of one if it reaches the goal, and zero otherwise.                                                       

In [10]:
# Importing Necessary Packages 
import numpy as np
import gym
import random
import time
from IPython.display import clear_output

### Parameter Description

##### Learning Rate(α): 
The learning rate is a number between 0 and 1, which can be thought of as how quickly the agent abandons the previous Q-value for the new Q-value.

#### Discount Rate(γ): 
It is a value betwen 0 and 1 which describes the rate at which we discount the future rewards in the expected return. This definition of the discounted return makes it to where our agent will care more about the immediate reward over future rewards since future rewards will be more heavily discounted. So, while the agent does consider the rewards it expects to receive in the future, the more immediate rewards have more influence when it comes to the agent making a decision about taking a particular action.

#### Exploration rate(ϵ): 
It the probability that the agent will explore the environment rather than exploit its current information about the environment. It is initially set to 1 as the agent has no information when the game starts. As the agent learns more about the environment, at the start of each new episode, ϵ will decay by some rate that we set so that the likelihood of exploration becomes less and less probable as the agent learns more and more about the environment. This is called the epsilon greedy strategy.


In [None]:
# Creating the game environment
env = gym.make("FrozenLake-v0")

# Specifying the action space and state space sizes
action_space_size = env.action_space.n
state_space_size = env.observation_space.n

# Initialising the Q-table with 0 
q_table = np.zeros((state_space_size, action_space_size))
print(q_table);

# No. of episodes we want our agent to play
num_episodes = 10000

# Here, we define the maximum number of steps that our agent is allowed to take within a single episode
# If the agent fails to terminate the episode, it will auto terminate
max_steps_per_episode = 100

# Learning Rate
learning_rate = 0.1

# Discount Rate
discount_rate = 0.99

# Initial Exploration Rate
exploration_rate = 1

# Maximum Exploration Rate
max_exploration_rate = 1

# Maximum Exploration Rate
min_exploration_rate = 0.01

# The rate of decay of exploration rate
exploration_decay_rate = 0.001

# List to store the reward in each episode. This will be so we can see how our game score changes over time.
rewards_all_episodes = []

In [12]:
# Q-learning algorithm
for episode in range(num_episodes):
    
    # Initialize new episode:
    #Resest the environment
    state = env.reset()
    
    #The done variable just keeps track of whether or not our episode is finished
    done = False
    
    #Reset the reward in the current episode to 0
    rewards_current_episode = 0
    
    for step in range(max_steps_per_episode): 
        
        # Exploration-exploitation trade-off
        # Generating a random number between 0 and 1
        exploration_rate_threshold = random.uniform(0, 1)
        
        # If the generated number is greater than the exploration rate, we exploit
        if exploration_rate_threshold > exploration_rate:
            action = np.argmax(q_table[state,:]) 
            
        # Else, we explore randomly    
        else:
            action = env.action_space.sample()

        # Now that our action is chosen, we then take that action by calling step() on our env object 
        # and passing our action to it. The function step() returns a tuple containing the new state, 
        # the reward for the action we took, whether or not the action ended our episode, and diagnostic 
        # information regarding our environment, which may be helpful for us if we end up needing to do any debugging.    
        new_state, reward, done, info = env.step(action)
    
        # Update Q-table for Q(s,a)
        q_table[state, action] = q_table[state, action] * (1 - learning_rate) + learning_rate * (reward + discount_rate * np.max(q_table[new_state, :]))
        
        # Next, we set our current state to the new_state that was returned to us once we took our last action
        state=new_state
        # we then update the rewards from our current episode by adding the reward we received for our previous action.
        rewards_current_episode += reward
        
        # If the episode is terminated, break the loop
        if done == True: 
            break
        
    # Once an episode is finished, we need to update our exploration_rate using exponential decay
    exploration_rate = min_exploration_rate + (max_exploration_rate - min_exploration_rate) * np.exp(-exploration_decay_rate*episode)
    
    # We then just append the rewards from the current episode to the list of rewards from all episodes
    rewards_all_episodes.append(rewards_current_episode)
    
    # We’re good to move on to the next episode.
    
print("done")    

done


In [None]:
# Calculate and print the average reward per thousand episodes
rewards_per_thousand_episodes = np.split(np.array(rewards_all_episodes),num_episodes/1000)
count = 1000

print("Average reward per thousand episodes:\n")
for r in rewards_per_thousand_episodes:
    print(count, ": ", str(sum(r/1000)))
    count += 1000
    

# Print updated Q-table
print("\n\nQ-table\n")
print(q_table)

In [None]:
# Watch our agent play Frozen Lake 

for episode in range(100):
    
    # Initialize new episode:
    #Resest the environment
    state = env.reset()
    
    # The done variable just keeps track of whether or not our episode is finished
    done = False
    
    # Print the number of episodes played
    print("EPISODE ", episode+1, "\n\n\n\n")
    time.sleep(1)

    for step in range(max_steps_per_episode):    
        
        # Clear the output window, it waits to clear the output until there is another printout to prevent overwriting.
        clear_output(wait=True)
        
        # We then call render() on our env object, which will render the current state of the environment to the display
        env.render()
        time.sleep(0.3)
        
        # Now that our agent has learnt how to play, we follow the optimum policy (exploitation)
        action = np.argmax(q_table[state,:])        
        
        # We now take the action by calling step() on our env object
        new_state, reward, done, info = env.step(action)
        
        # If the episode terminated
        if done:
            
            # Clear the output window, it waits to clear the output until there is another printout to prevent overwriting.
            clear_output(wait=True)
            
            # We then call render() on our env object, which will render the current state of the environment to the display
            env.render()
            
            # If the reward in the current step is 1
            if reward == 1:
                print("You reached the goal!")
                time.sleep(1)
                
            else:
                print("You fell through a hole!")
                time.sleep(1)
            # Clear output again    
            clear_output(wait=True)
            break
            
        # Set new state
        state = new_state
        
env.close()  
