In [3]:
import numpy as np
import gym

FrozenLake environment consists of a 4 by 4 grid representing a surface. The agent always starts from the state 0, [0,0] in the grid, and his goal is to reach the state 16, [4,4] in the grid. On his way, he could find some frozen surfaces or fall in a hole. If he falls, the episode is ended. When the agent reaches the goal, the reward is equal to one. Otherwise, it is equal to 0.

In [5]:
env = gym.make("FrozenLake-v1")
n_observations = env.observation_space.n
n_actions = env.action_space.n

In [6]:
#Initialize the Q-table to 0
Q_table = np.zeros((n_observations,n_actions))
print(Q_table)

[[0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]]


Gamma (γ):
Gamma is the discount factor, and it represents the extent to which future rewards should be considered in the decision-making process.
It is a value between 0 and 1, where 0 means the agent only considers immediate rewards, and 1 means the agent considers all future rewards equally.
By discounting future rewards, the algorithm accounts for the fact that future rewards are generally worth less than immediate rewards. This is because the environment is uncertain, and there is a possibility that the agent might not reach future states.

Learning Rate (α):
The learning rate is a parameter that controls the step size in updating the Q-values based on new information.
It is a value between 0 and 1 and determines the proportion of the difference between the old and new Q-values that is used to update the old Q-value.
A higher learning rate allows the agent to give more weight to the most recent information, making the learning process more volatile. On the other hand, a lower learning rate results in slower but more stable learning.


In [7]:
#number of episode we will run
n_episodes = 10000

#maximum of iteration per episode
max_iter_episode = 100

#initialize the exploration probability to 1
exploration_proba = 1

#exploartion decreasing decay for exponential decreasing
exploration_decreasing_decay = 0.001

# minimum of exploration proba
min_exploration_proba = 0.01

#discounted factor
gamma = 0.99

#learning rate
lr = 0.1


In [11]:
total_rewards_episode = list()

In [24]:
#we iterate over episodes
rewards_per_episode = []
for e in range(n_episodes):
    #we initialize the first state of the episode
    current_state,_ = env.reset()
    done = False
    
    #sum the rewards that the agent gets from the environment
    total_episode_reward = 0
    
    for i in range(max_iter_episode): 
        # we sample a float from a uniform distribution over 0 and 1
        # if the sampled flaot is less than the exploration proba
        #     the agent selects arandom action
        # else
        #     he exploits his knowledge using the bellman equation 
        
        if np.random.uniform(0,1) < exploration_proba:
            action = env.action_space.sample()
        else:
            action = np.argmax(Q_table[current_state,:])
        
        # The environment runs the chosen action and returns
        # the next state, a reward and true if the epiosed is ended.
        next_state, reward, done, _, _ = env.step(action)
        
    
        # We update our Q-table using the Q-learning iteration
        Q_table[current_state, action] = (1-lr) * Q_table[current_state, action] +lr*(reward + gamma*max(Q_table[next_state,:]))
        total_episode_reward = total_episode_reward + reward
        # If the episode is finished, we leave the for loop
        if done:
            break
        current_state = next_state
    #We update the exploration proba using exponential decay formula 
    exploration_proba = max(min_exploration_proba, np.exp(-exploration_decreasing_decay*e))
    rewards_per_episode.append(total_episode_reward)

In [27]:
print("Mean reward per thousand episodes")
for i in range(10):
    print((i+1)*1000,": mean espiode reward: ",\
           np.mean(rewards_per_episode[1000*i:1000*(i+1)]))

Mean reward per thousand episodes
1000 : mean espiode reward:  0.033
2000 : mean espiode reward:  0.215
3000 : mean espiode reward:  0.432
4000 : mean espiode reward:  0.599
5000 : mean espiode reward:  0.655
6000 : mean espiode reward:  0.692
7000 : mean espiode reward:  0.676
8000 : mean espiode reward:  0.664
9000 : mean espiode reward:  0.69
10000 : mean espiode reward:  0.668
