<img src="https://www.th-koeln.de/img/logo.svg" style="float:right;" width="200">

# 12th exercise: <font color="#C70039">First Reinforcement Learning Game (*Frozen Lake*) using OpenAI Gym</font>
* Course: AML
* Lecturer: <a href="https://www.gernotheisenberg.de/">Gernot Heisenberg</a>
* Author of notebook: <a href="https://www.gernotheisenberg.de/">Gernot Heisenberg</a>. This notebook is based on the great post and notebook from [Rodolfo Mendes](https://morioh.com/p/18a96b9091d3).
* Date:   21.12.2025
* Student: Tim Voßmerbäumer
* Matr.Nr.: 11474232

<img src="https://miro.medium.com/v2/resize:fit:720/format:webp/1*i53DAlKJx_91HgcSiFwyJQ.png" style="float: center;" width="600">

---------------------------------
**GENERAL NOTE 1**: 
Please make sure you are reading the entire notebook, since it contains a lot of information on your tasks (e.g. regarding the set of certain paramaters or a specific computational trick), and the written mark downs as well as comments contain a lot of information on how things work together as a whole. 

**GENERAL NOTE 2**: 
* Please, when commenting source code, just use English language only. 
* When describing an observation please use English language, too.
* This applies to all exercises throughout this course.

---------------------------------

### <font color="ce33ff">DESCRIPTION</font>:

#### OpenAI Gym
In this exercise you will be using Python and OpenAI Gym to develop your reinforcement learning algorithm. The Gym library is a collection of environments that can be used freely with the reinforcement learning algorithms.

Gym has a ton of environments ranging from simple text based games to Atari games like Breakout and Space Invaders. The library is intuitive to use and simple to install. Just run **pip install gym** and you are good to go! The link to Gym's installation instructions, requirements, and documentation is included in the description. 

Further reading about OpenAI Gym is available under https://www.gymlibrary.dev/.
This notebook is based on this great post and notebook from [Rodolfo Mendes](https://morioh.com/p/18a96b9091d3).

#### Frozen Lake
This description of the game is copied directly from Gym's website. 

*Winter is coming. You and your friends were tossing around a frisbee at the park when you made a wild throw that left the frisbee out in the middle of the lake. The water is mostly frozen, but there are a few holes where the ice has melted. If you step into one of those holes, you'll fall into the freezing water and die (Game over). At this time, there's an international frisbee shortage, so it's absolutely imperative that you navigate across the lake and retrieve the disc. However, the ice is slippery, so you won't always move in the direction you intend. The surface is described using a grid like the following:*

* SFFF
* FHFH
* FFFH
* HFFG

This grid is your environment! S is your (the agent's) starting point and it's safe. F represents the frozen surface and is also safe. H represents a hole and if your agent steps in a hole in the middle of a frozen lake, the game is over because the agent dies. Finally, G represents the goal, which is the space on the grid where the frisbee is located.

The agent can navigate *left, right, up, down* and the episode ends when the agent reaches the goal or falls in a hole. It receives a reward of **1** if it reaches the goal and **0** otherwise.

Here is the summary:
<img src="./images/FrozenLake.States.Rewards.png" style="float: center;" width="800">

---------------------------------

### <font color="FFC300">TASKS</font>:
The tasks that you need to work on within this notebook are always indicated below as bullet points. 
If a task is more challenging and consists of several steps, this is indicated as well. 
Make sure you have worked down the task list and commented your doings. 
This should be done by using markdown.<br> 
<font color=red>Make sure you don't forget to specify your name and your matriculation number in the notebook.</font>

**YOUR TASKS in this exercise are as follows**:
1. import the notebook to Google Colab or use your local machine.
2. make sure you specified you name and your matriculation number in the header below my name and date. 
    * set the date too and remove mine.
3. read the entire notebook carefully 
    * add comments whereever you feel it necessary for better understanding
    * run the notebook for the first time. 
4. install gym into your env!
5. You will train an agent to play the *Frozen Lake* game using Q-learning and you will get a playback of how the agent does after being trained.
6. Again the task: Your agent has to navigate the grid by staying on the frozen surface without falling into any holes until it reaches the frisbee. If it reaches the frisbee, it wins with a reward of plus one. If it falls in a hole, it loses and receives no points for the entire episode.
7. Your tasks are highlighted in the notebook (see below)
---------------------------------

### Imports 
import all important libs including gym

In [1]:
import numpy as np
import gymnasium as gym
import random
import time
from   IPython.display import clear_output

In [2]:
print(gym.__version__) # should be gym==0.26.2

1.2.3


### Creating the Environment
For creating your environment, just call *gym.make()* and pass a string of the name of the environment you want to set up. 
All the environments with their corresponding names you can use here are available on Gym's website (see above).
With this *env* object, you are able to query for information about the environment, sample states and actions, retrieve rewards and have your agent navigate the frozen lake. That is all made available to you conveniently with Gym.

In [3]:
env = gym.make("FrozenLake-v1")

### Creating the Q-Table
Now, construct your Q-table, and initialize all the Q-values to zero for each state-action pair.
The number of rows in the table is equivalent to the size of the state space in the environment, and the number of columns is equivalent to the size of the action space (see above). You can get this information using *env.observation_space.n* and *env.action_space.n* as shown below in the code. Then, you can use this information to build the Q-table and initialize it with zeros.

In [4]:
action_space_size = env.action_space.n
state_space_size = env.observation_space.n

q_table = np.zeros((state_space_size, action_space_size))

In [5]:
print(q_table)

[[0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]]


### Initializing Q-Learning hyperparameters
Now, we're going to create and initialize all the parameters needed to implement the Q-learning algorithm.

First, with *num_episodes*, you define the total number of episodes you want the agent to play during training. Then, with *max_steps_per_episode*, you define a maximum number of steps that your agent is allowed to take within a single episode. So, if by the 100th step, the agent has not reached the frisbee or fallen through a hole, then the episode will terminate with the agent receiving zero points.

Next, you will set your *learning_rate* and your *discount_rate* as well, which was represented with the symbol (lambda) in the course slides (keyword: discounted return G_t).

Now, the last four parameters are all related to the exploration-exploitation dilemma with respect to the epsilon-greedy policy. You are initializing your *exploration_rate* to **1** and setting the *max_exploration_rate* to **1** and a *min_exploration_rate* to **0.01**. The *max* and *min* are just bounds to how large or small your exploration rate can be. Remember, the exploration rate was represented with the symbol (epsilon) when discussed in the course slides.

Lastly, you will set the *exploration_decay_rate* to **0.01** to determine the rate at which the *exploration_rate* will decay.

**YOUR <font color="FFC300">TASK</font> in this exercise is as follows** (point 7 from the task list above):

All of the above parameters can change!
Your task is to create a *testplan* and tune all parameters by yourself and observe how they influence and change the performance of the algorithm. 
Make notes! They will help you during the exam.

## Heisenberg Example

In [6]:
num_episodes = 10000
max_steps_per_episode = 200

learning_rate = 0.01
discount_rate = 0.99

exploration_rate = 1
max_exploration_rate = 1
min_exploration_rate = 0.01
exploration_decay_rate = 0.01

Create a list to hold all of the rewards you will get from each episode. 
By means of this you can observe how your game score changes over time.

In [7]:
rewards_all_episodes = []

In the following code section, the entire Q-learning algorithm is implemented as discussed in detail in the AML course. 
When this code is executed, this is exactly where the training will take place. 
* The first for-loop contains everything that happens within a single episode. 
* The second nested loop contains everything that happens for a single time-step.

Read all the red comments, as they contain lots of important information on the implementation.

In [8]:
# Q-learning algorithm

# loop: for a single episode
for episode in range(num_episodes):
    # initialize 'new episode' parameters
    state, info = env.reset()
    ''' The done variable just keeps track of whether or not your episode is finished.
    Initialize it to False when first starting the episode and you will see later where 
    it will get updated to notify you when the episode is over.'''
    done = False
    
    ''' Keep track of the rewards within the current episode as well.
    Hence, set rewards_current_episode = 0 since you start 
    with no rewards at the beginning of each episode.'''
    rewards_current_episode = 0

    # nested loop: for a single time-step
    for step in range(max_steps_per_episode): 
        # Exploration-exploitation trade-off
        '''For each time-step within an episode set your exploration_rate_threshold 
        to a random number between 0 and 1. This will be used to determine whether 
        your agent will explore or exploit the environment in this time-step.'''
        exploration_rate_threshold = random.uniform(0, 1)
        if exploration_rate_threshold > exploration_rate:
            action = np.argmax(q_table[state,:]) 
        else:
            action = env.action_space.sample()

        # Take new action
        '''After action is chosen, take that action by calling step() on your env object and 
        pass your action to it. The function step() returns a tuple containing the new state, 
        the reward for the action you took, whether or not the action ended the episode and 
        diagnostic information regarding the environment (helpful for debugging).'''
        new_state, reward, done, truncated, info = env.step(action)

        # Update Q-table for Q(s,a)
        '''Compare this implementation with the equation in the course slides.'''
        q_table[state, action] = q_table[state, action] * (1 - learning_rate) + learning_rate * (reward + discount_rate * np.max(q_table[new_state, :]))
        
        '''Set your current state to the new_state that was returned when taking the last action 
        and then update the rewards from your current episode by adding the reward you received 
        for your previous action.'''
        # Set new state
        state = new_state
        # Add new reward 
        rewards_current_episode += reward 
        '''Then, check to see if your last action ended the episode 
        (game over by agent stepping in a hole or reaching the goal)! 
        If the action did end the episode, then jump out of this loop and start the next episode.
        Otherwise, transition to the next time-step.'''
        if done == True: 
            break
            
    # Exploration rate decay
    '''Once an episode is finished, you need to update your exploration_rate using exponential decay, 
    which just means that the exploration rate decays at a rate proportional to its current value. 
    You can decay the exploration_rate using the formula above, which makes use of all the exploration 
    rate parameters that were defined above in the hyperparameter section.'''
    exploration_rate = min_exploration_rate + (max_exploration_rate - min_exploration_rate) * np.exp(-exploration_decay_rate*episode)
    
    # Add current episode reward to total rewards list and move on to the next episode
    rewards_all_episodes.append(rewards_current_episode)


### All episodes training completed
After all episodes are finished you now just calculate the average reward per thousand episodes from your list that contains the rewards for all episodes so that you can print it out and see how the rewards changed over time.

In [9]:
# Calculate and print the average reward per thousand episodes
rewards_per_thousand_episodes = np.split(np.array(rewards_all_episodes),num_episodes/1000)
count = 1000

print("********Average reward per thousand episodes********\n")
for r in rewards_per_thousand_episodes:
    print(count, ": ", str(sum(r/1000)))
    count += 1000

********Average reward per thousand episodes********

1000 :  0.02100000000000001
2000 :  0.017000000000000008
3000 :  0.022000000000000013
4000 :  0.014000000000000005
5000 :  0.012000000000000004
6000 :  0.023000000000000013
7000 :  0.024000000000000014
8000 :  0.022000000000000013
9000 :  0.022000000000000013
10000 :  0.022000000000000013


### Interpretation

From this print, you can see that the average reward per thousand episodes did indeed progress over time. When the algorithm first started training, the first thousand episodes only averaged a reward of almost **0.18**, but by the time it got to its last thousand episodes, the reward drastically improved to almost **0.7**.

Let's take a second to understand how you can interpret these results. Your agent played **10000** episodes. At each time step within an episode, the agent received a reward of **1** if it reached the frisbee, otherwise, it received a reward of **0**. If the agent did indeed reach the frisbee, then the episode finished at that time-step.

Hence, that means for each episode, the total reward received by the agent for the entire episode is either **1** or **0**. So, for the first thousand episodes, you can interpret this score as meaning that **18%** of the time the agent received a reward of **1** and won the episode. And by the last thousand episodes from a total of **10000**, the agent was winning almost **70%** of the time.

By analyzing the grid of the game, you can see it is a lot more likely that the agent would fall in a hole or perhaps reach the max time steps than it is to reach the frisbee, so reaching the frisbee **70%** of the time is not too bad, especially since the agent had no explicit instructions to reach the frisbee. It learned that this is the correct thing to do.

* SFFF
* FHFH
* FFFH
* HFFG

At last, print out your updated Q-table to see how that has transitioned from its initial state of all zeros.

In [10]:
# Print updated Q-table
print("\n\n********Q-table********\n")
print(q_table)



********Q-table********

[[5.67056287e-03 1.16721366e-02 5.26768308e-03 4.43822369e-03]
 [5.41430444e-03 7.92474295e-04 5.26853490e-04 1.08740254e-03]
 [3.89178161e-03 0.00000000e+00 0.00000000e+00 0.00000000e+00]
 [0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00]
 [1.96664166e-02 3.61056675e-03 3.27552826e-03 3.28986824e-03]
 [0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00]
 [2.31433011e-02 0.00000000e+00 0.00000000e+00 0.00000000e+00]
 [0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00]
 [1.74314590e-03 3.93068407e-03 3.01676582e-02 3.39537158e-03]
 [6.85613581e-02 4.58193360e-03 4.63012546e-03 2.14699157e-04]
 [1.59567079e-01 0.00000000e+00 0.00000000e+00 0.00000000e+00]
 [0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00]
 [0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00]
 [5.89540593e-04 5.07282907e-03 6.42080050e-04 2.14036242e-01]
 [1.00526930e-04 2.31846499e-04 5.91124489e-01 2.23746286e-03]
 [0.00000000e+00 0.00000000e

## Voßmerbäumer Try (Longer, but more difficult training values)

Since I need to restart the kernel every time I run, I put it here again.

In [11]:
import numpy as np
import gymnasium as gym
import random
import time
from   IPython.display import clear_output

In [12]:
env = gym.make("FrozenLake-v1")

In [13]:
action_space_size = env.action_space.n
state_space_size = env.observation_space.n

q_table = np.zeros((state_space_size, action_space_size))

In [14]:
num_episodes = 100000 # Was 10000
max_steps_per_episode = 200 

learning_rate = 0.005 # Was 0.01 - Since the training is longer, I let it update it's Q-values more slowly
discount_rate = 0.999 # Was 0.99 - I put it even closer to one in hopes it would go even more for long term rewards.

exploration_rate = 1
max_exploration_rate = 1
min_exploration_rate = 0.0001 # Was 0.01 - I am hoping to get it very exploitative to benefit from the right policy (if found)
exploration_decay_rate = 0.001 # Was 0.01 - Since it has more time, I would like to let it explore longer

In [15]:
rewards_all_episodes = []

In [16]:
# Q-learning algorithm

# loop: for a single episode
for episode in range(num_episodes):
    # initialize 'new episode' parameters
    state, info = env.reset()
    ''' The done variable just keeps track of whether or not your episode is finished.
    Initialize it to False when first starting the episode and you will see later where 
    it will get updated to notify you when the episode is over.'''
    done = False
    
    ''' Keep track of the rewards within the current episode as well.
    Hence, set rewards_current_episode = 0 since you start 
    with no rewards at the beginning of each episode.'''
    rewards_current_episode = 0

    # nested loop: for a single time-step
    for step in range(max_steps_per_episode): 
        # Exploration-exploitation trade-off
        '''For each time-step within an episode set your exploration_rate_threshold 
        to a random number between 0 and 1. This will be used to determine whether 
        your agent will explore or exploit the environment in this time-step.'''
        exploration_rate_threshold = random.uniform(0, 1)
        if exploration_rate_threshold > exploration_rate:
            action = np.argmax(q_table[state,:]) 
        else:
            action = env.action_space.sample()

        # Take new action
        '''After action is chosen, take that action by calling step() on your env object and 
        pass your action to it. The function step() returns a tuple containing the new state, 
        the reward for the action you took, whether or not the action ended the episode and 
        diagnostic information regarding the environment (helpful for debugging).'''
        new_state, reward, done, truncated, info = env.step(action)

        # Update Q-table for Q(s,a)
        '''Compare this implementation with the equation in the course slides.'''
        q_table[state, action] = q_table[state, action] * (1 - learning_rate) + learning_rate * (reward + discount_rate * np.max(q_table[new_state, :]))
        
        '''Set your current state to the new_state that was returned when taking the last action 
        and then update the rewards from your current episode by adding the reward you received 
        for your previous action.'''
        # Set new state
        state = new_state
        # Add new reward 
        rewards_current_episode += reward 
        '''Then, check to see if your last action ended the episode 
        (game over by agent stepping in a hole or reaching the goal)! 
        If the action did end the episode, then jump out of this loop and start the next episode.
        Otherwise, transition to the next time-step.'''
        if done == True: 
            break
            
    # Exploration rate decay
    '''Once an episode is finished, you need to update your exploration_rate using exponential decay, 
    which just means that the exploration rate decays at a rate proportional to its current value. 
    You can decay the exploration_rate using the formula above, which makes use of all the exploration 
    rate parameters that were defined above in the hyperparameter section.'''
    exploration_rate = min_exploration_rate + (max_exploration_rate - min_exploration_rate) * np.exp(-exploration_decay_rate*episode)
    
    # Add current episode reward to total rewards list and move on to the next episode
    rewards_all_episodes.append(rewards_current_episode)


### All episodes training completed

In [17]:
# Calculate and print the average reward per thousand episodes
rewards_per_thousand_episodes = np.split(np.array(rewards_all_episodes),num_episodes/10000)
count = 10000

print("********Average reward per thousand episodes********\n")
for r in rewards_per_thousand_episodes:
    print(count, ": ", str(sum(r/10000)))
    count += 10000

********Average reward per thousand episodes********

10000 :  0.06160000000000074
20000 :  0.070600000000001
30000 :  0.07210000000000104
40000 :  0.070600000000001
50000 :  0.06960000000000097
60000 :  0.070800000000001
70000 :  0.06550000000000085
80000 :  0.07110000000000101
90000 :  0.07640000000000116
100000 :  0.07710000000000118


### Interpretation

Parameters:
num_episodes = 100000 <br>
max_steps_per_episode = 200 

learning_rate = 0.005 <br>
discount_rate = 0.999 <br>
<br>
exploration_rate = 1 <br>
max_exploration_rate = 1 <br>
min_exploration_rate = 0.0001 <br>
exploration_decay_rate = 0.0001 <br>
<br>
With this high number of episodes, the training peaked at around 50.000 episodes at around 7 %. Afterwards it became worse again.<br>
The low results might be the cause of multiple things. Starting with a very high discount rate of 0.999, means that far in the future rewards are valued almost as much as immediate rewards and the frozen lake example has very sparse rewards.<br> <br>
Additionally the very low minimum exploration rate and exploration decay rate together make it on the one hand stop exploring and exploit the learned policy almost entirely, but with a slow decay, a good policy is hard to find, since it keeps exploring a lot.<br>
Lastly to go deeper into the exploration decay rate, if it's as low as I set it with 0.0001, for the majority of the episodes, the agent is randomly exploring and therefore it is hard to find a good policy, not playing into the low exploration rate.<br>

In [18]:
# Print updated Q-table
print("\n\n********Q-table********\n")
print(q_table)



********Q-table********

[[8.14789449e-03 7.58162516e-02 6.38858643e-03 7.07046929e-03]
 [9.10616271e-04 6.95958724e-02 2.96475542e-03 2.07634835e-03]
 [6.75189592e-03 5.55731872e-03 1.17482573e-01 2.99989148e-03]
 [3.04160590e-03 2.13201969e-03 2.00771470e-03 1.18536191e-01]
 [8.33722767e-02 1.78197796e-03 2.36606219e-03 3.74819868e-03]
 [0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00]
 [1.11798533e-01 2.90507432e-03 3.24350056e-03 2.34213922e-04]
 [0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00]
 [1.06636071e-03 4.72308014e-03 9.17465292e-02 2.19053817e-03]
 [3.26028458e-03 2.02671970e-01 3.98696041e-03 2.21035230e-03]
 [3.69481102e-03 2.30408013e-01 9.42830802e-03 1.13038403e-03]
 [0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00]
 [0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00]
 [2.45184943e-03 2.83306175e-01 6.80278247e-03 5.00751007e-03]
 [4.32039289e-03 5.43508548e-02 5.03250303e-02 4.88628692e-01]
 [0.00000000e+00 0.00000000e

## Voßmerbäumer Try 2 ("Ideal" Parameters) 

### The parameters from the Heisenber example seem to be very nice already, but I will try to do it with 10.000 more episodes

In [19]:
import numpy as np
import gymnasium as gym
import random
import time
from   IPython.display import clear_output

In [20]:
env = gym.make("FrozenLake-v1")

In [21]:
action_space_size = env.action_space.n
state_space_size = env.observation_space.n

q_table = np.zeros((state_space_size, action_space_size))

In [22]:
num_episodes = 20000
max_steps_per_episode = 200

learning_rate = 0.01
discount_rate = 0.99

exploration_rate = 1
max_exploration_rate = 1
min_exploration_rate = 0.01
exploration_decay_rate = 0.01

In [23]:
rewards_all_episodes = []

In [24]:
# Q-learning algorithm

# loop: for a single episode
for episode in range(num_episodes):
    # initialize 'new episode' parameters
    state, info = env.reset()
    ''' The done variable just keeps track of whether or not your episode is finished.
    Initialize it to False when first starting the episode and you will see later where 
    it will get updated to notify you when the episode is over.'''
    done = False
    
    ''' Keep track of the rewards within the current episode as well.
    Hence, set rewards_current_episode = 0 since you start 
    with no rewards at the beginning of each episode.'''
    rewards_current_episode = 0

    # nested loop: for a single time-step
    for step in range(max_steps_per_episode): 
        # Exploration-exploitation trade-off
        '''For each time-step within an episode set your exploration_rate_threshold 
        to a random number between 0 and 1. This will be used to determine whether 
        your agent will explore or exploit the environment in this time-step.'''
        exploration_rate_threshold = random.uniform(0, 1)
        if exploration_rate_threshold > exploration_rate:
            action = np.argmax(q_table[state,:]) 
        else:
            action = env.action_space.sample()

        # Take new action
        '''After action is chosen, take that action by calling step() on your env object and 
        pass your action to it. The function step() returns a tuple containing the new state, 
        the reward for the action you took, whether or not the action ended the episode and 
        diagnostic information regarding the environment (helpful for debugging).'''
        new_state, reward, done, truncated, info = env.step(action)

        # Update Q-table for Q(s,a)
        '''Compare this implementation with the equation in the course slides.'''
        q_table[state, action] = q_table[state, action] * (1 - learning_rate) + learning_rate * (reward + discount_rate * np.max(q_table[new_state, :]))
        
        '''Set your current state to the new_state that was returned when taking the last action 
        and then update the rewards from your current episode by adding the reward you received 
        for your previous action.'''
        # Set new state
        state = new_state
        # Add new reward 
        rewards_current_episode += reward 
        '''Then, check to see if your last action ended the episode 
        (game over by agent stepping in a hole or reaching the goal)! 
        If the action did end the episode, then jump out of this loop and start the next episode.
        Otherwise, transition to the next time-step.'''
        if done == True: 
            break
            
    # Exploration rate decay
    '''Once an episode is finished, you need to update your exploration_rate using exponential decay, 
    which just means that the exploration rate decays at a rate proportional to its current value. 
    You can decay the exploration_rate using the formula above, which makes use of all the exploration 
    rate parameters that were defined above in the hyperparameter section.'''
    exploration_rate = min_exploration_rate + (max_exploration_rate - min_exploration_rate) * np.exp(-exploration_decay_rate*episode)
    
    # Add current episode reward to total rewards list and move on to the next episode
    rewards_all_episodes.append(rewards_current_episode)


### All episodes training completed

In [25]:
# Calculate and print the average reward per thousand episodes
rewards_per_thousand_episodes = np.split(np.array(rewards_all_episodes),num_episodes/1000)
count = 1000

print("********Average reward per thousand episodes********\n")
for r in rewards_per_thousand_episodes:
    print(count, ": ", str(sum(r/1000)))
    count += 1000

********Average reward per thousand episodes********

1000 :  0.03200000000000002
2000 :  0.04000000000000003
3000 :  0.04900000000000004
4000 :  0.03800000000000003
5000 :  0.04100000000000003
6000 :  0.04900000000000004
7000 :  0.047000000000000035
8000 :  0.03400000000000002
9000 :  0.03400000000000002
10000 :  0.028000000000000018
11000 :  0.03000000000000002
12000 :  0.04100000000000003
13000 :  0.035000000000000024
14000 :  0.04500000000000003
15000 :  0.03800000000000003
16000 :  0.02900000000000002
17000 :  0.04500000000000003
18000 :  0.04100000000000003
19000 :  0.04900000000000004
20000 :  0.05600000000000004


### Interpretation

In [None]:
Here

In [26]:
# Print updated Q-table
print("\n\n********Q-table********\n")
print(q_table)



********Q-table********

[[2.25200316e-02 2.28478236e-02 2.17351492e-02 3.60819985e-02]
 [9.84761299e-03 3.85417898e-02 1.67213008e-02 2.07172026e-02]
 [2.62634472e-02 2.72219938e-02 8.56834628e-02 2.74245119e-02]
 [1.98814837e-02 2.43719198e-02 2.13697089e-02 7.72860489e-02]
 [1.94142654e-02 1.49317446e-04 2.94455917e-09 1.49254119e-09]
 [0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00]
 [1.95033981e-02 1.14030620e-01 1.35575845e-02 7.87872784e-03]
 [0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00]
 [1.96223749e-14 2.42347146e-15 3.56558838e-11 1.72377003e-02]
 [6.45300001e-06 0.00000000e+00 0.00000000e+00 3.97802244e-02]
 [2.35045126e-02 1.90269465e-02 2.90725418e-01 8.61466397e-04]
 [0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00]
 [0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00]
 [5.63772556e-03 3.35992620e-01 2.89011195e-03 2.44237113e-04]
 [4.36312707e-02 6.88333232e-01 5.69096821e-02 4.29842422e-02]
 [0.00000000e+00 0.00000000e

### Interpretation
Despite increasing the number of episodes to 20,000 while maintaining the "ideal" parameters (learning_rate=0.01, discount_rate=0.99, min_exploration_rate=0.01, exploration_decay_rate=0.01), the agent's performance in the Frozen Lake environment shows a modest improvement but remains relatively low. The average reward per thousand episodes peaked at approximately 5.6% by the end of the training.

This suggests that while the agent is learning, it is not consistently finding the optimal policy to reach the goal, or the environment's inherent stochasticity makes it difficult to achieve very high win rates with these parameters. The Q-learning algorithm is a model-free approach, and in environments with high stochasticity and sparse rewards like Frozen Lake, it seems to take significantly longer time to achieve robust performance.
