# Notebook 10 - Reinforcement Learning / Self-Driving Cab

CSI4106 Artificial Intelligence   
Fall 2020  
Prepared by Julian Templeton and Caroline Barrière

***INTRODUCTION***:  
In this notebook we will be exploring the use of Reinforcment Learning to help allow an agent solve a specific task in an environment provided by [OpenAI's Gym library](https://gym.openai.com/). This library provides a number of environments that we can train an AI to master. Within this notebook we will be exploring a scenario in which a taxi located on a grid must be controlled by an agent to pickup a passenger located in one of four positions and drop the passenger off in one of three other positions.    

To familiarize yourself with the Self-Driving Cab problem tackled in this notebook, please go to the site https://www.learndatasci.com/tutorials/reinforcement-q-learning-scratch-python-openai-gym/ and read section 1 (rewards), section 2 (state space) which will make you understand why there are 500 possible states, section 3 (action space) which describes the possible actions.  

Throughout the notebook we will be working with a Baseline approach and a Q-Learning-based approach. This will provide insight into how Q-Learning can be applied to problems and how an agent can use Reinforcment Learning to solve problems in an environment.    

**When submitting this notebook, ensure that you do NOT reset the outputs from running the code (plus remember to save the notebook with ctrl+s).**      

**In order to keep the installation easy, you will be once again running this notebook in Google Colab, NOT on your local machine.**    

***HOMEWORK***:  
Go through the notebook by running each cell, one at a time.  
Look for **(TO DO)** for the tasks that you need to perform. Do not edit the code outside of the questions which you are asked to answer unless specifically asked. Once you're done, Sign the notebook (at the end of the notebook), and submit it.  

*The notebook will be marked on 30.  
Each **(TO DO)** has a number of points associated with it.*
***

**1.0 - Setting up the Taxi Game**   

To begin the notebook, we will need to set up and explore the environment that our agent will be working with. OpenAI's Gym provides many different experiments to use. These range from balancing acts to self driving cars to playing a simple Atari game. Unfortunately, not every option available to us can be easily worked with. Many can take hours of training to start seeing some exciting results. Each of these experiments use agents that can be trained by Reinforcment Learning to master how to perform the specified task. The methods used can range from the simple use of Q-Learning to the more complex use of one or more Deep Learning models that work in conjunction with Reinforcement Learning techniques.   

One simple, yet interesting, experiment involves an AI controlled taxi that must pick up and dropoff a passenger. This is the problem that we will be exploring throughout the notebook. The code used throughout the notebook comes from [this example](https://www.learndatasci.com/tutorials/reinforcement-q-learning-scratch-python-openai-gym/) and has been modified accordingly.

To start, we will install some of the packages that we will need to run the progam.

In [1]:
# Install the necessary libraries
!pip install cmake 'gym[atari]' scipy



In [2]:
# Import the necessary libraries
import random
import gym
import numpy as np
from IPython.display import clear_output

With all of the libraries installed, we will now make use of the Taxi program provided by Gym. Below we will import Gym, load the program as the active environment, and render an image representing the current state of the program.   

From the image seen below, there are four different key locations in the environment, represented by *R*, *G*, *B*, and *Y*. The letter with that is bolded in blue represents where the current passenger needs to get picked up and the letter bolded in purple represents where the passenger wants to dropped off. The yellow block represents the cell which the taxi cab is currently located at. Therefore, the taxi cab must first pick up the passenger and drop them off at the dropoff location. When a passenger is in the taxi, it turns green until the passenger is dropped off.

In [3]:
# Load the environment
env = gym.make("Taxi-v3").env
# Render the current state of the program
env.render()

+---------+
|[43mR[0m: | : :[35mG[0m|
| : | : : |
| : : : : |
| | : | : |
|Y| : |[34;1mB[0m: |
+---------+



Next we will reset the state of the environment and re-render the current state. We also print the total number of actions available to our agent (defined as the *Action Space*) and the *State Space* which represents the state of the program (where is the cab, the passenger, pickup location and dropoff location).

In [4]:
env.reset() # reset environment to a new, random state
env.render()

print("Action Space {}".format(env.action_space))
print("State Space {}".format(env.observation_space))

+---------+
|[34;1mR[0m: | : :G|
| : | : : |
| : : : : |
| | : | : |
|Y| : |[35m[43mB[0m[0m: |
+---------+

Action Space Discrete(6)
State Space Discrete(500)


Intuitively, we want our agent to learn which action to take given a specific state. Specifically, which action should be taken based on where the taxi cab is located in relation to the passenger location and drop off location. The six possible actions that the taxi can take at a given time step are:    

Action = 0: Head south    
Action = 1: Head north    
Action = 2: Head east    
Action = 3: Head west    
Action = 4: Pickup     
Action = 5: Dropoff    

Below is an example of setting the state to a specific encoding and rendering that state.

In [6]:
# The encoding below represents: (taxi row, taxi column, passenger index, destination index)
state = env.encode(3, 1, 2, 0) 
print("State:", state)

env.s = state
env.render()

State: 328
+---------+
|[35mR[0m: | : :G|
| : | : : |
| : : : : |
| |[43m [0m: | : |
|[34;1mY[0m| : |B: |
+---------+



The following example showcases how a state with the passenger within the taxi can be set.

In [7]:
# The encoding below represents: (taxi row, taxi column, passenger index, destination index)
state = env.encode(0, 1, 4, 0) 
print("State:", state)

env.s = state
env.render()

State: 36
+---------+
|[35mR[0m:[42m_[0m| : :G|
| : | : : |
| : : : : |
| | : | : |
|Y| : |B: |
+---------+



**(TO DO) Q1**    
Now that we have seen how to set a state via an encoding, you will need to set the state to match the descriptions below and render them.   
a) Set the passenger to be at position G, with the passenger wanting to be dropped off at position R, and the taxi positioned at a random point on the grid (the selected position of the taxi must be selected randomly). After setting the position, render the state.          
b) Set the passenger to be in the taxi (at any position without a letter on it) and set the passenger dropoff point to be position B. After setting the position, render the state.

**(TO DO) Q1 (a) - 2 marks**    
a) Set the passenger to be at position G, with the passenger wanting to be dropped off at position R, and the taxi positioned at a random point on the grid (the selected position of the taxi must be selected randomly). After setting the position, render the state.   

In [13]:
# TODO (remember to use random coordinates within the grid for the taxi)...
taxi_row = random.randint(0,4)
taxi_col = random.randint(0,4)
pass_ind = 1
dest_ind = 0


state = env.encode(taxi_row, taxi_col, pass_ind, dest_ind) 
print("State:", state)

env.s = state
env.render()

State: 264
+---------+
|[35mR[0m: | : :[34;1mG[0m|
| : | : : |
| : : :[43m [0m: |
| | : | : |
|Y| : |B: |
+---------+



**(TO DO) Q1 (b) - 2 marks**    
b) Set the passenger to be in the taxi (at any position without a letter on it) and set the passenger dropoff point to be position B. After setting the position, render the state.

In [22]:
# TODO ...

taxi_row = random.randint(1,3)
taxi_col = random.randint(1,2)

dest_ind = 3


state = env.encode(taxi_row, taxi_col, 4, dest_ind) 
print("State:", state)

env.s = state
env.render()

State: 139
+---------+
|R: | : :G|
| :[42m_[0m| : : |
| : : : : |
| | : | : |
|Y| : |[35mB[0m: |
+---------+



For every action that the taxi can take, we have a list representing the key information with respect to what will happen when an action is performed. After performing an action, the agent will receive a reward or a penalty. This reward or penalty will tell the agent how good or bad their decision to perform the specified action was.     

Below we display a dictionary that contains all possible actions along with the following information within the corresponding tuples:     

(     
  The probability of taking that action,     
  The resulting state after taking that action,    
  The reward for taking that action,    
  Whether or not the program will end when performing the action   
)      

Example tuple: (1.0, 328, -1, False)

In [24]:
env.P[328]

{0: [(1.0, 428, -1, False)],
 1: [(1.0, 228, -1, False)],
 2: [(1.0, 348, -1, False)],
 3: [(1.0, 328, -1, False)],
 4: [(1.0, 328, -10, False)],
 5: [(1.0, 328, -10, False)]}

Although not displayed by the code above, if the taxi is holding the passenger and is over the dropoff point, the reward for the dropoff action is 20.

**2.0 - Baseline Approach to the Taxi Game**   

To start, we will perform the simulation of the taxi cab scenario with a baseline approach that does not use Q-Learning. This approach will simply work by selecting a random available action at each time step, regardless of the current state. We will also prepare a method of playing through all frames within an episode to view how the agent controls the taxi in the scenario.

In [25]:
def run_single_simulation_baseline(env, state, disable_prints=False):
    '''
    Given the environment and a specific state, randomly select an action for the taxi
    to perform until the goal is completed.
    '''
    if not disable_prints:
        print("Testing for simulation: {}".format(state))
    # Set the state of the environment
    env.s = state
    # Used to hold all information for a single time step (including the image data)
    frames = []
    # Used to determine when the simulation has been completed
    done = False
    # Determines the number of times steps that the application has been run for
    time_steps = 0
    # The total values used to determine how many times the agent mistakenly
    # picks up no one or attempts to dropoff no passenger or attempts to
    # dropoff a passenger in the wrong position.
    penalties, reward = 0, 0
    # Run until the passenger has been picked up and dropped off in the target location
    while not done:
        # Perform a random action from the set of available actions in the environment
        action = env.action_space.sample()
        # From performing the action, retrieve the new state, the reward from taking the action,
        # whether the simulation is complete, and other information from performing the action.
        state, reward, done, info = env.step(action)
        # If an incorrect dropoff or pickup is performed, increment the penalty count
        if reward == -10:
            penalties += 1
        # Put each rendered frame into dict to use for animating the process and
        # tracking the details over the run
        frames.append({
            'frame': env.render(mode='ansi'),
            'state': state,
            'action': action,
            'reward': reward
            }
        )
        # Increment the time step count
        time_steps += 1
    # State the total number of steps taken and the total penalties that have occured.
    if not disable_prints:
        print("Timesteps taken: {}".format(time_steps))
        print("Penalties incurred: {}".format(penalties))
    # Return the frame data, the total penalties, and the total time steps
    return frames, penalties, time_steps

With the baseline approach defined, we will run a test with this approach to see how long it takes an agent using this approach to find a solution for simulation 328 and how many major penalties the agent receives.

In [44]:
state = 328
# Run a test and collect all frames from the run
frames, _, _ = run_single_simulation_baseline(env, state)

Testing for simulation: 328
Timesteps taken: 354
Penalties incurred: 108


After performing a simulation and retrieving the results, we can use the frames obtained from the simulation and pass it to the *print_frames* function below to display an animation containing all frames along with the information that was from each time step that the frame corresponds to.    

For the first episode that you view, it is recommended to run through the entire process at a slower speed (such as 0.3 or 0.5 in the sleep call). However you are free to increase the speed of the process by reducing the number in the sleep function call in the *print_frames* function below.

In [46]:
from IPython.display import clear_output
from time import sleep

def print_frames(frames):
    '''
    For each frame, show the frame and display the timestep it occurred at,
    the number of the active state, the action selected, adn the corresponding reward.
    '''
    for i, frame in enumerate(frames):
        clear_output(wait=True)
        print(frame['frame'])
        print(f"Timestep: {i + 1}")
        print(f"State: {frame['state']}")
        print(f"Action: {frame['action']}")
        print(f"Reward: {frame['reward']}")
        # Can adjust speed here
        sleep(.01)
# Print the frames from the episode
print_frames(frames)

+---------+
|[35m[34;1m[43mR[0m[0m[0m: | : :G|
| : | : : |
| : : : : |
| | : | : |
|Y| : |B: |
+---------+
  (Dropoff)

Timestep: 354
State: 0
Action: 5
Reward: 20


**(TO DO) Q2**   
a) Using the state defined from Q1 (a), retrieve the corresponding frames obtained from using the baseline approach above. Then display those frames.     
b) Using the state defined from Q1 (b), retrieve the corresponding frames obtained from using the baseline approach above. Then display those frames.  

**(TO DO) Q2 (a) - 2 marks**   
a) Using the state defined from Q1 (a), retrieve the corresponding frames obtained from using the baseline approach above. Then display those frames.

In [47]:
# TODO: Retrieve the corresponding frames from running the simulation starting from the state found in Q1 (a). Then show those frames.
state = 264
# Run a test and collect all frames from the run
frames, _, _ = run_single_simulation_baseline(env, state)
print_frames(frames)

+---------+
|[35m[34;1m[43mR[0m[0m[0m: | : :G|
| : | : : |
| : : : : |
| | : | : |
|Y| : |B: |
+---------+
  (Dropoff)

Timestep: 1609
State: 0
Action: 5
Reward: 20


**(TO DO) Q2 (b) - 2 marks**   
b) Using the state defined from Q1 (b), retrieve the corresponding frames obtained from using the baseline approach above. Then display those frames.

In [48]:
# TODO: Retrieve the corresponding frames from running the simulation starting from the state found in Q1 (b). Then show those frames.
state = 139
# Run a test and collect all frames from the run
frames, _, _ = run_single_simulation_baseline(env, state)
print_frames(frames)

+---------+
|R: | : :G|
| : | : : |
| : : : : |
| | : | : |
|Y| : |[35m[34;1m[43mB[0m[0m[0m: |
+---------+
  (Dropoff)

Timestep: 997
State: 475
Action: 5
Reward: 20


With the ability to simulate single runs of an episode with the Baseline approach, we will now define a function that we will use to evaluate the general performance of the baseline model when averaged over many episodes. The *evaluate_agent_baseline* function below accepts as input the total number of randomly selected episodes to run along with the environment, runs the random episodes, displays the average amount of timesteps taken per episode along with the average penalties incurred, and returns the frame data.     

***If the evaluate_agent_baseline function ever seems to be running for far too long (several minutes, not just one), stop the run by clicking the button at the top-left of the code cell being executed and run it again.***

In [49]:
def evaluate_agent_baseline(episodes, env):
    '''
    Given a number of episodes and an environment, run the specified
    number of episodes, where each run begins with a random state, display the
    naverage timesteps per episode and the average penalties per episode, and output
    the frames to be displayed.
    '''
    total_time_steps, total_penalties = 0, 0
    frames = []
    # Run through the total number of episodes
    for _ in range(episodes):
        # Get a random state
        state = env.reset()
        # Run the simulation, obtaining the results
        frame_data, penalties, time_steps = run_single_simulation_baseline(env, state, True)
        # Update the tracked data over all simulations
        total_penalties += penalties
        total_time_steps += time_steps
        frames = frames + frame_data
    print(f"Results after {episodes} episodes:")
    print(f"Average timesteps per episode: {total_time_steps / episodes}")
    print(f"Average penalties per episode: {total_penalties / episodes}")
    return frames

**(TO DO) Q3**    
a) Use the *evaluate_agent_baseline* function defined above to run through 100 random episodes for the environment.     
b) From the output seen from Q3 (a), how did the Baseline approach do and why do you think that it performed well or poorly? Explain with respect to the average timesteps per episode and the average penalties per episode.       
c) Without moving to a Reinforcment Learning approach how can the Baseline approach be modified to perform slightly better?

**(TO DO) Q3 (a) - 1 mark**    
a) Use the *evaluate_agent_baseline* function defined above to run through 100 random episodes for the environment.   

In [51]:
# TODO ...
f = evaluate_agent_baseline(100, env)

Results after 100 episodes:
Average timesteps per episode: 2231.29
Average penalties per episode: 721.97


**(TO DO) Q3 (b) - 1 mark**    
b) From the output seen from Q3 (a), how did the Baseline approach do and why do you think that it performed well or poorly? Explain with respect to the average timesteps per episode and the average penalties per episode.     

TODO ...    

The Baseline approach performed very poorly. For example, previous tests, for the same configuration of the 'game', the baseline approach was capable of having way lower penalties (in the 100s) than the one being returned from th eevaluation of the approach (721.97). The same applies to the timesteps as well. The random aspect of the baseline approach is the reason behind the poor results.

**(TO DO) Q3 (c) - 1 mark**    
c) Without moving to a Reinforcment Learning approach how can the Baseline approach be modified to perform slightly better?

TODO ...  
We could hardcode the code in a way that we can:   
*   avoid dropping people off we haven't reached the destination point
*   avoid picking up people from the wrong point 
*   avoid dropping people off when there is nobody in the cab

**3.0 - Training an Agent with Q-Learning to play the Taxi Game**   

Now that we have had an agent use the baseline model to complete the taxi simulation, we will have the agent use Q-Learning to try applying a Reinforcement Learning approach to the problem. To start the process, we will create a matrix of Q values for each action-state possibility (initializing it as zero). The agent will update this matrix when training and will need the matrix reset whenever the agent wants to reset its training.

In [56]:
# Initialize the table of Q values for the state-action pairs
q_table = np.zeros([env.observation_space.n, env.action_space.n])

With the matrix of Q values initialized, we will now define the training function that adjusts the Q values within *q_table*. The training process consists of running through a number of random simulations and updating the Q values for each state via Q-Learning.    

There are a number of hyperparameters used by the training function:    

- *alpha*: Learning parameter (you will need to describe it in a later question).   
- *gamma*: The long term reward discount parameter.    
- *epsilon*: Exploitation/Exploration parameter (you will need to describe it in a later question).  
- *num_simulations*: Represents how many random episodes should be generated to have the agents use to update its Q values.    

Thus, by running through this algorithm, an agent can learn which Q-values to use when working with other episodes.

In [52]:
def train_agent(alpha, gamma, epsilon, num_simulations):
    '''
    Trains an agent by updating its Q values for a total of num_simulations
    episodes with the alpha, gamma, and epsilon hyperparameters. 
    '''
    # For plotting metrics
    all_time_steps = []
    all_penalties = []
    # Generate the specified number of episodes
    for i in range(1, num_simulations + 1):
        # Generate a new state by resetting it
        state = env.reset()
        # Variables tracked (time steps, total penalties, the reward value)
        time_steps, penalties, reward, = 0, 0, 0
        done = False
        # Run the simulation 
        while not done:
            # Select a random action is the randomly selected number from a
            # uniform distribution is less than epsilon
            if random.uniform(0, 1) < epsilon:
                action = env.action_space.sample() # Explore action space
            # Otherwise use the currently learned Q values
            else:
                action = np.argmax(q_table[state]) # Exploit learned values
            # Retrieve the relevant information after performing the action
            next_state, reward, done, info = env.step(action) 
            # Retrieve the old Q value and the maximum Q value from the next state
            old_value = q_table[state, action]
            next_max = np.max(q_table[next_state])
            # Update the current Q value
            new_value = (1 - alpha) * old_value + alpha * (reward + gamma * next_max)
            q_table[state, action] = new_value
            # Track anytime an incorrect dropoff or pickup is made
            if reward == -10:
                penalties += 1
            # Proceed to the next state and time step
            state = next_state
            time_steps += 1
        # Display progress for each 100 episodes
        if i % 100 == 0:
            clear_output(wait=True)
            print(f"Episode: {i}")
    print("Training finished.\n")

We now use the training function with a set of hyperparameters to train the agent with Q-Learning to potentially improve performance over time.

In [57]:
# Hyperparameters
alpha = 0.1
gamma = 0.5
epsilon = 0.1
num_simulations = 100000
# Train the agent
train_agent(alpha, gamma, epsilon, num_simulations)

Episode: 100000
Training finished.



After the training, we can look at the Q-values that have been obtained in our state-action table for a specific state. Below we see that each Q-value for the six possible actions available for state 328 have been updated accordingly.

**(TO DO) Q4 - 2 marks**     
Below we print the Q-values that are available for the six actions at state 328 and we render that state to view it. Based on the available Q-values (assuming we are in exploitation mode), which action would be the next to be selected (or if there are ties, list all possible actions that would be considered)? Do any of the actions that contain larger Q values seem problematic if they were selected? Why or why not?

In [58]:
print(q_table[328])
env.s = 328
env.render()

[ -1.9864243   -1.95703125  -1.98470273  -1.97617815 -10.74944318
 -10.49604289]
+---------+
|[35mR[0m: | : :G|
| : | : : |
| : : : : |
| |[43m [0m: | : |
|[34;1mY[0m| : |B: |
+---------+
  (Dropoff)


TODO ... 
The action that would be next to be selected is action with index 1: North 
Actions with index 4 & 5 do have larger negative Q values as they represent dropping off and picking up passengers and in state 328 both actions are not appropriate and would therefore add a large immediate negative reward.  


With the training complete, we can now evaluate the Q-Learning approach in a similar method that we used to evaluate the Baseline approach. By passing the number of episodes to test for and the environment, we generate that number of random episodes and average the results obtained from running the Q-Learning approach to complete the episodes. Unlike the training, it is important to note that the hyperparameters that are used there are not used here. The agent simply uses the maximum Q-value at each step to determine which action to take at a given time step.   

***If the evaluate_agent_QL function ever seems to be running for far too long (a minute or more), stop the run by clicking the button at the top-left of the code cell being executed and run it again. This occurs because the training was insufficient at setting valid Q values and resulted in a dead-end for a specific state.***

In [60]:
def evaluate_agent_QL(episodes, env):
    '''
    Given a number to specify how many random states to run and the environment to use,
    display the averaged metrics obtained from the tests and return the frames obtained from the tests.
    '''
    total_time_steps, total_penalties = 0, 0
    frames = []
    for _ in range(episodes):
        # Generate a random state to use
        state = env.reset()
        # The information collected throughout the run
        time_steps, penalties, reward = 0, 0, 0
        # Determines when the episode is complete
        done = False
        # Run through the episode until complete
        while not done:
            # Select the action containing the maximum Q value
            action = np.argmax(q_table[state])
            # Run that action and retrieve the reward and other details
            state, reward, done, info = env.step(action)

            # Put each rendered frame into dict for animation
            frames.append({
                'frame': env.render(mode='ansi'),
                'state': state,
                'action': action,
                'reward': reward
                }
            )
            # Specify whether the agent incorrectly chose to pick up or dropoff a passenger
            if reward == -10:
                penalties += 1
            # Increment the current time step
            time_steps += 1
        # Track the totals
        total_penalties += penalties
        total_time_steps += time_steps
    # Display the performance over the tests
    print(f"Results after {episodes} episodes:")
    print(f"Average timesteps per episode: {total_time_steps / episodes}")
    print(f"Average penalties per episode: {total_penalties / episodes}")
    # Return the frames to allow a user to view the runs
    return frames

**(TO DO) Q5**     
a) Run the *evaluate_agent_QL* for 100 episodes to retrieve the average number of time steps and the average penalty after training.     
b) Given your results from Q5 (a), how do the observed results from the tests compare to the tests from the Baseline model in Q3 (a)? Specifically, which agent performs better with respect to the average number of penalties throughout the tests and which agent is able to solve the problems quicker (on average).    

**(TO DO) Q5 (a) - 1 mark**     
a) Run the *evaluate_agent_QL* for 100 episodes to retrieve the average number of time steps and the average penalty after training.     

In [67]:
# TODO ...
ql = evaluate_agent_QL(100, env)

Results after 100 episodes:
Average timesteps per episode: 12.78
Average penalties per episode: 0.0


**(TO DO) Q5 (b) - 2 marks**     
b) Given your results from Q5 (a), how do the observed results from the tests compare to the tests from the Baseline model in Q3 (a)? Specifically, which agent performs better with respect to the average number of penalties throughout the tests and which agent is able to solve the problems quicker (on average).     

TODO ... 
Overall, the results from the tests from Q5 shows that the new model performs a lot better than the Baseline model in Q3. With respect to the average number of penalties, the Baseline model have an average penalty greater than 0 whereas the new model has no penalties in any of the 100 episodes as the its average was, at least for the tests performed, equal to 0. In addition to that, the baseline model took a lot more steps to complete an episode than the ones taken by the new model.    

**4.0 - Testing Different Hyperparameters**   

Now we will try retraining the agent using different set ups for the hyperparameters. This will allow you to explore their impact on the Q-Learning as well as understand their purpose during the training.     

**(TO DO) Q6**     
Below we explore variations for all four hyperparameters used by the Q-Learning approach to better understand their impact on the training. When answering the questions, ***be careful to correctly set the hyperparameters***.   

a) Retrain the agent by resetting the Q learning values and training for only **35000 episodes** (with the same alpha, gamma, and epsilon values used in section 3.0 of this notebook). Then perform another test for 100 episodes with the environment.     
b) Retrain the agent by resetting the Q learning values and training for **100000 episodes**, but with an **epsilon value of 0.8** (with the same alpha and gamma values used in section 3.0 of this notebook). Then perform another test for 100 episodes with the environment.    
c) Retrain the agent by resetting the Q learning values and training for **100000 episodes**, but with an **alpha value of 0.7** (with the same gamma and epsilon values used in section 3.0 of this notebook). Then perform another test for 100 episodes with the environment.    
d) Retrain the agent by resetting the Q learning values and training for **100000 episodes**, but with an **gamma value of 0.15** (with the same alpha and epsilon values used in section 3.0 of this notebook). Then perform another test for 100 episodes with the environment.    
e) Based on your knowledge describe what the alpha and epsilon values are within the training function (i.e. what do they affect/do).    
f) Using the results obtained from your tests in Q6 (a), (b), (c), and (d), along with the initial results found from Q5, explain the impacts of modifying the number of episodes trained on (less vs more), the alpha value (lower vs higher), the gamma value (lower vs higher), and the epsilon value (lower vs higher). Even if the difference in the comparisons are minor, state them.       

As a note, below are the initial hyperparameter values used from section 3.0 of this notebook to use as reference:    

*alpha* = 0.1   
*gamma* = 0.5   
*epsilon* = 0.1   
*num_simulations* = 100000 

**(TO DO) Q6 (a) - 2 marks**     
a) Retrain the agent by resetting the Q learning values and training for only **35000 episodes** (with the same alpha, gamma, and epsilon values used in section 3.0 of this notebook). Then perform another test for 100 episodes with the environment.   

In [72]:
# TODO: Reset q_table
q_table = np.zeros([env.observation_space.n, env.action_space.n])
# TODO: Retrain with the specified hyperparameters
# Hyperparameters
alpha = 0.1
gamma = 0.5
epsilon = 0.1
num_simulations = 35000
train_agent(alpha, gamma, epsilon, num_simulations)
# TODO: Test for 100 episodes
q6a = evaluate_agent_QL(100, env)

Episode: 35000
Training finished.

Results after 100 episodes:
Average timesteps per episode: 13.07
Average penalties per episode: 0.0


**(TO DO) Q6 (b) - 2 marks**      
b) Retrain the agent by resetting the Q learning values and training for **100000 episodes**, but with an **epsilon value of 0.8** (with the same alpha and gamma values used in section 3.0 of this notebook). Then perform another test for 100 episodes with the environment. 

In [74]:
# TODO: Reset q_table
q_table = np.zeros([env.observation_space.n, env.action_space.n])
# TODO: Retrain with the specified hyperparameters
# Hyperparameters
alpha = 0.1
gamma = 0.5
epsilon = 0.8
num_simulations = 100000
train_agent(alpha, gamma, epsilon, num_simulations)
# TODO: Test for 100 episodes
q6b = evaluate_agent_QL(100, env)

Episode: 100000
Training finished.

Results after 100 episodes:
Average timesteps per episode: 12.86
Average penalties per episode: 0.0


**(TO DO) Q6 (c) - 2 marks**      
c) Retrain the agent by resetting the Q learning values and training for **100000 episodes**, but with an **alpha value of 0.7** (with the same gamma and epsilon values used in section 3.0 of this notebook). Then perform another test for 100 episodes with the environment.    

In [73]:
# TODO: Reset q_table
q_table = np.zeros([env.observation_space.n, env.action_space.n])
# TODO: Retrain with the specified hyperparameters
# Hyperparameters
alpha = 0.7
gamma = 0.5
epsilon = 0.1
num_simulations = 100000
train_agent(alpha, gamma, epsilon, num_simulations)
# TODO: Test for 100 episodes
q6c = evaluate_agent_QL(100, env)

Episode: 100000
Training finished.

Results after 100 episodes:
Average timesteps per episode: 13.05
Average penalties per episode: 0.0


**(TO DO) Q6 (d) - 2 marks**      
d) Retrain the agent by resetting the Q learning values and training for **100000 episodes**, but with an **gamma value of 0.15** (with the same alpha and epsilon values used in section 3.0 of this notebook). Then perform another test for 100 episodes with the environment.    

In [71]:
# TODO: Reset q_table
q_table = np.zeros([env.observation_space.n, env.action_space.n])
# TODO: Retrain with the specified hyperparameters
# Hyperparameters
alpha = 0.1
gamma = 0.15
epsilon = 0.1
num_simulations = 100000
train_agent(alpha, gamma, epsilon, num_simulations)
# TODO: Test for 100 episodes
q6d = evaluate_agent_QL(100, env)

Episode: 100000
Training finished.

Results after 100 episodes:
Average timesteps per episode: 13.05
Average penalties per episode: 0.0


**(TO DO) Q6 (e) - 2 marks**      
e) Based on your knowledge describe what the alpha and epsilon values are within the training function (i.e. what do they affect/do).    

TODO ...  
*   Alpha is the learning factor which affects in which proportions the addition of the old Q-value and the new Q-value result in the new Q-value
*   At every state, the model would have epsilon chances of taking a random action (exploring) rather than an action that would be recommend by the strategy (exploting)  

**(TO DO) Q6 (f) - 4 marks**      
f) Using the results obtained from your tests in Q6 (a), (b), (c), and (d), along with the initial results found from Q5 to serve as the Q-Learning baseline to compare with, explain the impacts of modifying the number of episodes trained on (less vs more), the alpha value (lower vs higher), the gamma value (lower vs higher), and the epsilon value (lower vs higher). Even if the difference in the comparisons are minor, state them.

TODO ... 


*   Less simulations during the training of the model increases the number of steps to reach the goal. 
*   A higher epsilon values did increase the amount of time taken to train the model maybe because random actions were more frequent 
*   A higher alpha value increased the number of steps taken to reach the goal during the 100 episodes used to test the model
*   A lower gamma value increase the number of steps taken to reach the goal during the 100 episodes used to test the model 





***SIGNATURE:***
My name is Ange Michaella Niyonkuru.
My student number is 8962161.
I certify being the author of this assignment.