# Deep Reinforcement Learning for OpenAI's "cartpole-v0"

In this notebook, we will perform the experiments, do hyperparameters tuning and visualize the results.

The agent classes are in the Python module agents.DQNforCartpole. 
We will first import the DQN agent and perform a number of experiments with it. 

For logging and visualization, the files logz.py and plot.py are used. They have been 
taken from UC Berkeley's course on deep reinforcement learning, homework 2, available here: https://github.com/berkeleydeeprlcourse/homework/tree/master/hw2 

## Setup

In [None]:
from agents.DQNforCartpole import DQNforCartpole
from agents.PGAforCartpole import PolicyGradientAgent
from environments import Environments
import os, time
from util.plotting import plot_result
import pickle
import json


## 1. Deep Q-Learning

Define a function to perform the experiments and save the location of 
the experiments in a separate results file. 

In [None]:
def do_experiment(allDQNs, numberOfTrials, numberOfEpisodesForEachTrial):
    """
    Calls the method run_numberOfTrials_experiments on each agent in the list allDQNs. 
    
    param: allDQNs: list of all agent for which to perform the experiments
    param: numberOfTrials: int that specifies the number of independent trials that each experiment will be performed
    param: numberOfEpisodesForEachTrial: int that specifies the maximum of episodes that each trial can take
    """
    # set up dict to save the locations of the results files for each
    # experiment
    target = 'data/logdirs.p'
    try:
        if os.path.getsize(target) > 0:
            with open(target, "rb") as handle:
                unpickler = pickle.Unpickler(handle)
                dict_of_logdirs = unpickler.load()
            print("Loading dictionary of logdirs")
    except:
        print("Creating empty dict")
        dict_of_logdirs = dict()
    
    for dqn in allDQNs:
        # make directory for experiment
        if not(os.path.exists('data')):
            os.makedirs('data')
        logdir = "DQN"+'-cartpole' + '_' + time.strftime("%d-%m-%Y_%H-%M-%S")
        logdir = os.path.join('data', logdir)
        if not(os.path.exists(logdir)):
            os.makedirs(logdir)
            
        # save logdir for current experiment for visualizaton later on
        dict_of_logdirs[dqn.exp_name] = logdir
        
        # run experiment
        dqn.run_numberOfTrials_experiments(
            numberOfTrials=numberOfTrials,
            numberOfEpisodesForEachTrial=numberOfEpisodesForEachTrial, 
            logdir=logdir
        )
        
    # save the dict_of_logdirs to disc
    pickle.dump( dict_of_logdirs, open('data/logdirs.p', 'wb'))

In [None]:
def visualize_results(experiment_numbers_to_visualize, value_to_visualize="AvgScoresFor100Episodes"):
    """
    Visualizes each agent that is specified via the parameter experiment_numbers_to_visualize
    """
    if type(experiment_numbers_to_visualize) is not set:
        raise TypeError("Argument to visualize_result must be a set of numbers")
    
    dict_of_logdirs = pickle.load(open('./data/logdirs.p', 'rb'))
    #for key in dict_of_logdirs:
    for exp_number in experiment_numbers_to_visualize:
        plot_result(dict_of_logdirs['dqn{}'.format(exp_number)],
                    value_to_visualize)

First, we specify the environment to use. As of now, this is not particularly difficult because we've only implemented
one: the cartpole. 

In [None]:
# create the cartpole environment
env = Environments.importCartpole()

Next, we will instantiate a deep q-learning agent. This agent is based on Mnih et al. (2013), which means that it does 
use experience replay but does not use target networks, as their Mnih et al. (2015) paper. For the hyperparameters, 
we will use pretty much what Mnih et al. have used, with the exception of the replay memory capacity, the neural
network architecture and the replay start size. The cartpole problem is much more lower-dimensional than the 
visual input from the Atari games, so we get away with a significantly simpler function approximator, compared to the 
CNN used by DeepMind.

In [None]:
# benchmark model: hyperparameters similar to Mnih et al. (2015)
dqn1 = DQNforCartpole(environment=env,
                      learning_rate=0.00025,
                      discount_rate=0.99,
                      exploration_rate=1.0,
                      exploration_rate_min=0.1,
                      exploration_rate_decay=0.999,
                      replay_memory_capacity=10000, 
                      replay_sampling_batch_size=32,
                      nn_architecture=[10],
                      replay_start_size=32,
                      exp_name="dqn1"
                      )

We will perform some parameter tuning for the following experiments. The hyperparameter to be tuned will be specified
in each comment. 

In [None]:
# First, we will double the number of hidden layers.
dqn2 = DQNforCartpole(environment=env,
                      learning_rate=0.00025,
                      discount_rate=0.99,
                      exploration_rate=1.0,
                      exploration_rate_min=0.1,
                      exploration_rate_decay=0.999,
                      replay_memory_capacity=10000, 
                      replay_sampling_batch_size=32,
                      nn_architecture=[10, 10],
                      replay_start_size=32,
                      exp_name="dqn2"
                      )

In [None]:
# Next, we double the number of hidden nodes for dqn1
dqn3 = DQNforCartpole(environment=env,
                      learning_rate=0.00025,
                      discount_rate=0.99,
                      exploration_rate=1.0,
                      exploration_rate_min=0.1,
                      exploration_rate_decay=0.999,
                      replay_memory_capacity=10000, 
                      replay_sampling_batch_size=32,
                      nn_architecture=[20],
                      replay_start_size=32,
                      exp_name="dqn3"
                      )

## Performing the Experiments

Perform the experiment with the specified agents for a certain number of 
trials and a given number of episodes in each trial. 
#### WARNING: executing the next cell (and setting do_it to True) can take a significant amount of time!

In [None]:
allDQNs = [dqn1, dqn2, dqn3]
numberOfTrials = 5
number_of_episodes_for_each_trial = 3000
do_it = False

if do_it:
    do_experiment(
        allDQNs=allDQNs, 
        numberOfTrials=numberOfTrials,
        numberOfEpisodesForEachTrial=number_of_episodes_for_each_trial
        )

## Visualize the Results

In [None]:
visualize_results(set([1,2,3]))

We see that dqn2 achieves a much higher average reward over 100 episodes than either dqn1 or dqn3. We will keep building
on that and double the number of hidden nodes in the first layer. 

In [None]:
# Next, we double the number of hidden nodes for dqn1
dqn4 = DQNforCartpole(environment=env,
                      learning_rate=0.00025,
                      discount_rate=0.99,
                      exploration_rate=1.0,
                      exploration_rate_min=0.1,
                      exploration_rate_decay=0.999,
                      replay_memory_capacity=10000, 
                      replay_sampling_batch_size=32,
                      nn_architecture=[20, 10],
                      replay_start_size=32,
                      exp_name="dqn4"
                      )

In [None]:
allDQNs = [dqn4]
numberOfTrials = 5
number_of_episodes_for_each_trial = 3000
do_it = False

if do_it:
    do_experiment(
        allDQNs=allDQNs, 
        numberOfTrials=numberOfTrials,
        numberOfEpisodesForEachTrial=number_of_episodes_for_each_trial
        )

In [None]:
visualize_results(set([2,4]))

Comparing these 2 plots, we see that dqn4 achieves a higher average reward over 100 episodes than dqn2, so will use 
dqn4 as the basis for our next experiment. 

In [None]:
# add 10 nodes to 1st and 2nd layer
dqn5 = DQNforCartpole(environment=env,
                      learning_rate=0.00025,
                      discount_rate=0.99,
                      exploration_rate=1.0,
                      exploration_rate_min=0.1,
                      exploration_rate_decay=0.999,
                      replay_memory_capacity=10000, 
                      replay_sampling_batch_size=32,
                      nn_architecture=[30, 20],
                      replay_start_size=32,
                      exp_name="dqn5"
                      )

In [None]:
allDQNs = [dqn5]
numberOfTrials = 5
number_of_episodes_for_each_trial = 3000
do_it = False

if do_it:
    do_experiment(
        allDQNs=allDQNs, 
        numberOfTrials=numberOfTrials,
        numberOfEpisodesForEachTrial=number_of_episodes_for_each_trial
        )

In [None]:
visualize_results(set([4,5]))

The reults for dqn4 look more promising than dqn5, so we will stay with the architecture in dqn4 for now. We will now 
adjust the discount rate.

In [None]:
# reduce discount rate to 0.95
dqn6 = DQNforCartpole(environment=env,
                      learning_rate=0.00025,
                      discount_rate=0.95,
                      exploration_rate=1.0,
                      exploration_rate_min=0.1,
                      exploration_rate_decay=0.999,
                      replay_memory_capacity=10000, 
                      replay_sampling_batch_size=32,
                      nn_architecture=[20, 10],
                      replay_start_size=32,
                      exp_name="dqn6"
                      )

In [None]:
# reduce discount rate to 0.90
dqn7 = DQNforCartpole(environment=env,
                      learning_rate=0.00025,
                      discount_rate=0.90,
                      exploration_rate=1.0,
                      exploration_rate_min=0.1,
                      exploration_rate_decay=0.999,
                      replay_memory_capacity=10000, 
                      replay_sampling_batch_size=32,
                      nn_architecture=[20, 10],
                      replay_start_size=32,
                      exp_name="dqn7"
                      )

In [None]:
allDQNs = [dqn6, dqn7]
numberOfTrials = 5
number_of_episodes_for_each_trial = 3000
do_it = False

if do_it:
    do_experiment(
        allDQNs=allDQNs, 
        numberOfTrials=numberOfTrials,
        numberOfEpisodesForEachTrial=number_of_episodes_for_each_trial
        )

In [None]:
visualize_results(set([4,6,7]))

We see that dqn7 clearly beats the other 2 experiments. It also shows the smallest variance across the trials. 
Adjusting the discount factor seems to be a promising avenue towards better performance. We will explore this further
in the next experiment, where we further decrease the learning rate. 

In [None]:
# reduce discount rate to 0.85
dqn8 = DQNforCartpole(environment=env,
                      learning_rate=0.00025,
                      discount_rate=0.85,
                      exploration_rate=1.0,
                      exploration_rate_min=0.1,
                      exploration_rate_decay=0.999,
                      replay_memory_capacity=10000, 
                      replay_sampling_batch_size=32,
                      nn_architecture=[20, 10],
                      replay_start_size=32,
                      exp_name="dqn8"
                      )

In [None]:
allDQNs = [dqn8]
numberOfTrials = 5
number_of_episodes_for_each_trial = 3000
do_it = False

if do_it:
    do_experiment(
        allDQNs=allDQNs, 
        numberOfTrials=numberOfTrials,
        numberOfEpisodesForEachTrial=number_of_episodes_for_each_trial
        )

In [None]:
visualize_results(set([7,8]))

While the difference in performance between dqn7 and dqn8 is not quite as clear as before, it still is better in terms
of the numbers of episodes needed to reach an average score of 195 (and thus stop the comparison in the plots). We will 
further decrease the discount factor to see if we can enhance the performance even more.

In [None]:
# reduce discount rate to 0.80
dqn9 = DQNforCartpole(environment=env,
                      learning_rate=0.00025,
                      discount_rate=0.80,
                      exploration_rate=1.0,
                      exploration_rate_min=0.1,
                      exploration_rate_decay=0.999,
                      replay_memory_capacity=10000, 
                      replay_sampling_batch_size=32,
                      nn_architecture=[20, 10],
                      replay_start_size=32,
                      exp_name="dqn9"
                      )

In [None]:
allDQNs = [dqn9]
numberOfTrials = 5
number_of_episodes_for_each_trial = 3000
do_it = False

if do_it:
    do_experiment(
        allDQNs=allDQNs, 
        numberOfTrials=numberOfTrials,
        numberOfEpisodesForEachTrial=number_of_episodes_for_each_trial
        )

In [None]:
visualize_results(set([8, 9]))

It's hard to judge the difference, here. It seems that, on average, dqn9 achieves higher scores faster. E.g., at episode 
1000, dqn8 has a score of around 70, while dqn9 has 100. So we will further reduce the discount rate.

In [None]:
# reduce discount rate to 0.75
dqn10 = DQNforCartpole(environment=env,
                     learning_rate=0.00025,
                      discount_rate=0.75,
                      exploration_rate=1.0,
                      exploration_rate_min=0.1,
                      exploration_rate_decay=0.999,
                      replay_memory_capacity=10000, 
                      replay_sampling_batch_size=32,
                      nn_architecture=[20, 10],
                      replay_start_size=32,
                      exp_name="dqn10"
                      )

In [None]:
allDQNs = [dqn10]
numberOfTrials = 5
number_of_episodes_for_each_trial = 3000
do_it = False

if do_it:
    do_experiment(
        allDQNs=allDQNs, 
        numberOfTrials=numberOfTrials,
        numberOfEpisodesForEachTrial=number_of_episodes_for_each_trial
        )

In [None]:
visualize_results(set([8,9,10]))

The variance has increased quite a bit for dqn10. Although the best agent was able to reach the goal of 195 in just 
around 1500 episodes, we prefer the low-variance results of dqn8 and dqn9. We will take dqn9 as the new reference
and will now investigate the impact of the learning rate.

In [None]:
# double learning rate to 0.0005
dqn11 = DQNforCartpole(environment=env,
                      learning_rate=0.00050,
                      discount_rate=0.80,
                      exploration_rate=1.0,
                      exploration_rate_min=0.1,
                      exploration_rate_decay=0.999,
                      replay_memory_capacity=10000, 
                      replay_sampling_batch_size=32,
                      nn_architecture=[20, 10],
                      replay_start_size=32,
                      exp_name="dqn11"
                      )

In [None]:
# half the learning rate to 0.000125
dqn12 = DQNforCartpole(environment=env,
                      learning_rate=0.000125,
                      discount_rate=0.80,
                      exploration_rate=1.0,
                      exploration_rate_min=0.1,
                      exploration_rate_decay=0.999,
                      replay_memory_capacity=10000, 
                      replay_sampling_batch_size=32,
                      nn_architecture=[20, 10],
                      replay_start_size=32,
                      exp_name="dqn12"
                      )

In [None]:
allDQNs = [dqn11, dqn12]
numberOfTrials = 5
number_of_episodes_for_each_trial = 3000
do_it = False

if do_it:
    do_experiment(
        allDQNs=allDQNs, 
        numberOfTrials=numberOfTrials,
        numberOfEpisodesForEachTrial=number_of_episodes_for_each_trial
        )

In [None]:
visualize_results(set([9,11,12]))

It seems that the higher learning rate (dqn11) leads to a faster increase of the score, while the lower learning rate 
(dqn12) leads to slower learning. This is consistent with previous experience. The optimal thing to do would be to 
lower the learning rate of dqn11 once it reaches around 150 points at episode 1000 but that would go to far for this
project. Instead, we will focus on the epsilon decay next. We will still use dqn9 as currently best-performing model.

In [None]:
# set exploration rate decay to 0.995
dqn13 = DQNforCartpole(environment=env,
                      learning_rate=0.00025,
                      discount_rate=0.80,
                      exploration_rate=1.0,
                      exploration_rate_min=0.1,
                      exploration_rate_decay=0.995,
                      replay_memory_capacity=10000, 
                      replay_sampling_batch_size=32,
                      nn_architecture=[20, 10],
                      replay_start_size=32,
                      exp_name="dqn13"
                      )

In [None]:
# set exploration rate decay to 0.99
dqn14 = DQNforCartpole(environment=env,
                      learning_rate=0.00025,
                      discount_rate=0.80,
                      exploration_rate=1.0,
                      exploration_rate_min=0.1,
                      exploration_rate_decay=0.99,
                      replay_memory_capacity=10000, 
                      replay_sampling_batch_size=32,
                      nn_architecture=[20, 10],
                      replay_start_size=32,
                      exp_name="dqn14"
                      )

In [None]:
allDQNs = [dqn13, dqn14]
numberOfTrials = 5
number_of_episodes_for_each_trial = 3000
do_it = False

if do_it:
    do_experiment(
        allDQNs=allDQNs, 
        numberOfTrials=numberOfTrials,
        numberOfEpisodesForEachTrial=number_of_episodes_for_each_trial
        )

In [None]:
visualize_results(set([9,13,14]))

It seems that lowering the exploration rate decay does not yield better performance. In retrospect, this makes sense since lowering the exploration rate decay to 0.99 means that the exploration rate will have reached the minimum of 
0.1 at episode 229 already (since $0.99^{229} \approx 0.1$), which explains the high variance in the susequent episodes: 
the q-function is still far from the optimal one. 

Next, we will increase the
replay start size. This means that we will have a higher pool to sample experiences from when starting with
learning. 

In [None]:
# increase replay_start_size by factor of 10
dqn15 = DQNforCartpole(environment=env,
                      learning_rate=0.00025,
                      discount_rate=0.80,
                      exploration_rate=1.0,
                      exploration_rate_min=0.1,
                      exploration_rate_decay=0.999,
                      replay_memory_capacity=10000, 
                      replay_sampling_batch_size=32,
                      nn_architecture=[20, 10],
                      replay_start_size=32*10,
                      exp_name="dqn15"
                      )

In [None]:
# increase replay_start_size by factor of 100
dqn16 = DQNforCartpole(environment=env,
                      learning_rate=0.00025,
                      discount_rate=0.80,
                      exploration_rate=1.0,
                      exploration_rate_min=0.1,
                      exploration_rate_decay=0.999,
                      replay_memory_capacity=10000, 
                      replay_sampling_batch_size=32,
                      nn_architecture=[20, 10],
                      replay_start_size=32*100,
                      exp_name="dqn16"
                      )

In [None]:
allDQNs = [dqn15, dqn16]
numberOfTrials = 5
number_of_episodes_for_each_trial = 3000
do_it = False

if do_it:
    do_experiment(
        allDQNs=allDQNs, 
        numberOfTrials=numberOfTrials,
        numberOfEpisodesForEachTrial=number_of_episodes_for_each_trial
        )

In [None]:
visualize_results(set([9, 15, 16]))

dqn16 combines relatively smooth learning with a quite low variance. But keep in mind that we only have 5 samples here,
so the results have to be taken with a grain of salt. In theory, the more episodes you simulate before starting to 
sample out of these, the more uncorrelated the samples should be. And for this environment, running 1500 episodes
without learning takes about 1.5sec, so we can safely use a dqn with a higher setting for the replay_start_size, 
without sacrificing any performance. This is different for environment, where simulation takes a long time, but for this
very simple problem, it does not matter. So we will take dqn16 as our new baseline and will now
vary the replay memory capacity. 

In [None]:
# multiply replay_memory_capacity by factor of 1/10
dqn17 = DQNforCartpole(environment=env,
                      learning_rate=0.00025,
                      discount_rate=0.80,
                      exploration_rate=1.0,
                      exploration_rate_min=0.1,
                      exploration_rate_decay=0.999,
                      replay_memory_capacity=1000, 
                      replay_sampling_batch_size=32,
                      nn_architecture=[20, 10],
                      replay_start_size=32*100,
                      exp_name="dqn17"
                      )

In [None]:
# multiply replay_memory_capacity by factor of 10
dqn18 = DQNforCartpole(environment=env,
                      learning_rate=0.00025,
                      discount_rate=0.80,
                      exploration_rate=1.0,
                      exploration_rate_min=0.1,
                      exploration_rate_decay=0.999,
                      replay_memory_capacity=100000, 
                      replay_sampling_batch_size=32,
                      nn_architecture=[20, 10],
                      replay_start_size=32*100,
                      exp_name="dqn18"
                      )

In [None]:
allDQNs = [dqn17, dqn18]
numberOfTrials = 5
number_of_episodes_for_each_trial = 3000
do_it = False

if do_it:
    do_experiment(
        allDQNs=allDQNs, 
        numberOfTrials=numberOfTrials,
        numberOfEpisodesForEachTrial=number_of_episodes_for_each_trial
        )

In [None]:
visualize_results(set([16,17,18]))

dqn17 seems to random for my taste. The sudden drop in the score at about 1400 episodes is much more pronounced than it 
is for dqn16. Surprisingly, dqn18 does not seem to show this drop at all. 

dqn16 seems to learn faster (score of 100 reached at about 1000 - 1200 episodes) than dqn18. 

## 2. Policy Gradients

In [None]:
def do_experiment_pga(allPGAs, numberOfTrials, numberOfEpisodesForEachTrial):
    """
    Calls the method run_numberOfTrials_experiments on each agent in the list allPGAs
    
    param: allPGAs: list of all policy gradient agents for which to perform the experiments
    param: numberOfTrials: int that specifies the number of independent trials that each experiment will be performed
    param: numberOfEpisodesForEachTrial: int that specifies the maximum of episodes that each trial can take
    """
    # set up dict to save the locations of the results files for each
    # experiment
    target = 'data/pg_logdirs.p'
    try:
        if os.path.getsize(target) > 0:
            with open(target, "rb") as handle:
                unpickler = pickle.Unpickler(handle)
                dict_of_logdirs = unpickler.load()
            print("Loading dictionary of logdirs")
    except:
        print("Creating empty dict")
        dict_of_logdirs = dict()
    
    for pga in allPGAs:
        # make directory for experiment
        if not(os.path.exists('data')):
            os.makedirs('data')
        logdir = "PGA"+'-cartpole' + '_' + time.strftime("%d-%m-%Y_%H-%M-%S")
        logdir = os.path.join('data', logdir)
        if not(os.path.exists(logdir)):
            os.makedirs(logdir)
            
        # save logdir for current experiment for visualizaton later on
        dict_of_logdirs[pga.exp_name] = logdir
        
        # run experiment
        pga.run_numberOfTrials_experiments(
            numberOfTrials=numberOfTrials,
            numberOfEpisodesForEachTrial=numberOfEpisodesForEachTrial, 
            logdir=logdir
        )
        
    # save the dict_of_logdirs to disc
    pickle.dump( dict_of_logdirs, open(target, 'wb'))

In [None]:
def visualize_results_pga(experiment_numbers_to_visualize, value_to_visualize="AvgScoresFor100Episodes"):
    """
    Visualizes each agent that is specified via the parameter experiment_numbers_to_visualize
    """
    if type(experiment_numbers_to_visualize) is not set:
        raise TypeError("Argument to visualize_result must be a set of numbers")
    
    dict_of_logdirs = pickle.load(open('./data/pg_logdirs.p', 'rb'))
    #for key in dict_of_logdirs:
    for exp_number in experiment_numbers_to_visualize:
        plot_result(dict_of_logdirs['pga{}'.format(exp_number)],
                    value_to_visualize)

In [None]:
# Same hyperparameters as Geron (2017) - Handson ML
pga1 = PolicyGradientAgent(environment=env,
                          learning_rate=0.01,
                          discount_rate=0.95,
                          number_of_episodes_per_update=10,
                          nn_architecture=[4],
                          exp_name='pga1'
                          )

In [None]:
allPGAs = [pga1]
numberOfTrials = 5
number_of_episodes_for_each_trial = 3000
do_it = False

if do_it:
    do_experiment_pga(
        allPGAs=allPGAs,
        numberOfTrials=numberOfTrials,
        numberOfEpisodesForEachTrial=number_of_episodes_for_each_trial
        )

In [None]:
visualize_results_pga(set([1]))

The learning is much smoother than with deep q-learning. For the best trial, the goal of 195 was reached after 1300
episodes already. However, there is also a very high variance in the results. For the worst trial, the goal was
not reached after the set maximum of 3000 episodes. 

We will now try to improve the performace by hyperparameter tuning. We will start with the network architecture since
I do not think that 4 hidden nodes yield a sufficient approximation to a 4-dimensional input vector. 

In [None]:
# set nn_architecture to [10]
pga2 = PolicyGradientAgent(environment=env,
                          learning_rate=0.01,
                          discount_rate=0.95,
                          number_of_episodes_per_update=10,
                          nn_architecture=[10],
                          exp_name='pga2'
                          )

In [None]:
# set nn_architecture to [4,4]
pga3 = PolicyGradientAgent(environment=env,
                          learning_rate=0.01,
                          discount_rate=0.95,
                          number_of_episodes_per_update=10,
                          nn_architecture=[4, 4],
                          exp_name='pga3'
                          )

In [None]:
allPGAs = [pga2, pga3]
numberOfTrials = 5
number_of_episodes_for_each_trial = 3000
do_it = False

if do_it:
    do_experiment_pga(
        allPGAs=allPGAs,
        numberOfTrials=numberOfTrials,
        numberOfEpisodesForEachTrial=number_of_episodes_for_each_trial
        )

In [None]:
visualize_results_pga(set([1,2,3]))

pga2 clearly has the best performance out of these 3 agents. The variance is significantly lower than either pga1 or
pga3. We will further adjust the neural network architecture to find out if we can get even better results. 

In [None]:
# set nn_architecture to [20]
pga4 = PolicyGradientAgent(environment=env,
                          learning_rate=0.01,
                          discount_rate=0.95,
                          number_of_episodes_per_update=10,
                          nn_architecture=[20],
                          exp_name='pga4'
                          )

In [None]:
# set nn_architecture to [10,10]
pga5 = PolicyGradientAgent(environment=env,
                          learning_rate=0.01,
                          discount_rate=0.95,
                          number_of_episodes_per_update=10,
                          nn_architecture=[10, 10],
                          exp_name='pga5'
                          )

In [None]:
allPGAs = [pga4, pga5]
numberOfTrials = 5
number_of_episodes_for_each_trial = 3000
do_it = False

if do_it:
    do_experiment_pga(
        allPGAs=allPGAs,
        numberOfTrials=numberOfTrials,
        numberOfEpisodesForEachTrial=number_of_episodes_for_each_trial
        )

In [None]:
visualize_results_pga(set([2,4,5]))

It's actually hard to make a decision here. pgq5 seems to facilitate the fastest learning. The best agent reached a score
of 195 already around 600 episodes, which no other agent has achieved so far. In terms of variance, it's the worst of the 
3 agents, however. pga4 learns faster than pga2, but there is some crawling going on near the top, which is not present 
in the other agents. For pga4, there is already some variance present when starting out, whereas pga2 and pga5 are 
almost variance-free under 250 episodes. We will go with pga5 because even the worst trial has achieved a score of 150 
at 600 iterations. We will now vary the dicount rate.

In [None]:
# increase discount rate to 0.99
pga6 = PolicyGradientAgent(environment=env,
                          learning_rate=0.01,
                          discount_rate=0.99,
                          number_of_episodes_per_update=10,
                          nn_architecture=[10, 10],
                          exp_name='pga6'
                          )

In [None]:
# decraese discount rate to 0.8
pga7 = PolicyGradientAgent(environment=env,
                          learning_rate=0.01,
                          discount_rate=0.8,
                          number_of_episodes_per_update=10,
                          nn_architecture=[10, 10],
                          exp_name='pga7'
                          )

In [None]:
allPGAs = [pga6, pga7]
numberOfTrials = 5
number_of_episodes_for_each_trial = 3000
do_it = False

if do_it:
    do_experiment_pga(
        allPGAs=allPGAs,
        numberOfTrials=numberOfTrials,
        numberOfEpisodesForEachTrial=number_of_episodes_for_each_trial
        )

In [None]:
visualize_results_pga(set([5,6,7]))

pga5 and pga7 are very similar, whereas pga6 shows a much higher variance between trials. We keep pga5 as our benchmark
model and will now turn towards adjusting the number of episodes per update.

In [None]:
# double number of episodes per update to 20
pga8 = PolicyGradientAgent(environment=env,
                          learning_rate=0.01,
                          discount_rate=0.95,
                          number_of_episodes_per_update=20,
                          nn_architecture=[10, 10],
                          exp_name='pga8'
                          )

In [None]:
# half number of episodes per update to 5
pga9 = PolicyGradientAgent(environment=env,
                          learning_rate=0.01,
                          discount_rate=0.95,
                          number_of_episodes_per_update=5,
                          nn_architecture=[10, 10],
                          exp_name='pga9'
                          )

In [None]:
allPGAs = [pga8, pga9]
numberOfTrials = 5
number_of_episodes_for_each_trial = 3000
do_it = False

if do_it:
    do_experiment_pga(
        allPGAs=allPGAs,
        numberOfTrials=numberOfTrials,
        numberOfEpisodesForEachTrial=number_of_episodes_for_each_trial
        )

In [None]:
visualize_results_pga(set([5,8,9]))

pga9 clearly outperforms pga5 and pga8 in the metric episodes until reaching an average score of 195, which is the one
we are interested in. Let's see if we can increase this even more by going down to one update per episode.

In [None]:
# decrease number of episodes per update to 2
pga10 = PolicyGradientAgent(environment=env,
                          learning_rate=0.01,
                          discount_rate=0.95,
                          number_of_episodes_per_update=2,
                          nn_architecture=[10, 10],
                          exp_name='pga10'
                          )

In [None]:
# decrease number of episodes per update to 2
pga11 = PolicyGradientAgent(environment=env,
                          learning_rate=0.01,
                          discount_rate=0.95,
                          number_of_episodes_per_update=1,
                          nn_architecture=[10, 10],
                          exp_name='pga11'
                          )

In [None]:
allPGAs = [pga10, pga11]
numberOfTrials = 5
number_of_episodes_for_each_trial = 3000
do_it = False

if do_it:
    do_experiment_pga(
        allPGAs=allPGAs,
        numberOfTrials=numberOfTrials,
        numberOfEpisodesForEachTrial=number_of_episodes_for_each_trial
        )

In [None]:
visualize_results_pga(set([9,10,11]))

A further decrease in the number of episodes per update does not yield any increase in performance. pga9 still is the 
best-performing agent. We will now turn towards our last hyperparameter to tune: the learning rate.

In [None]:
# double the learning rate to 0.02
pga12 = PolicyGradientAgent(environment=env,
                          learning_rate=0.02,
                          discount_rate=0.95,
                          number_of_episodes_per_update=5,
                          nn_architecture=[10, 10],
                          exp_name='pga12'
                          )

In [None]:
# half the learning rate to 0.005
pga13 = PolicyGradientAgent(environment=env,
                          learning_rate=0.005,
                          discount_rate=0.95,
                          number_of_episodes_per_update=5,
                          nn_architecture=[10, 10],
                          exp_name='pga13'
                          )

In [None]:
allPGAs = [pga12, pga13]
numberOfTrials = 5
number_of_episodes_for_each_trial = 3000
do_it = False

if do_it:
    do_experiment_pga(
        allPGAs=allPGAs,
        numberOfTrials=numberOfTrials,
        numberOfEpisodesForEachTrial=number_of_episodes_for_each_trial
        )

In [None]:
visualize_results_pga(set([9, 12, 13]))