# Deep Reinforcement Learning for OpenAI's "cartpole-v0"

In this notebook, we will perform the experiments, do hyperparameters tuning and visualize the results.

The agent classes are in the Python module agents.DQNforCartpole. 
We will first import the DQN agent and perform a number of experiments with it. 

For logging and visualization, the files logz.py and plot.py are used. They have been 
taken from UC Berkeley's course on deep reinforcement learning, homework 2, available here: https://github.com/berkeleydeeprlcourse/homework/tree/master/hw2 

## Setup

In [1]:
from agents.DQNforCartpole import DQNforCartpole
from environments import Environments
import os, time
from util.plotting import plot_result
import pickle
import json


  return f(*args, **kwds)
Using TensorFlow backend.


Define a function to perform the experiments and save the location of 
the experiments in a separate results file. 

In [2]:
def do_experiment(allDQNs, numberOfTrials, numberOfEpisodesForEachTrial):
    # set up dict to save the locations of the results files for each
    # experiment
    target = 'data/logdirs.p'
    try:
        if os.path.getsize(target) > 0:
            with open(target, "rb") as handle:
                unpickler = pickle.Unpickler(handle)
                dict_of_logdirs = unpickler.load()
            print("Loading dictionary of logdirs")
    except:
        print("Creating empty dict")
        dict_of_logdirs = dict()
    
    for dqn in allDQNs:
        # make directory for experiment
        if not(os.path.exists('data')):
            os.makedirs('data')
        logdir = "DQN"+'-cartpole' + '_' + time.strftime("%d-%m-%Y_%H-%M-%S")
        logdir = os.path.join('data', logdir)
        if not(os.path.exists(logdir)):
            os.makedirs(logdir)
            
        # safe logdir for current experiment for visualizaton later on
        dict_of_logdirs[dqn.exp_name] = logdir
        
        # run experiment
        dqn.run_numberOfTrials_experiments(
            numberOfTrials=numberOfTrials,
            numberOfEpisodesForEachTrial=numberOfEpisodesForEachTrial, 
            logdir=logdir
        )
        
    # save the dict_of_logdirs to disc
    pickle.dump( dict_of_logdirs, open('data/logdirs.p', 'wb'))

In [3]:
def visualize_results(set_of_numbers):
    if type(set_of_numbers) is not set:
        raise TypeError("Argument to visualize_result must be a set of numbers")
    
    dict_of_logdirs = pickle.load(open('data/logdirs.p', 'rb'))
    #for key in dict_of_logdirs:
    for exp_number in set_of_numbers:
        plot_result(dict_of_logdirs['dqn{}'.format(exp_number)],
                    'AvgScoresFor100Episodes')

First, we specify the environment to use. As of now, this is not particularly difficult because we've only implemented
one: the cartpole. 

In [4]:
# create the cartpole environment
env = Environments.importCartpole()


Importing environment CartPole-v0
----------------------------------
CartPole-v0's action space:       Discrete(2)
CartPole-v0's observation space:  Box(4,)

For the cartpole environment, the observation space is: 
obs[0]: the horizontal position of the cart (0.0 = center)
obs[1]: the velocity of the cart (0.0 = standing still)
obs[2]: the angle of the pole (0.0 = vertical)
obs[3]: the angular velocity of the cartpole (0.0 = standing still)


Next, we will instantiate a deep q-learning agent. This agent is based on Mnih et al. (2013), which means that it does 
use experience replay but does not use target networks, as their Mnih et al. (2015) paper. For the hyperparameters, 
we will use pretty much what Mnih et al. have used, with the exception of the replay memory capacity, the neural
network architecture and the replay start size. The cartpole problem is much more lower-dimensional than the 
visual input from the Atari games, so we get away with a significantly simpler function approximator, compared to the 
CNN used by DeepMind.

In [5]:
# benchmark model: hyperparameters similar to Mnih et al. (2015)
dqn1 = DQNforCartpole(environment=env,
                      learning_rate=0.00025,
                      discount_rate=0.99,
                      exploration_rate=1.0,
                      exploration_rate_min=0.1,
                      exploration_rate_decay=0.999,
                      replay_memory_capacity=10000, 
                      replay_sampling_batch_size=32,
                      nn_architecture=[10],
                      replay_start_size=32,
                      exp_name="dqn1"
                      )

Initializing DQN agent...
 .... dimension of state space: 4
 .... dimension of action space: 2


We will perform some parameter tuning. First, we will double the number of hidden layers.

In [6]:
dqn2 = DQNforCartpole(environment=env,
                      learning_rate=0.00025,
                      discount_rate=0.99,
                      exploration_rate=1.0,
                      exploration_rate_min=0.1,
                      exploration_rate_decay=0.999,
                      replay_memory_capacity=10000, 
                      replay_sampling_batch_size=32,
                      nn_architecture=[10, 10],
                      replay_start_size=32,
                      exp_name="dqn2"
                      )

Initializing DQN agent...
 .... dimension of state space: 4
 .... dimension of action space: 2


In [7]:
dqn3 = DQNforCartpole(environment=env,
                      learning_rate=0.00025,
                      discount_rate=0.99,
                      exploration_rate=1.0,
                      exploration_rate_min=0.1,
                      exploration_rate_decay=0.999,
                      replay_memory_capacity=10000, 
                      replay_sampling_batch_size=32,
                      nn_architecture=[20],
                      replay_start_size=32,
                      exp_name="dqn3"
                      )

Initializing DQN agent...
 .... dimension of state space: 4
 .... dimension of action space: 2


## Performing the Experiments

Perform the experiment with the specified agents for a certain number of 
trials and a given number of episodes in each trial. 
#### WARNING: executing the next cell can take a significant amount of time!

In [8]:
allDQNs = [dqn1, dqn2, dqn3]
numberOfTrials = 5
number_of_episodes_for_each_trial = 3000

do_experiment(
    allDQNs=allDQNs, 
    numberOfTrials=numberOfTrials,
    numberOfEpisodesForEachTrial=number_of_episodes_for_each_trial
    )

Creating empty dict
[32;1mLogging data to data/DQN-cartpole_29-01-2018_16-11-20/trial_1/log.txt[0m
Starting new trial
---------------------------------------------
|              Experiment |            dqn1 |
|                 TrialNo |               1 |
|                    seed |               1 |
|                 Episode |               1 |
|                    Time |            0.59 |
|           AverageReturn |            20.8 |
|               StdReturn |            9.25 |
|               MaxReturn |              44 |
|               MinReturn |              11 |
|                 Epsilon |           0.999 |
|               EpLenMean |            20.8 |
|                EpLenStd |            9.25 |
| AvgScoresFor100Episodes |            19.8 |
---------------------------------------------
---------------------------------------------
|              Experiment |            dqn1 |
|                 TrialNo |               1 |
|                    seed |               1 |
|     

---------------------------------------------
|              Experiment |            dqn1 |
|                 TrialNo |               1 |
|                    seed |               1 |
|                 Episode |              13 |
|                    Time |             1.3 |
|           AverageReturn |            21.6 |
|               StdReturn |            8.97 |
|               MaxReturn |              44 |
|               MinReturn |              11 |
|                 Epsilon |           0.987 |
|               EpLenMean |            21.6 |
|                EpLenStd |            8.97 |
| AvgScoresFor100Episodes |            20.6 |
---------------------------------------------
---------------------------------------------
|              Experiment |            dqn1 |
|                 TrialNo |               1 |
|                    seed |               1 |
|                 Episode |              14 |
|                    Time |            1.36 |
|           AverageReturn |       

---------------------------------------------
|              Experiment |            dqn1 |
|                 TrialNo |               1 |
|                    seed |               1 |
|                 Episode |              25 |
|                    Time |            1.99 |
|           AverageReturn |            20.9 |
|               StdReturn |            8.49 |
|               MaxReturn |              44 |
|               MinReturn |               9 |
|                 Epsilon |           0.975 |
|               EpLenMean |            20.9 |
|                EpLenStd |            8.49 |
| AvgScoresFor100Episodes |            19.9 |
---------------------------------------------
---------------------------------------------
|              Experiment |            dqn1 |
|                 TrialNo |               1 |
|                    seed |               1 |
|                 Episode |              26 |
|                    Time |            2.05 |
|           AverageReturn |       

---------------------------------------------
|              Experiment |            dqn1 |
|                 TrialNo |               1 |
|                    seed |               1 |
|                 Episode |              37 |
|                    Time |            2.71 |
|           AverageReturn |            21.4 |
|               StdReturn |            9.16 |
|               MaxReturn |              51 |
|               MinReturn |               9 |
|                 Epsilon |           0.964 |
|               EpLenMean |            21.4 |
|                EpLenStd |            9.16 |
| AvgScoresFor100Episodes |            20.4 |
---------------------------------------------
---------------------------------------------
|              Experiment |            dqn1 |
|                 TrialNo |               1 |
|                    seed |               1 |
|                 Episode |              38 |
|                    Time |            2.77 |
|           AverageReturn |       

---------------------------------------------
|              Experiment |            dqn1 |
|                 TrialNo |               1 |
|                    seed |               1 |
|                 Episode |              49 |
|                    Time |            3.39 |
|           AverageReturn |            22.5 |
|               StdReturn |            10.8 |
|               MaxReturn |              68 |
|               MinReturn |               9 |
|                 Epsilon |           0.952 |
|               EpLenMean |            22.5 |
|                EpLenStd |            10.8 |
| AvgScoresFor100Episodes |            21.5 |
---------------------------------------------
---------------------------------------------
|              Experiment |            dqn1 |
|                 TrialNo |               1 |
|                    seed |               1 |
|                 Episode |              50 |
|                    Time |            3.46 |
|           AverageReturn |       

---------------------------------------------
|              Experiment |            dqn1 |
|                 TrialNo |               1 |
|                    seed |               1 |
|                 Episode |              61 |
|                    Time |            4.14 |
|           AverageReturn |            22.8 |
|               StdReturn |            10.7 |
|               MaxReturn |              68 |
|               MinReturn |               9 |
|                 Epsilon |           0.941 |
|               EpLenMean |            22.8 |
|                EpLenStd |            10.7 |
| AvgScoresFor100Episodes |            21.8 |
---------------------------------------------
---------------------------------------------
|              Experiment |            dqn1 |
|                 TrialNo |               1 |
|                    seed |               1 |
|                 Episode |              62 |
|                    Time |            4.19 |
|           AverageReturn |       

---------------------------------------------
|              Experiment |            dqn1 |
|                 TrialNo |               1 |
|                    seed |               1 |
|                 Episode |              73 |
|                    Time |            4.82 |
|           AverageReturn |            22.4 |
|               StdReturn |            10.5 |
|               MaxReturn |              68 |
|               MinReturn |               9 |
|                 Epsilon |            0.93 |
|               EpLenMean |            22.4 |
|                EpLenStd |            10.5 |
| AvgScoresFor100Episodes |            21.7 |
---------------------------------------------
---------------------------------------------
|              Experiment |            dqn1 |
|                 TrialNo |               1 |
|                    seed |               1 |
|                 Episode |              74 |
|                    Time |            4.87 |
|           AverageReturn |       

---------------------------------------------
|              Experiment |            dqn1 |
|                 TrialNo |               1 |
|                    seed |               1 |
|                 Episode |              85 |
|                    Time |            5.48 |
|           AverageReturn |              22 |
|               StdReturn |            10.2 |
|               MaxReturn |              68 |
|               MinReturn |               9 |
|                 Epsilon |           0.918 |
|               EpLenMean |              22 |
|                EpLenStd |            10.2 |
| AvgScoresFor100Episodes |            21.1 |
---------------------------------------------
---------------------------------------------
|              Experiment |            dqn1 |
|                 TrialNo |               1 |
|                    seed |               1 |
|                 Episode |              86 |
|                    Time |            5.54 |
|           AverageReturn |       

---------------------------------------------
|              Experiment |            dqn1 |
|                 TrialNo |               1 |
|                    seed |               1 |
|                 Episode |              97 |
|                    Time |            6.15 |
|           AverageReturn |            22.3 |
|               StdReturn |            10.6 |
|               MaxReturn |              68 |
|               MinReturn |               9 |
|                 Epsilon |           0.908 |
|               EpLenMean |            22.3 |
|                EpLenStd |            10.6 |
| AvgScoresFor100Episodes |            21.5 |
---------------------------------------------
---------------------------------------------
|              Experiment |            dqn1 |
|                 TrialNo |               1 |
|                    seed |               1 |
|                 Episode |              98 |
|                    Time |             6.2 |
|           AverageReturn |       

---------------------------------------------
|              Experiment |            dqn1 |
|                 TrialNo |               1 |
|                    seed |               1 |
|                 Episode |             109 |
|                    Time |            6.82 |
|           AverageReturn |            22.1 |
|               StdReturn |            10.5 |
|               MaxReturn |              68 |
|               MinReturn |               9 |
|                 Epsilon |           0.897 |
|               EpLenMean |            22.1 |
|                EpLenStd |            10.5 |
| AvgScoresFor100Episodes |            21.6 |
---------------------------------------------
---------------------------------------------
|              Experiment |            dqn1 |
|                 TrialNo |               1 |
|                    seed |               1 |
|                 Episode |             110 |
|                    Time |            6.88 |
|           AverageReturn |       

---------------------------------------------
|              Experiment |            dqn1 |
|                 TrialNo |               1 |
|                    seed |               1 |
|                 Episode |             121 |
|                    Time |            7.49 |
|           AverageReturn |            22.1 |
|               StdReturn |              11 |
|               MaxReturn |              70 |
|               MinReturn |               9 |
|                 Epsilon |           0.886 |
|               EpLenMean |            22.1 |
|                EpLenStd |              11 |
| AvgScoresFor100Episodes |            21.7 |
---------------------------------------------
---------------------------------------------
|              Experiment |            dqn1 |
|                 TrialNo |               1 |
|                    seed |               1 |
|                 Episode |             122 |
|                    Time |            7.55 |
|           AverageReturn |       

---------------------------------------------
|              Experiment |            dqn1 |
|                 TrialNo |               1 |
|                    seed |               1 |
|                 Episode |             133 |
|                    Time |            8.17 |
|           AverageReturn |            22.1 |
|               StdReturn |            11.2 |
|               MaxReturn |              70 |
|               MinReturn |               9 |
|                 Epsilon |           0.875 |
|               EpLenMean |            22.1 |
|                EpLenStd |            11.2 |
| AvgScoresFor100Episodes |            21.7 |
---------------------------------------------
---------------------------------------------
|              Experiment |            dqn1 |
|                 TrialNo |               1 |
|                    seed |               1 |
|                 Episode |             134 |
|                    Time |            8.22 |
|           AverageReturn |       

---------------------------------------------
|              Experiment |            dqn1 |
|                 TrialNo |               1 |
|                    seed |               1 |
|                 Episode |             145 |
|                    Time |            8.89 |
|           AverageReturn |            22.2 |
|               StdReturn |            11.3 |
|               MaxReturn |              70 |
|               MinReturn |               9 |
|                 Epsilon |           0.865 |
|               EpLenMean |            22.2 |
|                EpLenStd |            11.3 |
| AvgScoresFor100Episodes |            21.3 |
---------------------------------------------
---------------------------------------------
|              Experiment |            dqn1 |
|                 TrialNo |               1 |
|                    seed |               1 |
|                 Episode |             146 |
|                    Time |            8.95 |
|           AverageReturn |       

---------------------------------------------
|              Experiment |            dqn1 |
|                 TrialNo |               1 |
|                    seed |               1 |
|                 Episode |             157 |
|                    Time |            9.62 |
|           AverageReturn |            22.1 |
|               StdReturn |              11 |
|               MaxReturn |              70 |
|               MinReturn |               9 |
|                 Epsilon |           0.855 |
|               EpLenMean |            22.1 |
|                EpLenStd |              11 |
| AvgScoresFor100Episodes |            20.7 |
---------------------------------------------
---------------------------------------------
|              Experiment |            dqn1 |
|                 TrialNo |               1 |
|                    seed |               1 |
|                 Episode |             158 |
|                    Time |            9.68 |
|           AverageReturn |       

---------------------------------------------
|              Experiment |            dqn1 |
|                 TrialNo |               1 |
|                    seed |               1 |
|                 Episode |             169 |
|                    Time |            10.3 |
|           AverageReturn |            22.2 |
|               StdReturn |              11 |
|               MaxReturn |              70 |
|               MinReturn |               9 |
|                 Epsilon |           0.844 |
|               EpLenMean |            22.2 |
|                EpLenStd |              11 |
| AvgScoresFor100Episodes |            20.8 |
---------------------------------------------
---------------------------------------------
|              Experiment |            dqn1 |
|                 TrialNo |               1 |
|                    seed |               1 |
|                 Episode |             170 |
|                    Time |            10.4 |
|           AverageReturn |       

---------------------------------------------
|              Experiment |            dqn1 |
|                 TrialNo |               1 |
|                    seed |               1 |
|                 Episode |             181 |
|                    Time |              11 |
|           AverageReturn |            21.8 |
|               StdReturn |            10.8 |
|               MaxReturn |              70 |
|               MinReturn |               9 |
|                 Epsilon |           0.834 |
|               EpLenMean |            21.8 |
|                EpLenStd |            10.8 |
| AvgScoresFor100Episodes |            20.4 |
---------------------------------------------
---------------------------------------------
|              Experiment |            dqn1 |
|                 TrialNo |               1 |
|                    seed |               1 |
|                 Episode |             182 |
|                    Time |            11.1 |
|           AverageReturn |       

---------------------------------------------
|              Experiment |            dqn1 |
|                 TrialNo |               1 |
|                    seed |               1 |
|                 Episode |             193 |
|                    Time |            11.8 |
|           AverageReturn |            21.8 |
|               StdReturn |            10.9 |
|               MaxReturn |              70 |
|               MinReturn |               9 |
|                 Epsilon |           0.824 |
|               EpLenMean |            21.8 |
|                EpLenStd |            10.9 |
| AvgScoresFor100Episodes |            20.4 |
---------------------------------------------
---------------------------------------------
|              Experiment |            dqn1 |
|                 TrialNo |               1 |
|                    seed |               1 |
|                 Episode |             194 |
|                    Time |            11.8 |
|           AverageReturn |       

---------------------------------------------
|              Experiment |            dqn1 |
|                 TrialNo |               1 |
|                    seed |               1 |
|                 Episode |             205 |
|                    Time |            12.4 |
|           AverageReturn |            21.9 |
|               StdReturn |            10.9 |
|               MaxReturn |              70 |
|               MinReturn |               9 |
|                 Epsilon |           0.815 |
|               EpLenMean |            21.9 |
|                EpLenStd |            10.9 |
| AvgScoresFor100Episodes |            20.4 |
---------------------------------------------
---------------------------------------------
|              Experiment |            dqn1 |
|                 TrialNo |               1 |
|                    seed |               1 |
|                 Episode |             206 |
|                    Time |            12.5 |
|           AverageReturn |       

---------------------------------------------
|              Experiment |            dqn1 |
|                 TrialNo |               1 |
|                    seed |               1 |
|                 Episode |             217 |
|                    Time |            13.1 |
|           AverageReturn |            21.6 |
|               StdReturn |            10.8 |
|               MaxReturn |              70 |
|               MinReturn |               9 |
|                 Epsilon |           0.805 |
|               EpLenMean |            21.6 |
|                EpLenStd |            10.8 |
| AvgScoresFor100Episodes |            19.7 |
---------------------------------------------
---------------------------------------------
|              Experiment |            dqn1 |
|                 TrialNo |               1 |
|                    seed |               1 |
|                 Episode |             218 |
|                    Time |            13.2 |
|           AverageReturn |       

---------------------------------------------
|              Experiment |            dqn1 |
|                 TrialNo |               1 |
|                    seed |               1 |
|                 Episode |             229 |
|                    Time |            13.8 |
|           AverageReturn |            21.5 |
|               StdReturn |            10.9 |
|               MaxReturn |              70 |
|               MinReturn |               8 |
|                 Epsilon |           0.795 |
|               EpLenMean |            21.5 |
|                EpLenStd |            10.9 |
| AvgScoresFor100Episodes |            19.7 |
---------------------------------------------
---------------------------------------------
|              Experiment |            dqn1 |
|                 TrialNo |               1 |
|                    seed |               1 |
|                 Episode |             230 |
|                    Time |            13.9 |
|           AverageReturn |       

---------------------------------------------
|              Experiment |            dqn1 |
|                 TrialNo |               1 |
|                    seed |               1 |
|                 Episode |             241 |
|                    Time |            14.5 |
|           AverageReturn |            21.3 |
|               StdReturn |            10.7 |
|               MaxReturn |              70 |
|               MinReturn |               8 |
|                 Epsilon |           0.786 |
|               EpLenMean |            21.3 |
|                EpLenStd |            10.7 |
| AvgScoresFor100Episodes |            18.7 |
---------------------------------------------
---------------------------------------------
|              Experiment |            dqn1 |
|                 TrialNo |               1 |
|                    seed |               1 |
|                 Episode |             242 |
|                    Time |            14.6 |
|           AverageReturn |       

---------------------------------------------
|              Experiment |            dqn1 |
|                 TrialNo |               1 |
|                    seed |               1 |
|                 Episode |             253 |
|                    Time |            15.2 |
|           AverageReturn |            21.3 |
|               StdReturn |            10.6 |
|               MaxReturn |              70 |
|               MinReturn |               8 |
|                 Epsilon |           0.776 |
|               EpLenMean |            21.3 |
|                EpLenStd |            10.6 |
| AvgScoresFor100Episodes |            18.9 |
---------------------------------------------
---------------------------------------------
|              Experiment |            dqn1 |
|                 TrialNo |               1 |
|                    seed |               1 |
|                 Episode |             254 |
|                    Time |            15.3 |
|           AverageReturn |       

KeyboardInterrupt: 

## Visualize the Results

In [None]:
visualize_results(set([1,2,3]))