## Installation

To run this you need several packages. First of all, you need anaconda, which you most likely already have if you're viewing this through jupyter. If not then check the readme on the class page.

System requirements: This should work on all operating systems (Linux, Mac, and Windows). However, several of the environments in the OpenAI-gym require additional simulators which don't aren't easy to get on Windows. In any case, it is strongly recommended that you use Linux, although you should be ok with Mac. (HINT: if you're on Windows check out the Windows Subsystem for Linux (WSL), although it'll make visualizing your policies a little tricky).

Then install the following packages (using conda or pip):

- pytorch --> `conda install pytorch -c pytorch`
- gym --> `pip install gym`
- gym (the cool environments, doesnt work on Windows) --> `pip install gym[all]`
(When install gym[all] don't worry if the mujoco installation doesn't work. That's a more advanced 3D physics simulator that has to be set up separately (see website). Anyway, we don't need it necessarily).

In [31]:
# If you're using colab, this will install the necessary packages!
!pip install torch
#!pip install gym
!pip install gym==0.12.1
#!wget https://pjreddie.com/media/files/rlhw_util.py



In [32]:
import sys, os, time
import numpy as np
import matplotlib.pyplot as plt
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
import torch.multiprocessing as mp
from torch import distributions
from torch.distributions import Categorical
from itertools import islice

import gym

from IPython.core.debugger import set_trace # import break point
import pickle as pkl
#%matplotlib inline

# Introduction
Welcome to the RL playground. Your task is to implement the REINFORCE and A3C algorithm to solve various OpenAI-gym environments. If you are not familiar with OpenAI-gym, stop reading and visit https://gym.openai.com/envs/ to see all the tasks you can try to solve.

In this homework, we will only look at tasks with a discrete (and small) action space. That being said, both algorithms can be modified slightly to work on tasks with continuous action spaces. For full credit you must fill in the code below so you achieve an average total reward per episode on the cartpole task (CartPole-v1) of at least 499 (for an episode length of 500) for both REINFORCE and A3C. Then you must apply your code to any one other environment in OpenAI-gym, and plot and compare the learning curves (average total reward per episode vs number of episodes trained on) between REINFORCE and A3C (where at least one of the algorithms shows significant improvement from initialization).

Below there's an overview of what every iteration will look like, regardless of whether you want to train or evaluate your agent.

In [33]:
SMALLMODEL = True
MEDMODEL = False
LARGEMODEL = False

In [34]:
from rlhw_util import * # <-- look whats inside here - it could save you a lot of work!

def run_iteration(mode, N, agent, gen, horizon=None, render=False):
    train = mode == 'train' 
    if train: # if mode is 'train'
        agent.train() # train the agent
    else: # if mode is not 'train'
        agent.eval() # evaluate agent

    states, actions, rewards = zip(*[gen(horizon=horizon, render=render) for _ in range(N)])
    
    #if render and renderClose:
    #    gen.env._env.close()     

    loss = None # loss initilized as None
    if train: # if mode is 'train'
        loss = agent.learn(states, actions, rewards) # loss returned from the training the agent 

    reward = sum([r.sum() for r in rewards]) / N # average over all rewards 

    return reward, loss # return avg reward and loss

## The Actor

We need to learn a policy which, given some state, outputs a distribution over all possible actions. As this is deep RL, we'll use a deep neural network to turn the observed state into the requisite action distribution. From this action distribution we can choose what action to take using `get_action`. Pytorch, brilliant as it is, makes our task incredibly easy, as we can use the `torch.distributions.Categorical` class for sampling.

You can experiment with all sorts of network architectures, but remember this is RL, not image classification on ImageNet, so you probably won't need a very deep network (HINT: look below at the state and action dimensionality to get a feel for the task).

In [35]:
class Actor(nn.Module):
    def __init__(self, state_dim, action_dim):
        super(Actor, self).__init__() # super used to inherit functionality of parent class (nn.Module), for subclass Actor, and instance self
        
        # TODO: Fill in the code to define your policy
        
        if SMALLMODEL:
            Cin = state_dim # input layer dimension
            Cout = action_dim # output layer dimension
            hidden_dim1 = 256 # hidden layer dimension
            
            self.lin1 = nn.Linear(state_dim, hidden_dim1) # linear layer
            self.lin2 = nn.Linear(hidden_dim1, action_dim) # linear layer
            
        elif MEDMODEL:
            Cin = state_dim # input layer dimension
            Cout = action_dim # output layer dimension
            hidden_dim1 = 256 # hidden layer 1 dimension
            hidden_dim2 = 512 # hidden layer 2 dimension
            hidden_dim3 = 256 # hidden layer 3 dimension
            
            self.lin1 = nn.Linear(state_dim, hidden_dim1) # linear layer
            self.lin2 = nn.Linear(hidden_dim1, hidden_dim2) # linear layer
            self.lin3 = nn.Linear(hidden_dim2, hidden_dim3) # linear layer
            self.lin4 = nn.Linear(hidden_dim3, action_dim) # linear layer
            
            
        elif LARGEMODEL:
            Cin = state_dim # input layer dimension
            Cout = action_dim # output layer dimension
            hidden_dim1 = 512 # hidden layer 1 dimension
            hidden_dim2 = 512 # hidden layer 2 dimension
            hidden_dim3 = 1024 # hidden layer 3 dimension
            hidden_dim4 = 512 # hidden layer 4 dimension
            hidden_dim5 = 512 # hidden layer 5 dimension
            
            self.lin1 = nn.Linear(state_dim, hidden_dim1) # linear layer
            self.lin2 = nn.Linear(hidden_dim1, hidden_dim2) # linear layer
            self.lin3 = nn.Linear(hidden_dim2, hidden_dim3) # linear layer
            self.lin4 = nn.Linear(hidden_dim3, hidden_dim4) # linear layer
            self.lin5 = nn.Linear(hidden_dim4, hidden_dim5) # linear layer
            self.lin6 = nn.Linear(hidden_dim5, action_dim) # linear layer
            
        else:
            return
        
        #raise NotImplementedError
        
    def forward(self, state):
        
        # TODO: Fill in the code to run a forward pass of your policy to get a distribution over actions (HINT: probabilities sum to 1)
        
        if SMALLMODEL:
            state = F.relu(self.lin1(state)) # use relu transistion function
            state = F.relu(self.lin2(state)) # use relu transistion function
            return F.softmax(state, dim=-1) # sum probs to 1
        
        elif MEDMODEL:
            state = F.relu(self.lin1(state)) # use relu transistion function
            state = F.relu(self.lin2(state)) # use relu transistion function
            state = F.relu(self.lin3(state)) # use relu transistion function
            state = F.relu(self.lin4(state)) # use relu transistion function
            return F.softmax(state, dim=-1) # sum probs to 1
        
        elif LARGEMODEL:
            state = F.relu(self.lin1(state)) # use relu transistion function
            state = F.relu(self.lin2(state)) # use relu transistion function
            state = F.relu(self.lin3(state)) # use relu transistion function
            state = F.relu(self.lin4(state)) # use relu transistion function
            state = F.relu(self.lin5(state)) # use relu transistion function
            state = F.relu(self.lin6(state)) # use relu transistion function
            return F.softmax(state, dim=-1) # sum probs to 1
            
        else:
            return
            
        #raise NotImplementedError

    def get_policy(self, state):
        return Categorical(self(state))

    def get_action(self, state, greedy=None):
        if greedy is None:
            greedy = not self.training

        policy = self.get_policy(state)
        return MLE(policy) if greedy else policy.sample()

## The REINFORCE Agent

The Actor defines our policy, but we also have to define how and when we'll be updating our policy, which brings us to the agent. The agent will house the policy (an `Actor`), and can then be used to generate rollouts (using `forward()`) or update the policy given a list of rollouts (using `learn()`).

The REINFORCE algorithm naively uses the returns directly to weight the gradients, however this makes the variance in the policy gradient estimation very large. As a result, we will use a baseline which is a linear model which takes in a state and outputs the return (sounds like a value function, right?). Except we're not going to train our baseline using gradient descent, instead we'll just solve the linear system analytically in every iteration, and use the solution in the next iteration. Don't worry about training/updating the baseline, but you do have to use it in the right way. (Optional experiment: try removing the baseline and see how performance changes)

In [36]:
class REINFORCE(nn.Module):
    
    def __init__(self, state_dim, action_dim, discount=0.97, lr=1e-3, weight_decay=1e-4):
        super(REINFORCE, self).__init__()
        self.actor = Actor(state_dim, action_dim)
        
        self.baseline = nn.Linear(state_dim, 1)
        
        # TODO: create an optimizer for the parameters of your actor (HINT: use the passed in lr and weight_decay args)
        
        self.optimizer = optim.Adam(self.parameters(), lr=lr, weight_decay=weight_decay) # create optimized for actor

        #raise NotImplementedError
    
        self.discount = discount
        
    def forward(self, state):
        return self.actor.get_action(state)
    
    def learn(self, states, actions, rewards):
        '''
        Takes in three arguments each of which is a list with equal length. Each element in the list is a 
        pytorch tensor with 1 row for every step in the episode, and the columns are state_dim, action_dim, 
        and 1, respectively.
        '''
        
        # TODO: implement the REINFORCE algorithm (HINT: check the slides/papers)
        
        returns = [compute_returns(rs, discount=self.discount) for rs in rewards]
        
        states, actions, returns = torch.cat(states), torch.cat(actions), torch.cat(returns)
        
        self.optimizer.zero_grad() # https://pytorch.org/docs/stable/optim.html
        
        lossSum = 0
        for i in range(len(states)): # iterate through state, action, reward for single episode 
            s = states[i] # get state
            a = actions[i] # get action
            r = returns[i] # get associated reward

            pi = self.actor.get_policy(s) # get the policy pi

            #loss = -pi.log_prob(a) * r # https://pytorch.org/docs/stable/distributions.html
            loss = pi.log_prob(a) * r # https://pytorch.org/docs/stable/distributions.html

            lossSum += loss
            
        lossSum.backward()
        self.optimizer.step()
            
        #raise NotImplementedError
        
        error = F.mse_loss(self.baseline(states).squeeze(), returns).detach()
        solve(states, returns, out=self.baseline)
        #error = F.mse_loss(self.baseline(states).squeeze(), returns).detach()
        
        return error.item() # Returns a rough estimate of the error in the baseline (dont worry about this too much)

## The Critic

Now we can introduce a critic, which is essentially a value function to estimate the expected discounted reward of a state.

In [37]:
class Critic(nn.Module):
    def __init__(self, state_dim):
        super(Critic, self).__init__()
        
        # TODO: define your value function network
        Cin = state_dim # input layer dimension
        Cout = 1 # output layer dimension - 1 for single value
        hidden_dim1 = 256 # hidden layer dimension

        self.lin1 = nn.Linear(Cin, hidden_dim1) # linear layer
        self.lin2 = nn.Linear(hidden_dim1, Cout) # linear layer

        #raise NotImplementedError

    def forward(self, state):
        
        # TODO: apply your value function network to get a value given this batch of states
        state = F.relu(self.lin1(state)) # use relu transistion function
        state = F.relu(self.lin2(state)) # use relu transistion function
        return F.softmax(state, dim=-1) # sum probs to 1
        
        #raise NotImplementedError

## The A3C Agent

Now we can put the actor and critic together using the A3C algorithm. It turns out, the tasks in the gym are all so simple that there is essentially no gain in parallelization, so technically we're implementing A2C (no async), but the RL part is the same.

In [38]:
class A3C(nn.Module):
    
    def __init__(self, state_dim, action_dim, discount=0.97, lr=1e-3, weight_decay=1e-4):
        super(A3C, self).__init__()
        self.actor = Actor(state_dim, action_dim)
        self.critic = Critic(state_dim)
        self.baseline = nn.Linear(state_dim, 1)
        
        # TODO: create an optimizer for the parameters of your actor (HINT: use the passed in lr and weight_decay args)
        # (HINT: the actor and critic have different objectives, so how many optimizers do you need?)
        
        # Create 2 optimizers, one for the actor, one for the critic (assume same lr and weight decay)
        self.optimizerAct = optim.Adam(self.actor.parameters(), lr=lr, weight_decay=weight_decay) # create optimized for actor
        self.optimizerCrit = optim.Adam(self.critic.parameters(), lr=lr, weight_decay=weight_decay) # create optimized for critic

        #raise NotImplementedError
    
        self.discount = discount
        
    def forward(self, state):
        return self.actor.get_action(state)
    
    def learn(self, states, actions, rewards):
        
        returns = [compute_returns(rs, discount=self.discount) for rs in rewards]
        
        states, actions, returns = torch.cat(states), torch.cat(actions), torch.cat(returns)
        
        # TODO: implement A3C (HINT: algorithm details found in A3C paper supplement) 
        # (HINT2: the algorithm is actually very similar to REINFORCE, the only difference is now we have a critic, what might that do?)
        
        self.optimizerAct.zero_grad() # https://pytorch.org/docs/stable/optim.html
        self.optimizerCrit.zero_grad()
        
        lossSum = 0.0
        for i in range(len(states)): # iterate through state, action, reward for single episode 
            s = states[i] # get state
            a = actions[i] # get action
            r = returns[i] # get associated reward
            
            # We used the pseudocode from the paper (page 14): https://arxiv.org/pdf/1602.01783.pdf
            V = self.critic(s)
            R = r + self.discount * V # check terminal state case??????????
            pi = self.actor.get_policy(s) # get the policy pi
            actLoss = -pi.log_prob(a) * (R - V) # https://pytorch.org/docs/stable/distributions.html
            critLoss = -(R - V) ** 2
            lossSum += actLoss + critLoss
            
        lossSum.backward()
        self.optimizerAct.step()
        self.optimizerCrit.step()
        
        error = F.mse_loss(self.baseline(states).squeeze(), returns).detach() # adapted from given code for actor 
        solve(states, returns, out=self.baseline)
        #error = F.mse_loss(self.baseline(states).squeeze(), returns).detach()
        
        return error.item() # Returns a rough estimate of the error in the baseline (dont worry about this too much)
        
        #raise NotImplementedError

## Part 1: Balancing a pole with a cart

First, we'll test both algorithms on a very simple toy system: the cartpole. Eventhough it's very low dimensional (state=4, action=2), this task is nontrival because it is underactuated. Nevertheless after a few thousand episodes our policy shouldn't have a problem! 

In [39]:
# Optimization hyperparameters
lr = 1e-3
weight_decay = 1e-4

In [40]:
env_name = 'CartPole-v1' 
#env_name = 'LunarLander-v2'
#env_name = 'Acrobot-v1'

e = Pytorch_Gym_Env(env_name)
state_dim = e.observation_space.shape[0]
action_dim = e.action_space.n

In [41]:
# Debug Cell

print(action_dim)
print(state_dim)
print(e.observation_space)
print(e.observation_space.high)
print(e.action_space)

2
4
Box(4,)
[4.8000002e+00 3.4028235e+38 4.1887903e-01 3.4028235e+38]
Discrete(2)


In [42]:
# Choose what agent to use
#agent = REINFORCE(state_dim, action_dim, lr=lr, weight_decay=weight_decay)
agent = A3C(state_dim, action_dim, lr=lr, weight_decay=weight_decay)

total_episodes = 0
print(agent) # Let's take a look at what we're working with...

A3C(
  (actor): Actor(
    (lin1): Linear(in_features=4, out_features=256, bias=True)
    (lin2): Linear(in_features=256, out_features=2, bias=True)
  )
  (critic): Critic(
    (lin1): Linear(in_features=4, out_features=256, bias=True)
    (lin2): Linear(in_features=256, out_features=1, bias=True)
  )
  (baseline): Linear(in_features=4, out_features=1, bias=True)
)


In [43]:
# Create a 
gen = Generator(e, agent)

### Let's do this!!

Below is the loop to train and evaluate your agent. You can play around with the number of iterations to run, and the number of rollouts per iteration. 

You can rerun this cell multiple times to keep training your model for more episodes. In any case, it shouldn't take more than 30 min to an 1 hour to train. (training never took me more than 5 min). HINT: Keep an eye on the eval_reward, it'll be pretty noisy, but if that should be slowly increasing.

In [44]:
# Lists for plottings model training

# REINFORCE Files 
#file_name = 'data/cartpoleREINFORCE.pkl'
#file_name = 'data/lunarlanderREINFORCE.pkl'
#file_name = 'data/acrobatREINFORCE.pkl'

# A3C Files
file_name = 'data/cartpoleA3C.pkl'
#file_name = 'data/lunarlander_A3C.pkl'
#file_name = 'data/acrobatlander_A3C.pkl'


# Lists to store plotting information
episodeList = []
totalRewardList = []
trainLossList = []
evalRewardList = []

In [47]:
num_iter = 100
#num_iter = 500 # for longer training
num_train = 10
num_eval = 10 # dont change this

for itr in range(num_iter):
    #agent.model.epsilon = epsilon * epsilon_decay ** (total_episodes / epsilon_decay_episodes)
    #print('** Iteration {}/{} **'.format(itr+1, num_iter))
    train_reward, train_loss = run_iteration('train', num_train, agent, gen)
    #train_reward, train_loss = run_iteration('train', num_train, agent, gen, render=True)
    eval_reward, _ = run_iteration('eval', num_eval, agent, gen)
    total_episodes += num_train
    print('Ep:{}: reward={:.3f}, loss={:.3f}, eval={:.3f}'.format(total_episodes, train_reward, train_loss, eval_reward))
    
    # add values to plotting lists
    episodeList.append(total_episodes)
    totalRewardList.append(train_reward)
    trainLossList.append(train_loss)
    evalRewardList.append(eval_reward)
    
    if eval_reward > 499 and env_name == 'CartPole-v1': # dont change this
        print('Success!!! You have solved cartpole task! Time for a bigger challenge!')
        #break
        
    if env_name == 'Acrobot-v1':
        if len(episodeList) > 10:
            if evalRewardList[-1] > -100:
                if evalRewardList[-2] > -100: 
                    if evalRewardList[-3] > -100:
                        if evalRewardList[-4] > -100:
                            print('solved acrobat!')
                            break


print('Done Training')

# save model
# Saving Model Metrics
print('Saving Metrics to Plot')

# dictionary to hold all model data
plotDict = {}
plotDict['episodeList'] = episodeList
plotDict['totalRewardList'] = totalRewardList
plotDict['trainLossList'] = trainLossList
plotDict['evalRewardList'] = evalRewardList

# dump values to pickle file for storage
with open(file_name, 'wb') as pklOutput:
    pkl.dump(plotDict, pklOutput, pkl.HIGHEST_PROTOCOL)

print('All Metrics Saved')
    
    
    

Ep:1010: reward=24.000, loss=37.787, eval=118.900
Ep:1020: reward=32.000, loss=40.742, eval=245.500
Ep:1030: reward=37.000, loss=111.563, eval=236.900
Ep:1040: reward=23.800, loss=154.022, eval=214.800
Ep:1050: reward=30.800, loss=126.136, eval=180.400
Ep:1060: reward=18.000, loss=33.650, eval=210.300
Ep:1070: reward=23.300, loss=49.865, eval=176.100
Ep:1080: reward=20.300, loss=26.368, eval=284.800
Ep:1090: reward=41.100, loss=86.481, eval=230.000
Ep:1100: reward=23.400, loss=43.785, eval=196.100
Ep:1110: reward=29.800, loss=51.320, eval=114.400
Ep:1120: reward=24.400, loss=65.232, eval=171.500
Ep:1130: reward=32.600, loss=140.103, eval=225.600
Ep:1140: reward=31.500, loss=61.587, eval=295.900
Ep:1150: reward=30.000, loss=51.124, eval=324.100
Ep:1160: reward=18.400, loss=58.910, eval=425.000
Ep:1170: reward=31.300, loss=148.251, eval=405.600
Ep:1180: reward=32.800, loss=39.557, eval=340.300
Ep:1190: reward=38.500, loss=76.025, eval=421.000
Ep:1200: reward=33.800, loss=51.265, eval=459

FileNotFoundError: [Errno 2] No such file or directory: 'data/cartpoleA3C.pkl'

In [None]:
# You can visualize your policy at any time
avgReward, avgLoss = run_iteration('eval', 5, agent, gen, render=True)
print(avgReward, avgLoss)

### Analysis

Plot the performance of each of your agents for the cartpole task and one additional task. When choosing a new environment, make sure is has a discrete action space. For each plot the x axis should show the total number of episodes the model was trained on, and the y axis shows the average total reward per episode.

You can leave the plots as cell outputs below, or you can save them as images and submit them separately.

### Deliverables
- single plot showing both the REINFORCE algorithm's performance, and A3C's performance on the same plot for the cartpole environment (CartPole-v1).
- single plot showing both the REINFORCE algorithm's performance, and A3C's performance on the same plot for a second environment of your choice (suggested -> LunarLander-v2, it's a little tricky but watching the agent fly spaceships is very entertaining!).
- in every case you models have to learn something for full credit.

In [None]:
CARTPOLE = True
LUNARLANDER = True
ACROBAT = True

if CARTPOLE:
    with open('data/cartpoleREINFORCE_smallmodel.pkl', 'rb') as cartpoleREINFORCE_smallmodel:
        cpRDict_sml = pkl.load(cartpoleREINFORCE_smallmodel)
        cpREINFORCE_sml_episode = cpRDict_sml['episodeList']
        cpREINFORCE_sml_totalReward = cpRDict_sml['totalRewardList']
        cpREINFORCE_sml_trainLoss = cpRDict_sml['trainLossList']
        cpREINFORCE_sml_evalReward = cpRDict_sml['evalRewardList']
        
    with open('data/cartpoleREINFORCE_medmodel.pkl', 'rb') as cartpoleREINFORCE_mediummodel:
        cpRDict_med = pkl.load(cartpoleREINFORCE_mediummodel)
        cpREINFORCE_med_episode = cpRDict_med['episodeList']
        cpREINFORCE_med_totalReward = cpRDict_med['totalRewardList']
        cpREINFORCE_med_trainLoss = cpRDict_med['trainLossList']
        cpREINFORCE_med_evalReward = cpRDict_med['evalRewardList']
        
    with open('data/cartpoleREINFORCE_largemodel.pkl', 'rb') as cartpoleREINFORCE_largemodel:
        cpRDict_lrg = pkl.load(cartpoleREINFORCE_largemodel)
        cpREINFORCE_lrg_episode = cpRDict_lrg['episodeList']
        cpREINFORCE_lrg_totalReward = cpRDict_lrg['totalRewardList']
        cpREINFORCE_lrg_trainLoss = cpRDict_lrg['trainLossList']
        cpREINFORCE_lrg_evalReward = cpRDict_lrg['evalRewardList']
    
    plt.figure(figsize=(15,20))
    plt.tight_layout()
    
    plt.subplot(3,1,1)
    plt.plot(cpREINFORCE_sml_episode, cpREINFORCE_sml_totalReward,'r', label='Small Model')
    plt.plot(cpREINFORCE_med_episode, cpREINFORCE_med_totalReward,'g', label='Medium Model')
    plt.plot(cpREINFORCE_lrg_episode, cpREINFORCE_lrg_totalReward,'b', label='Large Model')
    
    plt.legend()
    plt.xlabel('Episodes')
    plt.ylabel('Avg Reward')
    plt.title('Cartpole - Avg Reward vs Training Episodes')
    
    plt.subplot(3,1,2)
    plt.plot(cpREINFORCE_sml_episode, cpREINFORCE_sml_trainLoss,'r', label='Small Model')
    plt.plot(cpREINFORCE_med_episode, cpREINFORCE_med_trainLoss,'g', label='Medium Model')
    plt.plot(cpREINFORCE_lrg_episode, cpREINFORCE_lrg_trainLoss,'b', label='Large Model')
    
    plt.legend()
    plt.xlabel('Episodes')
    plt.ylabel('Train Loss')
    plt.title('Cartpole - Training Loss vs Training Episodes')
    
    plt.subplot(3,1,3)
    plt.plot(cpREINFORCE_sml_episode, cpREINFORCE_sml_evalReward,'r', label='Small Model')
    plt.plot(cpREINFORCE_med_episode, cpREINFORCE_med_evalReward,'g', label='Medium Model')
    plt.plot(cpREINFORCE_lrg_episode, cpREINFORCE_lrg_evalReward,'b', label='Large Model')
    
    plt.legend()
    plt.xlabel('Episodes')
    plt.ylabel('Eval Reward')
    plt.title('Cartpole - Eval Reward vs Training Episodes')

    
    
if LUNARLANDER:
    with open('data/lunarlanderREINFORCE_smallmodel.pkl', 'rb') as lunarlanderREINFORCE_smallmodel:
        llRDict_sml = pkl.load(lunarlanderREINFORCE_smallmodel)
        llREINFORCE_sml_episode = llRDict_sml['episodeList']
        llREINFORCE_sml_totalReward = llRDict_sml['totalRewardList']
        llREINFORCE_sml_trainLoss = llRDict_sml['trainLossList']
        llREINFORCE_sml_evalReward = llRDict_sml['evalRewardList']
        
    with open('data/lunarlanderREINFORCE_medmodel.pkl', 'rb') as lunarlanderREINFORCE_mediummodel:
        llRDict_med = pkl.load(lunarlanderREINFORCE_mediummodel)
        llREINFORCE_med_episode = llRDict_med['episodeList']
        llREINFORCE_med_totalReward = llRDict_med['totalRewardList']
        llREINFORCE_med_trainLoss = llRDict_med['trainLossList']
        llREINFORCE_med_evalReward = llRDict_med['evalRewardList']
        
    with open('data/lunarlanderREINFORCE_largemodel.pkl', 'rb') as lunarlanderREINFORCE_largemodel:
        llRDict_lrg = pkl.load(lunarlanderREINFORCE_largemodel)
        llREINFORCE_lrg_episode = llRDict_lrg['episodeList']
        llREINFORCE_lrg_totalReward = llRDict_lrg['totalRewardList']
        llREINFORCE_lrg_trainLoss = llRDict_lrg['trainLossList']
        llREINFORCE_lrg_evalReward = llRDict_lrg['evalRewardList']
 
    plt.figure(figsize=(15,20))
    plt.tight_layout()
    
    plt.subplot(3,1,1)
    plt.plot(llREINFORCE_sml_episode, llREINFORCE_sml_totalReward,'r', label='Small Model')
    plt.plot(llREINFORCE_med_episode, llREINFORCE_med_totalReward,'g', label='Medium Model')
    plt.plot(llREINFORCE_lrg_episode, llREINFORCE_lrg_totalReward,'b', label='Large Model')
    
    plt.legend()
    plt.xlabel('Episodes')
    plt.ylabel('Avg Reward')
    plt.title('Lunar Lander - Avg Reward vs Training Episodes')
    
    plt.subplot(3,1,2)
    plt.plot(llREINFORCE_sml_episode, llREINFORCE_sml_trainLoss,'r', label='Small Model')
    plt.plot(llREINFORCE_med_episode, llREINFORCE_med_trainLoss,'g', label='Medium Model')
    plt.plot(llREINFORCE_lrg_episode, llREINFORCE_lrg_trainLoss,'b', label='Large Model')
    
    plt.legend()
    plt.xlabel('Episodes')
    plt.ylabel('Train Loss')
    plt.title('Lunar Lander - Training Loss vs Training Episodes')
    
    plt.subplot(3,1,3)
    plt.plot(llREINFORCE_sml_episode, llREINFORCE_sml_evalReward,'r', label='Small Model')
    plt.plot(llREINFORCE_med_episode, llREINFORCE_med_evalReward,'g', label='Medium Model')
    plt.plot(llREINFORCE_lrg_episode, llREINFORCE_lrg_evalReward,'b', label='Large Model')
    
    plt.legend()
    plt.xlabel('Episodes')
    plt.ylabel('Eval Reward')
    plt.title('Lunar Lander - Eval Reward vs Training Episodes')
  
    
    
if ACROBAT:
    with open('data/acrobatREINFORCE_smallmodel.pkl', 'rb') as acrobatREINFORCE_smallmodel:
        abRDict_sml = pkl.load(acrobatREINFORCE_smallmodel)
        abREINFORCE_sml_episode = abRDict_sml['episodeList']
        abREINFORCE_sml_totalReward = abRDict_sml['totalRewardList']
        abREINFORCE_sml_trainLoss = abRDict_sml['trainLossList']
        abREINFORCE_sml_evalReward = abRDict_sml['evalRewardList']
        
    with open('data/acrobatREINFORCE_medmodel.pkl', 'rb') as acrobatREINFORCE_mediummodel:
        abRDict_med = pkl.load(acrobatREINFORCE_mediummodel)
        abREINFORCE_med_episode = abRDict_med['episodeList']
        abREINFORCE_med_totalReward = abRDict_med['totalRewardList']
        abREINFORCE_med_trainLoss = abRDict_med['trainLossList']
        abREINFORCE_med_evalReward = abRDict_med['evalRewardList']
        
    with open('data/acrobatREINFORCE_largemodel.pkl', 'rb') as acrobatREINFORCE_largemodel:
        abRDict_lrg = pkl.load(acrobatREINFORCE_largemodel)
        abREINFORCE_lrg_episode = abRDict_lrg['episodeList']
        abREINFORCE_lrg_totalReward = abRDict_lrg['totalRewardList']
        abREINFORCE_lrg_trainLoss = abRDict_lrg['trainLossList']
        abREINFORCE_lrg_evalReward = abRDict_lrg['evalRewardList']

    plt.figure(figsize=(15,20))
    plt.tight_layout()
    
    plt.subplot(3,1,1)
    plt.plot(abREINFORCE_sml_episode, abREINFORCE_sml_totalReward,'r', label='Small Model')
    plt.plot(abREINFORCE_med_episode, abREINFORCE_med_totalReward,'g', label='Medium Model')
    plt.plot(abREINFORCE_lrg_episode, abREINFORCE_lrg_totalReward,'b', label='Large Model')
    
    plt.legend()
    plt.xlabel('Episodes')
    plt.ylabel('Avg Reward')
    plt.title('Acrobat - Avg Reward vs Training Episodes')
    
    plt.subplot(3,1,2)
    plt.plot(abREINFORCE_sml_episode, abREINFORCE_sml_trainLoss,'r', label='Small Model')
    plt.plot(abREINFORCE_med_episode, abREINFORCE_med_trainLoss,'g', label='Medium Model')
    plt.plot(abREINFORCE_lrg_episode, abREINFORCE_lrg_trainLoss,'b', label='Large Model')
    
    plt.legend()
    plt.xlabel('Episodes')
    plt.ylabel('Train Loss')
    plt.title('Acrobat - Training Loss vs Training Episodes')
    
    plt.subplot(3,1,3)
    plt.plot(abREINFORCE_sml_episode, abREINFORCE_sml_evalReward,'r', label='Small Model')
    plt.plot(abREINFORCE_med_episode, abREINFORCE_med_evalReward,'g', label='Medium Model')
    plt.plot(abREINFORCE_lrg_episode, abREINFORCE_lrg_evalReward,'b', label='Large Model')
    
    plt.legend()
    plt.xlabel('Episodes')
    plt.ylabel('Eval Reward')
    plt.title('Acrobat - Eval Reward vs Training Episodes')

In [None]:
if CARTPOLE:
    with open('data/cartpoleREINFORCE_converge.pkl', 'rb') as cartpoleREINFORCE_conv:
        cpRDict_conv = pkl.load(cartpoleREINFORCE_conv)
        cpREINFORCE_conv_episode = cpRDict_conv['episodeList']
        cpREINFORCE_conv_totalReward = cpRDict_conv['totalRewardList']
        cpREINFORCE_conv_trainLoss = cpRDict_conv['trainLossList']
        cpREINFORCE_conv_evalReward = cpRDict_conv['evalRewardList']
    
    plt.figure(figsize=(15,20))
    plt.tight_layout()
    
    plt.subplot(3,1,1)
    plt.plot(cpREINFORCE_conv_episode[0:200], cpREINFORCE_conv_totalReward[0:200])
    plt.xlabel('Episodes')
    plt.ylabel('Avg Reward')
    plt.title('Cartpole - Avg Reward vs Training Episodes')
    
    plt.subplot(3,1,2)
    plt.plot(cpREINFORCE_conv_episode[0:200], cpREINFORCE_conv_trainLoss[0:200])
    plt.xlabel('Episodes')
    plt.ylabel('Train Loss')
    plt.title('Cartpole - Training Loss vs Training Episodes')
    
    plt.subplot(3,1,3)
    plt.plot(cpREINFORCE_conv_episode[0:200], cpREINFORCE_conv_evalReward[0:200])
    plt.xlabel('Episodes')
    plt.ylabel('Eval Reward')
    plt.title('Cartpole - Eval Reward vs Training Episodes')


In [None]:
if ACROBAT:
    with open('data/acrobatREINFORCE_conv.pkl', 'rb') as acrobatREINFORCE_conv:
        abRDict_conv = pkl.load(acrobatREINFORCE_conv)
        abREINFORCE_conv_episode = abRDict_conv['episodeList']
        abREINFORCE_conv_totalReward = abRDict_conv['totalRewardList']
        abREINFORCE_conv_trainLoss = abRDict_conv['trainLossList']
        abREINFORCE_conv_evalReward = abRDict_conv['evalRewardList']
    
    plt.figure(figsize=(15,20))
    plt.tight_layout()
    
    plt.subplot(3,1,1)
    plt.plot(cpREINFORCE_conv_episode[0:200], cpREINFORCE_conv_totalReward[0:200])
    plt.xlabel('Episodes')
    plt.ylabel('Avg Reward')
    plt.title('Cartpole - Avg Reward vs Training Episodes')
    
    plt.subplot(3,1,2)
    plt.plot(cpREINFORCE_conv_episode[0:200], cpREINFORCE_conv_trainLoss[0:200])
    plt.xlabel('Episodes')
    plt.ylabel('Train Loss')
    plt.title('Cartpole - Training Loss vs Training Episodes')
    
    plt.subplot(3,1,3)
    plt.plot(cpREINFORCE_conv_episode[0:200], cpREINFORCE_conv_evalReward[0:200])
    plt.xlabel('Episodes')
    plt.ylabel('Eval Reward')
    plt.title('Cartpole - Eval Reward vs Training Episodes')