# Reinforcement Learning - Policy Gradient
If you want to test/submit your solution **restart the kernel, run all cells and submit the pg_autograde.py file into codegrade.**

In [1]:
# This cell imports %%execwritefile command (executes cell and writes it into file). 
from custommagics import CustomMagics
get_ipython().register_magics(CustomMagics)

In [2]:
%%execwritefile pg_autograde.py
import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch import optim
from tqdm import tqdm as _tqdm

def tqdm(*args, **kwargs):
    return _tqdm(*args, **kwargs, mininterval=1)  # Safety, do not overflow buffer

Overwriting pg_autograde.py


In [3]:
%matplotlib inline

import matplotlib.pyplot as plt
import sys

import gym
import time

assert sys.version_info[:3] >= (3, 6, 0), "Make sure you have Python 3.6 installed!"

---

## 3. Policy Gradient

### 3.1 Policy Network

In order to implement policy gradient, we will first implement a class with a policy network. Although in general this does not have to be the case, we will use an architecture very similar to the Q-network that we used (two layers with ReLU activation for the hidden layer). Since we have discrete actions, our model will output one value per action, where each value represents the (normalized!) probability of selecting that action. *Use the softmax activation function.*

In [4]:
%%execwritefile -a pg_autograde.py

class NNPolicy(nn.Module):
    
    def __init__(self, num_hidden=128):
        nn.Module.__init__(self)
        self.l1 = nn.Linear(4, num_hidden)
        self.l2 = nn.Linear(num_hidden, 2)

    def forward(self, x):
        """
        Performs a forward pass through the network.
        
        Args:
            x: input tensor (first dimension is a batch dimension)
            
        Return:
            Probabilities of performing all actions in given input states x. Shape: batch_size x action_space_size
        """
        # YOUR CODE HERE
        model = torch.nn.Sequential(
            self.l1,
            nn.ReLU(),
            self.l2,
            nn.Softmax(dim = 1)
        )
        return model(x.float())
        
    def get_probs(self, obs, actions):
        """
        This function takes a tensor of states and a tensor of actions and returns a tensor that contains 
        a probability of perfoming corresponding action in all states (one for every state action pair). 

        Args:
            obs: a tensor of states. Shape: batch_size x obs_dim
            actions: a tensor of actions. Shape: batch_size x 1

        Returns:
            A torch tensor filled with probabilities. Shape: batch_size x 1.
        """
        # YOUR CODE HERE
        probs = self.forward(obs)
        actions = actions.squeeze(1)
        
#         print('actions', actions, actions.shape)
#         print('probs', probs, probs.shape)

        # Selects the probability from the action
        # Should be rewritten with indexing instead of for loop
        action_probs = torch.tensor([prob[action] for prob, action in zip(probs, actions)])
        action_probs = action_probs.unsqueeze(1)
        
        return action_probs
    
    def sample_action(self, obs):
        """
        This method takes a state as input and returns an action sampled from this policy.  

        Args:
            obs: state as a tensor. Shape: 1 x obs_dim or obs_dim

        Returns:
            An action (int).
        """
        # YOUR CODE HERE
        # Reshape if necessary, input shape is variable
        assert len(obs.shape) in [1, 2]
        if len(obs.shape) == 1: obs = obs.unsqueeze(0)
        
        probs = self.forward(obs).squeeze(0).detach().numpy() # Put observation through network
        action = np.random.choice([0, 1], p = probs) # Select action with weighted by network output
        
        return action

Appending to pg_autograde.py


In [5]:
# Let's instantiate and test if it works
num_hidden = 128
torch.manual_seed(1234)
policy = NNPolicy(num_hidden)

states = torch.rand(10, 4)
actions = torch.randint(low=0, high=2, size=(10,1))
print(actions)

# Does the outcome make sense?
forward_probs = policy.forward(states)
print(forward_probs)
assert forward_probs.shape == (10,2), "Output of forward has incorrect shape."
sampled_action = policy.sample_action(states[0])
assert sampled_action == 0 or sampled_action == 1, "Output of sample action is not 0 or 1"

action_probs = policy.get_probs(states, actions)
print(action_probs)
assert action_probs.shape == (10,1), "Output of get_probs has incorrect shape."

tensor([[0],
        [1],
        [1],
        [0],
        [0],
        [1],
        [0],
        [1],
        [0],
        [0]])
tensor([[0.4578, 0.5422],
        [0.4657, 0.5343],
        [0.4563, 0.5437],
        [0.4634, 0.5366],
        [0.4564, 0.5436],
        [0.4725, 0.5275],
        [0.4769, 0.5231],
        [0.4834, 0.5166],
        [0.4797, 0.5203],
        [0.4618, 0.5382]], grad_fn=<SoftmaxBackward0>)
tensor([[0.4578],
        [0.5343],
        [0.5437],
        [0.4634],
        [0.4564],
        [0.5275],
        [0.4769],
        [0.5166],
        [0.4797],
        [0.4618]])


### 3.2 Monte Carlo REINFORCE

Now we will implement the *Monte Carlo* policy gradient algorithm. Remember that this means that we will estimate returns for states by sample episodes. Compared to DQN, this means that we do *not* perform an update step at every environment step, but only at the end of each episode. This means that we should generate an episode of data, compute the REINFORCE loss (which requires computing the returns) and then perform a gradient step.

* You can use `torch.multinomial` to sample from a categorical distribution.
* The REINFORCE loss is defined as $- \sum_t \log \pi_\theta(a_t|s_t) G_t$, which means that you should compute the (discounted) return $G_t$ for all $t$. Make sure that you do this in **linear time**, otherwise your algorithm will be very slow! Note the - (minus) since you want to maximize return while you want to minimize the loss.

To help you, we wrote down signatures of a few helper functions. Start by implementing a sampling routine that samples a single episode (similarly to the one in Monte Carlo lab).

In [31]:
%%execwritefile -a pg_autograde.py

def sample_episode(env, policy):
    """
    A sampling routine. Given environment and a policy samples one episode and returns states, actions, rewards
    and dones from environment's step function as tensors.

    Args:
        env: OpenAI gym environment.
        policy: A policy which allows us to sample actions with its sample_action method.

    Returns:
        Tuple of tensors (states, actions, rewards, dones). All tensors should have same first dimension and 
        should have dim=2. This means that vectors of length N (states, rewards, actions) should be Nx1.
        Hint: Do not include the state after termination in states.
    """
    states = []
    actions = []
    rewards = []
    dones = []
    
    # YOUR CODE HERE
    state = env.reset()
    done = False
    
    while not done:
        action = policy.sample_action(torch.from_numpy(state))
        new_state, reward, done, info = env.step(action)
        states.append(torch.as_tensor(state))
        actions.append(torch.as_tensor(action))
        rewards.append(torch.as_tensor(reward))
        dones.append(torch.as_tensor(done))
        
        state = new_state
    
    
    states = torch.stack(states)
    actions = torch.stack(actions).unsqueeze(1)
    rewards = torch.stack(rewards).unsqueeze(1)
    dones = torch.stack(dones).unsqueeze(1)
    
    print(states.shape, actions.shape, rewards.shape, dones.shape)
    
    return states, actions, rewards, dones

Appending to pg_autograde.py


In [155]:
# Let's sample some episodes
env = gym.envs.make("CartPole-v1")
num_hidden = 128
torch.manual_seed(1234)
policy = NNPolicy(num_hidden)
for episode in range(3):
    trajectory_data = sample_episode(env, policy)

torch.Size([12, 4]) torch.Size([12, 1]) torch.Size([12, 1]) torch.Size([12, 1])
torch.Size([15, 4]) torch.Size([15, 1]) torch.Size([15, 1]) torch.Size([15, 1])
torch.Size([13, 4]) torch.Size([13, 1]) torch.Size([13, 1]) torch.Size([13, 1])


Now implement loss computation and training loop of the algorithm.

In [165]:
states, actions, rewards, dones = trajectory_data
t = torch.arange(0, len(states)).unsqueeze(1)
action_probs = policy.get_probs(states, actions)
r = t.squeeze(1)
t = t.squeeze(1) + 1
grid_r, grid_t = torch.meshgrid(r, t)
grid = (grid_r*grid_t).T
grid[0]

tensor([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12])

In [166]:
grid

tensor([[  0,   1,   2,   3,   4,   5,   6,   7,   8,   9,  10,  11,  12],
        [  0,   2,   4,   6,   8,  10,  12,  14,  16,  18,  20,  22,  24],
        [  0,   3,   6,   9,  12,  15,  18,  21,  24,  27,  30,  33,  36],
        [  0,   4,   8,  12,  16,  20,  24,  28,  32,  36,  40,  44,  48],
        [  0,   5,  10,  15,  20,  25,  30,  35,  40,  45,  50,  55,  60],
        [  0,   6,  12,  18,  24,  30,  36,  42,  48,  54,  60,  66,  72],
        [  0,   7,  14,  21,  28,  35,  42,  49,  56,  63,  70,  77,  84],
        [  0,   8,  16,  24,  32,  40,  48,  56,  64,  72,  80,  88,  96],
        [  0,   9,  18,  27,  36,  45,  54,  63,  72,  81,  90,  99, 108],
        [  0,  10,  20,  30,  40,  50,  60,  70,  80,  90, 100, 110, 120],
        [  0,  11,  22,  33,  44,  55,  66,  77,  88,  99, 110, 121, 132],
        [  0,  12,  24,  36,  48,  60,  72,  84,  96, 108, 120, 132, 144],
        [  0,  13,  26,  39,  52,  65,  78,  91, 104, 117, 130, 143, 156]])

In [99]:
reversed(torch.cumsum(r, dim = 0))

tensor([[171, 171],
        [153, 153],
        [136, 136],
        [120, 120],
        [105, 105],
        [ 91,  91],
        [ 78,  78],
        [ 66,  66],
        [ 55,  55],
        [ 45,  45],
        [ 36,  36],
        [ 28,  28],
        [ 21,  21],
        [ 15,  15],
        [ 10,  10],
        [  6,   6],
        [  3,   3],
        [  1,   1],
        [  0,   0]])

In [88]:
torch.flip( torch.cumsum(r, dim = 0), dims = 0)

TypeError: flip(): argument 'dims' must be tuple of ints, not int

In [82]:
torch.cumsum(r[:-1], dim = 0)

tensor([[ 1.,  1.],
        [ 2.,  2.],
        [ 3.,  3.],
        [ 4.,  4.],
        [ 5.,  5.],
        [ 6.,  6.],
        [ 7.,  7.],
        [ 8.,  8.],
        [ 9.,  9.],
        [10., 10.],
        [11., 11.],
        [12., 12.],
        [13., 13.],
        [14., 14.],
        [15., 15.],
        [16., 16.],
        [17., 17.],
        [18., 18.]])

In [73]:
rewards

tensor([[1.],
        [1.],
        [1.],
        [1.],
        [1.],
        [1.],
        [1.],
        [1.],
        [1.],
        [1.],
        [1.],
        [1.],
        [1.],
        [1.],
        [1.],
        [1.],
        [1.],
        [1.],
        [1.]])

In [58]:
loss = - torch.sum(torch.log(action_probs) * torch.pow(0.99, t) rewards, dim = 1)
torch.log(action_probs) * rewards

tensor([[-0.5967],
        [-0.5524],
        [-0.9258],
        [-0.8595],
        [-0.8059],
        [-0.7338],
        [-0.7301],
        [-0.6494],
        [-0.5861],
        [-0.5424],
        [-0.9491],
        [-0.5401],
        [-0.4852],
        [-1.0380],
        [-0.4786],
        [-0.4301],
        [-0.3890],
        [-0.3497],
        [-0.3176]])

In [54]:
%%execwritefile -a pg_autograde.py

def compute_reinforce_loss(policy, episode, discount_factor):
    """
    Computes reinforce loss for given episode.

    Args:
        policy: A policy which allows us to get probabilities of actions in states with its get_probs method.

    Returns:
        loss: reinforce loss
    """
    # Compute the reinforce loss
    # Make sure that your function runs in LINEAR TIME
    # Note that the rewards/returns should be maximized 
    # while the loss should be minimized so you need a - somewhere
    
    # YOUR CODE HERE
    states, actions, rewards, dones = trajectory_data = episode
    action_probs = policy.get_probs(states, actions)
    loss = - torch.sum(torch.log(action_probs) * rewards, dim = 1)
    
    return loss

# YOUR CODE HERE
# raise NotImplementedError

def run_episodes_policy_gradient(policy, env, num_episodes, discount_factor, learn_rate, 
                                 sampling_function=sample_episode):
    optimizer = optim.Adam(policy.parameters(), learn_rate)
    
    episode_durations = []
    for i in range(num_episodes):
        
        # YOUR CODE HERE
        raise NotImplementedError
                           
        if i % 10 == 0:
            print("{2} Episode {0} finished after {1} steps"
                  .format(i, len(episode[0]), '\033[92m' if len(episode[0]) >= 195 else '\033[99m'))
        episode_durations.append(len(episode[0]))
        
    return episode_durations

Appending to pg_autograde.py


In [55]:
# Smoothing function for nicer plots
def smooth(x, N):
    cumsum = np.cumsum(np.insert(x, 0, 0)) 
    return (cumsum[N:] - cumsum[:-N]) / float(N)

In [56]:
# Feel free to play around with the parameters!
num_episodes = 500
discount_factor = 0.99
learn_rate = 0.001
seed = 42
env = gym.envs.make("CartPole-v1")
torch.manual_seed(seed)
env.seed(seed)
policy = NNPolicy(num_hidden)

episode_durations_policy_gradient = run_episodes_policy_gradient(
    policy, env, num_episodes, discount_factor, learn_rate)

plt.plot(smooth(episode_durations_policy_gradient, 10))
plt.title('Episode durations per episode')
plt.legend(['Policy gradient'])

NotImplementedError: 

If you want to test/submit your solution **restart the kernel, run all cells and submit the pg_autograde.py file into codegrade.**