# Introduction to Policy Gradients

Authors: Sean Lin, Jaiveer Singh, Ethan Mehta

Estimated Time: 1 hr. 30 min.

References: 

- OpenAI Spinning Up (https://spinningup.openai.com/en/latest/index.html)
<br> OpenAI's open source resource on reinforcement learning, Spinning Up, provides utils files and skeleton code for vanilla policy gradients that we built upon for this assignment

- EECS 16A Segway Tours Problem: https://eecs16a.org/homework/prob4.pdf

- Vanilla Policy Gradients: https://medium.com/@aniket.tcdav/vanilla-policy-gradient-with-tensorflow-2-9855df271472, https://www.janisklaise.com/post/rl-policy-gradients/

## Introduction

Recall the classic EECS 16AB self-balancing segway problem. The segway problem is used as a means to develop fundamental understanding of control theory and linear algebra. Interestingly, we can use reinforcement learning methods to teach a segway how to self-balance! This assignment will use the segway problem as a means to introduce students to deep reinforcement learning.

In the previous assignment, you explored various methods for policy updates in the discrete state space and discrete action space. However, policy updates will only work when our parameters are discrete. There are many examples in the world where our parameters and state space are continuous––such as the segway problem. This assignment will discuss how to deal with continuous parameters when our state space is continuous via a method known as policy gradients! 

In this assignment, we will discuss how to represent a policy when given a continuous state space, which provides a first look at deep reinforcement learning––the usage of deep learning in RL applications. We will then implement such a policy, and introduce how to optimize it via a policy gradient. Next, we will implement a fundamental policy gradient algorithm known as <b> Vanilla Policy Gradient </b> (VPG), and use our algorithm to solve the self-balancing segway problem.

<img src="images/segway.png">

## Setup

We have prepared a conda environment for you to work on this assignment, since this assignment uses packages such as tensorflow, numpy, and OpenAI gym that may cause problems if the versions are incompatible and different. Follow the following steps to set up the environment:

1) In this problem's, you should see an environment.yml file.
<br> 2) Create a new conda environment named spinningup by running `conda env create --file environment.yaml`
<br> 3) Run `conda activate spinningup` to activate the environment
<br> 4) Run `python -m ipykernel install --user --name=spinningup` to export this environment into an iPython notebook kernel so that we can use it in this notebook.
<br> 5) Restart the jupyter notebook. In the menu, go to `Kernel -> Change Kernel`. You should see an environment named `spinningup` among the list of kernels. Switch your kernel to `spinningup`, and now you are all set!
<br>
<img src='images/kernel_change.png' width="909" height="500">
</br>
<br> 6) Run the imports below

In [1]:
import numpy as np
import tensorflow as tf
import gym
import time
import core
import vpg
from utils.logx import EpochLogger
from utils.mpi_tf import MpiAdamOptimizer, sync_all_params
from utils.mpi_tools import mpi_fork, mpi_avg, proc_id, mpi_statistics_scalar, num_procs
from utils.run_utils import setup_logger_kwargs




<b> Note:</b> Because machine learning, and specifically deep reinforcement learning, is still a relatively nascent field, many machine-learning libraries are fragile and require dependencies that may easily become deprecated (i.e. tensorflow 1.15 vs. tensorflow 2.0). To reproduce machine learning experiments and projects, it is paramount that one knows how to freeze and export an environment for later usage as we did for you in this assignment. We recommend that all students taking EECS 16ML be well-versed in this practice.

## Policies given continuous state space, discrete action space

Recall that a policy is a rule used to decide what actions to take given the state or observations of the world. We can conceptualize a policy as a mapping from input observations to an output action. In the previous problem, we looked at worlds with discrete state spaces and discrete action spaces; for instance in the taxi problem, there are a discrete set of states that the taxi and passenger can be in. Likewise, there was a discrete set of actions that the taxi could take (move up, move down, move left, move right, pick up, and drop off). The policy takes the form of a dictionary, where one state maps to one action.

The self-balancing segway presents a different challenge. In this world, the segway can move anywhere on the line, and lean in any way. The observations that we gather take on the form:
$$[\text{position of cart}, \text{velocity of cart}, \text{angle of pole}, \text{rotation rate of pole}]$$

In other words, the state space is continuous. We now see that our previous notion of a policy no longer applies––we cannot represent a policy as a dictionary because the set of states, which served as the keys to the dictionary, are infinite! Therefore, we must formulate our policy in some other way.

This is where neural networks come in handy. Given continuous observations as input, we can develop a classification neural network that outputs the probabilities that we pick an action. Given these probabilities, we then sample from a multinomial distribution defined by these probabilities to pick what action to take.

In the case of the CartPole, we have two actions that we can take: move left or move right. After feeding our observational data into our neural network. We do some processing to obtain the probabilities of moving left ($p_l$) and right ($p_r$) respectively. Since we have two actions, we can use a $binomial(p_l)$ distribution to sample for which action to take! In the following exercises, we will implement such a categorical policy.


<img src="images/policy_nn.png" width="576" height="300">

### Task 1: Implement Multilayer Perceptron

In this task, we will define a categorical policy computed via a multilayer perceptron. 

First, fill in the function `mlp`. `mlp` builds a multilayer perceptron. It takes in as parameter layers, which specifies the number of units per layer in the multilayer perceptron (i.e. if layers=[64,64,2], we have 2 fully connected layers of 64 units, and an output layer of 2 units). It also takes in x, the input tensor, and activation functions that are necessary for the construction of the neural network.

Hint: You my find the function `tf.layers.dense` helpful.



In [2]:
def mlp(x, layers=[64,64,2], activation=tf.tanh, output_activation=None):
    """
    Builds a multi-layer perceptron in Tensorflow.

    Args:
        x: Input tensor.

        layers: Tuple, list, or other iterable giving the number of units
            for each layer of the MLP.

        activation: Activation function for all layers except last.

        output_activation: Activation function for last layer.

    Returns:
        A TF symbol for the output of an MLP that takes x as an input.

    """
    
    ### SOLUTION ###
    for layer in layers[:-1]:
        x = tf.layers.dense(x, units=layer, activation=activation)
    return tf.layers.dense(x, units=layers[-1], activation=output_activation)
    ### END ###

### Task 2: Implement Categorical Policy

Now that we have a multilayer perceptron that can take in observations as input and compute the log-odds (logits) of our two actions, we want to define a categorical policy that makes use of this MLP to determine our action. Follow the following steps to implement our categorical policy:

1) Call `mlp` on the appropriate arguments to obtain a tensor of logits (log-odds) for our actions. Logits are defined as follows:

$$logit(p_a) = \log{\frac{p_a}{1-p_a}}$$

where $p$ is defined as the probability of taking action $a$.

2) To convert logits to probabilities, we can take a softmax of the logits. For an action a, the softmax is defined as follows: 

$$ Softmax(a) = \frac{e^{l_a}}{\sum_{a' \in A} e^{l_a'}} $$

where $l_a$ is the logit for action $a$, and $A$ is the set of all actions.
<br>In this case, we want to take the log of the softmax because it is easier to compute gradients with log softmaxes. 

3) Sample from a multinomial distribution to obtain our action.

<b> Hint 1: </b> Read up on the functions `tf.nn.log_softmax` and `tf.multinomial`
<br><b> Hint 2: </b> The number of actions that we can take is represented as `action_space.n`

In [3]:
def mlp_categorical_policy(x, a, nn_sizes, activation, output_activation, action_space):
    """
    Builds TF symbols to sample actions and compute log-probabilities of those actions.

    Args:
        x: Input tensor of states. Shape [batch, obs_dim].

        a: Input tensor of actions. Shape [batch, act_dim].

        nn_sizes: Sizes of the layers for action network MLP, excluding the output layer.

        activation: Activation function for all layers except last.

        output_activation: Activation function for last layer (action layer).

        action_space: A gym.spaces object describing the action space of the
            environment this agent will interact with.

    Returns:
        pi: A symbol for sampling stochastic actions from a multinomial distribution

        logp: A symbol for computing log-likelihoods of actions from a multinomial distribution.

        logp_pi: A symbol for computing log-likelihoods of actions in pi from a multinomial distribution

    """
    
    ### SOLUTION ###
    act_dim = action_space.n
    logits = mlp(x, list(nn_sizes)+[act_dim], activation, None)
    logp_all = tf.nn.log_softmax(logits)
    sample = tf.multinomial(logits,1)
    ### END ###
    pi = tf.squeeze(sample, axis=1)
    logp = tf.reduce_sum(tf.one_hot(a, depth=act_dim) * logp_all, axis=1)
    logp_pi = tf.reduce_sum(tf.one_hot(pi, depth=act_dim) * logp_all, axis=1)
    return pi, logp, logp_pi

### Task 3: Run the Segway Experiment!

Below is the starter code to run the CartPole experiment. We use an algorithm called Vanilla Policy Gradient (VPG) that takes in your categorical policy and trains on the segway problem. All that is left to do is initialize some of the experiment parameters. We want our categorical policy to be a 3 layer neural network. The first two layers consist of 64 units. Given this information, fill in `nn_units` and `depth` accordingly. Remember that the last layer of the neural network is not defined here, but in the `mlp_categorical_policy` function that you wrote above.

Train for 100 epochs and 4000 steps per epoch. The model should take roughly 20-30 min. to train. As you train your RL model, you will be able to see the cartpole training. Pay attention to the metric AverageEpRet––this is the average reward that your model achieved in the epoch. If your code was implemented correctly, AverageEpRet should achieve scores of roughly 200 or higher in the last 10 epochs.

Answer the following questions:

1) What is the AverageEpRet of your model in the final 10 epochs? 
<br> <i> Answer: </i>


2) What do you notice about the CartPole as it is training? Describe its improvement over time.
<br> <i> Answer: </i>

In [4]:
env = 'CartPole-v1'

### SOLUTION ###
nn_units = 64
depth = 2
steps = 4000
epochs = 100
### END ###
gamma = 0.99
seed = 0
# parser.add_argument('--cpu', type=int, default=2)
exp_name = 'vpg'

# Reset the default graph to prevent errors on multiple runs of Vanilla Policy Gradient
tf.reset_default_graph()
logger_kwargs = setup_logger_kwargs(exp_name, seed)

vpg.run(lambda : gym.make(env), actor_critic=core.mlp_actor_critic,
    ac_kwargs=dict(policy=mlp_categorical_policy, nn_sizes=[nn_units]*depth), gamma=gamma, 
    seed=seed, steps_per_epoch=steps, epochs=epochs,
    logger_kwargs=logger_kwargs)

[32;1mLogging data to /Users/seanlin/Desktop/MehtaKnights-189/vpg-problem/data/vpg/vpg_s0/progress.txt[0m
[36;1mSaving config:
[0m
{
    "ac_kwargs":	{
        "nn_sizes":	[
            64,
            64
        ],
        "policy":	"mlp_categorical_policy"
    },
    "actor_critic":	"mlp_actor_critic",
    "env_fn":	"<function <lambda> at 0x108e6d9d8>",
    "epochs":	100,
    "exp_name":	"vpg",
    "gamma":	0.99,
    "lam":	0.97,
    "logger":	{
        "<utils.logx.EpochLogger object at 0x1303de4e0>":	{
            "epoch_dict":	{},
            "exp_name":	"vpg",
            "first_row":	true,
            "log_current_row":	{},
            "log_headers":	[],
            "output_dir":	"/Users/seanlin/Desktop/MehtaKnights-189/vpg-problem/data/vpg/vpg_s0",
            "output_file":	{
                "<_io.TextIOWrapper name='/Users/seanlin/Desktop/MehtaKnights-189/vpg-problem/data/vpg/vpg_s0/progress.txt' mode='w' encoding='UTF-8'>":	{
                    "mode":	"w"
             





Instructions for updating:
Use keras.layers.Dense instead.
Instructions for updating:
Please use `layer.__call__` method instead.
Instructions for updating:
Use `tf.random.categorical` instead.

[32;1m
Number of parameters: 	 pi: 4610, 	 v: 4545
[0m
Instructions for updating:
tf.py_func is deprecated in TF V2. Instead, there are two
    options available in V2.
    - tf.py_function takes a python function which manipulates tf eager
    tensors instead of numpy arrays. It's easy to convert a tf eager tensor to
    an ndarray (just call tensor.numpy()) but having access to eager tensors
    means `tf.py_function`s can use accelerators such as GPUs as well as
    being differentiable using a gradient tape.
    - tf.numpy_function maintains the semantics of the deprecated tf.py_func
    (it is not differentiable, and manipulates numpy arrays). It drops the
    stateful argument making all functions stateful.
    

Instructions for updating:
Use tf.where in 2.0, which has the same broad

KeyboardInterrupt: 

### You're done! Isn't it cool what reinforcement learning can do?