# REINFORCEMENT LEARNING (RL) – POLICY GRADIENTS

- https://theneuralperspective.com/2016/11/25/reinforcement-learning-rl-policy-gradients-i/
- https://theneuralperspective.com/2016/11/26/1656/
- https://github.com/ashutoshkrjha/Cartpole-OpenAI-Tensorflow/blob/master/cartpole.py
- https://gist.github.com/shanest/535acf4c62ee2a71da498281c2dfc4f4


- https://github.com/dennybritz/reinforcement-learning

### OBJECTIVE

The three main components in our model include the state, action and reward. The state can be thought of the environment which generates an action which leads to a reward. Actions can also alter the state and often the reward may be delayed and not always an immediate response to a given action.

```    
               -----------------
              |                 |
             \ /                |
              -                 |
            State-----------> Action-----------> Reward
              |                                    -
              |                                   / \
              |                                    |
              --------------------------------------
```

s  ---> state   
a  ---> action   
a* ---> correct action  
r  ---> reward   
$\pi$  ---> policy  
$\theta$  --->  policy weights   
R ---> total reward   
$\hat A $  ---> Advantage Est   
$\gamma$  ---> discount factor   

$max_\theta \sum_{n=1}^{N}\ \log P(y_n|x_n;\theta)$    
= $max_\theta \sum_{n=1}^{N}\ \log P(a^*_n|s_n;\theta)$    
= $min_\theta \big[- \sum_{n=1}^{N}\ \log P(a^*_n|s_n;\theta)\big]$ 

Since we don’t have the correction actions to take, the best we can do it try some actions that may turn our good/bad and then eventually train our policy weights (theta) so that we increase the chances of the good actions. One common approach is to collect a series of states, actions and the corresponding rewards (s0, a0, r0, s1, a1, r1, … ) and from this we can calculate R, the total reward – sum of all the rewards r. This will give us the policy gradient:

$$
\frac{\partial J}{\partial \theta} = \frac{\partial \sum \log \pi (a|s;\theta)}{\partial \theta} R
$$

We will take a closer look at why this policy gradient is expressed as it is later in the section where we draw parallels with supervised learning.

When calculating the total reward R for an episode (series of events), we will have good and bad actions. But, according to our policy gradient, we will be updating the weights to favor ALL the actions in a given episode if the reward is positive and the magnitude of the update depends on the magnitude of the gradient. When we repeat this with enough episodes, our policy gradient becomes quite precise in modeling what actions to take given a state in order to maximize the reward.

There are several additions we can make to our policy gradient, such as adding on an advantage estimator to determine which specific actions were good/bad instead of just using the total reward to judge an episode as good/bad.  

So how does this supervised learning technique relate to reinforcement learning policy gradients? If you inspect the loss functions, you will see that they are exactly the same in principle. The negative log likelihood loss function above is the simplification of the multi-class cross entropy loss function below.

$$
J(\theta) = - \sum_i y_i ln(\hat y_i)
$$

Eg:   
Computed($\hat y$): [0.3,0.3,0.4]  
Targets ($y$) : [0,0,1]   
$$
J(\theta) = -  [0 * ln(0.3) + 0 * ln(0.3) + 1 * ln(0.4)] = - ln(0.4)
$$

With the multinomial cross entropy, you can see that we only keep the loss contribution from the correct class. Usually, with neural nets, this will be case if our ouputs are sparse (just 1 true class). Therefore, we can rewrite our loss into just a $\sum(-log(\hat y))$ where y_hat will just be the probability of the correct class. We just replace $y_i$ (true y) with 1 and for the probabilities for the other classes, doesn’t matter because their $y_i$ is 0. This is referred to as negative log likelihood. But for drawing the parallel between supervised learning and RL, let’s keep it in the explicit cross-entropy form.

Supervised: $J(\theta) = - \sum y_i \log (\hat y_i)$    
Reninforcement: $J(\theta) = - \sum r \log \pi (a|s;\theta)$


In supervised learning, we have a prediction ($\hat y$) and a true label ($y_i$). In a typical case, only one label will be 1 (true) in $y_i$, so therefore only the log of the true class’s prediction will be taken into account. But as we saw above, the gradient will take into account the predicted probability for all the classes. In RL, we have our action (a) based on our policy ($\pi$) which we take the log of. We multiply the action from the policy with our reward for that action. 

Note: The action is a number from the outputs (ex. chosen action is 2, so we take the 0.9 from  [0.2, 0.3, 0.9, 0.1] to put into the log . The reward can be any magnitude and direction but like our supervised case, it will help determine the loss and properly adjust the weights by influencing the gradient. If the reward is positive, the weights will be altered via the gradient in order to favor the action that was made in that state. If the reward is negative, the gradient will be unfavored to make that action with that particular state. DO NOT draw parallels by saying the reward is like $y_i$ because, as you can see, that is not the case. 

### NUANCES

You may notice that we do an additional operation to our rewards before feeding it in for training. We do what’s known as discounting the reward. The idea is that each reward will be weighted by all the rewards that follow it since the action responsible for the current reward will determine the rewards for the subsequent events. We weight each reward by the discount factor gamma^(time since reward). So each reward will be recalculated by the following expression:

$$
r_t = \sum_{k=0}^{\infty} \gamma ^ k r_{t+1}
$$


https://gym.openai.com/docs

In [4]:
import gym
import time
env = gym.make('CartPole-v0')
env.reset()
total_reward = 0
for _ in range(10000):
    env.render()
    time.sleep(0.1)
    observation, reward, done, info = env.step(env.action_space.sample()) # take a random action
    total_reward += reward
    if done: break
env.close()

print('Total Reward is:', total_reward)

[2017-12-05 10:34:52,539] Making new env: CartPole-v0


Total Reward is: 19.0


In [5]:
#possible actions available
for i in range(20):
    print(env.action_space.sample(), end=' ')

0 1 0 0 1 1 0 1 0 1 0 0 0 0 0 1 1 0 0 0 

In [6]:
print(env.action_space)
#> Discrete(2) i.e valid actions are either 0 or 1. 
print(env.observation_space)
#> Box(4,)
print(env.observation_space.high)
#> array([ 2.4       ,         inf,  0.20943951,         inf])
print(env.observation_space.low)
#> array([-2.4       ,        -inf, -0.20943951,        -inf])

Discrete(2)
Box(4,)
[  4.80000000e+00   3.40282347e+38   4.18879020e-01   3.40282347e+38]
[ -4.80000000e+00  -3.40282347e+38  -4.18879020e-01  -3.40282347e+38]


In [7]:
# from gym import envs
# avaiable_envs = envs.registry.all()
# [print(each) for each in list(avaiable_envs)]

In [8]:
import gym
env = gym.make('CartPole-v0')

input_initial = env.reset()

print('input_initial: ', input_initial)
for i_episode in range(10):
    total_reward = 0
    observation = env.reset()
    for t in range(1000):
        env.render()
        time.sleep(0.01)
        action = env.action_space.sample()
        observation, reward, done, info = env.step(action)
        total_reward += reward
        if done:
            
#             print(observation, reward, done, info )
            print("Episode finished after {} timesteps with score {}".format(t+1, total_reward))
            break
env.close()

[2017-12-05 10:35:20,341] Making new env: CartPole-v0


input_initial:  [ 0.0171071  -0.02363623  0.04319796 -0.01053885]
Episode finished after 14 timesteps with score 14.0
Episode finished after 28 timesteps with score 28.0
Episode finished after 16 timesteps with score 16.0
Episode finished after 13 timesteps with score 13.0
Episode finished after 13 timesteps with score 13.0
Episode finished after 28 timesteps with score 28.0
Episode finished after 13 timesteps with score 13.0
Episode finished after 21 timesteps with score 21.0
Episode finished after 24 timesteps with score 24.0
Episode finished after 15 timesteps with score 15.0


In [9]:
import sys
import numpy as np
import json
import os, inspect
import math
sys.path.append("../")
%load_ext autoreload
%autoreload 2
import logging
logger = logging.getLogger(__name__)

import pickle
import tensorflow as tf

import matplotlib.pyplot as plt
import math

import gym

  return f(*args, **kwds)


In [27]:

env = gym.make('CartPole-v0')

env.reset()

# Hyperparameters
H_SIZE = 10  # Number of hidden layer neurons
batch_size = 5  # Update Params after every 5 episodes
ETA = 1e-2  # Learning Rate
GAMMA = 0.99  # Discount factor

INPUT_DIM = 4  # Input dimensions

# Initializing
tf.reset_default_graph()

In [28]:
# Network to define moving left or right
input = tf.placeholder(tf.float32, [None, INPUT_DIM], name="input_x")
W1 = tf.get_variable("W1", shape=[INPUT_DIM, H_SIZE],
                     initializer=tf.contrib.layers.xavier_initializer())
layer1 = tf.nn.relu(tf.matmul(input, W1))
W2 = tf.get_variable("W2", shape=[H_SIZE, 1],
                     initializer=tf.contrib.layers.xavier_initializer())
score = tf.matmul(layer1, W2)
probability = tf.nn.sigmoid(score)

# From here we define the parts of the network needed for learning a good policy.
tvars = tf.trainable_variables()
input_y = tf.placeholder(tf.float32, [None, 1], name="input_y")
advantages = tf.placeholder(tf.float32, name="reward_signal")

# The loss function. This sends the weights in the direction of making actions
# that gave good advantage (reward over time) more likely, and actions that didn't less likely.
loglik = tf.log(input_y * (input_y - probability) + (1 - input_y) * (input_y + probability))
loss = -tf.reduce_mean(loglik * advantages)
newGrads = tf.gradients(loss, tvars)

[2017-09-09 21:39:58,084] Making new env: CartPole-v0


In [None]:
adam = tf.train.AdamOptimizer(learning_rate=ETA)  # Adam optimizer
W1Grad = tf.placeholder(tf.float32, name="batch_grad1")  # Placeholders for final gradients once update happens
W2Grad = tf.placeholder(tf.float32, name="batch_grad2")
batchGrad = [W1Grad, W2Grad]
updateGrads = adam.apply_gradients(zip(batchGrad, tvars))


In [None]:
def discount_rewards(r):
    """ take 1D float array of rewards and compute discounted reward """
    discounted_r = np.zeros_like(r)
    running_add = 0
    for t in reversed(xrange(0, r.size)):
        running_add = running_add * GAMMA + r[t]
        discounted_r[t] = running_add
    return discounted_r


xs, hs, drs, ys = [], [], [], []  # Arrays to store parameters till an update happens
running_reward = None
reward_sum = 0
episode_number = 1
total_episodes = 10000
init = tf.initialize_all_variables()

In [None]:
# Training
with tf.Session() as sess:
    rendering = False
    sess.run(init)
    input_initial = env.reset()  # Initial state of the environment

    # Array to store gradients for each min-batch step
    gradBuffer = sess.run(tvars)
    for ix, grad in enumerate(gradBuffer):
        gradBuffer[ix] = grad * 0

    while episode_number <= total_episodes:

        if reward_sum / batch_size > 100 or rendering == True:  # Render environment only after avg reward reaches 100
            env.render()
            rendering = True

        # Format the state for placeholder
        x = np.reshape(input_initial, [1, INPUT_DIM])

        # Run policy network
        tfprob = sess.run(probability, feed_dict={input: x})
        action = 1 if np.random.uniform() < tfprob else 0

        xs.append(x)  # Store x
        y = 1 if action == 0 else 0
        ys.append(y)

        # take action for the state
        input_initial, reward, done, info = env.step(action)
        reward_sum += reward

        drs.append(reward)  # store reward after action is taken

        if done:
            episode_number += 1
            # Stack the memory arrays to feed in session
            epx = np.vstack(xs)
            epy = np.vstack(ys)
            epr = np.vstack(drs)

            xs, hs, drs, ys = [], [], [], []  # Reset Arrays

            # Compute the discounted reward
            discounted_epr = discount_rewards(epr)

            discounted_epr -= np.mean(discounted_epr)
            discounted_epr /= np.std(discounted_epr)

            # Get and save the gradient
            tGrad = sess.run(newGrads, feed_dict={input: epx, input_y: epy, advantages: discounted_epr})
            for ix, grad in enumerate(tGrad):
                gradBuffer[ix] += grad

            # Update Params after Min-Batch number of episodes
            if episode_number % batch_size == 0:
                sess.run(updateGrads, feed_dict={W1Grad: gradBuffer[0], W2Grad: gradBuffer[1]})
                for ix, grad in enumerate(gradBuffer):
                    gradBuffer[ix] = grad * 0

                # Print details of the present model
                running_reward = reward_sum if running_reward is None else running_reward * 0.99 + reward_sum * 0.01
                print
                'Average reward for episode %f.  Total average reward %f.' % (
                reward_sum / batch_size, running_reward / batch_size)

                if reward_sum / batch_size > 200:
                    print
                    "Task solved in", episode_number, 'episodes'
                    break

                reward_sum = 0

            input_initial = env.reset()

print(episode_number, 'Episodes completed.')

In [None]:
#https://github.com/ashutoshkrjha/Cartpole-OpenAI-Tensorflow


import numpy as np
import _pickle as pickle
import tensorflow as tf

import matplotlib.pyplot as plt
import math
from overrides import overrides


#TensorFlow
from dhira.tf.models.internal.base_tf_model import BaseTFModel

class PolicyGradient(BaseTFModel):
    def __init__(self,
                 name='PlocyGradient',
                 run_id=0,
                 save_dir=None,
                 log_dir=None):
        super(self.__class__, self).__init__(name=name,
                 run_id=run_id,
                 save_dir=save_dir,
                 log_dir=log_dir)

        # Hyperparameters
        self.H_SIZE = 10  # Number of hidden layer neurons
        self.ETA = 1e-2  # Learning Rate
        self.GAMMA = 0.99  # Discount factor

        self.INPUT_DIM = 4  # Input dimensions

    def _create_placeholders(self):
        # Network to define moving left or right
        self.observations = tf.placeholder(tf.float32, [None, self.INPUT_DIM], name="observations")
        self.actions = tf.placeholder(tf.float32, [None, 1], name="action")
        # self.reward = tf.placeholder(tf.float32, name="reward_signal")
        self.rewards = tf.placeholder(tf.float32, [None, 1], name="reward_signal")


    @overrides
    def _setup_graph_def(self):
        layer = tf.layers.dense(inputs=self.observations,
                                units=self.H_SIZE,
                                activation=tf.nn.relu,
                                kernel_initializer=tf.contrib.layers.xavier_initializer())
        self.pred_actions = tf.layers.dense(inputs=layer,
                                           units=1,
                                           activation=tf.nn.sigmoid,
                                           kernel_initializer=tf.contrib.layers.xavier_initializer())


        # The loss function. This sends the weights in the direction of making actions
        # that gave good advantage (reward over time) more likely, and actions that didn't less likely.
        # log(y * (y - y^) + (1 - y) * (y + y^))
        self._loglik = tf.log(self.actions * (self.actions - self.pred_actions) +
                        (1 - self.actions) * (self.actions + self.pred_actions))
        self._loss = -tf.reduce_mean(self._loglik * self.rewards)

        #  mean(log(y * log(y^) + (1 - y) * log(1 - y^))) * rewards
        # self._loss = - tf.reduce_mean((self.actions * tf.log(self.pred_action) +
        #                               (1 - self.actions) * (tf.log(1 - self.pred_action))) *
        #                               self.rewards,0)


        self._optimizer = tf.train.AdamOptimizer(learning_rate=self.ETA).minimize(self._loss)#, global_step=self.global_step)  # Adam optimizer

    @overrides
    def _get_eval_metric(self):
        return self._loss

    @overrides
    def _get_prediction(self):
        return self.pred_actions

    @overrides
    def _get_optimizer(self):
        return self._optimizer

    @overrides
    def _get_loss(self):
        return self._loss

    # def discount_rewards(self, r, GAMMA=0.99):
    #     """ take 1D float array of rewards and compute discounted reward """
    #     discounted_r = np.zeros_like(r)
    #     running_add = 0
    #     for t in reversed(range(0, r.size)):
    #         running_add = running_add * GAMMA + r[t]
    #         discounted_r[t] = running_add
    #     return discounted_r

    def discount_rewards(self, rewards, gamma):
        """
        Return discounted rewards weighed by gamma.
        Each reward will be replaced with a weight reward that
        involves itself and all the other rewards occuring after it.
        The later the reward after it happens, the less effect it
        has on the current rewards's discounted reward since gamma&amp;lt;1.

        [r0, r1, r2, ..., r_N] will look someting like:
        [(r0 + r1*gamma^1 + ... r_N*gamma^N), (r1 + r2*gamma^1 + ...), ...]
        """
        return np.array([sum([gamma ** t * r for t, r in enumerate(rewards[i:])])
                         for i in range(len(rewards))])

    @overrides
    def _get_train_feed_dict(self, batch, is_done):
        inputs, lables = batch
        observation, reward = inputs
        if is_done is True:
            # Compute the discounted reward
            reward = np.vstack(
                self.discount_rewards(reward, self.GAMMA))

            discounted_epr = self.discount_rewards(reward, self.GAMMA)

            discounted_epr -= np.mean(discounted_epr)
            discounted_epr /= np.std(discounted_epr)

            return {self.actions:lables[0], self.observations: observation[0], self.rewards:discounted_epr}

        else:
            return {self.observations: observation[0]}