# Implementing PPO-clipped method

Let's implement the PPO-clipped method for swinging up the pendulum task. The code
used in this section is adapted from one of the very good PPO implementations (https://github.com/MorvanZhou/Reinforcement-learning-with-tensorflow/tree/master/contents/12_Proximal_Policy_Optimization) by Morvan. 

First, let's import the necessary libraries:

In [1]:
import warnings
warnings.filterwarnings('ignore')

import numpy as np
import matplotlib.pyplot as plt
import gym

import tensorflow.compat.v1 as tf
tf.disable_v2_behavior()

Instructions for updating:
non-resource variables are not supported in the long term


## Creating the gym environment

Let's create a pendulum environment using gym:

In [2]:
env = gym.make('Pendulum-v0').unwrapped

Get the state shape of the environment:

In [3]:
state_shape = env.observation_space.shape[0]

Get the action shape of the environment:

In [4]:
action_shape = env.action_space.shape[0]

Note that the pendulum is a continuous environment and thus our action space consists of
continuous values. So, we get the bound of our action space:

In [5]:
action_bound = [env.action_space.low, env.action_space.high]

Set the epsilon value which is used in the clipped objective:

In [6]:
epsilon = 0.2 

## Defining the PPO class

Let's define the class called PPO where we will implement the PPO algorithm.  For a clear understanding, you can also check the detailed explanation of code on the book.

In [7]:
class PPO(object):
    #first, let's define the init method
    def __init__(self):
        
        #start the TensorFlow session
        self.sess = tf.Session()
        
        #define the placeholder for the state
        self.state_ph = tf.placeholder(tf.float32, [None, state_shape], 'state')

        #now, let's build the value network which returns the value of a state
        with tf.variable_scope('value'):
            layer1 = tf.layers.dense(self.state_ph, 100, tf.nn.relu)
            self.v = tf.layers.dense(layer1, 1)
            
            #define the placeholder for the Q value
            self.Q = tf.placeholder(tf.float32, [None, 1], 'discounted_r')
            
            #define the advantage value as the difference between the Q value and state value
            self.advantage = self.Q - self.v

            #compute the loss of the value network
            self.value_loss = tf.reduce_mean(tf.square(self.advantage))
            
            #train the value network by minimizing the loss using Adam optimizer
            self.train_value_nw = tf.train.AdamOptimizer(0.002).minimize(self.value_loss)

        #now, we obtain the policy and its parameter from the policy network
        pi, pi_params = self.build_policy_network('pi', trainable=True)

        #obtain the old policy and its parameter from the policy network
        oldpi, oldpi_params = self.build_policy_network('oldpi', trainable=False)
        
        #sample an action from the new policy
        with tf.variable_scope('sample_action'):
            self.sample_op = tf.squeeze(pi.sample(1), axis=0)       

        #update the parameters of the old policy
        with tf.variable_scope('update_oldpi'):
            self.update_oldpi_op = [oldp.assign(p) for p, oldp in zip(pi_params, oldpi_params)]

        #define the placeholder for the action
        self.action_ph = tf.placeholder(tf.float32, [None, action_shape], 'action')
        
        #define the placeholder for the advantage
        self.advantage_ph = tf.placeholder(tf.float32, [None, 1], 'advantage')

        #now, let's define our surrogate objective function of the policy network
        with tf.variable_scope('loss'):
            with tf.variable_scope('surrogate'):
                
                #first, let's define the ratio 
                ratio = pi.prob(self.action_ph) / oldpi.prob(self.action_ph)
    
                #define the objective by multiplying ratio and the advantage value
                objective = ratio * self.advantage_ph
                
                #define the objective function with the clipped and unclipped objective:
                L = tf.reduce_mean(tf.minimum(objective, 
                                   tf.clip_by_value(ratio, 1.-epsilon, 1.+ epsilon)*self.advantage_ph))
                
            
            #now, we can compute the gradient and maximize the objective function using gradient
            #ascent. However, instead of doing that, we can convert the above maximization objective
            #into the minimization objective by just adding a negative sign. So, we can denote the loss of
            #the policy network as:
            
            self.policy_loss = -L
    
        #train the policy network by minimizing the loss using Adam optimizer:
        with tf.variable_scope('train_policy'):
            self.train_policy_nw = tf.train.AdamOptimizer(0.001).minimize(self.policy_loss)
        
        #initialize all the TensorFlow variables
        self.sess.run(tf.global_variables_initializer())

    #now, let's define the train function
    def train(self, state, action, reward):
        
        #update the old policy
        self.sess.run(self.update_oldpi_op)
        
        #compute the advantage value
        adv = self.sess.run(self.advantage, {self.state_ph: state, self.Q: reward})
            
        #train the policy network
        [self.sess.run(self.train_policy_nw, {self.state_ph: state, self.action_ph: action, self.advantage_ph: adv}) for _ in range(10)]
        
        #train the value network
        [self.sess.run(self.train_value_nw, {self.state_ph: state, self.Q: reward}) for _ in range(10)]

    
    #we define a function called build_policy_network for building the policy network. Note
    #that our action space is continuous here, so our policy network returns the mean and
    #variance of the action as an output and then we generate a normal distribution using this
    #mean and variance and we select an action by sampling from this normal distribution

    def build_policy_network(self, name, trainable):
        with tf.variable_scope(name):
            
            #define the layer of the network
            layer1 = tf.layers.dense(self.state_ph, 100, tf.nn.relu, trainable=trainable)
            
            #compute mean
            mu = 2 * tf.layers.dense(layer1, action_shape, tf.nn.tanh, trainable=trainable)
            
            #compute standard deviation
            sigma = tf.layers.dense(layer1, action_shape, tf.nn.softplus, trainable=trainable)
            
            #compute the normal distribution
            norm_dist = tf.distributions.Normal(loc=mu, scale=sigma)
            
        #get the parameters of the policy network
        params = tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES, scope=name)
        return norm_dist, params

    #let's define a function called select_action for selecting the action
    def select_action(self, state):
        state = state[np.newaxis, :]
        
        #sample an action from the normal distribution generated by the policy network
        action = self.sess.run(self.sample_op, {self.state_ph: state})[0]
        
        #we clip the action so that they lie within the action bound and then we return the action
        action =  np.clip(action, action_bound[0], action_bound[1])

        return action

    #we define a function called get_state_value to obtain the value of the state computed by the value network
    def get_state_value(self, state):
        if state.ndim < 2: state = state[np.newaxis, :]
        return self.sess.run(self.v, {self.state_ph: state})[0, 0]


## Training the network

Now, let's start training the network. First, let's create an object to our PPO class:

In [8]:
ppo = PPO()

Instructions for updating:
Use keras.layers.Dense instead.
Instructions for updating:
Please use `layer.__call__` method instead.
Instructions for updating:
The TensorFlow Distributions library has moved to TensorFlow Probability (https://github.com/tensorflow/probability). You should update all references to use `tfp.distributions` instead of `tf.distributions`.
Instructions for updating:
The TensorFlow Distributions library has moved to TensorFlow Probability (https://github.com/tensorflow/probability). You should update all references to use `tfp.distributions` instead of `tf.distributions`.


Define the number of episodes:

In [9]:
num_episodes = 1000

Define the number of time steps in each episode:

In [10]:
num_timesteps = 200

Define the discount factor, $\gamma$:

In [11]:
gamma = 0.9

Set the batch size:

In [12]:
batch_size = 32

Now, let's train

In [None]:
#for each episode
for i in range(num_episodes):
    
    #initialize the state by resetting the environment
    state = env.reset()
    
    #initialize the lists for holding the states, actions, and rewards obtained in the episode
    episode_states, episode_actions, episode_rewards = [], [], []
    
    #initialize the return
    Return = 0
    
    #for every step
    for t in range(num_timesteps):   
        
        #render the environment
        env.render()
        
        #select the action
        action = ppo.select_action(state)
        
        #perform the selected action
        next_state, reward, done, _ = env.step(action)
        
        #store the state, action, and reward in the list
        episode_states.append(state)
        episode_actions.append(action)
        episode_rewards.append((reward+8)/8)    
        
        #update the state to the next state
        state = next_state
        
        #update the return
        Return += reward
        
        #if we reached the batch size or if we reached the final step of the episode
        if (t+1) % batch_size == 0 or t == num_timesteps-1:
            
            #compute the value of the next state
            v_s_ = ppo.get_state_value(next_state)
            
            #compute Q value as sum of reward and discounted value of next state
            discounted_r = []
            for reward in episode_rewards[::-1]:
                v_s_ = reward + gamma * v_s_
                discounted_r.append(v_s_)
            discounted_r.reverse()
    
            #stack the episode states, actions, and rewards:
            es, ea, er = np.vstack(episode_states), np.vstack(episode_actions), np.array(discounted_r)[:, np.newaxis]
            
            #empty the lists
            episode_states, episode_actions, episode_rewards = [], [], []
            
            #train the network
            ppo.train(es, ea, er)
        
    #print the return for every 10 episodes
    if i %10 ==0:
         print("Episode:{}, Return: {}".format(i,Return))  

Episode:0, Return: -1478.0535012255646


Now that we learned how PPO with clipped objective works and how to implement them, in the next section we will learn another interesting type of PPO algorithm called PPO with
the penalized objective.