# Cart pole balancing with policy gradient

Now, let's learn how to implement the policy gradient algorithm with reward-to-go for the
cart pole balancing task.

First, let's import the necessary libraries:

In [1]:
import warnings
warnings.filterwarnings('ignore')
import numpy as np
import gym

import tensorflow.compat.v1 as tf
tf.disable_v2_behavior()

Instructions for updating:
non-resource variables are not supported in the long term


Create the cart pole environment using gym:

In [2]:
env = gym.make('CartPole-v0')

Get the state shape:

In [3]:
state_shape = env.observation_space.shape[0]

Get the number of actions:

In [4]:
num_actions = env.action_space.n

## Computing discounted and normalized reward

Instead of using the rewards directly, we can use the discounted and normalized rewards. 

Define the discount factor, $\gamma$:

In [5]:
gamma = 0.95

Let's define a function called discount_and_normalize_rewardsfor computing the discounted and normalized rewards:

In [6]:
def discount_and_normalize_rewards(episode_rewards):
    
    #initialize an array for storing the discounted reward
    discounted_rewards = np.zeros_like(episode_rewards)
    
    #compute the discounted reward
    reward_to_go = 0.0
    for i in reversed(range(len(episode_rewards))):
        reward_to_go = reward_to_go * gamma + episode_rewards[i]
        discounted_rewards[i] = reward_to_go
        
    #normalize and return the reward
    discounted_rewards -= np.mean(discounted_rewards)
    discounted_rewards /= np.std(discounted_rewards)
    
    return discounted_rewards

## Building the policy network

First, let's define the placeholder for the state:

In [7]:
state_ph = tf.placeholder(tf.float32, [None, state_shape], name="state_ph")

Define the placeholder for the action:

In [8]:
action_ph = tf.placeholder(tf.int32, [None, num_actions], name="action_ph")

Define the placeholder for the discounted reward:

In [9]:
discounted_rewards_ph = tf.placeholder(tf.float32, [None,], name="discounted_rewards")

Define the layer 1:

In [10]:
layer1 = tf.layers.dense(state_ph, units=32, activation=tf.nn.relu)

Instructions for updating:
Use keras.layers.Dense instead.
Instructions for updating:
Please use `layer.__call__` method instead.


Define the layer 2, note that the number of units in the layer 2 is set to the number of
actions:

In [11]:
layer2 = tf.layers.dense(layer1, units=num_actions)

Obtain the probability distribution over the action space as an output of the network by
applying the softmax function to the result of layer 2:

In [12]:
prob_dist = tf.nn.softmax(layer2)

We learned that we compute gradient as:

$$\nabla_{\theta} J(\theta) = \frac{1}{N} \sum_{i=1}^{N}\left[\sum_{t=0}^{T-1}  \nabla_{\theta} \log \pi_{\theta}\left(a_{t} | s_{t}\right)R_t\right] $$
    
After computing the gradient we update the parameter of the network using the gradient
ascent as:    

$$\theta = \theta + \alpha \nabla_{\theta} J(\theta) $$

However, it is a standard convention to perform minimization rather than maximization.
So, we can convert the above maximization objective into the minimization objective by just
adding a negative sign.

Thus, we can define negative log policy as:


In [13]:
neg_log_policy = tf.nn.softmax_cross_entropy_with_logits_v2(logits = layer2, labels = action_ph)

Now, let's define the loss:

In [14]:
loss = tf.reduce_mean(neg_log_policy * discounted_rewards_ph) 

Define the train operation for minimizing the loss using Adam optimizer:

In [15]:
train = tf.train.AdamOptimizer(0.01).minimize(loss)

## Training the network

Now, let's train the network for several iterations. For simplicity, let's just generate one
episode on every iteration.

Set the number of iterations:

In [16]:
num_iterations = 1000

In [None]:
#start the TensorFlow session
with tf.Session() as sess:
    
    #initialize all the TensorFlow variables
    sess.run(tf.global_variables_initializer())
    
    #for every iteration
    for i in range(num_iterations):
        
        #initialize an empty list for storing the states, actions, and rewards obtained in the episode
        episode_states, episode_actions, episode_rewards = [],[],[]
    
        #set the done to False
        done = False
        
        #initialize the state by resetting the environment
        state = env.reset()

        #initialize the return
        Return = 0

        #while the episode is not over
        while not done:
            
            #reshape the state
            state = state.reshape([1,4])
            
            #feed the state to the policy network and the network returns the probability distribution
            #over the action space as output which becomes our stochastic policy 
            pi = sess.run(prob_dist, feed_dict={state_ph: state})
            
            #now, we select an action using this stochastic policy
            a = np.random.choice(range(pi.shape[1]), p=pi.ravel()) 
            
            #perform the selected action
            next_state, reward, done, info = env.step(a)
            
            #render the environment
            env.render()
            
            #update the return
            Return += reward
            
            #one-hot encode the action
            action = np.zeros(num_actions)
            action[a] = 1
            
            #store the state, action, and reward into their respective list
            episode_states.append(state)
            episode_actions.append(action)
            episode_rewards.append(reward)

            #update the state to the next state
            state=next_state                                                                                                                                                                                                                                                                                                                                                                                                                                                                            


        #compute the discounted and normalized reward
        discounted_rewards= discount_and_normalize_rewards(episode_rewards)
        
        #define the feed dictionary
        feed_dict = {state_ph: np.vstack(np.array(episode_states)),
                     action_ph: np.vstack(np.array(episode_actions)), 
                     discounted_rewards_ph: discounted_rewards 
                    }
                    
        #train the network
        loss_, _ = sess.run([loss, train], feed_dict=feed_dict)

        #print the return for every 10 iteration
        if i%10==0:
            print("Iteration:{}, Return: {}".format(i,Return))  


Iteration:0, Return: 16.0


Now that we have learned how to implement the policy gradient algorithm with rewardto-go, in the next section, we will learn another interesting variance reduction technique
called policy gradient with baseline. 