# OpenAI GYM Pendulum-v0 # 
---
This notebook describes the algoritm used for the swing-up and balancing of a pendulum using an actor-critic reinforcement learning agent. The algorithm used is based on
[DDPG](https://arxiv.org/pdf/1509.02971.pdf "Continuous Control with Deep Reinforcement Learning (2015)"). The algoritm is applied to the [Pendulum-v0](https://gym.openai.com/envs/Pendulum-v0/) environment of [OpenAI Gym](https://github.com/openai/gym), a simple and effective toolkit to develop and prototype Reinforcement Learning algorithms. 

To install and manage all requirements Anaconda or Miniconda is recommended. Use the provided yaml file to install all dependencies or install them manually in a new environment called `tf-gym` using the following commands:
```
conda create -n tf-gym python=3
source activate tf-gym
conda install numpy matplotlib jupyter notebook
pip install --ignore-installed --upgrade tensorflow
pip install gym 
```

> **NOTE:** If you want to run tensorflow on the gpu, replace `tensorflow` with `tensorflow-gpu`. Don't forget to install `CUDA` and `cuDNN`. See the [TensorFlow](https://www.tensorflow.org/) website for more info.

This DDPG approach uses batch normalization on both hidden layers. As batch normalization uses an internal moving average and variance this is not updated by the soft target network updates. Thoughts on the correctness of this is appreciated.

---
## General Imports ##

In [1]:
import tensorflow as tf
import numpy as np

import matplotlib.pyplot as plt
%matplotlib inline 

## Hyperparameters ##

The hyperparameters are a set of user set parameters which are used to setup and tune the learning process of the agent. The parameters are not able to be learned by the program itself.

In [2]:
# Render the environment
RENDER = False

# Training epochs
EPOCHS = 50000

# Maximum number of steps per epoch
EPOCH_LENGTH = 1000   

# Number of experiences stored in the replay buffer
BUFFER_SIZE = 1000000

# Number of experiences per training batch
BATCH_SIZE = 64

# Reinforcement Learning Parameters
GAMMA = 0.99
TAU = 0.001
LEARN_RATE_ACTOR = 0.001
LEARN_RATE_CRITIC = 0.01

## Create OpenAI Gym environment ##

In [3]:
import gym

env = gym.make('Pendulum-v0')

# Read out the expected size of the state and input
num_states = np.prod(np.array(env.observation_space.shape)) 
num_actions = np.prod(np.array(env.action_space.shape))

## ReplayBuffer Class ##
Class to create and handel the buffer used for the replay of previous experiences. This is needed to remove temporal correlation from the state data and is shown to increase training performance of RL-Agents.

In [4]:
from collections import deque

class ReplayBuffer:

    num_state = 3
    
    def __init__(self, buffer_size):
        " Initializes the replay buffer by creating a deque() and setting the size and buffer count. "
        self.buffer = deque()
        self.buffer_size = buffer_size
        self.count = 0
         
    def add(self, s, a, r, d, s2):
         
        """ Adds new experience to the ReplayBuffer(). If the buffer size is
        reached, the oldest item is removed.
         
        Inputs needed to create new experience:
            s      - State
            a      - Action
            r      - Reward
            d      - Done
            s2     - Resulting State     
        """
        # Create experience list
        experience = (s, a, r, d, s2)
        
        # Check the size of the buffer
        if self.count < self.buffer_size:
            self.count += 1
        else:
            self.buffer.popleft()
            
        # Add experience to buffer
        self.buffer.append(experience)
        
    def size(self):
        " Return the amount of stored experiences. " 
        return self.count
    
    def batch(self, batch_size):
        "Return a \"batch_size\" number of random samples from the buffer."
        
        if self.count < batch_size:
            batch = random.sample(self.buffer, self.count)
            batch_size = self.count
        else:
            batch = random.sample(self.buffer, batch_size)
            
        batch_state = np.array([item[0] for item in batch]).reshape([batch_size,self.num_state])
        batch_action = np.array([item[1] for item in batch]).reshape([batch_size, 1])
        batch_reward = np.array([item[2] for item in batch]).reshape([batch_size, 1])
        batch_done = np.array([item[3] for item in batch]).reshape([batch_size, 1])
        batch_next_state = np.array([item[4] for item in batch]).reshape([batch_size,self.num_state])
        
        return batch_state, batch_action, batch_reward, batch_done, batch_next_state 
            
    def clear(self):
        " Remove all entries from the ReplayBuffer. "
        self.buffer.clear()
        self.count = 0

## ActionNoise ##
To aid the exploration of the agent, noise is added to the predicted optimal action as calculated by the Actor network. This class manages this noise input and is based on Ornstein-Uhlenbeck noise. 

In [5]:
import random

class ActionNoise():
    def __init__(self, mu = np.zeros(1), sigma= 0.3, theta = 0.15, dt = 1e-2, x0 = None):
        self.theta = theta
        self.mu = mu
        self.sigma = sigma
        self.dt = dt
        self.x0 = x0
        self.reset()
        
    def __call__(self):
        x = (self.x_prev + self.theta*(self.mu - self.x_prev)*self.dt 
             + self.sigma*np.sqrt(self.dt)* np.random.normal(size=self.mu.shape))
        self.x_prev = x
        
        return x
    
    def reset(self):
        self.x_prev = self.x0 if self.x0 is not None else np.zeros_like(self.mu)
        
    def __repr__(self):
        return 'ActionNoise(mu={}, sigma={})'.format(self.mu, self.sigma)
        

## Neural Networks ##
Create the neural network classes for the actor and critic part of the reinforcement learning controller. 

### Actor Neural Network ###

In [16]:
class ActorNetwork:
    
    # Actor Network Parameters
    num_outputs = 1
    num_hidden_1 = 400
    num_hidden_2 = 300
    
    def __init__(self, session, num_states, action_range, learning_rate, tau, batch_size):
        " Initialize the actor and target network. "
        
        # Set session
        self.session = session
        
        # Set input and output parameters
        self.num_inputs = num_states
        self.output_min = action_range[0]
        self.output_max = action_range[1]
        
        # Set learning parameters
        self.learning_rate = learning_rate
        self.tau = tau
        self.batch_size = batch_size
        
        # Create actor and target networks
        self.actor_input, self.actor_is_training, self.actor_network = self.make('actor')
        self.target_input, self.target_is_training, self.target_network = self.make('target_actor', source_scope = 'actor')
        
        # Retrieve the collections with the variables of the actor and target networks
        self.actor_collection = tf.get_collection('actor')
        self.target_collection = tf.get_collection('target_actor')
        
        # Create update_target_network_op
        self.update_target_network_op = [t.assign(tf.scalar_mul(self.tau, a) + tf.scalar_mul(1 - self.tau, t)) 
                                        for a, t in zip(self.actor_collection, self.target_collection)]

        # Action gradient placeholder, provided by critic network
        self.action_gradient = tf.placeholder(tf.float32, [None, 1])
        
        # Actor gradient with batch normalization 
        self.actor_gradients = [ (gradient/batch_size) 
                for gradient in tf.gradients(self.actor_network, self.actor_collection, -self.action_gradient) ]
        
        # Create optimizer
        with tf.control_dependencies(tf.get_collection(tf.GraphKeys.UPDATE_OPS)):
            self.optimizer = tf.train.AdamOptimizer(self.learning_rate)\
                                    .apply_gradients(zip(self.actor_gradients, self.actor_collection)) 

        
        
    def make(self, scope, source_scope = None):
        " Create an actor network "
        
        # Create initializer
        if source_scope is not None: 
            init_counter = 0
            source = tf.get_collection(source_scope)
            
        def initializer():
            if source_scope = None: 
                return None
            
            else:
                initializer = source[init_counter].initialized_value()
                init_counter += 1
                
                return initializer 
            
            
        
        # Add variable scope to easily differentiate between actor and target networks
        with tf.variable_scope(scope): 
            
            # Define input placeholder
            input = tf.placeholder(tf.float32, [None,self.num_inputs])
            
            # Define with_default placeholder for the is_training boolean
            is_training = tf.placeholder_with_default(False, [], name = 'is_training')
            
            # Create TensorFlow network
            layer_1 = tf.layers.dense(input, self.num_hidden_1, name = 'dense_layer_1')
            layer_1 = tf.layers.batch_normalization(layer_1, training = is_training, name = 'batch_norm_1')
            layer_1 = tf.nn.relu(layer_1)
            
            layer_2 = tf.layers.dense(layer_1, self.num_hidden_2, name = 'dense_layer_2')
            layer_2 = tf.layers.batch_normalization(layer_2, training = is_training, name = 'batch_norm_2')
            layer_2 = tf.nn.relu(layer_2)
            
            network = tf.layers.dense(layer_2, self.num_outputs, name = 'output_layer')
            network = tf.nn.sigmoid(network)
            
            # Scale output to action range
            network = tf.add(tf.multiply(network, (self.output_max - self.output_min)), self.output_min)   
        
        # Add network variables to tf.collection for easy retrieval
        with tf.variable_scope(scope, reuse = True):
            
            # Layer 1: Dense variables + Batch Norm variables
            tf.add_to_collection(scope, tf.get_variable('dense_layer_1/kernel'))
            tf.add_to_collection(scope, tf.get_variable('dense_layer_1/bias'))
            tf.add_to_collection(scope, tf.get_variable('batch_norm_1/gamma'))
            tf.add_to_collection(scope, tf.get_variable('batch_norm_1/beta')) 
            
            # Layer 2: Dense variables + Batch Norm variables
            tf.add_to_collection(scope, tf.get_variable('dense_layer_2/kernel'))
            tf.add_to_collection(scope, tf.get_variable('dense_layer_2/bias'))
            tf.add_to_collection(scope, tf.get_variable('batch_norm_2/gamma'))
            tf.add_to_collection(scope, tf.get_variable('batch_norm_2/beta'))
            
            # Output layer: Dense variables
            tf.add_to_collection(scope, tf.get_variable('output_layer/kernel'))
            tf.add_to_collection(scope, tf.get_variable('output_layer/bias'))
            
            # Non-Trainable Batch Norm Variables
            tf.add_to_collection(scope, tf.get_variable('batch_norm_1/moving_mean'))
            tf.add_to_collection(scope, tf.get_variable('batch_norm_1/moving_variance'))
            tf.add_to_collection(scope, tf.get_variable('batch_norm_2/moving_mean'))
            tf.add_to_collection(scope, tf.get_variable('batch_norm_2/moving_variance'))
            
        return input, is_training, network  
       
        
    def train(self, inputs, gradient):
        self.session.run(self.optimizer, feed_dict = {self.actor_input: inputs, 
                                                      self.action_gradient: gradient, 
                                                      self.actor_is_training: True})
       
    
    def update_target_network(self):
        self.session.run(self.update_target_network_op)  
    
    
    def predict(self, state):
        return self.session.run(self.actor_network, feed_dict = {self.actor_input: state})
    
    
    def predict_target(self, state):
        return self.session.run(self.target_network, feed_dict = {self.target_input: state})


### Critic Neural Network ###


In [17]:
class CriticNetwork:
    
    # Critic Network Parameters
    num_outputs = 1
    num_hidden_1 = 400
    num_hidden_2 = 300
    
    def __init__(self, session, num_states, learning_rate, tau, batch_size):
        
        # Set session
        self.session = session
        
        # Set input and output parameters
        self.num_inputs = num_states
        
        # Set learning parameters
        self.learning_rate = learning_rate
        self.tau = tau
        self.batch_size = batch_size
        
        # Create critic and target networks
        self.critic_input, self.action, self.critic_network = self.make('critic')
        self.target_input, self.target_action, self.target_network = self.make('target_critic')
        
        # Retrieve the collections with the variables of the actor and target networks
        critic_collection = tf.get_collection('critic')
        target_collection = tf.get_collection('target_critic')
        
        # Create update_target_network_op
        self.update_target_network_op = [t.assign(tf.scalar_mul(self.tau, c) + tf.scalar_mul(1. - self.tau, t)) 
                                        for c, t in zip(critic_collection, target_collection)]
        
        # Target y_i values
        self.target_q_value = tf.placeholder(tf.float32, [None, 1])
        
        # Loss and optimization 
        self.loss = tf.reduce_mean(tf.square(self.target_q_value - self.critic_network))
        self.optimizer = tf.train.AdamOptimizer(self.learning_rate).minimize(self.loss)
        
        # Gradient with respect to actions
        self.action_gradients_op = tf.gradients(self.critic_network, self.action)
        

    def make(self, scope):
        
        # Add variable scope to easily differentiate between actor and target networks
        with tf.variable_scope(scope): 
            
            # Define input placeholders
            input = tf.placeholder(tf.float32, [None,self.num_inputs])
            action = tf.placeholder(tf.float32, [None, 1])
            is_training = tf.placeholder_with_default(False, [], name = 'is_training')

            layer_1 = tf.layers.dense(input, self.num_hidden_1, name = 'dense_layer_1')
            layer_1 = tf.layers.batch_normalization(layer_1, training = is_training, name = 'batch_norm_1')
            layer_1 = tf.nn.relu(layer_1)
            
            layer_2 = tf.layers.dense(layer_1, self.num_hidden_2, name = 'dense_layer_2')
            layer_2 = tf.layers.batch_normalization(layer_2, training = is_training, name = 'batch_norm_2')
            layer_2 = tf.nn.relu(layer_2)
            
            network = tf.layers.dense(layer_2, self.num_outputs, name = 'output_layer')
            
        # Add network variables to tf.collection for easy retrieval
        with tf.variable_scope(scope, reuse = True):
            
            # Layer 1: Dense variables + Batch Norm variables
            tf.add_to_collection(scope, tf.get_variable('dense_layer_1/kernel'))
            tf.add_to_collection(scope, tf.get_variable('dense_layer_1/bias'))
            tf.add_to_collection(scope, tf.get_variable('batch_norm_1/gamma'))
            tf.add_to_collection(scope, tf.get_variable('batch_norm_1/beta'))
            tf.add_to_collection(scope, tf.get_variable('batch_norm_1/moving_mean'))
            tf.add_to_collection(scope, tf.get_variable('batch_norm_1/moving_variance'))
            
            # Layer 2: Dense variables + Batch Norm variables
            tf.add_to_collection(scope, tf.get_variable('dense_layer_2/kernel'))
            tf.add_to_collection(scope, tf.get_variable('dense_layer_2/bias'))
            tf.add_to_collection(scope, tf.get_variable('batch_norm_2/gamma'))
            tf.add_to_collection(scope, tf.get_variable('batch_norm_2/beta'))
            tf.add_to_collection(scope, tf.get_variable('batch_norm_2/moving_mean'))
            tf.add_to_collection(scope, tf.get_variable('batch_norm_2/moving_variance'))
            
            # Output layer: Dense variables
            tf.add_to_collection(scope, tf.get_variable('output_layer/kernel'))
            tf.add_to_collection(scope, tf.get_variable('output_layer/bias'))
        
        return input, action, is_training, network
        

    def train(self, inputs, actions, target_q_values):
        self.session.run([self.critic_network, self.optimizer], feed_dict = {
            self.critic_input: inputs, 
            self.action: actions,
            self.target_q_value: target_q_values,
            self.critic_is_training: True})
       
    
    def update_target_network(self):
        self.session.run(self.update_target_network_op)  
    
    
    def predict(self, states, actions):
        return self.session.run(self.critic_network, feed_dict = {
            self.critic_input: states,
            self.action: actions})
    
    
    def predict_target(self, states, actions):
        return self.session.run(self.target_network, feed_dict = {
            self.target_input: states,
            self.target_action: actions})
    
    def action_gradients(self, states, actions):
        return session.run(self.action_gradients_op, feed_dict = {
            self.critic_input: states,
            self.action: actions})[0]
   

## Run Model ##

In [59]:
# Reset tf.Graph()
tf.reset_default_graph()

# Create action range
action_range = np.append(env.action_space.low, env.action_space.high)

""" Start TensorFlow session"""
with tf.Session() as session:
    
    # Create buffer
    replay_buffer = ReplayBuffer(BUFFER_SIZE)

    # Create networks
    actor = ActorNetwork(session, num_states, action_range, LEARN_RATE_ACTOR, TAU, BATCH_SIZE)
    critic = CriticNetwork(session, num_states, LEARN_RATE_CRITIC, TAU, BATCH_SIZE)
    
    # Initialize 
    session.run(tf.global_variables_initializer())
    
    # Epoch reward
    episode_reward = np.zeros(EPOCHS)
    
    # Action noise generator
    noise = ActionNoise()
    
    """ Epochs loop """
    for i in range(EPOCHS):
        
        # Reset the environment at the start of each epcoh
        state = np.reshape(env.reset(), [num_states, 1])

        if i % 10 == 0: print('EPOCH: {}/{}'.format(i, EPOCHS))
        """ Steps in each epoch """
        for j in range(EPOCH_LENGTH):
            
            # Render the environment if specified
            if RENDER:
                env.render()
                
            """ Select action, execute and update buffer """
            # Determine action and critic values
            action = actor.predict(state.reshape([1, num_states])) + noise()
            
            # Run the environment
            next_state, reward, done, _info = env.step(action)

            # Add experience to the replay buffer
            replay_buffer.add(state, action, reward, done, next_state)

            # Update state & reward
            state = next_state
            episode_reward[i] += reward
            
            """ Start Training """
            if replay_buffer.size() >= BATCH_SIZE:
                # Replay batch
                batch_state, batch_action, batch_reward, batch_done, batch_next_state = replay_buffer.batch(BATCH_SIZE)

                # Predict value under target policy
                target_q = critic.predict_target(batch_state, batch_action)

                # Train critic
                y_i = batch_reward + GAMMA*target_q
                critic.train(batch_state, batch_action, y_i)

                # Update actor
                actions = actor.predict(batch_state)
                gradients = critic.action_gradients(batch_state, actions)
                actor.train(batch_state, gradients)

                # Update target networks
                actor.update_target_network()
                critic.update_target_network()     
                
     
env.close()
        

EPOCH: 0/50000


KeyboardInterrupt: 

In [None]:
plt.plot(range(EPOCHS), episode_reward)

In [17]:
state = env.reset()

for i in range(10000):
    
    env.render()
    
    # Determine action and critic values
    action = actor.predict(state.reshape([1, num_states])) + noise()
            
    # Run the environment
    next_state, reward, done, _info = env.step(action)

RuntimeError: Attempted to use a closed Session.

## Save Network ##

In [12]:
# TODO: Create checkpoint
target_q.shape

(256, 1)

In [13]:
np.zeros(EPOCHS).shape

(1000,)

In [14]:
print('EPOCH: {}/{}'.format(0, 10))

EPOCH: 0/10


In [18]:
# Reset tf.Graph()
tf.reset_default_graph()

# Create action range
action_range = np.append(env.action_space.low, env.action_space.high)

""" Start TensorFlow session"""
with tf.Session() as session:
    
    # Create buffer
    replay_buffer = ReplayBuffer(BUFFER_SIZE)

    # Create networks
    actor = ActorNetwork(session, num_states, action_range, LEARN_RATE_ACTOR, TAU, BATCH_SIZE)

In [19]:
actor

<__main__.ActorNetwork at 0x1987fdd8>

In [20]:
tf.trainable_variables('actor')

[<tf.Variable 'actor/dense_layer_1/kernel:0' shape=(3, 400) dtype=float32_ref>,
 <tf.Variable 'actor/dense_layer_1/bias:0' shape=(400,) dtype=float32_ref>,
 <tf.Variable 'actor/batch_norm_1/gamma:0' shape=(400,) dtype=float32_ref>,
 <tf.Variable 'actor/batch_norm_1/beta:0' shape=(400,) dtype=float32_ref>,
 <tf.Variable 'actor/dense_layer_2/kernel:0' shape=(400, 300) dtype=float32_ref>,
 <tf.Variable 'actor/dense_layer_2/bias:0' shape=(300,) dtype=float32_ref>,
 <tf.Variable 'actor/batch_norm_2/gamma:0' shape=(300,) dtype=float32_ref>,
 <tf.Variable 'actor/batch_norm_2/beta:0' shape=(300,) dtype=float32_ref>,
 <tf.Variable 'actor/output_layer/kernel:0' shape=(300, 1) dtype=float32_ref>,
 <tf.Variable 'actor/output_layer/bias:0' shape=(1,) dtype=float32_ref>]

In [21]:
tf.global_variables('actor')

[<tf.Variable 'actor/dense_layer_1/kernel:0' shape=(3, 400) dtype=float32_ref>,
 <tf.Variable 'actor/dense_layer_1/bias:0' shape=(400,) dtype=float32_ref>,
 <tf.Variable 'actor/batch_norm_1/gamma:0' shape=(400,) dtype=float32_ref>,
 <tf.Variable 'actor/batch_norm_1/beta:0' shape=(400,) dtype=float32_ref>,
 <tf.Variable 'actor/batch_norm_1/moving_mean:0' shape=(400,) dtype=float32_ref>,
 <tf.Variable 'actor/batch_norm_1/moving_variance:0' shape=(400,) dtype=float32_ref>,
 <tf.Variable 'actor/dense_layer_2/kernel:0' shape=(400, 300) dtype=float32_ref>,
 <tf.Variable 'actor/dense_layer_2/bias:0' shape=(300,) dtype=float32_ref>,
 <tf.Variable 'actor/batch_norm_2/gamma:0' shape=(300,) dtype=float32_ref>,
 <tf.Variable 'actor/batch_norm_2/beta:0' shape=(300,) dtype=float32_ref>,
 <tf.Variable 'actor/batch_norm_2/moving_mean:0' shape=(300,) dtype=float32_ref>,
 <tf.Variable 'actor/batch_norm_2/moving_variance:0' shape=(300,) dtype=float32_ref>,
 <tf.Variable 'actor/output_layer/kernel:0' shap

In [22]:
tf.get_collection('actor')

[<tf.Variable 'actor/dense_layer_1/kernel:0' shape=(3, 400) dtype=float32_ref>,
 <tf.Variable 'actor/dense_layer_1/bias:0' shape=(400,) dtype=float32_ref>,
 <tf.Variable 'actor/batch_norm_1/gamma:0' shape=(400,) dtype=float32_ref>,
 <tf.Variable 'actor/batch_norm_1/beta:0' shape=(400,) dtype=float32_ref>,
 <tf.Variable 'actor/batch_norm_1/moving_mean:0' shape=(400,) dtype=float32_ref>,
 <tf.Variable 'actor/batch_norm_1/moving_variance:0' shape=(400,) dtype=float32_ref>,
 <tf.Variable 'actor/dense_layer_2/kernel:0' shape=(400, 300) dtype=float32_ref>,
 <tf.Variable 'actor/dense_layer_2/bias:0' shape=(300,) dtype=float32_ref>,
 <tf.Variable 'actor/batch_norm_2/gamma:0' shape=(300,) dtype=float32_ref>,
 <tf.Variable 'actor/batch_norm_2/beta:0' shape=(300,) dtype=float32_ref>,
 <tf.Variable 'actor/batch_norm_2/moving_mean:0' shape=(300,) dtype=float32_ref>,
 <tf.Variable 'actor/batch_norm_2/moving_variance:0' shape=(300,) dtype=float32_ref>,
 <tf.Variable 'actor/output_layer/kernel:0' shap

In [None]:


            # Build the network layer by layer
            layer_1 = tf.add(tf.matmul(input, weights['hidden_1']), biases['hidden_1'])
            layer_1 = tf.contrib.layers.batch_norm(layer_1, is_training = is_training)