# DDPG - BipedalWalker-v2

- Xinyao Qian
- Tianhao Liu

- Get familiar with the BipedalWalker-v2 environment first

Find that BipedalWalker behaves embarrasingly bad if taking random walking strategy.

In [None]:
import tensorflow as tf
import numpy as np
import gym

# Load Environment
ENV_NAME = 'BipedalWalker-v2'
env = gym.make(ENV_NAME)
# Repeoducible environment parameters
env.seed(1)

s=env.reset()
episode=100
steps=5000
while i in range(episode):
    for j in range(steps):
        env.render()
        a=env.action_space.sample()
        s_,r,d,_=env.step(a)

        if d:
            s=env.reset()

[33mWARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.[0m
[33mWARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.[0m


# Our solution


**Since the action space of BipedalWalker is consecutive, that means value based models such as Q-Learning or DQN, are not applicable**, because value based models generally try to fit a better value function that tells us how good it is to be at a certain state s (V(s)) or to take action a at the state s (Q(s,a)), and then we still need to choose specific action based on our exploring strategy (e.g. $\epsilon$-greedy). Obviously, it can't work when our actions are consecutive/countless.

So then we consider using **policy based models**, for example, REINFORCE. However, there is another problem that REINFORCE can only update parameters/learn everytime an episode ends, which slowed the convergence process. 

Then we get to know that there is another series of models that called **Actor Critic which combines the advantages of both the value based model and the policy based model and make it possible for policy based models to update itself at every step**. 

Specifically, we simultaneously train a policy gradients network and a Q-Learning network. The policy network behaves as the actor which takes in observations and outputs best actions to be taken, while the value network will behave as a critic to take in observations and tell the actor how 'good' to be at the current state, so that the actor can know how good its last action that brought it here was, and update its parameters according to this feedback, while the critic can also update its own parameters in the way Q-Learning does. **In a sense, actor and critic are supervising each other to become better and better**.

<center>
![](https://morvanzhou.github.io/static/results/ML-intro/AC3.png)
</center>


> https://morvanzhou.github.io/static/results/ML-intro/AC3.png


# Environment preparation & Definition of Classes: Actor, Critic, Memory

In [None]:
class Actor:
    def __init__(self, sess, learning_rate,action_dim,action_bound):
        self.sess = sess
        self.action_dim = action_dim
        self.action_bound = action_bound
        self.learning_rate = learning_rate

        # input current state, output action to be taken
        self.a = self.build_neural_network(S, scope='eval_nn', trainable=True)

        self.a_ = self.build_neural_network(S_, scope='target_nn', trainable=False)

        self.eval_parameters = tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES, scope='Actor/eval_nn')
        self.target_parameters = tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES, scope='Actor/target_nn')


    def act(self, s):
        s = s[np.newaxis, :] 
        return self.sess.run(self.a, feed_dict={s: state})[0] 

    def learn(self, state):  # update parameters
        self.sess.run(self.train_op, feed_dict={s: state})
        if self.t_replace_counter % self.t_replace_iter == 0:
            self.sess.run([tf.assign(t, e) for t, e in zip(self.target_parameters, self.eval_parameters)])
        self.t_replace_counter += 1
    
    def build_neural_network(self, s, scope, trainable):
        
        init_weights = tf.random_normal_initializer(0., 0.1)
        init_bias = tf.constant_initializer(0.01)
        # three dense layer networks
        nn = tf.layers.dense(s, 500, activation=tf.nn.relu,
                              kernel_initializer=init_weights, bias_initializer=init_bias, name='l1', trainable=trainable)
        nn = tf.layers.dense(nn, 200, activation=tf.nn.relu,
                              kernel_initializer=init_weights, bias_initializer=init_bias, name='l2', trainable=trainable)
        actions = tf.layers.dense(nn, self.action_dim, activation=tf.nn.tanh, kernel_initializer=init_weights,
                                  bias_initializer=init_bias, name='a', trainable=trainable)
        scaled_actions = tf.multiply(actions, self.action_bound, name='scaled_actions')  

        return scaled_actions

    def add_gradient(self, a_gradients):
        self.policy_gradients_and_vars = tf.gradients(ys=self.a, xs=self.eval_parameters, grad_ys=a_gradients)
        opt = tf.train.RMSPropOptimizer(-self.learning_rate) # gradient ascent
        self.train_op = opt.apply_gradients(zip(self.policy_gradients_and_vars, self.eval_parameters), global_step=GLOBAL_STEP)

class Critic(object):
    def __init__(self, sess, state_dim, action_dim, learning_rate, discount_factor, action, next_action):
        self.sess = sess
        self.state_dimension = state_dim
        self.action_dimension = action_dim
        self.discount_factor = discount_factor
        self.learning_rate = learning_rate
        # Input (state, action) pair, output q values
        self.action = action
        self.q = self.build_nn(S, self.action, 'evaluation_nn', trainable=True)

        self.next_q = self.build_nn(NextState, next_action, 'target_nn', trainable=False)    

        self.evaluation_parameters = tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES)
        self.target_parameters = tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES)

        
        self.target_q = R + self.discount_factor * self.q_


        # temporal difference
        self.td = tf.abs(self.target_q - self.q)
        self.weights = tf.placeholder(tf.float32, [None, 1], name='weights')
        
        self.loss = tf.reduce_mean(self.weights * tf.squared_difference(self.target_q, self.q))
        self.train_optimizer = tf.train.AdamOptimizer(self.lr).minimize(self.loss, global_step=GLOBAL_STEP) # add global_step parameters to ensure increment of global_step
        self.action_gradients = tf.gradients(self.q, a)[0]   


    def learn(self, state, action, reward, nextState, weights):
        _, td = self.sess.run([self.train_op, self.td], feed_dict={State: state, self.action: action, R: reward, NextState: nextState, self.weights: weights})
        
        self.sess.run([tf.assign(t, e) for t, e in zip(self.target_parameters, self.evaluation_parameters)])
        return td

    def build_nn(self, s, a, scope, trainable):
        
        init_weights = tf.random_normal_initializer(0., 0.01)
        init_bias = tf.constant_initializer(0.01)

        w1_state = tf.get_variable('w1_state', [self.state_dimension, 500], initializer=init_weights, trainable=trainable)
        w1_action= tf.get_variable('w1_action', [self.state_dimension, 500], initializer=init_weights, trainable=trainable)
        b1 = tf.get_variable('b1', [1, 500], initializer=init_bias, trainable=trainable)
        nn = tf.nn.relu(tf.matmul(s, w1_s) + tf.matmul(a, w1_a) + b1)
        
        nn = tf.layers.dense(nn, 20, activation=tf.nn.relu, kernel_initializer=init_weights,
                                  bias_initializer=init_bias, name='l2', trainable=trainable)
        
        q = tf.layers.dense(nn, 1, kernel_initializer=init_weights, bias_initializer=init_bias, trainable=trainable)   # Q(s,a)
        return q


# https://github.com/jaromiru/AI-blog/blob/master/SumTree.py
class SumTree:
    write = 0

    def __init__(self, capacity):
        self.capacity = capacity
        self.tree = numpy.zeros( 2*capacity - 1 )
        self.data = numpy.zeros( capacity, dtype=object )

    def _propagate(self, idx, change):
        parent = (idx - 1) // 2

        self.tree[parent] += change

        if parent != 0:
            self._propagate(parent, change)

    def _retrieve(self, idx, s):
        left = 2 * idx + 1
        right = left + 1

        if left >= len(self.tree):
            return idx

        if s <= self.tree[left]:
            return self._retrieve(left, s)
        else:
            return self._retrieve(right, s-self.tree[left])

    def total(self):
        return self.tree[0]

    def add(self, p, data):
        idx = self.write + self.capacity - 1

        self.data[self.write] = data
        self.update(idx, p)

        self.write += 1
        if self.write >= self.capacity:
            self.write = 0

    def update(self, idx, p):
        change = p - self.tree[idx]

        self.tree[idx] = p
        self._propagate(idx, change)

    def get(self, s):
        idx = self._retrieve(0, s)
        dataIdx = idx - self.capacity + 1

        return (idx, self.tree[idx], self.data[dataIdx])

    
# https://github.com/jaromiru/AI-blog/blob/master/Seaquest-DDQN-PER.py

class Memory:   # stored as ( s, a, r, s_ ) in SumTree
    e = 0.01
    a = 0.6

    def __init__(self, capacity):
        self.tree = SumTree(capacity)

    def _getPriority(self, error):
        return (error + self.e) ** self.a

    def add(self, error, sample):
        p = self._getPriority(error)
        self.tree.add(p, sample) 

    def sample(self, n):
        batch = []
        segment = self.tree.total() / n

        for i in range(n):
            a = segment * i
            b = segment * (i + 1)

            s = random.uniform(a, b)
            (idx, p, data) = self.tree.get(s)
            batch.append( (idx, data) )

        return batch

    def update(self, idx, error):
        p = self._getPriority(error)
        self.tree.update(idx, p)


[33mWARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.[0m
[33mWARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.[0m
Finished!


# Main loop for trainning

In [None]:
import tensorflow as tf
import gym
import numpy as np
import shutil
import os

# reproducible results
np.random.seed(1)
tf.set_random_seed(1)

# Load Environment
ENV_NAME = 'BipedalWalker-v2'
env = gym.make(ENV_NAME)
# Reproducible environment parameters
env.seed(1)


STATE_DIMENSION = env.observation_space.shape[0] 
ACTION_DIMENSION = env.action_space.shape[0] 
ACTION_BOUND = env.action_space.high 

########################################  Hyperparameters  ########################################

# number of episodes to be trained
TRAIN_EPI_NUM=500
# Learning rate for actor and critic
ACTOR_LR=0.05
CRITIC_LR=0.05
R_DISCOUNT=0.9 # reward discount

MEMORY_CAPACITY=1000000

ACTOR_REP_ITE=1700 # after such many iterations, update ACTOR
CRITIC_REP_ITE=1500

BATCH=40 # size of batch used to learn

# Path used to store training result (parameters)
TRAIN_DATA_PATH='./train'


GLOBAL_STEP = tf.Variable(0, trainable=False) # record how many steps we have gone through
INCREASE_GLOBAL_STEP = GLOBAL_STEP.assign(tf.add(GLOBAL_STEP, 1))


# set automatically decaying learning rate to ensure convergence
ACTOR_LR = tf.train.exponential_decay(LR_A, GLOBAL_STEP, 10000, .95, staircase=True)
CRITIC_LR = tf.train.exponential_decay(LR_C, GLOBAL_STEP, 10000, .90, staircase=True)


END_POINT = (200 - 10) * (14/30)    # The end point of the game


##################################################
LOAD_MODEL = True # Whether to load trained model#
##################################################


with tf.Session() as sess:

    # Create actor and critic.
    actor = Actor(sess, ACTION_DIMENSION, ACTION_BOUND, ACTOR_LR, REPLACE_ITER_A)
    critic = Critic(sess, STATE_DIMENSION, ACTION_DIMENSION, CRITIC_LR, R_DISCOUNT, REPLACE_ITER_C, actor.a, actor.a_)

    actor.add_grad_to_graph(critic.a_grads)

    # Memory class implementation from: https://github.com/jaara/AI-blog/blob/master/Seaquest-DDQN-PER.py
    memory = Memory(MEMORY_CAPACITY)

    # saver is used to store or restore trained parameters
    saver = tf.train.Saver(max_to_keep=100)  # Maximum number of recent checkpoints to keep. Defaults to 5.


    ################################# Determine whether it's a new training or going-on training ###############3
    if LOAD_MODEL: # Returns CheckpointState proto from the "checkpoint" file.
        checkpoints = tf.train.get_checkpoint_state(TRAIN_DATA_PATH, 'checkpoint').all_model_checkpoint_paths
        saver.restore(sess, checkpoints[-1]) # reload trained parameters into the tf session
    else:
        if os.path.isdir(TRAIN_DATA_PATH): 
          shutil.rmtree(TRAIN_DATA_PATH) # recursively remove all files under directory
        os.mkdir(TRAIN_DATA_PATH)

        sess.run(tf.global_variables_initializer())

    explore_degree=0.1
    explore_degree_minimum=0.0001
    explore_decay_factor=0.99

    #################################  Main loop for training #################################
    for i_episode in range(MAX_EPISODES):
        
        state = env.reset()
        episode_reward = 0 # the episode reward
        
        while True:

            action = actor.act(s)

            action = np.clip(np.random.normal(action, explore_degree), -ACTION_BOUND, ACTION_BOUND)   # explore using randomness
            next_state, reward, done, _ = env.step(a) 

            trainsition = np.hstack((s, a, [r], s_))
            probability = np.max(memory.tree.tree[-memory.tree.capacity:])
            memory.store(probability, transition)  # stored for later learning

            # when r=-100, that means BipedalWalker has falled to the groud
            episode_reward += reward


            # when the training reaches stable stage, we lessen the probability of exploration
            if GLOBAL_STEP.eval(sess) > MEMORY_CAPACITY/20:
                explore_degree = max([explore_decay_factor*explore_degree, explore_degree_minimum])  # decay the action randomness
                tree_index, b_memory, weights = memory.prio_sample(BATCH)    # for critic update

                b_state = b_memory[:, :STATE_DIMENSION]
                b_action = b_memory[:, STATE_DIMENSION: STATE_DIMENSION + ACTION_DIMENSION]
                b_reward = b_memory[:, -STATE_DIMENSION - 1: -STATE_DIMENSION]
                b_next_state = b_memory[:, -STATE_DIMENSION:]
                
                td = critic.learn(b_state, b_action, b_reward, b_next_state, weights)
                actor.learn(b_state)
                
                for i in range(len(tree_index)):  # update priority
                    index = tree_idx[i]
                    memory.update(index, td[i])


            # if GLOBAL_STEP.eval(sess) % SAVE_MODEL_ITER == 0:
            #     ckpt_path = os.path.join(TRAIN_DATA_PATH, 'DDPG.ckpt')
            #     save_path = saver.save(sess, ckpt_path, global_step=GLOBAL_STEP, write_meta_graph=False)
            #     print("\nSave Model %s\n" % save_path)

            if done:
                if "running_reward" not in globals():
                    running_reward = episode_reward
                else:
                    running_reward = 0.95*running_r + 0.05*ep_r
                
                print('Episode:',i_episode,'running reward: ',running_reward,', episode reward: ',episode_reward)
                break # start new episode

            state = nextState
            sess.run(INCREASE_GLOBAL_STEP)




INFO:tensorflow:Restoring parameters from ./data/DDPG.ckpt-1200000
Episode: 0 | Achieve  | Running_r: 271 | Epi_r: 271.74 | Exploration: 0.000 | Pos: 88 | LR_A: 0.000000 | LR_C: 0.000000
Episode: 1 | Achieve  | Running_r: 271 | Epi_r: 269.24 | Exploration: 0.000 | Pos: 88 | LR_A: 0.000000 | LR_C: 0.000000
Episode: 2 | Achieve  | Running_r: 271 | Epi_r: 273.15 | Exploration: 0.000 | Pos: 88 | LR_A: 0.000000 | LR_C: 0.000000
Episode: 3 | Achieve  | Running_r: 271 | Epi_r: 271.24 | Exploration: 0.000 | Pos: 88 | LR_A: 0.000000 | LR_C: 0.000000
Episode: 4 | Achieve  | Running_r: 271 | Epi_r: 269.90 | Exploration: 0.000 | Pos: 88 | LR_A: 0.000000 | LR_C: 0.000000
Episode: 5 | Achieve  | Running_r: 271 | Epi_r: 268.49 | Exploration: 0.000 | Pos: 88 | LR_A: 0.000000 | LR_C: 0.000000
Episode: 6 | Achieve  | Running_r: 271 | Epi_r: 271.28 | Exploration: 0.000 | Pos: 88 | LR_A: 0.000000 | LR_C: 0.000000
Episode: 7 | Achieve  | Running_r: 271 | Epi_r: 269.52 | Exploration: 0.000 | Pos: 88 | LR_A: