# Actor Critic Network

Both value and policy based methods have big drawbacks. That's why we use "hybrid method" Actor Critic, which has two networks:
- a Critic which measures how good the taken action is
- an Actor that controls how our agent behaves

The Policy Gradient method has a big problem because of Monte Carlo, which waits until the end of episode to calculate the reward. We may conclude that if we have a high reward $R(t)$, all actions that we took were good, even if some were really bad.

## Actor Critic

Instead of waiting until the end of the episode as we do in Monte Carlo REINFORCE, we make an update at each step (TD Learning).

Because we do an update at each time step, we can't use the total rewards $R(t)$. Instead, we need to train a Critic model that approximates the Q-value function. This value function replaces the reward function in policy gradient that calculates the rewards only at the end of the episode.

Because we have two models (Actor and Critic) that must be trained, it means that we have two set of weights ($\theta$ for our action and $w$ for our Critic) that must be optimized separately:
$$\Delta \theta = \alpha_1 \nabla_{\theta}(\log \pi_{\theta}(s, a)) q_{w}(s, a)$$
$$\Delta w = \alpha_2 \nabla_{w} L(R(s, a) + \lambda q_{w}(s_{t + 1}, a_{t + 1}), q_{w}(s_t, a_t))$$

## Advantage Actor Critic

Value-based methods have high variability. To reduce this problem we use advantage function instead of value function:
$$A(s, a) = Q(s, a) - V(s)$$
where $V(s)$ is average value of that state. This function will tell us the improvement compared to the average the action taken at that state is.

The problem of implementing this advantage function is that is requires two value functions  -  $Q(s,a)$ and $V(s)$. Fortunately, we can use the TD error as a good estimator of the advantage function:
$$A(s, a) = Q(s, a) - V(s) = r + \lambda V(s') - V(s)$$

In [1]:
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import tensorflow as tf
import numpy as np
import random

import gym

In [2]:
RANDOM_SEED = 40

random.seed(RANDOM_SEED)
np.random.seed(RANDOM_SEED)
tf.set_random_seed(RANDOM_SEED)

In [3]:
env = gym.make("CartPole-v0")

a_size = env.action_space.n
s_size = env.observation_space.shape[0]

print("Action space size: {}".format(a_size))
print("State space size: {}".format(s_size))

possible_actions = np.identity(a_size)

Action space size: 2
State space size: 4


In [4]:
class A2CNetwork(object):
    def __init__(self, s_size, a_size, learning_rate=0.01):
        self.a_size = a_size
        self.s_size = s_size
        
        self.states = tf.placeholder(shape=[None, s_size], dtype=tf.float32)
        self.dense = tf.layers.dense(inputs=self.states, units=32, activation=tf.nn.relu)
        self.policy = tf.layers.dense(inputs=self.dense, units=self.a_size, activation=tf.nn.softmax)
        self.value = tf.layers.dense(inputs=self.dense, units=1)
        
        self.actions = tf.placeholder(shape=[None, a_size], dtype=tf.float32)
        self.target_values = tf.placeholder(shape=[None,], dtype=tf.float32)
        self.advantages = tf.placeholder(shape=[None,], dtype=tf.float32)
        
        # policy loss
        log_prob = tf.log(tf.clip_by_value(self.policy, 0.000001, 0.999999))
        neg_log_responsible_policy = -tf.reduce_sum(tf.multiply(log_prob, self.actions), reduction_indices=1)
        self.policy_loss = tf.reduce_mean(tf.multiply(neg_log_responsible_policy, self.advantages))
        
        # value loss
        self.value_loss = tf.reduce_mean(tf.square(self.target_values - self.value))
        
        #loss
        self.loss = 0.5 * self.value_loss + self.policy_loss
        
        trainer = tf.train.AdamOptimizer(learning_rate=learning_rate)
        self.optimize = trainer.minimize(self.loss)

In [5]:
tf.reset_default_graph()

network = A2CNetwork(s_size, a_size)
init = tf.global_variables_initializer()

In [6]:
sess = tf.Session()
sess.run(init)

In [7]:
num_episodes = 300
min_batch_size = 32
discount_factor = 0.95

for episode in range(num_episodes):
    episode_states = []
    episode_rewards = []
    episode_actions = []
    episode_values = []
    r_total = 0
    
    s = env.reset()
    done = False
    
    while not done:
        pi, value = sess.run([network.policy, network.value], feed_dict={
            network.states: [s]
        })
        action = np.random.choice(a_size, p=pi[0])
        s1, r, done, _ = env.step(action)
        
        action_vec = possible_actions[action]
        
        episode_states.append(s)
        episode_rewards.append(r)
        episode_actions.append(action_vec)
        episode_values.append(value[0][0])
        r_total += r
        
        if done or len(episode_states) > min_batch_size:
            target_value = 0
            if not done: 
                target_value = sess.run(network.value, feed_dict={network.states: [s1]})[0]

            target_values = np.zeros_like(episode_rewards)
            for i in range(len(episode_states) - 1, -1, -1):
                target_value = episode_rewards[i] +  discount_factor * target_value
                target_values[i] = target_value

            advantages = target_values - np.array(episode_values)

            loss, _ = sess.run([network.loss, network.optimize], feed_dict={
                network.states: episode_states,
                network.advantages: advantages,
                network.actions: episode_actions,
                network.target_values: target_values
            })


            episode_states = []
            episode_rewards = []
            episode_actions = []
            episode_values = []
        
        if done and episode % 10 == 0:
            print("EPIDOSE {:0>5}: {}".format(episode, r_total))
        
        s = s1

EPIDOSE 00000: 33.0
EPIDOSE 00010: 21.0
EPIDOSE 00020: 15.0
EPIDOSE 00030: 17.0
EPIDOSE 00040: 45.0
EPIDOSE 00050: 19.0
EPIDOSE 00060: 26.0
EPIDOSE 00070: 53.0
EPIDOSE 00080: 40.0
EPIDOSE 00090: 37.0
EPIDOSE 00100: 127.0
EPIDOSE 00110: 37.0
EPIDOSE 00120: 88.0
EPIDOSE 00130: 88.0
EPIDOSE 00140: 45.0
EPIDOSE 00150: 32.0
EPIDOSE 00160: 106.0
EPIDOSE 00170: 200.0
EPIDOSE 00180: 154.0
EPIDOSE 00190: 108.0
EPIDOSE 00200: 150.0
EPIDOSE 00210: 19.0
EPIDOSE 00220: 101.0
EPIDOSE 00230: 180.0
EPIDOSE 00240: 27.0
EPIDOSE 00250: 33.0
EPIDOSE 00260: 96.0
EPIDOSE 00270: 200.0
EPIDOSE 00280: 81.0
EPIDOSE 00290: 26.0


In [None]:
s = env.reset()
# for i in range(3): env.step(0)
r_total = 0
done = False
while True:
    env.render()
    pi = sess.run(network.policy, feed_dict={network.states: [s]})
    a = np.random.choice(a_size, p=pi[0])
    s, r, done, _ = env.step(a)
    r_total += r
    #print(done)
#     if done == True:
#         print(r_total)
#         break

[33mWARN: You are calling 'step()' even though this environment has already returned done = True. You should always call 'reset()' once you receive 'done = True' -- any further steps are undefined behavior.[0m


In [9]:
env.close()