# Actor Critic Network

Both value and policy based methods have big drawbacks. That's why we use "hybrid method" Actor Critic, which has two networks:
- a Critic which measures how good the taken action is
- an Actor that controls how our agent behaves

The Policy Gradient method has a big problem because of Monte Carlo, which waits until the end of episode to calculate the reward. We may conclude that if we have a high reward $R(t)$, all actions that we took were good, even if some were really bad.

## Actor Critic

Instead of waiting until the end of the episode as we do in Monte Carlo REINFORCE, we make an update at each step (TD Learning).

Because we do an update at each time step, we can't use the total rewards $R(t)$. Instead, we need to train a Critic model that approximates the Q-value function. This value function replaces the reward function in policy gradient that calculates the rewards only at the end of the episode.

Because we have two models (Actor and Critic) that must be trained, it means that we have two set of weights ($\theta$ for our action and $w$ for our Critic) that must be optimized separately:
$$\Delta \theta = \alpha_1 \nabla_{\theta}(\log \pi_{\theta}(s, a)) q_{w}(s, a)$$
$$\Delta w = \alpha_2 \nabla_{w} L(R(s, a) + \lambda q_{w}(s_{t + 1}, a_{t + 1}), q_{w}(s_t - a_t))$$

## Advantage Actor Critic

Value-based methods have high variability. To reduce this problem we use advantage function instead of value function:
$$A(s, a) = Q(s, a) - V(s)$$
where $V(s)$ is average value of that state. This function will tell us the improvement compared to the average the action taken at that state is.

The problem of implementing this advantage function is that is requires two value functions  -  $Q(s,a)$ and $V(s)$. Fortunately, we can use the TD error as a good estimator of the advantage function:
$$A(s, a) = Q(s, a) - V(s) = r + \lambda V(s') - V(s)$$

In [10]:
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import tensorflow as tf
import numpy as np
import random

import gym

In [4]:
RANDOM_SEED = 40

random.seed(RANDOM_SEED)
np.random.seed(RANDOM_SEED)
tf.set_random_seed(RANDOM_SEED)

In [5]:
env = gym.make("CartPole-v0")

a_size = env.action_space.n
s_size = env.observation_space.shape[0]

print("Action space size: {}".format(a_size))
print("State space size: {}".format(s_size))

possible_actions = np.identity(a_size)

Action space size: 2
State space size: 4


In [7]:
class A2CNetwork(object):
    def __init__(self, s_size, a_size, learning_rate=0.01):
        self.a_size = a_size
        self.s_size = s_size
        
        self.states = tf.placeholder(shape=[None, s_size], dtype=tf.float32)
        self.dense = tf.layers.dense(inputs=self.states, units=32, activation=tf.nn.relu)
        self.policy = tf.layers.dense(inputs=self.dense, units=self.a_size, activation=tf.nn.softmax)
        self.value = tf.layers.dense(inputs=self.dense, units=1)
        
        self.actions = tf.placeholder(shape=[None, a_size], dtype=tf.float32)
        self.target_values = tf.placeholder(shape=[None,], dtype=tf.float32)
        self.advantages = tf.placeholder(shape=[None,], dtype=tf.float32)
        
        # policy loss
        log_prob = tf.log(tf.clip_by_value(self.policy, 0.000001, 0.999999))
        neg_log_responsible_policy = -tf.reduce_sum(tf.multiply(log_prob, self.actions), reduction_indices=1)
        self.policy_loss = tf.reduce_mean(tf.multiply(neg_log_responsible_policy, self.advantages))
        
        # value loss
        self.value_loss = tf.reduce_mean(tf.square(self.target_values - self.value))
        
        #loss
        self.loss = self.value_loss + self.policy_loss
        
        trainer = tf.train.AdamOptimizer(learning_rate=learning_rate)
        self.optimize = trainer.minimize(self.loss)

In [8]:
tf.reset_default_graph()

network = A2CNetwork(s_size, a_size)
init = tf.global_variables_initializer()

In [9]:
sess = tf.Session()
sess.run(init)

In [22]:
num_episodes = 300
min_batch_size = 32
discount_factor = 0.95

for episode in range(num_episodes):
    episode_states = []
    episode_rewards = []
    episode_actions = []
    episode_values = []
    r_total = 0
    
    s = env.reset()
    done = False
    
    while not done:
        pi, value = sess.run([network.policy, network.value], feed_dict={
            network.states: [s]
        })
        action = np.random.choice(a_size, p=pi[0])
        s1, r, done, _ = env.step(action)
        
        action_vec = possible_actions[action]
        
        episode_states.append(s)
        episode_rewards.append(r)
        episode_actions.append(action_vec)
        episode_values.append(value[0][0])
        r_total += r
        
        if done or len(episode_states) > min_batch_size:
            target_value = 0
            if not done: 
                target_value = sess.run(network.value, feed_dict={network.states: [s1]})[0]

            target_values = np.zeros_like(episode_rewards)
            for i in range(len(episode_states) - 1, -1, -1):
                target_value = episode_rewards[i] +  discount_factor * target_value
                target_values[i] = target_value

#             print(target_values)
#             print(episode_values)
            advantages = target_values - np.array(episode_values)
#             print(advantages)
            loss, _ = sess.run([network.loss, network.optimize], feed_dict={
                network.states: episode_states,
                network.advantages: advantages,
                network.actions: episode_actions
            })


            episode_states = []
            episode_rewards = []
            episode_actions = []
            episode_values = []
        
        if done and episode % 10 == 0:
            print("EPIDOSE {:0>5}: {}".format(episode, r_total))
        
        s = s1

InvalidArgumentError: You must feed a value for placeholder tensor 'Placeholder_2' with dtype float and shape [?]
	 [[Node: Placeholder_2 = Placeholder[dtype=DT_FLOAT, shape=[?], _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]

Caused by op 'Placeholder_2', defined at:
  File "/home/vasiko/anaconda3/envs/py36rl/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/home/vasiko/anaconda3/envs/py36rl/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/vasiko/anaconda3/envs/py36rl/lib/python3.6/site-packages/ipykernel_launcher.py", line 16, in <module>
    app.launch_new_instance()
  File "/home/vasiko/anaconda3/envs/py36rl/lib/python3.6/site-packages/traitlets/config/application.py", line 658, in launch_instance
    app.start()
  File "/home/vasiko/anaconda3/envs/py36rl/lib/python3.6/site-packages/ipykernel/kernelapp.py", line 499, in start
    self.io_loop.start()
  File "/home/vasiko/anaconda3/envs/py36rl/lib/python3.6/site-packages/tornado/platform/asyncio.py", line 132, in start
    self.asyncio_loop.run_forever()
  File "/home/vasiko/anaconda3/envs/py36rl/lib/python3.6/asyncio/base_events.py", line 422, in run_forever
    self._run_once()
  File "/home/vasiko/anaconda3/envs/py36rl/lib/python3.6/asyncio/base_events.py", line 1434, in _run_once
    handle._run()
  File "/home/vasiko/anaconda3/envs/py36rl/lib/python3.6/asyncio/events.py", line 145, in _run
    self._callback(*self._args)
  File "/home/vasiko/anaconda3/envs/py36rl/lib/python3.6/site-packages/tornado/ioloop.py", line 758, in _run_callback
    ret = callback()
  File "/home/vasiko/anaconda3/envs/py36rl/lib/python3.6/site-packages/tornado/stack_context.py", line 300, in null_wrapper
    return fn(*args, **kwargs)
  File "/home/vasiko/anaconda3/envs/py36rl/lib/python3.6/site-packages/tornado/gen.py", line 1233, in inner
    self.run()
  File "/home/vasiko/anaconda3/envs/py36rl/lib/python3.6/site-packages/tornado/gen.py", line 1147, in run
    yielded = self.gen.send(value)
  File "/home/vasiko/anaconda3/envs/py36rl/lib/python3.6/site-packages/ipykernel/kernelbase.py", line 346, in process_one
    yield gen.maybe_future(dispatch(*args))
  File "/home/vasiko/anaconda3/envs/py36rl/lib/python3.6/site-packages/tornado/gen.py", line 326, in wrapper
    yielded = next(result)
  File "/home/vasiko/anaconda3/envs/py36rl/lib/python3.6/site-packages/ipykernel/kernelbase.py", line 259, in dispatch_shell
    yield gen.maybe_future(handler(stream, idents, msg))
  File "/home/vasiko/anaconda3/envs/py36rl/lib/python3.6/site-packages/tornado/gen.py", line 326, in wrapper
    yielded = next(result)
  File "/home/vasiko/anaconda3/envs/py36rl/lib/python3.6/site-packages/ipykernel/kernelbase.py", line 513, in execute_request
    user_expressions, allow_stdin,
  File "/home/vasiko/anaconda3/envs/py36rl/lib/python3.6/site-packages/tornado/gen.py", line 326, in wrapper
    yielded = next(result)
  File "/home/vasiko/anaconda3/envs/py36rl/lib/python3.6/site-packages/ipykernel/ipkernel.py", line 294, in do_execute
    res = shell.run_cell(code, store_history=store_history, silent=silent)
  File "/home/vasiko/anaconda3/envs/py36rl/lib/python3.6/site-packages/ipykernel/zmqshell.py", line 536, in run_cell
    return super(ZMQInteractiveShell, self).run_cell(*args, **kwargs)
  File "/home/vasiko/anaconda3/envs/py36rl/lib/python3.6/site-packages/IPython/core/interactiveshell.py", line 2817, in run_cell
    raw_cell, store_history, silent, shell_futures)
  File "/home/vasiko/anaconda3/envs/py36rl/lib/python3.6/site-packages/IPython/core/interactiveshell.py", line 2843, in _run_cell
    return runner(coro)
  File "/home/vasiko/anaconda3/envs/py36rl/lib/python3.6/site-packages/IPython/core/async_helpers.py", line 67, in _pseudo_sync_runner
    coro.send(None)
  File "/home/vasiko/anaconda3/envs/py36rl/lib/python3.6/site-packages/IPython/core/interactiveshell.py", line 3018, in run_cell_async
    interactivity=interactivity, compiler=compiler, result=result)
  File "/home/vasiko/anaconda3/envs/py36rl/lib/python3.6/site-packages/IPython/core/interactiveshell.py", line 3183, in run_ast_nodes
    if (yield from self.run_code(code, result)):
  File "/home/vasiko/anaconda3/envs/py36rl/lib/python3.6/site-packages/IPython/core/interactiveshell.py", line 3265, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-8-f7c785481313>", line 3, in <module>
    network = A2CNetwork(s_size, a_size)
  File "<ipython-input-7-354b2f8627b0>", line 12, in __init__
    self.target_values = tf.placeholder(shape=[None,], dtype=tf.float32)
  File "/home/vasiko/anaconda3/envs/py36rl/lib/python3.6/site-packages/tensorflow/python/ops/array_ops.py", line 1735, in placeholder
    return gen_array_ops.placeholder(dtype=dtype, shape=shape, name=name)
  File "/home/vasiko/anaconda3/envs/py36rl/lib/python3.6/site-packages/tensorflow/python/ops/gen_array_ops.py", line 4925, in placeholder
    "Placeholder", dtype=dtype, shape=shape, name=name)
  File "/home/vasiko/anaconda3/envs/py36rl/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
    op_def=op_def)
  File "/home/vasiko/anaconda3/envs/py36rl/lib/python3.6/site-packages/tensorflow/python/util/deprecation.py", line 454, in new_func
    return func(*args, **kwargs)
  File "/home/vasiko/anaconda3/envs/py36rl/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 3155, in create_op
    op_def=op_def)
  File "/home/vasiko/anaconda3/envs/py36rl/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1717, in __init__
    self._traceback = tf_stack.extract_stack()

InvalidArgumentError (see above for traceback): You must feed a value for placeholder tensor 'Placeholder_2' with dtype float and shape [?]
	 [[Node: Placeholder_2 = Placeholder[dtype=DT_FLOAT, shape=[?], _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]
