# Actor Critic Methods

### Issues with REINFORCE

In the previous notebook, we learned one of the policy gradient algorithms - REINFORCE. Even though it is a relatively easy and successful method, it has some drawbacks:

- As the samples are random, there is a huge variability in probabilities and reward value that translates into fluctuating gradients.
- REINFORCE fails to learn from trajectories that has a cumulative reward of 0 making it a slower algorithm.

### Reducing variance

One of the ways of reducing such variance is to subtract a baseline from the cumulative reward. Smaller rewards will translate into smaller gradients which will make training smoother.

Let's look at the following example, let's say that $\nabla_{\theta}log\pi_{\theta}(\tau) = [0.5, 0.2, 0.3]$ and $R(\tau) = [1000, 1001, 1002]$.

In such case the variance is **23286.8**.

If we happen to decrease the rewards by constant value of 1001, the variance drops to **0.1633**.

This becomes the basis of Actor Critic methods - from the mathematical perspective, it's the same REINFORCE algorithm (*kind of*) if we happen to reduce the cumulative reward function by a baseline function.

### Actor Critic methods

As it has already been mentioned, the actor critic methods functions according to the same function - the only difference is a baseline function that reduces the cumulative reward.

From the structural standpoint, Actor Critic methods has two parts:
 - The Critic that estimates the value function (the new function we get by subtracting the accumulative reward)
 - The actor that updates policy
 
 The following picture demonstrates some of the existing Actor Critic methods. For this tutorial, however, we will analyze the Advantage Actor Critic algorithm (A2C for short). 
![actor-critic](https://miro.medium.com/max/2000/1*T1zTYVLkMNngE09fOqTkSA.png)

### Advantage Actor Critic (A2C)

As we can see from the figure above, the Advantage Actor-Critic algorithm includes the advantage function $A(s_t, a_t)$. In simple terms the advantage term tells us how much better it is to take an action $a_t$ instead of other actions at state $s_t$. Mathematically, this can be written as:
$$
A(s_t, a_t) = Q(s_t, a_t) - V(s_t)
$$

Here we are introduced to two new functions which might not make sense at the moment.

- $Q(s_t, a_t)$ is called a **Q-value** and functions as a discounted future return. In other words, it shows how much return can be expected if we continue our trajectory until the end $T$.

- V(s_t) shows how much return can be expected from the current state $s_t$ until the final step $T$.

If both of these definitions sound similar to you, you are not wrong - the only difference lays in the first step. $Q$ value describes an expected reward **after** the action is taken, while $V$ shows reward of just being at the state.

Consequently, the difference of these functions allows to see the advantage of certain action.

### Implementation

Similar to the REINFORCE algorithm, we are not going to provide an implementation straight away - you can find it in the Challenge solution notebook.

On the other hand, it is useful to discuss of the logic behind the A2C model.

As it has been mentioned earlier, the A2C model consists of two parts:

- Actor - takes a state and outputs the probability distribution of each action
- Critic - takes a state and outputs expected reward for the range of actions

Let's say we want to build a game agent in which there are only two possible actions at the given state. The following shared neural network could be used for both, actor and critic parts.

In [None]:
def model():    
    
    inputs = layers.Input(shape=(4,))
    common = layers.Dense(128, activation="relu")(inputs)
    action = layers.Dense(2, activation="softmax")(common)
    critic = layers.Dense(1)(common)

    model = keras.Model(inputs=inputs, outputs=[action, critic])
    
    return model

Such model would take the state as an input and output the probability distribution of both of the actions as well as the expected reward for the range of actions.

The overall pipeline of the code would look something like this:
1. Extracting environment information (state, action, etc.)
2. Passing state through the model to generate action and critic outputs
3. Sample action from the probability distribution
4. Calculating rewards
5. Comparing rewards after taken trajectory to those calculated at the start of the trajectory to generate loss