# Seminar 4 Part 2: Advantage Actor-Critic
-----------------

Actor-Critic methods are policy gradient methods that represent the policy function independent of the value function. 

A policy function (or policy) returns a probability distribution over actions that the agent can take based on the given state.
A value function determines the expected return for an agent starting at a given state and acting according to a particular policy forever after.

In the Actor-Critic method, the policy is referred to as the ***actor*** that proposes a set of possible actions given a state, and the estimated value function is referred to as the ***critic***, which evaluates actions taken by the ***actor*** based on the given policy.

In this tutorial, both the ***Actor*** and ***Critic*** will be represented using neural networks.


## CartPole-v1

In the [CartPole-v1 environment](https://gym.openai.com/envs/CartPole-v1), a pole is attached to a cart moving along a frictionless track. 
The pole starts upright and the goal of the agent is to prevent it from falling over by applying a force of -1 or +1 to the cart. 
A reward of +1 is given for every time step the pole remains upright.
An episode ends when (1) the pole is more than 15 degrees from vertical or (2) the cart moves more than 2.4 units from the center.

<center>
  <figure>
    <img src="./graphs/cartpole-v0.gif">
    <figcaption>
      Trained actor-critic model in Cartpole-v1 environment
    </figcaption>
  </figure>
</center>

The problem is considered "solved" when the average total reward for the episode reaches 495 over 100 consecutive trials.

The code below is organized upon this [repository](https://github.com/yc930401/Actor-Critic-pytorch).

## Setup Environment

Import necessary packages and configure global settings.

In [1]:
import gym

env = gym.make("CartPole-v1").unwrapped
state_size = env.observation_space.shape[0]
action_size = env.action_space.n

print("Dimension of state: ", state_size)
print("Number of action: ", action_size)

Dimension of state:  4
Number of action:  2


For Cartpole-v1, there are four values representing the state: 

- cart position, 
- cart-velocity, 
- pole angle, 
- pole velocity. 

The agent can take two actions to push the cart: 
- left (0),
- right (1).

## Model

The *Actor* and *Critic* will be modeled using neural networks that generates the action probabilities and critic value, respectively.  

During the forward pass, the model will take in the state as the input and will output both action probabilities and critic value $V$, which models the state-dependent [value function](https://spinningup.openai.com/en/latest/spinningup/rl_intro.html#value-functions). The goal is to train a model that chooses actions based on a policy $\pi$ that maximizes expected [return](https://spinningup.openai.com/en/latest/spinningup/rl_intro.html#reward-and-return).

In [2]:
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
from torch.distributions import Categorical

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")


class Actor(nn.Module):
    def __init__(self, state_size, action_size):
        super(Actor, self).__init__()
        self.state_size = state_size
        self.action_size = action_size
        self.linear1 = nn.Linear(self.state_size, 128)
        self.linear2 = nn.Linear(128, 256)
        self.linear3 = nn.Linear(256, self.action_size)

    def forward(self, state):
        output = F.relu(self.linear1(state))
        output = F.relu(self.linear2(output))
        output = self.linear3(output)
        distribution = Categorical(F.softmax(output, dim=-1))
        return distribution


class Critic(nn.Module):
    def __init__(self, state_size, action_size):
        super(Critic, self).__init__()
        self.state_size = state_size
        self.action_size = action_size
        self.linear1 = nn.Linear(self.state_size, 128)
        self.linear2 = nn.Linear(128, 256)
        self.linear3 = nn.Linear(256, 1)

    def forward(self, state):
        output = F.relu(self.linear1(state))
        output = F.relu(self.linear2(output))
        value = self.linear3(output)
        return value

### Loss functions for actor and critic

#### Actor loss

The actor loss is based on [policy gradients with the critic as a state dependent baseline](https://www.youtube.com/watch?v=EKqxumCuAAY&t=62m23s) and computed with single-sample (per-episode) estimates.

$$L_{\text{actor}} = -\sum^{T}_{t=1} \log\pi_{\theta}(a_{t} | s_{t})[G(s_{t}, a_{t})  - V^{\pi}_{\theta}(s_{t})]$$

where:
- $T$: the number of timesteps per episode, which can vary per episode
- $s_{t}$: the state at timestep $t$
- $a_{t}$: chosen action at timestep $t$ given state $s$
- $\pi_{\theta}$: is the policy (***actor***) parameterized by $\theta$
- $V^{\pi}_{\theta}$: is the value function (***critic***) also parameterized by $\theta$
- $G(s_{t}, a_{t})$: the expected return for a given state, action pair at timestep $t$
- $G(s_{t}, a_{t})  - V^{\pi}_{\theta}(s_{t})$: the advantage that indicates the goodness of an action given a particular state over a random action selected according to the policy $\pi$ for that state.
    - Using the advantage may result in high variance during training. 

A negative term is added to the sum since the idea is to maximize the probabilities of actions yielding higher rewards by minimizing the combined loss.

##### Empirical merit of using the advantage


Without the advantage, the algorithm would try to increase probabilities for actions taken on a particular state based on expected return $G_t$, which may not make much of a difference if the relative probabilities between actions remain the same.

For instance, suppose that two actions for a given state would yield the same expected return. Without the advantage, the algorithm would try to raise the probability of these actions based on the objective $J$. With the advantage, it may turn out that there's no advantage ($G - V = 0$) and thus no benefit gained in increasing the actions' probabilities and the algorithm would set the gradients to zero.

#### Critic loss

Training $V$ to be as close possible to $G$ can be set up as a regression problem with the following square loss function:

$$L_{\text{critic}} = \sum_{t=1}^T(G(s_t, a_t) - V^{\pi}_{\theta}(s_t))^2.$$

##### Computing expected returns $G(s_t, a_t)$

The sequence of rewards $\{r_{1}, r_2, \ldots, r_t, \ldots, r_T\}$ collected during one episode is converted into a sequence of expected returns $\{G_{1}, G_2, \ldots, G_t, \ldots, G_T\}$ where 

$$G_{t} = \sum^{T}_{t'=t} \gamma^{t'-t}r_{t'}$$

and $\gamma\in (0, 1]$. The implementation is below. 

In [3]:
def compute_returns(rewards, masks, gamma=0.99):
    R = torch.zeros(1)
    returns = []
    for step in reversed(range(len(rewards))):
        R = rewards[step] + gamma * R * masks[step]
        returns.insert(0, R)
    return returns

### Training

To train the agent, you will follow these steps:

1. Run the agent on the environment to collect training data per episode.
2. Compute expected return at each time step.
3. Compute the loss for the actor and critic models.
4. Compute gradients and update network parameters.
5. Repeat 1-4 until either success criterion or max episodes has been reached.

Note that this method differs algorithmically from the lecture and also from the Sutton-Barto Book. The weights are updated online after each step, while the implementation in this Notebook only updates parameters in an offline manner at the end of an episode.

In [4]:
from itertools import count
import warnings

warnings.filterwarnings("ignore")


def trainIters(actor, critic, n_iters):
    optimizerA = optim.Adam(actor.parameters())
    optimizerC = optim.Adam(critic.parameters())
    for iter in range(n_iters):
        state = env.reset(seed=iter)
        state = state[0]
        log_probs = []
        values = []
        rewards = []
        masks = []
        env.reset()

        for i in count():
            env.render()
            state = torch.FloatTensor(state).to(device)
            dist, value = actor(state), critic(state)

            action = dist.sample()
            next_state, reward, done, _, _ = env.step(action.cpu().numpy())

            log_prob = dist.log_prob(action).unsqueeze(0)

            log_probs.append(log_prob)
            values.append(value)
            rewards.append(torch.tensor([reward], dtype=torch.float, device=device))
            masks.append(torch.tensor([1 - done], dtype=torch.float, device=device))

            state = next_state

            if done:
                print("Iteration: {}, Score: {}".format(iter, i))
                break

        next_state = torch.FloatTensor(next_state).to(device)
        returns = compute_returns(rewards, masks)

        log_probs = torch.cat(log_probs)
        returns = torch.cat(returns).detach()
        values = torch.cat(values)

        advantage = returns - values

        actor_loss = -(log_probs * advantage.detach()).mean()
        critic_loss = advantage.pow(2).mean()

        optimizerA.zero_grad()
        optimizerC.zero_grad()
        actor_loss.backward()
        critic_loss.backward()
        optimizerA.step()
        optimizerC.step()
    env.close()

In [5]:
actor = Actor(state_size, action_size).to(device)
critic = Critic(state_size, action_size).to(device)
trainIters(actor, critic, n_iters=200)

Iteration: 0, Score: 14
Iteration: 1, Score: 14
Iteration: 2, Score: 36
Iteration: 3, Score: 11
Iteration: 4, Score: 24
Iteration: 5, Score: 12
Iteration: 6, Score: 28
Iteration: 7, Score: 21
Iteration: 8, Score: 17
Iteration: 9, Score: 21
Iteration: 10, Score: 22
Iteration: 11, Score: 16
Iteration: 12, Score: 27
Iteration: 13, Score: 34
Iteration: 14, Score: 24
Iteration: 15, Score: 20
Iteration: 16, Score: 29
Iteration: 17, Score: 21
Iteration: 18, Score: 11
Iteration: 19, Score: 50
Iteration: 20, Score: 23
Iteration: 21, Score: 26
Iteration: 22, Score: 30
Iteration: 23, Score: 38
Iteration: 24, Score: 38
Iteration: 25, Score: 97
Iteration: 26, Score: 35
Iteration: 27, Score: 20
Iteration: 28, Score: 41
Iteration: 29, Score: 44
Iteration: 30, Score: 21
Iteration: 31, Score: 34
Iteration: 32, Score: 21
Iteration: 33, Score: 34
Iteration: 34, Score: 41
Iteration: 35, Score: 81
Iteration: 36, Score: 34
Iteration: 37, Score: 21
Iteration: 38, Score: 8
Iteration: 39, Score: 49
Iteration: 