# Actor-Critic Methods

In _value-based_ methods, we use a value function to determine the optimal policy. The exploration
and exploitation tradeoff is left to manual tuning. In _policy-based_ methods, we learn the policy
directly, but the Monte Carlo methods have high-variance and tend to be slower (mitigating that by
collecting more samples causes the algorithm to be less sample-efficient).

Actor-critic methods combine the two approaches:

 * We learn the _actor_ policy, to control how the agent behave.
 * We measure how good the chosen actions are via a _critic_, which learns the value function.

Remember the _baseline_ in policy gradient? The _critic_ helps computing it!

In [1]:
import numpy as np
import torch
import torch.optim as optim
import torch.nn as nn
import torch.nn.functional as F

import gymnasium as gym

from util.gymnastics import DEVICE, ReplayBuffer, gym_simulation
from util.gymnastics import init_random, plot_scores, soft_update_model_params

## Environment

In [None]:
gym_simulation("Pendulum-v1")

In [2]:
# For convenience, harcoding actions' interval [-2.0, 2.0]
ACTION_SCALE = 2.0

## General (1-Step) Algorithm

The simplified version of an [actor-critic](https://arxiv.org/abs/1602.01783) algorithm goes as
follows:

 1. In state $S_t$, the _actor_ outputs the action $A_t$ using the policy $\pi_{\theta}(S_t)$ and
    obtaining $R_{t+1}$ and $S_{t+1}$
 2. The _critic_ outputs the value of state and next state: $V_t = \^{v_w}(S_t),
    V_{t+1} = \^{v_w}(S_{t+1})$
 3. We compute the _advantage_: $A_t = Q(s_t, a_t) - V(s_t) = [R_{t+1} + \gamma V_{t+1}] - V_t$
 3. The _actor_ updates the policy parameters $\theta$ using the _advantage_:
    $\space \space \Delta \theta = \alpha \nabla_{\theta} [ log \pi_{\theta}(S_t) ] A_t$
 4. The _critic_ updates the value function parameters $w$ minimizing $A_t$
 5. Repeat

The _advantage actor critic (A2C)_ algorithm works basically this way. For a robust implementation,
check [Stable Baselines3 A2C](https://stable-baselines3.readthedocs.io/en/master/modules/a2c.html).

## Neural Network Models

In [4]:
class ActorNetwork(nn.Module):
    """Actor (Policy) Model."""

    def __init__(self, state_size, action_size, fc1_units=400, fc2_units=300):
        super(ActorNetwork, self).__init__()
        # TODO: Create three linear layers, first two initialized via Kaiming normal, the last one
        #       with uniform distribution in [-3e-3, 3e-3]. the input of the policy network is a
        #       state, while the output a (continuous) action.

    def forward(self, state):
        """Build an actor (policy) network that maps states -> actions."""
        # TODO: Use ReLU, ReLU, TanH :)
        pass

In [5]:
class CriticNetwork(nn.Module):
    """Critic (Value) Model."""

    def __init__(self, state_size, action_size, fcs1_units=400, fc2_units=300):
        super(CriticNetwork, self).__init__()
        # TODO: Same architecture as the policy network, but now the input is the flattened state
        #       and action, while the output is a single value: the state-action value.

    def forward(self, state, action):
        """Build a critic (value) network that maps (state, action) pairs -> Q-values."""
        # TODO: concat state and action (pay attention at the dimension!), ReLU for non-linearity.
        #       The output is directly the output of the last linear layer (without non-linearity).
        pass


## Q-Learning Based Actor-Critic Algorithms

We are going to implement modern actor-critic algorithms such as DDPG, TD3, SAC. They are closely
related to Q-Learning, and in-between DQN and policy-gradient methods. All these algorithms are
analyzed in detail in [OpenAI SpinningUp](https://spinningup.openai.com/en/latest) and I strongly
recommend to read that amazing resource!

That is because these algorithms learn approximators for the optimal action-value function
$Q^*(s, a)$ and optimal (deterministic, excluding SAC) policy $a^*(s) = \argmax_{a} Q^*(s, a)$
exploting the fact that the action-value function is differentiable.

Finding the max of the action-value function would be an optimization problem in and of itself, but
instead we learn a deterministic policy $\mu(s)$ such that: $Q(s, \mu(s)) \approx \max_a Q(s,a)$.
All of these algorithms are off-policy and use a replay buffer (like DQN).


## Training Loop

In [6]:
def train_actor_critic(env, agent, n_episodes=1_000, max_t=300):
    """General training loop for actor-critic algorithms in this lecture."""
    # Records all episode scores.
    scores = []
    for i_episode in range(1, n_episodes+1):
        # TODO: Reset the environment.
        state, _ = None
        # TODO: Reset the score to zero.
        score = None
        # Run training for max_timesteps.
        for _ in range(max_t):
            # TODO: Select an action via agent.act(state, add_noise=True) -> notice the noise!
            action = None
            # TODO: Perform a step in the environment.
            next_state, reward, terminated, truncated, _ = None
            # TODO: Compute `done`.
            done = None
            # TODO: Perform a step for the agent calling agent.step(...)
            # ...
            # Update state, score, and check termination.
            state = next_state
            score += reward
            if done:
                break
        # Record statistics and print debugging information.
        scores.append(score)
        avg_score = np.mean(scores[-100:])
        print(f'\rEpisode {i_episode}\tAverage Score: {avg_score:.2f}',
              end="\n" if i_episode % 50 == 0 else "")
        if avg_score >= -370:
            print(f'\rEpisode {i_episode} solved environment!\tAverage Score: {avg_score:.2f}')
            break
            
    return scores

## DDPG

The _Deep Deterministic Policy Gradient_ algorithm can be thought the DQN algorithm for continuous
action spaces, and it uses the same techniques: replay-buffer + target-networks (for both actor and
critic).

We already leant how the _actor_ learns a deterministic policy $\mu(s)$ maximizing the critic value.
The _critic_ learns the action-value function $Q(s, a)$ minimizing a _mean-squared Bellman error_
(MSBE - the squared _advantage_ if you will) like DQN.

Because the policy is deterministic, the _exploration / exploitation_ tradeoff can be tuned adding
noise to the action. In the [original paper](https://proceedings.mlr.press/v32/silver14.pdf), the
authors recommended to use
[Ornstein-Uhlenbeck noise](https://en.wikipedia.org/wiki/Ornstein%E2%80%93Uhlenbeck_process), but
it turns out that a zero-mean Gaussian noise works just as well (hence, we'll use that here).

In [7]:
class AgentDDPG:
    def __init__(self, state_size, action_size, start_mem_size=128,
                 gamma=0.99, lr_actor=1e-4, lr_critic=1e-3, exploration_noise_scale=0.1):
        self.state_size = state_size
        self.action_size = action_size
        self.start_mem_size = start_mem_size
        self.gamma = gamma
        self.exploration_noise_scale = exploration_noise_scale

        # Actor network (w/ target network)
        # TODO: Create actor local and target networks, plus the Adam optimizer. Make sure to keep
        #       the target network in eval mode and copy the local network parameters.
        self.actor        = None
        self.actor_target = None
        # ...
        self.actor_optimizer = None

        # Critic network (w/ target network)
        # TODO: Create critic local and target networks, plus the Adam optimizer. Make sure to keep
        #       the target network in eval mode and copy the local network parameters.
        self.critic        = None
        self.critic_target = None
        # ...
        self.critic_optimizer = None

        # TODO: Create the replay buffer (using the provided one in the `util` module).
        self.memory = None

    @torch.no_grad
    def act(self, state: np.array, add_noise=False):
        """Returns actions for given state as per current policy."""
        # TODO: Convert the state into a tensor.
        state = None
        # TODO: Get the action via the actor local network
        action = None
        # In the original paper, the noise is generated via an Ornstein-Uhlenbeck process. It turns
        # out, a normal gaussian noise works just as well. Hence, that's what we use.
        if add_noise:
            # TODO: Add noise via normal distribution (np.random.normal) times the parameter
            #       `self.exploration_noise_scale`.
            action += None
        # TODO: Return the action, making sure to clip it based on ACTION_SCALE.
        return None

    def step(self, state, action, reward, next_state, done):
        """Save experience in replay memory, and use random sample from buffer to learn."""
        # TODO: Save experience / reward in the memory.
        # ...

        # TODO: Learn, if enough samples are available in memory.
        if len(self.memory) > self.start_mem_size:
            pass

    def learn(self, experiences):
        states, actions, rewards, next_states, dones = experiences

        # CRITIC UPDATE.
        with torch.no_grad():
            # TODO: Get the next actions using the actor target on next_states.
            actions_next = None
            # TODO: Get the Q_targets_next using the critic target.
            Q_targets_next = None
            # TODO: Compute Q targets for current states: rewards + (gamma * Q_targets * (1-dones))
            Q_targets = None

        # TODO: Compute the Q_expected using the critic local network.
        Q_values = None
        # TODO: Compute critic loss: MSE between Q_expected and Q_targets.
        critic_loss = None
        # TODO: Perform a minimization step of the critic loss with its optimizer.
        # ...

        # ACTOR UPDATE.
        # TODO: Compute the action predictions via the actor_local network.
        actions_pred = None
        # TODO: Compute actor loss, which is the negative critic_local(states, actions_pred) mean.
        actor_loss = None
        # TODO: Perform a minimization step of the actor loss with its optimizer.
        # ...

        # TODO: update target networks, calling the `soft_update_model_params` utility function.
        # ...

In [None]:
with init_random(gym.make('Pendulum-v1')) as env:
    agent_ddpg = AgentDDPG(env.observation_space.shape[0], env.action_space.shape[0])
    scores_ddpg = train_actor_critic(env, agent_ddpg)
plot_scores(scores_ddpg)

In [None]:
gym_simulation("Pendulum-v1", agent_ddpg)

## TD3

The _Twin Delayed DDPG_ (TD3) algorithm expands on DDPG with a couple of additional tricks:

 * It learns _two_ Q functions, and uses the smaller Q value to for the target. That is to address
   overestimation of Q values in DDPG. The "_twin_" part of the name comes from this.
 * Updates the policy (and target) networks less frequently than the Q function (hence, "_delayed_")
   and that is to keep the target and learning more stable.
 * Finally, it adds noise to the target action to "smooth out" the action value and make it harder
   for the policy to exploit errors in the Q function.

In [10]:
class AgentTD3:
    def __init__(self, state_size, action_size, start_mem_size=128,
                 gamma=0.99, lr_actor=1e-4, lr_critic=1e-3, exploration_noise_scale=0.1,
                 policy_noise = 0.2, noise_clamp=0.5,
                 policy_freq=2):
        self.state_size = state_size
        self.action_size = action_size
        self.start_mem_size = start_mem_size
        self.gamma = gamma
        self.exploration_noise_scale = exploration_noise_scale
        self.policy_noise = policy_noise
        self.noise_clamp = noise_clamp
        self.policy_freq = policy_freq
        self.t_step = 0

        # Actor network (w/ target network)
        # TODO: Build the same actor local and target network + optimizer as DDPG.
        # ...

        # TD3 trick n.1: Twin critic networks (w/ target network)
        # TODO: Build **two** twin critic networks!
        self.twin_critic_1        = None
        # ...
        self.twin_critic_2        = None
        # ...

        # TODO: Build a single Adam optimizer (hint: concatenate all parameters as list).
        self.critic_optimizer     = None

        # TODO: Instantiate the replay buffer.
        self.memory = ReplayBuffer()

    @torch.no_grad
    def act(self, state, add_noise=False):
        """Returns actions for given state as per current policy."""
        # TODO: Convert the state into a tensor.
        state = None
        # TODO: Get the action via the actor local network
        action = None
        # In the original paper, the noise is generated via an Ornstein-Uhlenbeck process. It turns
        # out, a normal gaussian noise works just as well. Hence, that's what we use.
        if add_noise:
            # TODO: Add noise via normal distribution (np.random.normal) times the parameter
            #       `self.exploration_noise_scale`.
            action += None
        # TODO: Return the action, making sure to clip it based on ACTION_SCALE.
        return None

    def step(self, state, action, reward, next_state, done):
        """Save experience in replay memory, and use random sample from buffer to learn."""
        self.t_step += 1

        # TODO: Save experience / reward in the memory.
        # ...

        # TODO: Learn, if enough samples are available in memory.
        if len(self.memory) > self.start_mem_size:
            pass

    def learn(self, experiences):
        states, actions, rewards, next_states, dones = experiences

        # UPDATE TWIN CRITICS.
        with torch.no_grad():
            # TODO: Get the next actions using the actor target.
            actions_next = None

            # TD3 trick n.3: target policy smoothing.
            # TODO: Get the noise torch.randn_like(actions_next) times the policy_noise.
            noise = None
            # TODO: Clamp the noise based on noise_clamp and scale it by ACTION_SCALE
            noise = None
            # TODO: Add the noise to the actions_next.
            actions_next += None
            # TODO: Clamp the actions to be in the correct interval [-ACTION_SCALE, ACTION_SCALE].
            actions_next = None

            # TODO: Compute Q_targets_1 and Q_targets_2 via the twin critic networks using the next
            #       states and actions.
            Q_targets_next_1 = None
            Q_targets_next_2 = None
            # TODO: Pick Q_targets_next as the minimum between those two.
            Q_targets_next = None
            # TODO: Compute Q_targets as DDPG :)
            Q_targets = None

        # TODO: Compute the Q_values of both twin critics.
        Q_values_1 = None
        Q_values_2 = None
        # TODO: Compute the critic loss as sum of mse_loss of the two critics.
        critic_loss = None

        # TODO: Minimize the critic loss.
        # ...

        # UPDATE ACTOR.
        # TD3 trick n.2: delayed policy updates.
        if self.t_step % self.policy_freq == 0:
            # TODO: Get actions from the actor.
            actions_pred = None
            # TODO: Compute the Q_target via the first twin critic.
            Q_values = None
            # TODO: Compute the action loss as negative mean of the Q_target.
            actor_loss = None

            # TODO: Minimize the actor loss.
            # ...

            # TODO: Update all target networks.
            # ...


In [None]:
with init_random(gym.make('Pendulum-v1')) as env:
    agent_td3 = AgentTD3(env.observation_space.shape[0], env.action_space.shape[0])
    scores_td3 = train_actor_critic(env, agent_td3)
plot_scores(scores_td3)

In [None]:
gym_simulation("Pendulum-v1", agent_td3)

## SAC (Optional)

The _Soft Actor Critic_ (SAC) algorithm is an off-policy algorithm similar to DDPG and TD3, that
learns a stochastic policy instead. It adopts many of the techniques used in TD3, but it stems from
the _Maximum Entropy Formulation_ of reinforcement learning.

For an in-depth understanding of both max-ent and SAC, I suggest watching
[Lecture 1](https://www.youtube.com/watch?v=2GwBez0D20A&list=PLwRJQ4m4UJjNymuBM9RdmB3Z9N5-0IlY0&index=2)
of Peter Abbeel Deep RL course, and reading the OpenAI SpinningUp SAC summary.

In short and as a highlight, in this formulation of RL the optimization objective is to maximize the
expected return plus the _entropy_ of the policy $H[ \pi (a | s_t)]$, term that intuitively balance
exploration and exploitation:

$$
\max_{\pi} \mathbb{E}\Bigl[ \sum_{t=0}^{T} r_t + \beta H[ \pi (a | s_t)] \Bigr]
$$

The _entropy_ in fact "measures" how uncertain is a policy (i.e., a deterministic policy has a very
low entropy, while a random one has high entropy).

While the _critic_ network learns the action-value as usual, the _actor_ network learns the mean
and standard deviation of a Gaussian representing the stochastic policy, using `tanh` to "squash"
the values in an acceptable range, and having the `logstd` depending on the network parameters as
well.

In [13]:
class ActorNetworkSAC(nn.Module):
    """The actor network for SAC. It is provided given its technicalities."""

    def __init__(self, state_size, action_size, action_scale=ACTION_SCALE, action_bias=0.0):
        super().__init__()
        self.fc1 = nn.Linear(state_size, 256)
        self.fc2 = nn.Linear(256, 256)
        self.fc_mean = nn.Linear(256, action_size)
        self.fc_logstd = nn.Linear(256, action_size)
        self.action_scale = action_scale
        self.action_bias = action_bias

    def forward(self, x):
        x = self.fc1(x)
        x = F.relu(x)
        x = self.fc2(x)
        x = F.relu(x)
        mean    = self.fc_mean(x)
        log_std = self.fc_logstd(x)
        log_std = torch.tanh(log_std)
        return mean, self.adjust_log_std(log_std)

    def get_action(self, x):
        mean, log_std = self(x)
        std = log_std.exp()
        normal = torch.distributions.Normal(mean, std)
        sample = normal.rsample() # Reparameterization trick: (mean + std * N(0,1))
        output = torch.tanh(sample)
        log_prob = normal.log_prob(sample)
        # Enforcing action bounds (and non-zero log)
        action = output * self.action_scale + self.action_bias
        log_prob -= torch.log(self.action_scale * (1 - output.pow(2)) + 1e-6)
        return action, log_prob

    def adjust_log_std(self, log_std):
        log_std_min, log_std_max = (-5, 2) # From SpinUp / Denis Yarats
        return log_std_min + 0.5 * (log_std_max - log_std_min) * (log_std + 1)

In [14]:
class AgentSAC:
    def __init__(self, state_size, action_size, start_mem_size=128,
                 gamma=0.99, lr_actor=1e-4, lr_critic=1e-3, policy_freq=2):
        self.state_size = state_size
        self.action_size = action_size
        self.start_mem_size = start_mem_size
        self.gamma = gamma
        self.policy_freq = policy_freq
        self.t_step = 0

        # TODO: Build the actor network with the ActorNetworkSAC (no target!)
        self.actor = None
        self.actor_optimizer = optim.Adam(self.actor.parameters(), lr=lr_actor)

        # TODO: Build twin critics like in TD3.
        # ...

        # TODO: Make the replay buffer.
        self.memory = None

    @torch.no_grad
    def act(self, state, add_noise=False): # SAC doesn't really have noise, but for consistency...
        """Returns actions for given state as per current policy."""
        # TODO: Convert the numpy state to a tensor.
        state = None
        # TODO: Get the action calling `get_action` from the actor.
        action, _ = None
        # TODO: Return the numpy action.
        return None

    def step(self, state, action, reward, next_state, done):
        """Save experience in replay memory, and use random sample from buffer to learn."""
        self.t_step += 1

        # TODO: Save experience / reward in the memory.
        # ...

        # TODO: Learn, if enough samples are available in memory.
        if len(self.memory) > self.start_mem_size:
            pass

    def learn(self, experiences, alpha=0.2):
        states, actions, rewards, next_states, dones = experiences

        # UPDATE TWIN CRITICS
        with torch.no_grad():
            # TODO: Get next action and logprob via actor.get_action(states).
            actions_next, log_pi_next_st = None
            # TODO: Compute the entropy term as: alpha * log_pi_next_st
            entropy_term = None

            # TODO: Compute Q_targets like in TD3, BUT subtract entropy_term from Q_targets_next.
            # ...

        # TODO: Compute and minimize critic loss like in TD3.
        # ...

        # UPDATE ACTOR.
        if self.t_step % self.policy_freq == 0:
            # TODO: Get action and log via actor.get_action(...)
            action, log_pi = None
            # TODO: Compute the entropy term as above
            entropy_term = None

            # TODO: Use the min of the twin critic computed Q values.
            Q_values_1 = None
            Q_values_2 = None
            Q_values = None
            # TODO: actor loss is the (entropy_term - Q_values).mean()
            actor_loss = None

            # TODO: Minimize the actor loss.
            # ...

            # TODO: Update the twin critic target networks.
            # ...

In [None]:
with init_random(gym.make('Pendulum-v1')) as env:
    agent_sac = AgentSAC(env.observation_space.shape[0], env.action_space.shape[0])
    scores_sac = train_actor_critic(env, agent_sac)
plot_scores(scores_sac)

In [None]:
gym_simulation("Pendulum-v1", agent_sac)