# Continuous control
We will now move on and look at some continuous control tasks. Policy gradients are well suited to continuous control tasks as the policy can take the form of some continuous distribution -- the most common is to use either a deterministic policy which maps a state directly to an continuos action, i.e. $\pi(s) = a$, or to learn the parameters of a Normal distribution. Initially we will look at the Deep Deterministic Policy Gradient (DDPG) algorithm, and then compare this to the Soft Actor-Critic algorithm. 

Before we do that, let's re-import all the packages and functions we will need that we used in the previous notebook.

In [1]:
import torch
import torch.nn as nn
import torch.optim as optim
import numpy as np
import random
import gymnasium as gym
from collections import deque
import copy


class ReplayBuffer:
    def __init__(self, capacity, batch_size=128):
        self.buffer = deque(maxlen=capacity)
        self.batch_size = batch_size

    def push(self, data):
        self.buffer.append(data)

    def sample(self):
        return random.sample(self.buffer, self.batch_size)

    def __len__(self):
        return len(self.buffer)
    

def run_test(alg, env):  # used to run an eval episode for any algorithm with a `greedy_act` method.
    state, _ = env.reset()
    done = False
    score = 0
    while not done:
        action = alg.greedy_act(state)
        state, reward, terminated, truncated, _ = env.step(action)
        done = terminated or truncated
        score += reward
    return score

---
# Deep Deterministic Policy Gradient
The deterministic policy gradient states that the gradient for a determinstic policy is given by $\mathbb{E}_{s \sim \rho^\beta}\left[\nabla_{\theta} Q(s, \pi_\theta(s)) \right] = \mathbb{E}_{s \sim \rho^\beta}\left[\nabla_a Q(s, a) \nabla_{\theta}\pi_\theta(s) \right]$, where $\rho^\beta$ is the state distribution according to some exploratory policy $\beta$. This is just an application of the chain rule. Thankfully, due to the wonders of autograd, we again don't need to worry about calculating some long gradient via the chain rule, as it will handle it all for us! So to obtain the policy gradient in PyTorch it is as simple as evaluating the policy $a = \pi(s)$, passing $a$ into the critic $Q$ along with the state, and using `.backward()` as we do with a normal loss. 

Before we look at implementing the algorithm, we will take a quick look at how the action space changes in a gym environment when it is continuous. 

In [2]:
env = gym.make('Pendulum-v1')
print(env.action_space)
print(env.action_space.sample())

Box(-2.0, 2.0, (1,), float32)
[-1.6669251]


So we can see that now the action space is a Box object, similar to the state spaces we have dealt with. In the pendulum environment it is just a one dimensional action in the range $[-2, 2]$. Whilst *most* continuous control benchmark environments have a symmetric action space, they are usually in the range $[-1, 1]$ (just something to note), but they are usually of high dimension that just a single scalar. 

Let's move on to the code for DDPG! We can use the same replay buffer that we used for the DQN, so we just need to write the code for the actor and critic networks, and then the actual algorithm itself.

In [3]:
class DDPGActor(nn.Module):
    def __init__(self, input_size, hidden_dim, output_size, action_upper_bound):
        super(DDPGActor, self).__init__()
        self.input_layer = nn.Linear(input_size, hidden_dim)
        self.h1 = nn.Linear(hidden_dim, hidden_dim)
        self.h2 = nn.Linear(hidden_dim, hidden_dim)
        self.output_layer = nn.Linear(hidden_dim, output_size)

        self.ub = action_upper_bound

    def forward(self, x):
        x = torch.relu(self.input_layer.forward(x))
        x = torch.relu(self.h1.forward(x))
        x = torch.relu(self.h2.forward(x))
        x = self.ub * torch.tanh(self.output_layer.forward(x))
        return x


class DDPGCritic(nn.Module):
    def __init__(self, state_dim, action_dim, hidden_dim):
        super(DDPGCritic, self).__init__()
        self.input_layer = nn.Linear(state_dim + action_dim, hidden_dim)
        self.h1 = nn.Linear(hidden_dim, hidden_dim)
        self.h2 = nn.Linear(hidden_dim, hidden_dim)
        self.output_layer = nn.Linear(hidden_dim, 1)

    def forward(self, state, action):
        x = torch.cat((state, action), dim=1)
        x = torch.relu(self.input_layer.forward(x))
        x = torch.relu(self.h1.forward(x))
        x = torch.relu(self.h2.forward(x))
        x = self.output_layer.forward(x)
        return x

The actor here is similar to the networks we have seen already, with the exception that the output is passed through a `tanh` activation. This is so that we have a symmetric output in $(-1, 1)$, which we then scale by the upper bound of the environment's action space (in the case of Pendulum this would be 2.0). In general, you don't need to use a `tanh` activation, you could for instance just clamp the outputs to be in between $(-1, 1)$ and then re-scale (test this if you like), but if your action space varies in range, i.e. if your min/max values for each dimension aren't the same, then you are probably better to clamp the values manually to your custom range. The critic is similar to a mix of the PPO and DQN critic we have used prior. In DQN we estimate $Q$ but over a discrete quantity, whereas here the action is continuous so the action must be given as input to the network, of which we will receive a scalar, which is more similar to the PPO critic (though that is a value function, not a $Q$-function). Let's now test out a trained DDPG agent on the Pendulum environment. 

In [4]:
env = gym.make('Pendulum-v1', render_mode='human')
state_dim = env.observation_space.shape[0]
action_dim = env.action_space.shape[0]
action_upper_bound = env.action_space.high.max()
net = DDPGActor(state_dim, 512, action_dim, action_upper_bound)
net.load_state_dict(torch.load('networks/Pendulum_ddpg_actor'))
state, _ = env.reset()
done = False
score = 0
while not done:
    with torch.no_grad():
        state = torch.from_numpy(state).float()
        action = net.forward(state).flatten().numpy()  # we should make sure that the action is a numpy array, not a tensor
    state, reward, terminated, truncated, _ = env.step(action)
    done = terminated or truncated
    score += reward
print(f"The trained DDPG score was {score}")

The trained DDPG score was -126.32883407238033


Let's now have a go at implementing this ourselves. I will provide the entire code for DDPG, **then as an exercise you can have a go at implementing a child class for TD3, an algorithm built on top of DDPG which improves performance. You can follow the pseudocode from the [paper](https://proceedings.mlr.press/v80/fujimoto18a.html), algorithm 1 (if you are pushed for time and would rather look at SAC then come back to this)**.

In [5]:
class DDPG:
    def __init__(self, state_dim, hidden_size, num_actions, action_ub, gamma=0.99, exploration_noise=0.2, memory_size=100000, batch_size=128, tau=0.0025,
                 critic_lr=1e-3, actor_lr=1e-4):
        self.actor = DDPGActor(state_dim, hidden_size, num_actions, action_ub)
        self.actor_target = copy.deepcopy(self.actor)
        self.actor_optimiser = torch.optim.Adam(self.actor.parameters(), lr=actor_lr)

        self.critic = DDPGCritic(state_dim, num_actions, hidden_size)
        self.critic_target = copy.deepcopy(self.critic)
        self.critic_optimiser = torch.optim.Adam(self.critic.parameters(), lr=critic_lr)
        self.critic_loss_fn = torch.nn.MSELoss()

        self.memory = ReplayBuffer(memory_size, batch_size)
        self.batch_size = batch_size
        self.tau = tau
        self.gamma = gamma
        self.num_actions = num_actions
        self.exploration_noise = exploration_noise
        self.ub = action_ub
        self.lb = -self.ub
        self.alg_name = 'ddpg'

        self.grad_steps = 0

    def remember(self, state, action, reward, next_state, done):
        self.memory.push((state.flatten(), action.flatten(), reward, next_state.flatten(), done))

    def get_batch(self):
        batch = self.memory.sample()

        states = torch.from_numpy(np.array([s for s, _, _, _, _ in batch])).float()
        actions = torch.from_numpy(np.array([a for _, a, _, _, _ in batch])).float()
        rewards = torch.FloatTensor([[r] for _, _, r, _, _ in batch])
        next_states = torch.from_numpy(np.array([ns for _, _, _, ns, _ in batch])).float()
        dones = torch.FloatTensor([[d] for _, _, _, _, d in batch])

        return states, actions, rewards, next_states, dones

    def act(self, state):
        with torch.no_grad():
            state = torch.FloatTensor(state)
            action = self.actor.forward(state).numpy().flatten()
            action += np.random.normal(0, self.exploration_noise, size=self.num_actions)
            return np.clip(action, self.lb, self.ub)

    def greedy_act(self, state):
        with torch.no_grad():
            state = torch.FloatTensor(state)
            action = self.actor.forward(state).numpy().flatten()
            return np.clip(action, self.lb, self.ub)

    def update_target(self):
        for target_param, param in zip(self.actor_target.parameters(), self.actor.parameters()):
            target_param.data.copy_(self.tau * param.data + (1.0 - self.tau) * target_param.data)

        for target_param, param in zip(self.critic_target.parameters(), self.critic.parameters()):
            target_param.data.copy_(self.tau * param.data + (1.0 - self.tau) * target_param.data)

    def experience_replay(self):
        if len(self.memory) < self.batch_size:
            return

        states, actions, rewards, next_states, dones = self.get_batch()

        q_vals = self.critic.forward(states, actions)
        with torch.no_grad():
            next_actions = self.actor_target.forward(next_states)
            next_q_vals = self.critic_target.forward(next_states, next_actions)
            target = rewards + self.gamma * (1 - dones) * next_q_vals
        self.critic_optimiser.zero_grad()
        critic_loss = self.critic_loss_fn(q_vals, target)
        critic_loss.backward()
        self.critic_optimiser.step()

        self.actor_optimiser.zero_grad()
        actor_loss = -self.critic.forward(states, self.actor.forward(states)).mean()
        actor_loss.backward()
        self.actor_optimiser.step()

        self.grad_steps += 1
        self.update_target()

    def save_model(self, optional_path=""):
        actor = copy.deepcopy(self.actor)
        critic = copy.deepcopy(self.critic)
        torch.save(critic.state_dict(), f"{optional_path}_{self.alg_name}_critic")
        torch.save(actor.state_dict(), f"{optional_path}_{self.alg_name}_actor")

    def load_model(self, optional_path=""):
        self.actor.load_state_dict(torch.load(f"{optional_path}_{self.alg_name}_actor"))
        self.critic.load_state_dict(torch.load(f"{optional_path}_{self.alg_name}_critic"))

The class is very similar to the DQN class we made earlier. The major change in the `init` is that we now provide an `exploration_noise` parameter that is the standard dev. of a normal distribution that we sample from and add as exploratory noise. In the `get_batch` method note that now, as our actions are continuous, they are also made into a tensor with the `Float` data-type, rather than `Long`. 

The critic is trained in a similar way to that which we have seen. Perform a forward pass of our current network for the estimate of the $Q$-value, and then compute the target value using the observed rewards and bootstrapping the value from the next state-action pair, where the next action is generated from our target actor network. The actor loss is very simple to calculate, as mentioned at the start of the section, we simply evaluate the critic at the current state with actions coming from the actor network. As with PPO, **remember to take a negative so that we do have a 'loss' to minismise, since we actually want to maximise the critic values!**.

And there we have it, our first continuous control algorithm! Let's see how it performs on the Pendulum environment.

In [6]:
env = gym.make('Pendulum-v1')
test_env = gym.make('Pendulum-v1')
state_dim = env.observation_space.shape[0]
num_actions = env.action_space.shape[0]
action_upper_bound = env.action_space.high.max()
ddpg = DDPG(state_dim, 512, num_actions, action_upper_bound, memory_size=100000, batch_size=256)

while len(ddpg.memory) < 10_000:
    state, _ = env.reset()
    done = False
    while not done:
        action = env.action_space.sample()
        next_state, reward, terminated, truncated, _ = env.step(action)
        done = terminated or truncated
        ddpg.remember(state, action, reward, next_state, terminated)
        state = next_state

episode = 0
ddpg_test_scores = []
while ddpg.grad_steps < 15000:
    state, _ = env.reset()
    done = False
    score = 0
    episode += 1
    while not done:
        action = ddpg.act(state)
        next_state, reward, terminated, truncated, _ = env.step(action)
        done = terminated or truncated
        ddpg.remember(state, action, reward, next_state, terminated)
        ddpg.experience_replay()
        state = next_state
        score += reward
        if ddpg.grad_steps % 1000 == 0:
            test_score = []
            for test in range(10):
                test_score.append(run_test(ddpg, test_env))
            ddpg_test_scores.append((ddpg.grad_steps, np.mean(test_score)))
            print(f"Grad steps: {ddpg.grad_steps}. Score: {np.mean(test_score)}.")

Grad steps: 1000. Score: -1421.5344259725166.
Grad steps: 2000. Score: -1216.0460907242204.
Grad steps: 3000. Score: -369.42793752268545.
Grad steps: 4000. Score: -175.785608671645.
Grad steps: 5000. Score: -108.39953607617625.
Grad steps: 6000. Score: -193.68439201631.
Grad steps: 7000. Score: -107.26759802689318.
Grad steps: 8000. Score: -151.43578176546862.
Grad steps: 9000. Score: -149.29236506829002.
Grad steps: 10000. Score: -112.66637990201427.
Grad steps: 11000. Score: -124.2061166889783.
Grad steps: 12000. Score: -112.1826415557251.
Grad steps: 13000. Score: -125.03645093694186.
Grad steps: 14000. Score: -96.61743854802293.
Grad steps: 15000. Score: -145.33077361268525.


Now you should have a go at writing a child class, in the cell below, for the TD3 algorithm as previously mentioned. There should only be changes to any extra hyperparameters you need in the `init` and the `experience_replay` method. 

--- 

### Soft Actor-Critic
Soft Actor-Critic (SAC) is a deep reinforcement learning algorithm that uses a maximum-entropy framework to optimise the policy and the value function simultaneously. We learn the policy parameters by *minimising* $\mathbb{E}_{s \sim \mathcal{B}}\left[ \mathbb{E}_{a \sim \pi_\phi}\left[ \alpha \log \pi_\phi(a|s) - Q_\theta(s, a) \right] \right]$, where $\phi$ are the parameters of the policy, $\theta$ are the parameters of the $Q$-function, $\alpha$ is the temperature hyper-parameter which balances out how much we care about maximising the entropy, and $\mathcal{B}$ is our replay buffer. We learn the parameters of the $Q$-function by minimising 
$$\mathbb{E}_{(s, a) \sim \mathcal{B}} \left[ \left(Q_\theta(s, a) - \left(r(s, a) + \gamma \mathbb{E}_{a'\sim \pi_\phi, s' \sim p} \left[Q_{\theta'}(s', a') - \alpha \log \pi_\phi(a'|s')\right] \right)\right)^2\right]\;;$$
where $\theta'$ are the parameters of the target critic network and $p$ is the transition function of the MDP. 

In the original paper, they treat $\alpha$ as a hyperparameter, but in an updated work [(found here)](https://arxiv.org/abs/1812.05905) they provide information on how we can automatically tune it. This is beyond the scope of the lab, but I will provide code on how to do this as part of the implementation!

You may have also noticed that we are required to evaluate the critic at an action sampled from the distribution we are learning. This is different to the DDPG case where we directly output the action, because here we are learning the *parameters* of a distribution (e.g. the mean and var of a Normal distn), and so we need to ensure that the sampled action is differentiable with respect to the weights (I say weights here to distinguish between the parameters of the distribution). This is non-trivial to do, and it largely limits us to learning the parameters of distributions which are susceptible to the reparameterisation trick [(this stack exchange answer summarises things well)](https://stats.stackexchange.com/questions/199605/how-does-the-reparameterization-trick-for-vaes-work-and-why-is-it-important). Luckily for us, we will largely be concerned with learning the parameters of a Normal distribution, which can be reparameterised by noting that the random variable $Y = \mu + \sigma \epsilon$, where $\epsilon \sim N(0, 1)$, has a $N(\mu, \sigma^2)$ distribution. So we just need to learn the mean and standard deviation, sample the noise from a unit normal distribution and we can differentiate (w.r.t. our network weights) through our sampled action. 

Now, the final thing to note is that the support of the Normal distribution is entire real line, whereas our action space for the Pendulum environment was $[-2, 2]$. So, how do we deal with this? We use a transformation! If we learn the parameters of a normal distribution, then we can squash it to be in our desired range by using a $\mbox{tanh}$ transformation. That is, if we have $X \sim N(\mu, \sigma^2)$, then the transformed random variable $Y = a \times \mbox{tanh}(X)$ (where in the case of pendulum $a=2$) will have support in our desired range. You can calculate the pdf of $Y$ analytical, if you like (it will be easiest for the univariate case), by using the fact that $f_Y(y) = f_X(x) \left|\frac{dx}{dy}\right|$, evaluated at $x = g^{-1}(y)$ (where $g$ in this case is $\mbox{tanh}$), but you might be relieved to know that PyTorch has built in functionality for transformations of random variables so we don't need to worry implementing this ourselves -- we'll show a quick example of this below.

In [7]:
from torch.distributions.normal import Normal
from torch.distributions.transformed_distribution import TransformedDistribution
from torch.distributions.transforms import TanhTransform
from torch.distributions.independent import Independent
from torch.distributions.transforms import AffineTransform

means = torch.FloatTensor([[0, 5, 10, 40]])
variances = torch.ones_like(means)  # vars need to be positive
normal_distribution = Independent(Normal(means, variances), 1)  # we typically will assume indepdence between the different dimensions of the action space
print(normal_distribution.sample())  # sample from the distribution

transformed_distribution = TransformedDistribution(Independent(Normal(means, variances), 1), [TanhTransform(cache_size=1),
                                                                                                AffineTransform(0, 2, cache_size=1)])
print(transformed_distribution.sample())

tensor([[ 1.4328,  4.3316, 10.5660, 38.5486]])
tensor([[-1.4290,  1.9998,  2.0000,  2.0000]])


The initial distribution was 4 independent normals with means 0, 5, 10, 40 and each with variance of 1. We can see that when we sample from them we have numbers we would expect, all centred roughly around their means. In the transformed distribution we can see they have a max value of 2 (they will also have a minimum value of 2), which we would expect from the transformed distribution. The `TransformedDistribution` function takes in an original distribution, followed by a list of transformations to apply, in the order they are given. We first give the `TanhTransform` transformation, which applies the $\mbox{tanh}$ transformation, followed  by an affine transformation -- in this case we shifted by 0 and scaled by 2. If you're lucky enough to work on problems with an action space that has values in $[-1, 1]^d$, then you will not need to use the `AffineTransform`. 

Now that we have seen how to transform the distribution, let's get on with writing the code for the algorithm. I will provide most of the code for you, including the automatic tuning of $\alpha$ and where to use it, but I will leave the optimisation of the policy and $Q$ for you to fill in. First, we need the Actor and Critic networks.

In [8]:
class SACActor(nn.Module):
    def __init__(self, input_dim, num_actions, hidden_size=256, sd_min=-20, sd_max=10):
        super(SACActor, self).__init__()
        self.input_layer = nn.Linear(input_dim, hidden_size)
        self.h1 = nn.Linear(hidden_size, hidden_size)
        self.h2 = nn.Linear(hidden_size, hidden_size)
        self.mean_layer = nn.Linear(hidden_size, num_actions)
        self.sd_layer = nn.Linear(hidden_size, num_actions)

        self.log_sd_min = sd_min
        self.log_sd_max = sd_max

    def forward(self, state):
        x = torch.relu(self.input_layer(state))
        x = torch.relu(self.h1(x))
        x = torch.relu(self.h2(x))
        mean = self.mean_layer(x)
        log_sd = self.sd_layer(x)
        log_sd = log_sd.clamp(self.log_sd_min, self.log_sd_max)
        return mean, log_sd

class SACCritic(nn.Module):
    def __init__(self, input_dim, hidden_size, action_dim):
        super(SACCritic, self).__init__()
        self.input_layer = nn.Linear(input_dim + action_dim, hidden_size)
        self.h1 = nn.Linear(hidden_size, hidden_size)
        self.h2 = nn.Linear(hidden_size, hidden_size)
        self.output_layer = nn.Linear(hidden_size, 1)

        self.input_layer2 = nn.Linear(input_dim + action_dim, hidden_size)
        self.h21 = nn.Linear(hidden_size, hidden_size)
        self.h22 = nn.Linear(hidden_size, hidden_size)
        self.output_layer2 = nn.Linear(hidden_size, 1)

    def forward(self, state, action):
        x_ = torch.cat((state, action), dim=1)
        x = torch.relu(self.input_layer(x_))
        x = torch.relu(self.h1(x))
        x = torch.relu(self.h2(x))
        x = self.output_layer(x)

        x1 = torch.relu(self.input_layer2(x_))
        x1 = torch.relu(self.h21(x1))
        x1 = torch.relu(self.h22(x1))
        x1 = self.output_layer2(x1)
        return x, x1

The actor is similar to the PPO actor, but we now have two output layer, one for the mean and the other for the (log) standard deviation. We clamp the `log_sd` to be in a certain range. This ensures that it does not tend to 0 (no exploration) or be too big (too much exploration). The critic is similar to that of the DDPG, taking in an action and a state, but you'll note that we essentially maintain two critics. This is because to try and ovecome maximisation bias, in the policy update and critic update we take the minimum of the two critics. 

Onto the SAC class!

In [None]:
class SAC:
    def __init__(self, state_dim, num_actions, hidden_size=256, gamma=0.99, batch_size=256, max_memory=100000, tau=0.0025, critic_lr=1e-3, actor_lr=1e-4, action_scale=1):

        self.actor = SACActor(state_dim, hidden_size, num_actions)
        self.actor_optimiser = Adam(self.actor.parameters(), lr=actor_lr)
        self.critic = SACCritic(state_dim, hidden_size, num_actions)
        self.critic_target = copy.deepcopy(self.critic)
        self.critic_optimiser = Adam(self.critic.parameters(), lr=critic_lr)

        self.alg_name = "SAC"

        self.log_alpha = torch.zeros(1, requires_grad=True, device='cpu')
        self.alpha_optimiser = Adam([self.log_alpha], lr=critic_lr)

        self.memory = ReplayBuffer(max_memory, batch_size)
        self.action_scale = action_scale
        self.gamma = gamma
        self.batch_size = batch_size
        self.critic_loss_function = nn.MSELoss()
        self.target_entropy = -num_actions
        self.tau = tau
        self.grad_steps = 0
        self.actor_loss = deque(maxlen=10000)
        self.critic_loss = deque(maxlen=10000)
        self.alpha_loss = deque(maxlen=10000)
        self.alpha_gradient = []
        self.num_actions = num_actions

    def get_batch(self):
        batch = self.memory.sample()

        states = torch.from_numpy(np.array([s for s, _, _, _, _ in batch])).float()
        actions = torch.from_numpy(np.array([a for _, a, _, _, _ in batch])).float()
        rewards = torch.FloatTensor([[r] for _, _, r, _, _ in batch])
        next_states = torch.from_numpy(np.array([ns for _, _, _, ns, _ in batch])).float()
        dones = torch.FloatTensor([[d] for _, _, _, _, d in batch])

        return states, actions, rewards, next_states, dones

    def remember(self, state, action, reward, next_state, done):
        action = action.flatten()
        self.memory.push((state, action, reward, next_state, done))

    def update_target(self):
        for real, target in zip(self.critic.parameters(), self.critic_target.parameters()):
            target.data.copy_(real.data * self.tau + target.data * (1 - self.tau))

    def experience_replay(self):
        if len(self.memory) < self.batch_size * 10:
            return
        states, old_actions, rewards, next_states, dones = self.get_batch()

        alpha = self.log_alpha.exp().detach()

        # critic loss
        # fill in here

        self.critic_optimiser.zero_grad()
        critic_loss.backward()
        self.critic_optimiser.step()

        # actor loss
        # fill in here

        alpha_loss = -(self.log_alpha * (log_probs + self.target_entropy).detach()).mean()
        self.alpha_loss.append(alpha_loss.item())

        self.actor_optimiser.zero_grad()
        actor_loss.backward()
        self.actor_optimiser.step()

        self.alpha_optimiser.zero_grad()
        alpha_loss.backward()
        self.alpha_optimiser.step()

        self.update_target()
        self.grad_steps += 1

    def get_actions_log_probs(self, states):
        dist = self.get_distribution(states)
        actions = dist.rsample()
        log_probs = dist.log_prob(actions)
        return actions, log_probs.unsqueeze(dim=-1)

    def get_distribution(self, states):
        mean, log_sd = self.actor.forward(states)
        tanh_dist = TransformedDistribution(Independent(Normal(mean, log_sd.exp()), 1), [TanhTransform(cache_size=1),
                                                                                         AffineTransform(0, self.action_scale, cache_size=1)])
        return tanh_dist

    def act(self, state):
        with torch.no_grad():
            state = torch.FloatTensor(state).unsqueeze(dim=0)
            dist = self.get_distribution(state)
            action = dist.rsample()
            return action.numpy().flatten()

    def greedy_act(self, state):
        with torch.no_grad():
            state = torch.FloatTensor(state).unsqueeze(dim=0)
            mean, log_sd = self.actor.forward(state)
            mean = torch.tanh(mean)
            return mean.numpy().flatten()

    def save_model(self, optional_path=""):
        actor = copy.deepcopy(self.actor)
        critic = copy.deepcopy(self.critic)
        torch.save(critic.state_dict(), f"{optional_path}_{self.alg_name}_critic")
        torch.save(actor.state_dict(), f"{optional_path}_{self.alg_name}_actor")

    def load_model(self, optional_path=""):
        self.actor.load_state_dict(torch.load(f"{optional_path}_{self.alg_name}_actor"))
        self.critic.load_state_dict(torch.load(f"{optional_path}_{self.alg_name}_critic"))

I've left the critic loss and actor loss for you to complete yourselves (any questions, please ask). You will want to make use of the `get_action_log_probs` method. It will take in a state (or batch of states), perform a forward pass of the network to get the parameters of the distribution, and then sample from the distribution, returning those sampled actions and the corresponding log-probabilities for those actions. **Note:** in PPO we saw that we can sample from a distribution using `.sample()`. This is fine if we only want the action, but the action returned from this *will not be differentiable*. To obtain a differentiable action, we need to sample using `.rsample()`, which is what we use in the aforementioned method. Some PyTorch distributions will not have `.rsample()` implemented if the reparameterisation trick cannot be used.

Now, run your implementation of SAC with the below code and then compare to DDPG, like with did with PPO and DQN. 

In [None]:
env = gym.make('Pendulum-v1')
test_env = gym.make('Pendulum-v1')
state_dim = env.observation_space.shape[0]
num_actions = env.action_space.shape[0]
action_upper_bound = env.action_space.high.max()
sac = SAC(state_dim, num_actions, 512, action_scale=action_upper_bound, max_memory=100000, batch_size=256)

while len(sac.memory) < 10_000:
    state, _ = env.reset()
    done = False
    while not done:
        action = env.action_space.sample()
        next_state, reward, terminated, truncated, _ = env.step(action)
        done = terminated or truncated
        sac.remember(state, action, reward, next_state, terminated)
        state = next_state

episode = 0
sac_test_scores = []
while sac.grad_steps < 15000:
    state, _ = env.reset()
    done = False
    score = 0
    episode += 1
    while not done:
        action = sac.act(state)
        next_state, reward, terminated, truncated, _ = env.step(action)
        done = terminated or truncated
        sac.remember(state, action, reward, next_state, terminated)
        sac.experience_replay()
        state = next_state
        score += reward
        if sac.grad_steps % 1000 == 0:
            test_score = []
            for test in range(10):
                test_score.append(run_test(sac, test_env))
            sac_test_scores.append((sac.grad_steps, np.mean(test_scores)))
            print(f"Grad steps: {sac.grad_steps}. Score: {np.mean(test_score)}.")

In [None]:
import matplotlib.pyplot as plt

plt.plot([x[0] for x in sac_test_scores], [x[1] for x in sac_test_scores], label='SAC')
plt.plot([x[0] for x in ddpg_test_scores], [x[1] for x in ddpg_test_scores], label='DDPG')
# plt.plot([x[0] for x in TD3_test_scores], [x[1] for x in TD3_test_scores], label='TD3')  # uncomment if you were able to implement TD3
plt.legend()
plt.ylabel("Test Score")
plt.xlabel("Grad Steps")
plt.show()