In this practical work, we implement the Soft Actor-Critic (SAC) algorithm to train an agent in the Inverted Pendulum environment. SAC is an off-policy reinforcement learning method designed for continuous action spaces. It improves exploration by maximizing both reward and policy entropy, leading to more stable learning.

The Inverted Pendulum is a classic control problem where the goal is to balance a pole on a moving cart using continuous control inputs. You will complete key functions in the SAC implementation, gaining hands-on experience with policy updates, Q-learning, and entropy-based exploration.

For more details on SAC, refer to:
[🔗 Spinning Up: SAC](https://spinningup.openai.com/en/latest/algorithms/sac.html)

# Useful installs and imports

In [None]:
pip install gymnasium[mujoco]

In [None]:
pip install stable_baselines3

In [None]:
from tqdm import tqdm
import os
import base64
import matplotlib.pyplot as plt
from IPython.display import HTML
from google.colab import output
import matplotlib.pyplot as plt
import matplotlib.animation as animation
import gymnasium as gym
import cv2
import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

# Useful function

In [None]:
def make_env(env_id, seed, idx):
    def thunk():
        env = gym.make(env_id)
        env = gym.wrappers.RecordEpisodeStatistics(env)
        env.action_space.seed(seed)
        return env

    return thunk

# Defining the actor-critic models

The `SoftQNetwork` class is a neural network used to estimate the Q-value (action-value) function in SAC. It takes both the state `x` and action `a` as input, processes them through the network. In the `SoftQNetwork` class, complete the `forward` function to compute the Q-value for a given state-action pair. We use Relu function as activation.

In [None]:
class SoftQNetwork(nn.Module):
    def __init__(self, env):
        super().__init__()
        self.fc1 = nn.Linear(
            np.array(env.single_observation_space.shape).prod() + np.prod(env.single_action_space.shape),
            256,
        )
        self.fc2 = nn.Linear(256, 256)
        self.fc3 = nn.Linear(256, 1)

    def forward(self, x, a):
        <YOUR CODE HERE>
        return x

In many reinforcement learning environments, the **action space** is bounded, meaning that the actions the agent can take are constrained within a specific range, such as between -1 and 1. However, the **actor network** may output actions that are not within this range, which can cause problems when interacting with the environment.

To handle this, we use **rescaling**. The idea is to map the output of the actor network, which may be unconstrained (i.e., can range from negative to positive infinity), into the valid action range for the environment.

How rescaling works:

- **`action_scale`**: This is the factor by which we multiply the output of the network to scale the actions back into the environment's action space.
- **`action_bias`**: This shifts the scaled actions to ensure they are centered around the middle of the action bounds.



In the `Actor` class, complete the `forward` and `get_action` methods.


In [None]:
LOG_STD_MAX = 2
LOG_STD_MIN = -5


class Actor(nn.Module):
    def __init__(self, env):
        super().__init__()
        self.fc1 = nn.Linear(np.array(env.single_observation_space.shape).prod(), 256)
        self.fc2 = nn.Linear(256, 256)
        self.fc_mean = nn.Linear(256, np.prod(env.single_action_space.shape))
        self.fc_logstd = nn.Linear(256, np.prod(env.single_action_space.shape))
        # action rescaling
        self.register_buffer(
            "action_scale",
            torch.tensor(
                (env.single_action_space.high - env.single_action_space.low) / 2.0,
                dtype=torch.float32,
            ),
        )
        self.register_buffer(
            "action_bias",
            torch.tensor(
                (env.single_action_space.high + env.single_action_space.low) / 2.0,
                dtype=torch.float32,
            ),
        )

    def forward(self, x):
        <YOUR CODE HERE>
        log_std = LOG_STD_MIN + 0.5 * (LOG_STD_MAX - LOG_STD_MIN) * (log_std + 1)  # From SpinUp

        return mean, log_std

    def get_action(self, x):
        mean, log_std = <YOUR CODE HERE>
        std = <YOUR CODE HERE>
        normal = torch.distributions.Normal(mean, std)
        x_t = <YOUR CODE HERE> # Sample x_t for the normal distribution N(mean,std)
        y_t = torch.tanh(x_t)
        action = y_t * self.action_scale + self.action_bias
        log_prob = normal.log_prob(x_t)
        # Enforcing Action Bound on log probability and mean
        log_prob -= torch.log(self.action_scale * (1 - y_t.pow(2)) + 1e-6)
        log_prob = log_prob.sum(1, keepdim=True)
        mean = torch.tanh(mean) * self.action_scale + self.action_bias
        return action, log_prob, mean

# Training the agent

In [None]:
# Training hyperparameters
total_timesteps = 30000
buffer_size = int(1e6)
gamma = 0.99
tau = 0.005
learning_starts = 5000
batch_size = 256
policy_lr = 3e-4
q_lr = 1e-3
policy_frequency = 2
target_network_frequency = 1
alpha = 0.2
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

**Entropy Adjustment in SAC**

In **Soft Actor-Critic (SAC)**, entropy is used to encourage exploration, and its level is dynamically adjusted. The **target entropy** is computed as `target_entropy = -torch.prod(torch.Tensor(envs.single_action_space.shape).to(device)).item()`, which is the negative of the number of action dimensions, ensuring the policy maintains a certain level of randomness. The temperature parameter `alpha` controls the trade-off between maximizing rewards and maintaining entropy. Instead of using a fixed value, **SAC learns alpha dynamically**. To achieve this, `log_alpha = torch.zeros(1, requires_grad=True, device=device)` initializes `log_alpha` as a trainable parameter. The actual alpha value is then computed as `alpha = log_alpha.exp().item()`, ensuring it remains positive. Finally, `a_optimizer = optim.Adam([log_alpha], lr=q_lr)` sets up an Adam optimizer to update `log_alpha` during training, allowing the agent to automatically tune its level of exploration and adapt to the environment.


In [None]:
target_entropy = -torch.prod(torch.Tensor(envs.single_action_space.shape).to(device)).item()
log_alpha = torch.zeros(1, requires_grad=True, device=device)
alpha = log_alpha.exp().item()
a_optimizer = optim.Adam([log_alpha], lr=q_lr)

We define the different actor and critic networks and their optimizers of the agent.

In [None]:
actor = Actor(envs).to(device)
qf1 = SoftQNetwork(envs).to(device)
qf2 = SoftQNetwork(envs).to(device)
qf1_target = SoftQNetwork(envs).to(device)
qf2_target = SoftQNetwork(envs).to(device)
qf1_target.load_state_dict(qf1.state_dict())
qf2_target.load_state_dict(qf2.state_dict())
q_optimizer = optim.Adam(list(qf1.parameters()) + list(qf2.parameters()), lr=q_lr)
actor_optimizer = optim.Adam(list(actor.parameters()), lr=policy_lr)

In [None]:
from stable_baselines3.common.buffers import ReplayBuffer

rb = ReplayBuffer(
        buffer_size,
        envs.single_observation_space,
        envs.single_action_space,
        device,
        n_envs=num_envs,
        handle_timeout_termination=False,
    )

In [None]:
# For reproductibility
seed = 0
np.random.seed(seed)
torch.manual_seed(seed)
torch.backends.cudnn.deterministic = True

The provided code implements the **Soft Actor-Critic (SAC) training loop** for the **Inverted Pendulum** environment.

1. **Action Selection:**  
   - Complete the line `actions, _, _ = <YOUR CODE HERE>` to ensure actions are chosen by the **Actor network** after the initial exploration phase.  
   
2. **Target Q-Value Computation:**  
   - Fill in `next_state_actions, next_state_log_pi, _ = <YOUR CODE HERE>` to obtain the next state’s action and log probability.  
   - Implement `qf1_next_target`, `qf2_next_target`, and `min_qf_next_target`, ensuring you **include the entropy term** when computing `next_q_value`.  

3. **Critic Loss Calculation:**  
   - Complete `<YOUR CODE HERE>` to compute `qf1_loss`, `qf2_loss`, and `qf_loss` using **Mean Squared Error (MSE)** between the predicted Q-values and `next_q_value`.  
   - Optimize the critic networks accordingly.  

4. **Policy Update:**  
   - Implement `pi, log_pi, _ = <YOUR CODE HERE>` to sample actions from the policy.  
   - Compute `min_qf_pi` and `actor_loss`, ensuring the **entropy term (α * log_pi)** is included.  

5. **Alpha Update:**  
   - Complete the code to update `log_alpha` using the **alpha loss** equation:  


6. **Target Networks Update:**  
   - Explain the purpose of the **target network update step** and how the soft update formula helps stabilize training.


In [None]:
envs = gym.vector.SyncVectorEnv(
        [make_env("InvertedPendulum-v4", i, i) for i in range(num_envs)]
    )
envs.single_observation_space.dtype = np.float32

episodic_returns = []
episode_steps = []

obs, _ = envs.reset(seed=seed)

for global_step in range(total_timesteps):
    if global_step < learning_starts:
        actions = np.array([envs.single_action_space.sample() for _ in range(envs.num_envs)])
    else:
        actions, _, _ = <YOUR CODE HERE> # Use the actions selected by the Actor
        actions = actions.detach().cpu().numpy()

    next_obs, rewards, terminations, truncations, infos = envs.step(actions)

    if "episode" in infos:
      episodic_returns.append(infos['episode']['r'])
      epiosde_step.append(global_step)
      print(f"global_step={global_step}, episodic_return={infos['episode']['r']}")

    # Add transition to replay buffer
    rb.add(obs, next_obs.copy(), actions, rewards, terminations, infos)

    # TRY NOT TO MODIFY: CRUCIAL step easy to overlook
    obs = next_obs

    # ALGO LOGIC: training.
    if global_step > learning_starts:
        data = rb.sample(batch_size)
        with torch.no_grad():
            next_state_actions, next_state_log_pi, _ = <YOUR CODE HERE>
            qf1_next_target = <YOUR CODE HERE>
            qf2_next_target = <YOUR CODE HERE>
            min_qf_next_target = <YOUR CODE HERE> # DO NOT FORGET THE ENTROPY TERM
            next_q_value = data.rewards.flatten() + (1 - data.dones.flatten()) * gamma * (min_qf_next_target).view(-1)

        <YOUR CODE HERE>
        qf1_loss = <YOUR CODE HERE>
        qf2_loss = <YOUR CODE HERE>
        qf_loss = qf1_loss + qf2_loss

        # optimize the critics
        <YOUR CODE HERE>

        if global_step % policy_frequency == 0:
            for _ in range(policy_frequency):
                pi, log_pi, _ = <YOUR CODE HERE>
                qf1_pi = <YOUR CODE HERE>
                qf2_pi = <YOUR CODE HERE>
                min_qf_pi = <YOUR CODE HERE>
                actor_loss = ((alpha * log_pi) - min_qf_pi).mean()

                actor_optimizer.zero_grad()
                actor_loss.backward()
                actor_optimizer.step()

                # Update alpha
                with torch.no_grad():
                    _, log_pi, _ = actor.get_action(data.observations)
                alpha_loss = (-log_alpha.exp() * (log_pi + target_entropy)).mean()

                <YOUR CODE HERE>
                alpha = log_alpha.exp().item()

        # update the target networks
        if global_step % target_network_frequency == 0:
            for param, target_param in zip(qf1.parameters(), qf1_target.parameters()):
                target_param.data.copy_(tau * param.data + (1 - tau) * target_param.data)
            for param, target_param in zip(qf2.parameters(), qf2_target.parameters()):
                target_param.data.copy_(tau * param.data + (1 - tau) * target_param.data)

The code snippet below generates a plot to visualize the agent’s performance during training. Analyze the plot after running your training and provide a short interpretation of the results.

In [None]:
plt.plot(episodic_returns)
plt.xlabel("Training episode")
plt.ylabel("Episodic return")
plt.show()

# Evaluating the actor

The function `test_actor` is designed to evaluate a trained SAC policy by running it in the **Inverted Pendulum** environment for `num_episodes` and recording the episodic returns. Complete the missing code, run the function with a trained actor, and interpret the results.


In [None]:
def test_actor(actor, num_episodes=10):
  episodic_returns = np.zeros(num_episodes)

  for i in range(num_episodes):
    episode_return = 0
    envs = gym.vector.SyncVectorEnv(
        [make_env("InvertedPendulum-v4", i, i, capture_video=False) for i in range(num_envs)]
    )
    envs.single_observation_space.dtype = np.float32
    obs, _ = envs.reset()
    done = False
    while not done:
      <YOUR CODE HERE>
      done = terminated or truncated

    episodic_returns[i] = ep_return
    print(f"Episode {i+1}: Return = {ep_return}")
  return {"mean return":episodic_returns.mean(), "std return":episodic_returns.std()}

In [None]:
test_actor(actor)