<a href="https://colab.research.google.com/github/ddcreating/RL_code/blob/main/lab12_SAC.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Lab 12: Soft Actor Critic

Train a Soft Actor-Critic (SAC) agent on HalfCheetah-v4,
track its learning curve, and compare the result with your previous RL algorithm.

We will learn to use the **stable_baselines3** library


## To start
Run the following code that solves **the Pendulum problem**

In [1]:
# Install dependencies
!pip install "stable-baselines3[extra]" "gymnasium[classic_control]"


Collecting stable-baselines3[extra]
  Downloading stable_baselines3-2.7.0-py3-none-any.whl.metadata (4.8 kB)
Downloading stable_baselines3-2.7.0-py3-none-any.whl (187 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m187.2/187.2 kB[0m [31m3.2 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: stable-baselines3
Successfully installed stable-baselines3-2.7.0


The following code trains a full SAC agent — with two critics, entropy tuning, and replay buffer — out of the box

In [2]:
import gymnasium as gym
from stable_baselines3 import SAC

# Create environment
env = gym.make("Pendulum-v1")

# Create SAC model
model = SAC("MlpPolicy", env, verbose=1)

# Train the agent with N time steps
N = 4000
model.learn(total_timesteps=N)

Gym has been unmaintained since 2022 and does not support NumPy 2.0 amongst other critical functionality.
Please upgrade to Gymnasium, the maintained drop-in replacement of Gym, or contact the authors of your software and request that they upgrade.
See the migration guide at https://gymnasium.farama.org/introduction/migration_guide/ for additional information.


Using cpu device
Wrapping the env with a `Monitor` wrapper
Wrapping the env in a DummyVecEnv.


  return datetime.utcnow().replace(tzinfo=utc)


----------------------------------
| rollout/           |           |
|    ep_len_mean     | 200       |
|    ep_rew_mean     | -1.48e+03 |
| time/              |           |
|    episodes        | 4         |
|    fps             | 41        |
|    time_elapsed    | 19        |
|    total_timesteps | 800       |
| train/             |           |
|    actor_loss      | 23.1      |
|    critic_loss     | 0.255     |
|    ent_coef        | 0.813     |
|    ent_coef_loss   | -0.344    |
|    learning_rate   | 0.0003    |
|    n_updates       | 699       |
----------------------------------


  return datetime.utcnow().replace(tzinfo=utc)


----------------------------------
| rollout/           |           |
|    ep_len_mean     | 200       |
|    ep_rew_mean     | -1.49e+03 |
| time/              |           |
|    episodes        | 8         |
|    fps             | 38        |
|    time_elapsed    | 42        |
|    total_timesteps | 1600      |
| train/             |           |
|    actor_loss      | 47.6      |
|    critic_loss     | 0.16      |
|    ent_coef        | 0.647     |
|    ent_coef_loss   | -0.606    |
|    learning_rate   | 0.0003    |
|    n_updates       | 1499      |
----------------------------------
----------------------------------
| rollout/           |           |
|    ep_len_mean     | 200       |
|    ep_rew_mean     | -1.36e+03 |
| time/              |           |
|    episodes        | 12        |
|    fps             | 37        |
|    time_elapsed    | 64        |
|    total_timesteps | 2400      |
| train/             |           |
|    actor_loss      | 67.4      |
|    critic_loss    

<stable_baselines3.sac.sac.SAC at 0x7f48ede11c70>

# Visualize the result

In [3]:
import matplotlib.pyplot as plt
import matplotlib.animation as animation
import gymnasium as gym
from stable_baselines3 import SAC
from IPython.display import HTML

env = gym.make("Pendulum-v1", render_mode="rgb_array")


N_steps = 500

frames = []
obs, _ = env.reset()
for _ in range(N_steps):
    action, _ = model.predict(obs, deterministic=True)
    obs, reward, done, truncated, info = env.step(action)
    frame = env.render()
    frames.append(frame)
    if done or truncated:
        obs, _ = env.reset()

env.close()

# Create matplotlib animation
fig = plt.figure()
img = plt.imshow(frames[0])

def animate(i):
    img.set_data(frames[i])
    return [img]

ani = animation.FuncAnimation(fig, animate, frames=len(frames), interval=30)

plt.close()
HTML(ani.to_html5_video())


  from pkg_resources import resource_stream, resource_exists
Implementing implicit namespace packages (as specified in PEP 420) is preferred to `pkg_resources.declare_namespace`. See https://setuptools.pypa.io/en/latest/references/keywords.html#keyword-namespace-packages
  declare_namespace(pkg)
Implementing implicit namespace packages (as specified in PEP 420) is preferred to `pkg_resources.declare_namespace`. See https://setuptools.pypa.io/en/latest/references/keywords.html#keyword-namespace-packages
  declare_namespace(pkg)
  return datetime.utcnow().replace(tzinfo=utc)
Implementing implicit namespace packages (as specified in PEP 420) is preferred to `pkg_resources.declare_namespace`. See https://setuptools.pypa.io/en/latest/references/keywords.html#keyword-namespace-packages
  declare_namespace(pkg)
  return datetime.utcnow().replace(tzinfo=utc)


## Exercise 1: Soft Actor-Critic on HalfCheetah
**Goal:**
Adapt the existing Soft Actor-Critic (SAC) implementation from the Pendulum-v1 environment to train and evaluate a SAC agent on the more challenging HalfCheetah-v4 task.

### Instructions

1. Adapt your code: Modify your previous SAC implementation so that it runs on the HalfCheetah-v4 environment from MuJoCo.

2. Reward Logger:
Implement a custom RewardLogger callback to record episode rewards during training (you will have to search online about how to create a class for logger).
After training, use the logged data to plot the learning curve (episode reward vs. timesteps).

3. Experimentation:
    * Compare the performance of SAC on HalfCheetah-v4 with your previous algorithm.
    * Adjust the entropy temperature parameter (ent_coef or target_entropy) and observe how this affects: Exploration behavior, Convergence speed, and Final performance.

Answer the following questions
1. Did SAC reach a higher average reward or converge faster than your previous method?

2. How did changing the temperature for entropy affect the performance and stability of learning?

3. Describe any differences you observed in exploration or motion behavior of the agent.


### Deliverables

* Python notebook or script containing:
    * The adapted SAC training code
    * The RewardLogger implementation
    * Learning curve plots for SAC and your previous algorithm
    * Comparison of different entropy temperature settings

* The answer of the questions above

In [4]:
!pip install "stable-baselines3[extra]" "gymnasium[mujoco]"

Collecting mujoco>=2.1.5 (from gymnasium[mujoco])
  Downloading mujoco-3.3.7-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (41 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m42.0/42.0 kB[0m [31m3.0 MB/s[0m eta [36m0:00:00[0m
Collecting glfw (from mujoco>=2.1.5->gymnasium[mujoco])
  Downloading glfw-2.10.0-py2.py27.py3.py30.py31.py32.py33.py34.py35.py36.py37.py38.p39.p310.p311.p312.p313-none-manylinux_2_28_x86_64.whl.metadata (5.4 kB)
Downloading mujoco-3.3.7-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (6.9 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.9/6.9 MB[0m [31m63.7 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading glfw-2.10.0-py2.py27.py3.py30.py31.py32.py33.py34.py35.py36.py37.py38.p39.p310.p311.p312.p313-none-manylinux_2_28_x86_64.whl (243 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m243.5/243.5 kB[0m [31m18.2 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected pa

In [5]:
import gymnasium as gym
import numpy as np
import matplotlib.pyplot as plt

from stable_baselines3 import SAC
from stable_baselines3.common.callbacks import BaseCallback
from stable_baselines3.common.env_util import make_vec_env


In [6]:
class RewardLogger(BaseCallback):
    """
    Custom callback: Records the total reward and corresponding time steps for each episode during training.
    """
    def __init__(self, verbose: int = 0):
        super().__init__(verbose)
        self.episode_rewards = []
        self.episode_lengths = []
        self.timesteps = []

    def _on_step(self) -> bool:
        infos = self.locals.get("infos", None)
        if infos is not None:
            for info in infos:
                if "episode" in info.keys():
                    ep_info = info["episode"]
                    self.episode_rewards.append(ep_info["r"])
                    self.episode_lengths.append(ep_info["l"])
                    self.timesteps.append(self.num_timesteps)
                    if self.verbose > 0:
                        print(f"Episode done: R={ep_info['r']:.2f}, len={ep_info['l']}")
        return True


In [None]:
# Create a HalfCheetah environment
env_id = "HalfCheetah-v4"

env = make_vec_env(env_id, n_envs=1)

TOTAL_TIMESTEPS = 300_000  # or 100_000、500_000

logger_default = RewardLogger(verbose=0)

# ent_coef="auto" Represents automatic entropy temperature adjustment
model_default = SAC(
    "MlpPolicy",
    env,
    verbose=1,
    ent_coef="auto",
    tensorboard_log="./sac_halfcheetah_tb_default/"
)

model_default.learn(
    total_timesteps=TOTAL_TIMESTEPS,
    callback=logger_default
)


  logger.deprecation(


Using cpu device
Logging to ./sac_halfcheetah_tb_default/SAC_1
---------------------------------
| rollout/           |          |
|    ep_len_mean     | 1e+03    |
|    ep_rew_mean     | -239     |
| time/              |          |
|    episodes        | 4        |
|    fps             | 31       |
|    time_elapsed    | 128      |
|    total_timesteps | 4000     |
| train/             |          |
|    actor_loss      | -31.2    |
|    critic_loss     | 1.64     |
|    ent_coef        | 0.314    |
|    ent_coef_loss   | -10.6    |
|    learning_rate   | 0.0003   |
|    n_updates       | 3899     |
---------------------------------


In [None]:
# Set the entropy temperature for multiple sets of experiments
ent_coefs = [0.1, 0.2]
results = {}

for ent in ent_coefs:
    print(f"\n==== Train SAC with ent_coef={ent} ====")
    env = make_vec_env(env_id, n_envs=1)

    logger = RewardLogger(verbose=0)
    model = SAC(
        "MlpPolicy",
        env,
        verbose=1,
        ent_coef=ent,   # Fixed entropy temperature
        tensorboard_log=f"./sac_halfcheetah_tb_ent{ent}/"
    )

    model.learn(
        total_timesteps=TOTAL_TIMESTEPS,
        callback=logger
    )

    results[ent] = {
        "logger": logger,
        "model": model
    }


In [None]:
plt.figure(figsize=(8, 5))

# The default automatic temperature setting
plt.plot(
    logger_default.timesteps,
    logger_default.episode_rewards,
    label="SAC (ent_coef='auto')"
)

# Curves with different ent_coef values
for ent, res in results.items():
    lg = res["logger"]
    plt.plot(
        lg.timesteps,
        lg.episode_rewards,
        label=f"SAC (ent_coef={ent})"
    )

plt.xlabel("Timesteps")
plt.ylabel("Episode Reward")
plt.title("SAC on HalfCheetah-v4: Learning Curves")
plt.legend()
plt.grid(True)
plt.show()


In [None]:
# The evaluation was conducted using the automatic temperature model.
eval_env = gym.make(env_id, render_mode=None)  # If you want to see the visuals, change it to "human".
obs, info = eval_env.reset(seed=42)

episode_reward = 0.0
for step in range(1000):
    action, _ = model_default.predict(obs, deterministic=True)
    obs, reward, terminated, truncated, info = eval_env.step(action)
    episode_reward += reward
    if terminated or truncated:
        break

eval_env.close()
print("Eval episode reward (deterministic policy):", episode_reward)
