<a href="https://colab.research.google.com/github/aem226/Reinforcement-Learning-Projects/blob/main/lab12_SAC.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Lab 12: Soft Actor Critic

Train a Soft Actor-Critic (SAC) agent on HalfCheetah-v4,
track its learning curve, and compare the result with your previous RL algorithm.

We will learn to use the **stable_baselines3** library


## To start
Run the following code that solves **the Pendulum problem**

In [1]:
# Install dependencies
!pip install "stable-baselines3[extra]" "gymnasium[classic_control]"




The following code trains a full SAC agent — with two critics, entropy tuning, and replay buffer — out of the box

In [2]:
import gymnasium as gym
from stable_baselines3 import SAC

# Create environment
env = gym.make("Pendulum-v1")

# Create SAC model
model = SAC("MlpPolicy", env, verbose=1)

# Train the agent with N time steps
N = 4000
model.learn(total_timesteps=N)

Gym has been unmaintained since 2022 and does not support NumPy 2.0 amongst other critical functionality.
Please upgrade to Gymnasium, the maintained drop-in replacement of Gym, or contact the authors of your software and request that they upgrade.
See the migration guide at https://gymnasium.farama.org/introduction/migration_guide/ for additional information.


Using cpu device
Wrapping the env with a `Monitor` wrapper
Wrapping the env in a DummyVecEnv.


  return datetime.utcnow().replace(tzinfo=utc)
  return datetime.utcnow().replace(tzinfo=utc)


----------------------------------
| rollout/           |           |
|    ep_len_mean     | 200       |
|    ep_rew_mean     | -1.45e+03 |
| time/              |           |
|    episodes        | 4         |
|    fps             | 41        |
|    time_elapsed    | 19        |
|    total_timesteps | 800       |
| train/             |           |
|    actor_loss      | 21.7      |
|    critic_loss     | 0.199     |
|    ent_coef        | 0.812     |
|    ent_coef_loss   | -0.336    |
|    learning_rate   | 0.0003    |
|    n_updates       | 699       |
----------------------------------
----------------------------------
| rollout/           |           |
|    ep_len_mean     | 200       |
|    ep_rew_mean     | -1.46e+03 |
| time/              |           |
|    episodes        | 8         |
|    fps             | 38        |
|    time_elapsed    | 41        |
|    total_timesteps | 1600      |
| train/             |           |
|    actor_loss      | 48.1      |
|    critic_loss    

<stable_baselines3.sac.sac.SAC at 0x7f37cad85a00>

# Visualize the result

In [3]:
import matplotlib.pyplot as plt
import matplotlib.animation as animation
import gymnasium as gym
from stable_baselines3 import SAC
from IPython.display import HTML

env = gym.make("Pendulum-v1", render_mode="rgb_array")


N_steps = 500

frames = []
obs, _ = env.reset()
for _ in range(N_steps):
    action, _ = model.predict(obs, deterministic=True)
    obs, reward, done, truncated, info = env.step(action)
    frame = env.render()
    frames.append(frame)
    if done or truncated:
        obs, _ = env.reset()

env.close()

# Create matplotlib animation
fig = plt.figure()
img = plt.imshow(frames[0])

def animate(i):
    img.set_data(frames[i])
    return [img]

ani = animation.FuncAnimation(fig, animate, frames=len(frames), interval=30)

plt.close()
HTML(ani.to_html5_video())


  from pkg_resources import resource_stream, resource_exists
Implementing implicit namespace packages (as specified in PEP 420) is preferred to `pkg_resources.declare_namespace`. See https://setuptools.pypa.io/en/latest/references/keywords.html#keyword-namespace-packages
  declare_namespace(pkg)
Implementing implicit namespace packages (as specified in PEP 420) is preferred to `pkg_resources.declare_namespace`. See https://setuptools.pypa.io/en/latest/references/keywords.html#keyword-namespace-packages
  declare_namespace(pkg)
Implementing implicit namespace packages (as specified in PEP 420) is preferred to `pkg_resources.declare_namespace`. See https://setuptools.pypa.io/en/latest/references/keywords.html#keyword-namespace-packages
  declare_namespace(pkg)
  return datetime.utcnow().replace(tzinfo=utc)
  return datetime.utcnow().replace(tzinfo=utc)


## Exercise 1: Soft Actor-Critic on HalfCheetah
**Goal:**
Adapt the existing Soft Actor-Critic (SAC) implementation from the Pendulum-v1 environment to train and evaluate a SAC agent on the more challenging HalfCheetah-v4 task.

### Instructions

1. Adapt your code: Modify your previous SAC implementation so that it runs on the HalfCheetah-v4 environment from MuJoCo.

2. Reward Logger:
Implement a custom RewardLogger callback to record episode rewards during training (you will have to search online about how to create a class for logger).
After training, use the logged data to plot the learning curve (episode reward vs. timesteps).

3. Experimentation:
    * Compare the performance of SAC on HalfCheetah-v4 with your previous algorithm.
    * Adjust the entropy temperature parameter (ent_coef or target_entropy) and observe how this affects: Exploration behavior, Convergence speed, and Final performance.

Answer the following questions
1. Did SAC reach a higher average reward or converge faster than your previous method?

2. How did changing the temperature for entropy affect the performance and stability of learning?

3. Describe any differences you observed in exploration or motion behavior of the agent.


### Deliverables

* Python notebook or script containing:
    * The adapted SAC training code
    * The RewardLogger implementation
    * Learning curve plots for SAC and your previous algorithm
    * Comparison of different entropy temperature settings

* The answer of the questions above

In [4]:
!pip install gymnasium[mujoco]



In [6]:
#Tracks episodes reward and training timesteps

from stable_baselines3.common.callbacks import BaseCallback

class RewardLogger(BaseCallback):
    def __init__(self, verbose=0):
        super().__init__(verbose)
        self.episode_rewards = []
        self.episode_timesteps = []

    def _on_step(self) -> bool:
        infos = self.locals.get("infos", [])
        for info in infos:
            if "episode" in info:
                self.episode_rewards.append(info["episode"]["r"])
                self.episode_timesteps.append(self.num_timesteps)
        return True


In [None]:
#Train SAC on HalfCheetah

import gymnasium as gym
from stable_baselines3 import SAC

env = gym.make("HalfCheetah-v4")

sac_logger = RewardLogger()

model = SAC(
    "MlpPolicy",
    env,
    verbose=1,
    ent_coef="auto",
)

total_timesteps = 300_000
model.learn(total_timesteps=total_timesteps, callback=sac_logger)

  logger.deprecation(


Using cpu device
Wrapping the env with a `Monitor` wrapper
Wrapping the env in a DummyVecEnv.
---------------------------------
| rollout/           |          |
|    ep_len_mean     | 1e+03    |
|    ep_rew_mean     | -235     |
| time/              |          |
|    episodes        | 4        |
|    fps             | 35       |
|    time_elapsed    | 111      |
|    total_timesteps | 4000     |
| train/             |          |
|    actor_loss      | -31.6    |
|    critic_loss     | 1.05     |
|    ent_coef        | 0.313    |
|    ent_coef_loss   | -11.2    |
|    learning_rate   | 0.0003   |
|    n_updates       | 3899     |
---------------------------------
---------------------------------
| rollout/           |          |
|    ep_len_mean     | 1e+03    |
|    ep_rew_mean     | -235     |
| time/              |          |
|    episodes        | 8        |
|    fps             | 35       |
|    time_elapsed    | 227      |
|    total_timesteps | 8000     |
| train/             |

In [None]:
import matplotlib.pyplot as plt

plt.figure(figsize=(7,4))
plt.plot(sac_logger.episode_timesteps, sac_logger.episode_rewards)
plt.xlabel("Timestep")
plt.ylabel("Episode Reward")
plt.title("SAC on HalfCheetah-v4 — Learning Curve")
plt.grid(True)
plt.show()


In [None]:
def run_sac(ent_coef_setting, steps=200_000):
    env = gym.make("HalfCheetah-v4")
    logger = RewardLogger()
    model = SAC("MlpPolicy", env, verbose=0, ent_coef=ent_coef_setting)
    model.learn(total_timesteps=steps, callback=logger)
    return logger

log_low  = run_sac(0.01)
log_auto = run_sac("auto")
log_high = run_sac(0.5)

plt.figure(figsize=(8,5))

plt.plot(log_low.episode_timesteps, log_low.episode_rewards, label="ent_coef=0.01 (low)")
plt.plot(log_auto.episode_timesteps, log_auto.episode_rewards, label="ent_coef=auto (default)")
plt.plot(log_high.episode_timesteps, log_high.episode_rewards, label="ent_coef=0.5 (high)")

plt.xlabel("Timestep")
plt.ylabel("Episode Reward")
plt.title("Entropy Temperature Comparison — SAC on HalfCheetah")
plt.legend()
plt.grid(True)
plt.show()


plt.figure(figsize=(7,4))
plt.plot(prev_timesteps, prev_rewards, label="Previous Algorithm")
plt.plot(sac_logger.episode_timesteps, sac_logger.episode_rewards, label="SAC")
plt.xlabel("Timesteps")
plt.ylabel("Episode Reward")
plt.title("Comparison: SAC vs Previous RL Algorithm")
plt.legend()
plt.grid(True)
plt.show()
