<a href="https://colab.research.google.com/github/ZahraAlharz/Oxford-AI-Summer-School/blob/main/Humanoid_Walker_SAC_HW_Oxford.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
%%capture
!apt install -y python3-opengl
!apt install -y ffmpeg
!apt install -y xvfb
!pip3 install pyvirtualdisplay

In [None]:
from pyvirtualdisplay import Display

display = Display(visible=0, size=(1400, 900))
display.start()

<pyvirtualdisplay.display.Display at 0x7c83bb431ed0>

In [None]:
%pip install gymnasium[mujoco]



# Task

We will use the SAC algorithm to train the **walker** environment.

You can implement it yourself or use the Stablebaselines3 version.

Walker environment consists of a structure of legs and the agent's actions can move the joints. The goal is to make the structure able to walk.

You can see more about the actions, observations and rewards [here](https://gymnasium.farama.org/environments/mujoco/walker2d/)

![Walker Image](https://gymnasium.farama.org/_images/walker2d.gif)


In [None]:
!pip install stable_baselines3[extra]



In [None]:
import gymnasium as gym
import numpy as np

from stable_baselines3 import SAC
from stable_baselines3.common.vec_env import DummyVecEnv
from stable_baselines3.common.evaluation import evaluate_policy
from stable_baselines3.common.noise import NormalActionNoise

from IPython.display import clear_output

import matplotlib.pyplot as plt
from matplotlib.animation import FuncAnimation
from IPython.display import HTML

In [None]:
make_env = lambda: gym.make('Walker2d-v4', render_mode='rgb_array', max_episode_steps= 2000) #1000
env = DummyVecEnv([make_env for _ in range(4)])

In [None]:
# normalize the actions
num_actions = env.action_space.shape[0]
action_mean = np.zeros(num_actions)
# action_mean = int((num_actions/2))
action_std = np.array([0.1] * (num_actions)) #maybe change to 0.2
action_std = action_std*2

print(action_mean)
print(action_std)
noise = NormalActionNoise(action_mean, action_std)

[0. 0. 0. 0. 0. 0.]
[0.2 0.2 0.2 0.2 0.2 0.2]


In [None]:
model = SAC("MlpPolicy",
            env,
            verbose=1,
            action_noise=noise,  # noise for exploration
            learning_rate=1e-3)

Using cuda device


In [None]:
print("Action Shape:", env.action_space.shape)
print("Observation Shape:", env.observation_space.shape)
#print("Policy Output Shape:", model.predict(observation)[0].shape) # Assuming a single observation

Action Shape: (6,)
Observation Shape: (17,)


In [None]:
model.learn(total_timesteps=int(5e5), progress_bar=True)

Output()

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
|    n_updates       | 10162    |
---------------------------------
---------------------------------
| time/              |          |
|    episodes        | 252      |
|    fps             | 242      |
|    time_elapsed    | 171      |
|    total_timesteps | 41552    |
| train/             |          |
|    actor_loss      | -67.7    |
|    critic_loss     | 7.78     |
|    ent_coef        | 0.0292   |
|    ent_coef_loss   | 0.552    |
|    learning_rate   | 0.001    |
|    n_updates       | 10362    |
---------------------------------
---------------------------------
| time/              |          |
|    episodes        | 256      |
|    fps             | 242      |
|    time_elapsed    | 174      |
|    total_timesteps | 42220    |
| train/             |          |
|    actor_loss      | -71.2    |
|    critic_loss     | 9.99     |
|    ent_coef        | 0.0304   |
|    ent_coef_loss   | -0.684   |
|    learning_rat

<stable_baselines3.sac.sac.SAC at 0x7c82462b53f0>

In [None]:
mean_reward, _ = evaluate_policy(model, env, n_eval_episodes=15)
print(f"Mean reward: {mean_reward:.2f}")



Mean reward: 1937.30


In [None]:
def frames_to_video(frames, fps=24):
    fig = plt.figure(figsize=(frames[0].shape[1] / 100, frames[0].shape[0] / 100), dpi=100)
    ax = plt.axes()
    ax.set_axis_off()

    if len(frames[0].shape) == 2:  # Grayscale image
        im = ax.imshow(frames[0], cmap='gray')
    else:  # Color image
        im = ax.imshow(frames[0])

    def init():
        if len(frames[0].shape) == 2:
            im.set_data(frames[0], cmap='gray')
        else:
            im.set_data(frames[0])
        return im,

    def update(frame):
        if len(frames[frame].shape) == 2:
            im.set_data(frames[frame], cmap='gray')
        else:
            im.set_data(frames[frame])
        return im,

    interval = 1000 / fps
    anim = FuncAnimation(fig, update, frames=len(frames), init_func=init, blit=True, interval=interval)
    plt.close()
    return HTML(anim.to_html5_video())

In [None]:
t_env = DummyVecEnv([lambda: gym.make('Walker2d-v4', render_mode="rgb_array")])
state = t_env.reset()
frames = []

while True:
    action, _ = model.predict(state)
    state_next, r, done, info = t_env.step(action)
    frames.append(t_env.render())
    state = state_next
    if done.all():
        break

t_env.close()

In [None]:
frames_to_video(frames)

In [None]:
model.save("model")

  and should_run_async(code)


-stability
-walking
-standing
-average reward
-hyperparameter
-evaluate policy