<div style="display: flex; align-items: center; justify-content: center;">
    <img style="float: right;" src="imgs/OST.png" width=260, height=130>
</div>
<div style="text-align: center;">
    <h1>Learning a Humanoid to Standup with Reinforcement Learning</h1>
    <h2>Christoph Landolt</h2>
    <h3>20. June 2024</h3>
</div>

## 1. Introduction

**Task Description**

In this Notebook, the HumanoidStandup Task is solved. This task involves training a reinforcement learning agent to control a humanoid model consisting of 15 body parts to transition from a horizontal lying position to a standing posture.
This task is especially challenging because the agent interacts with a high-dimensional action space of 17 actions corresponding to joint angles and velocities and observes a state space of 376 dimensions, which includes information about the Humanoid's position, velocity, and angular velocity.
The task is considered solved when the Humanoid successfully stands up without falling over. This demonstrates the agent's ability to handle complex, high-dimensional control tasks.

**Project Scope**

The aim of this project is to evaluate the best reinforcement learning algorithm for the HumaniudStandup task and then to optimize the parameters of this algorithm.
Subsequently, reward shaping will be carried out and it will be shown how several agents can be trained in parallel.

## 2. Objective and approach

**Objective**
This project aims to train a horizontal-lying humanoid robot, simulated in the Mujoco physics engine, to stand up independently.

**Approach**

The project is carried out in the following steps:
1) Comparison of the following RL algorithms for this task: PPO, SAC, TD3 and A2C with the default parameters by comparing the reward. The choice of the RL algorithm will be verified with the current literature.
2) Tuning the parameters of the selected RL algorithm
3) Creating and registering a custum environment with adjusted reward
4) Parallel training of the agent
5) Discussion of the results

## 3. Python Library Requirements

This project requires the following Python libraries:

- mujoco
- gymnasium
- stable-baselines3

## 4. Documentation of the Mujoco Humanoid simulation (Adapted from [Gymnasium Documentation](https://gymnasium.farama.org/environments/mujoco/humanoid_standup/))
The environment which is described in [Paper](https://ieeexplore.ieee.org/document/6386025) simulates a human consisting of two legs, two arms and a Torso.

The environment is designet to learn a humanoid to stand up or to lern how to walk.

### Overview
| Description | Information |
|----------|----------|
| Action Space   | ``` Box(-0.4, 0.4, (17,), float32) ```   |
| Observation Space    | ```Box(-inf, inf, (348,), float64)```     |
| import learning to Walk    | 	```gymnasium.make("Humanoid-v4")```     |
| import standup   | ```gymnasium.make("HumanoidStandup-v4")```     |


### Action Space
An action represents the torques applied at the hinge joints.

![imgs/humanoid.png](imgs/humanoid.png)

| Num | Name  | Joint | Type (Unit) | 
|----------|----------|----------|----------|
| 0 | abdomen_y | hinge | torque (N m) |
| 1 | abdomen_z | hinge | torque (N m) | 
| 2 | abdomen_x | hinge | torque (N m) |
| 3 | right_hip_x | hinge | torque (N m) |
| 4 | right_hip_z | hinge | torque (N m) |
| 5 | right_hip_y | hinge | torque (N m) |
| 6 | right_knee | hinge | torque (N m) | 
| 7 | left_hip_x  | hinge | torque (N m) |
| 8 | left_hip_z | hinge | torque (N m) |
| 9 | left_hip_y | hinge | torque (N m) | 
| 10 | left_knee | hinge | torque (N m) | 
| 11 | right_shoulder1 | hinge | torque (N m) | 
| 12 | right_shoulder2 | hinge | torque (N m) | 
| 13 | right_elbow | hinge | torque (N m) | 
| 14 | left_shoulder1 | hinge | torque (N m) | 
| 15 | left_shoulder2 | hinge | torque (N m) | 
| 16 | left_elbow | hinge | torque (N m) | 

### Observation Space
The observation space for the Gymnasium HumanoidStandup task is a 376-dimensional continuous space. This high-dimensional space captures a wide range of information about the state of the humanoid model, which consists of 15 body parts.

The state vector includes the following information:

- Position and velocity of the humanoid's center of mass
- Position, velocity, and angular velocity of each of the humanoid's body parts
- Joint angles and velocities
- Contact forces
- Boolean indicators for whether each body part is in contact with the ground

This rich state representation allows the reinforcement learning agent to have a comprehensive understanding of the current state of the humanoid and its environment, which is crucial for learning to balance and coordinate the humanoid's movements to stand up from a prone position.

A detailed overview can be found under this [Link](https://gymnasium.farama.org/environments/mujoco/humanoid_standup/).

### Rewards
**Standup-Task**

The total reward is: reward = uph_cost + 1 - quad_ctrl_cost - quad_impact_cost.
- uph_cost: A reward for moving up
- quad_ctrl_cost: A negative reward to penalize the Humanoid for taking actions that are too large.
- impact_cost: A negative reward to penalize the Humanoid if the external contact forces are too large.

A detailed overview can be found under this [Link](https://gymnasium.farama.org/environments/mujoco/humanoid_standup/).

**Learning to Walk-Task**

The reward function for the Humanoid learning to walk task is composed of four parts:

1. Healthy Reward: A fixed reward given at every timestep that the humanoid is alive.

2. Forward Reward: A reward for moving forward, calculated as `forward_reward_weight * (average center of mass before action - average center of mass after action) / dt`. Here, `dt` is the time between actions, dependent on the `frame_skip` parameter.

3. Control Cost: A penalty for large control forces, calculated as `ctrl_cost_weight * sum(control^2)`.

4. Contact Cost: A penalty for large external contact forces.

The total reward is calculated as `reward = healthy_reward + forward_reward - ctrl_cost - contact_cost`.

A detailed overview can be found under this [Link](https://gymnasium.farama.org/environments/mujoco/humanoid/#).

## 5. Description of the used Algorithms
The following 3 agents are being evaluated for this project:

### 5.1 Soft Actor-Critic (SAC) (Theory adapted from [OpenAI Spinning Up](https://spinningup.openai.com/en/latest/algorithms/sac.html#id2) implemented with [Stable-Baselines3](https://stable-baselines3.readthedocs.io/en/master/modules/sac.html))
The Soft Actor-Critic was introduced in the following Paper: [Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor](https://arxiv.org/abs/1801.01290)

The Soft Actor-Critic (SAC) is an off-policy reinforcement learning algorithm that optimizes a stochastic policy in an entropy-regularized framework. The objective function is given by `J(π) = 𝔼[∑_t γ^t (r(s_t, a_t) + α H(π(·|s_t)))]`, where `r(s_t, a_t)` is the reward, `H(π(·|s_t))` is the entropy of the policy `π` at state `s_t`, `γ` is the discount factor, and `α` is the temperature parameter that determines the trade-off between exploration and exploitation.

The SAC algorithm uses two Q-functions to reduce overestimation bias, and learns the policy by minimizing the KL divergence from an exponential of the Q-function. The policy update rule is `π^* = argmin_π D_KL(π(·|s_t) || exp(Q(s_t, ·)/α))`.

The Q-functions are updated by minimizing the Bellman residual, `𝔼[(Q(s_t, a_t) - (r(s_t, a_t) + γ 𝔼[Q(s_{t+1}, π(s_{t+1})) - α log π(a_{t+1}|s_{t+1})]))^2]`.

SAC has been shown to achieve state-of-the-art performance on a range of continuous control tasks due to its stability and efficiency.


### 5.2 Proximal Policy Optimization (PPO) (Theory adapted from [OpenAI Spinning Up](https://spinningup.openai.com/en/latest/algorithms/ppo.html) implemented with [Stable-Baselines3](https://stable-baselines3.readthedocs.io/en/master/modules/ppo.html))
The Proximal Policy Optimization Algorithms was introduced in the following Paper: [Proximal Policy Optimization Algorithms](https://arxiv.org/abs/1707.06347)

Proximal Policy Optimization (PPO) is a type of policy gradient method for reinforcement learning that aims to improve sample efficiency and ease of implementation. It uses a surrogate objective function to take multiple optimization steps on the same batch of data, while limiting the policy update to prevent it from deviating too much from the current policy. This is achieved by adding a penalty term to the objective function, which discourages large policy changes.

The objective function in PPO is:

$$L^{CLIP}(\theta) = \hat{E}_t[min(r_t(\theta)\hat{A}_t, clip(r_t(\theta), 1-\epsilon, 1+\epsilon)\hat{A}_t)]$$

where $r_t(\theta) = \frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_{old}}(a_t|s_t)}$ is the probability ratio of the new and old policies, $\hat{A}_t$ is the estimated advantage at time $t$, and $\epsilon$ is a hyperparameter that controls the degree of policy change.

PPO strikes a balance between sample complexity, ease of implementation, and computational cost, making it a popular choice for many reinforcement learning tasks.

### 5.3 Advantage Actor Critic (A2C) (Theory adapted from [Hugging Face](https://huggingface.co/blog/deep-rl-a2c) implemented with [Stable-Baselines3](https://stable-baselines3.readthedocs.io/en/master/modules/a2c.html))

The Advantage Actor Critic Algorithms was introduced in the following Paper: [Asynchronous Methods for Deep Reinforcement Learning
](https://arxiv.org/abs/1602.01783)

Advantage Actor-Critic (A2C) is a type of policy gradient method for reinforcement learning that combines the strengths of both actor-critic and advantage methods. The actor is the policy that the agent uses to interact with the environment, while the critic estimates the value function to help the actor improve its policy. The advantage function, defined as $A(s, a) = Q(s, a) - V(s)$, measures how much better an action is compared to the average action at that state.

The objective function in A2C is:

$$L(\theta) = \hat{E}_t[\log \pi_\theta(a_t|s_t) \hat{A}_t]$$

where $\pi_\theta(a_t|s_t)$ is the policy (actor), $\hat{A}_t$ is the estimated advantage at time $t$, and $\theta$ are the parameters of the policy.

A2C uses synchronous updates for efficiency and stability. It is an online algorithm, learning while the agent interacts with the environment. It's widely used due to its balance between computational efficiency and learning performance.

### 5.4 TD3: Twin Delayed DDPG (Deep Deterministic Policy Gradient) (Theory adapted from [OpenAI Spinning Up](https://spinningup.openai.com/en/latest/algorithms/td3.html#id1) implemented with [Stable-Baselines3](https://stable-baselines.readthedocs.io/en/master/modules/td3.html?highlight=TD3))

The Twin Delayed DDPG Algorithms was introduced in the following Paper: [Addressing Function Approximation Error in Actor-Critic Methods](https://arxiv.org/pdf/1802.09477)

clandolt
Describe the Twin Delayed DDPG Algorithms im max 6 Sentences. Include Formulas. Output as Jupyter Notebook Markdown

GitHub Copilot
Twin Delayed Deep Deterministic Policy Gradient (TD3) is an algorithm for reinforcement learning that improves upon the Deep Deterministic Policy Gradient (DDPG) method. TD3 addresses the overestimation bias in Q-value estimation by maintaining two Q-functions and using the smaller of the two Q-values to update the policy. It also introduces a delay in the policy update to reduce the variance of policy and Q-value updates.

The objective function for the Q-functions in TD3 is:

$$L(\theta_i) = \hat{E}{s,a,r,s'\sim D}[(Q_i(s,a|\theta_i) - (r + \gamma min{j=1,2}Q(s',\pi(s'|\phi_{old})|\theta_{j_{old}})))^2]$$

where $Q_i(s,a|\theta_i)$ are the Q-functions, $r$ is the reward, $\gamma$ is the discount factor, $\pi(s'|\phi_{old})$ is the old policy, and $\theta_{j_{old}}$ are the old parameters of the Q-functions.

The objective function for the policy is:

$$L(\phi) = \hat{E}_{s\sim D}[-Q_1(s,\pi(s|\phi)|\theta_1)]$$

where $Q_1(s,\pi(s|\phi)|\theta_1)$ is one of the Q-functions.

TD3 is an online algorithm, learning while the agent interacts with the environment. It's known for its stability and efficiency in learning continuous control tasks.

### Load the Required Libraries

In [1]:
from __future__ import annotations

import os
import multiprocessing
import numpy as np


import gymnasium as gym
from gymnasium.wrappers.monitoring.video_recorder import VideoRecorder

from stable_baselines3 import SAC, TD3, A2C, PPO
from stable_baselines3.common.evaluation import evaluate_policy
from stable_baselines3.common.noise import NormalActionNoise
from stable_baselines3.common.env_util import make_vec_env
from stable_baselines3.common.vec_env import VecVideoRecorder, DummyVecEnv

  from distutils.dep_util import newer, newer_group
  from distutils.dep_util import newer, newer_group
  from pkg_resources import resource_stream, resource_exists


pygame 2.5.2 (SDL 2.28.2, Python 3.10.12)
Hello from the pygame community. https://www.pygame.org/contribute.html


### Choice of the best RL Algorithm
1. The basic idea is to test several RL algorithms with the standard parameters in parallel and to monitor the training progress using the reward.
2. Subsequently, the algorithm that makes the fastest training progress is to be implemented and tuned.

In [None]:
# Create directories to hold models and logs for the tensor board
model_dir = "models"
log_dir = "logs"
os.makedirs(model_dir, exist_ok=True)
os.makedirs(log_dir, exist_ok=True)

In [None]:
### Define training parameters
TIMESTEPS = 25000

In [None]:
def train(env, humanoid_training_algo):
    match humanoid_training_algo:
        case 'PPO':
            model = PPO('MlpPolicy', env, verbose=1, device='cuda', tensorboard_log=log_dir)
        case 'SAC':
            model = SAC('MlpPolicy', env, verbose=1, device='cuda', tensorboard_log=log_dir)
        case 'TD3':
            model = TD3('MlpPolicy', env, verbose=1, device='cuda', tensorboard_log=log_dir)
        case 'A2C':
            model = A2C('MlpPolicy', env, verbose=1, device='cuda', tensorboard_log=log_dir)
        case _:
            print('Algorithm not found')
            return
    iters = 0
    while True:
        iters += 1

        model.learn(total_timesteps=TIMESTEPS, reset_num_timesteps=False)
        model.save(f"{model_dir}/{humanoid_training_algo}_{TIMESTEPS*iters}")

**Train the Model**
After one Algorithm performs clearly better than the others interrupt the training in the Jupyter Notebook

In [None]:
env = gym.make("HumanoidStandup-v4", render_mode="None")
algorithms = ['PPO', 'SAC', 'TD3', 'A2C']

processes = []
for algorithm in algorithms:
    p = multiprocessing.Process(target=train, args=(env, algorithm,))
    p.start()
    processes.append(p)
for p in processes:
    p.join()

# Start Tensorboard
!tensorboard --logdir ./logs/

### Choose of the best Algorithm and parameter tuning
![Training Performance](./imgs/Algo_Training.JPG)

In [2]:
total_episodes = 2000
max_timesteps_per_episode = 1000

# Calculate total timesteps
total_timesteps = total_episodes * max_timesteps_per_episode

log_dir = "SAC_logs"
video_dir = "SAC_videos"

In [None]:
# Create and wrap the environment
env = gym.make("HumanoidStandup-v4", render_mode="rgb_array", max_episode_steps=1_000)
env = gym.wrappers.RecordVideo(env, video_dir)
observation, info = env.reset(seed=42)

#model = SAC("MlpPolicy", env, verbose=1, buffer_size=buffer_size, learning_rate=learning_rate, batch_size=batch_size, gamma=gamma, tensorboard_log=log_dir)
model = SAC("MlpPolicy", env, verbose=1, tensorboard_log=log_dir)
model.learn(total_timesteps=total_timesteps, reset_num_timesteps=False)
# Save the agent
model.save("sac_humanoid")
vec_env = model.get_env()

del model  # delete trained model to demonstrate loading

# Load the trained agent
model = SAC.load("sac_humanoid", env=env)

# Evaluate the agent
mean_reward, std_reward = evaluate_policy(model, model.get_env(), n_eval_episodes=10)

# Enjoy trained agent
vec_env = model.get_env()
obs = vec_env.reset()
for i in range(100000):
    action, _states = model.predict(obs, deterministic=True)
    obs, rewards, dones, info = vec_env.step(action)
    vec_env.render("human")

**Results of default Soft Actor-Critic (SAC) training**

An NVIDIA RTX 6000 GPU was used for the training. After 2,000 epochs and approx. 3 hours, 1,540,000 timesteps were performed and the reward converged:

![Reward Function](./imgs/SAC_Reward_Situp.JPG)

Unfortunately, with this model and these parameters, it was only possible to train stable sitting and not complete standing up:
| Episode 0 | Episode 343 |
|:---:|:---:|
| ![SAC_RL_Episode_0](./imgs/SAC_RL_Episode_0.gif) | ![SAC_RL_Episode_343](./imgs/SAC_RL_Episode_343.gif) |

| Episode 1000 | Episode 2000 |
|:---:|:---:|
| ![SAC_RL_Episode_1000](./imgs/SAC_RL_Episode_1000.gif) | ![SAC_RL_Episode_2000](./imgs/SAC_RL_Episode_2000.gif) |

**Interpretation of the result:**

The Humanoid Standup Task is very complex and provides a huge observation and action space.
I suspect that sitting is a saddle point that needs to be "overcome".
To achieve this, the agent could be made a little more exploratory by increasing entropy or action noise.
An alternative would be to adjust the reward function to give more weight to the upward movement.

In [None]:
total_episodes = 2000
max_timesteps_per_episode = 1000

# Calculate total timesteps
total_timesteps = total_episodes * max_timesteps_per_episode

buffer_size = 1000000 # replay buffer size
learning_rate = 0.02 # How much to change the model due to the eccor at each iteration
batch_size = 64 # number of experiences sampled from the replay buffer for each update of the model
tau=0.005
gamma = 0.99
ent_coef=0.5 # Entropy 

In [None]:
# Create and wrap the environment
env = gym.make("HumanoidStandup-v4", render_mode="rgb_array", max_episode_steps=1_000)
env = gym.wrappers.RecordVideo(env, video_dir)
observation, info = env.reset(seed=42)

# Define the action noise
n_actions = env.action_space.shape[-1]
action_noise = NormalActionNoise(mean=np.zeros(n_actions), sigma=0.1 * np.ones(n_actions))

model = SAC("MlpPolicy", env, verbose=1, buffer_size=buffer_size, learning_rate=learning_rate, batch_size=batch_size, gamma=gamma, action_noise=action_noise, ent_coef=ent_coef, tensorboard_log=log_dir)
model.learn(total_timesteps=total_timesteps, reset_num_timesteps=False)
# Save the agent
model.save("sac_humanoid")
vec_env = model.get_env()

del model  # delete trained model to demonstrate loading

# Load the trained agent
model = SAC.load("sac_humanoid", env=env)

# Evaluate the agent
mean_reward, std_reward = evaluate_policy(model, model.get_env(), n_eval_episodes=10)

# Enjoy trained agent
vec_env = model.get_env()
obs = vec_env.reset()
for i in range(100000):
    action, _states = model.predict(obs, deterministic=True)
    obs, rewards, dones, info = vec_env.step(action)
    vec_env.render("human")

**Results of default Soft Actor-Critic (SAC) training**

An NVIDIA RTX 6000 GPU was used for the training. After 2,000 epochs and approx. 3.9 hours, 2,000,000 timesteps were performed and the reward converged:

![Reward Function](./imgs/SAC_Reward_Situp_param_set_1.JPG)

Unfortunately, with this model and these parameters, it was only possible to train stable sitting and not complete standing up:
| Episode 0 | Episode 343 |
|:---:|:---:|
| ![SAC_RL_Episode_0](./imgs/SAC_RL_Episode_0_para_tuning_1.gif) | ![SAC_RL_Episode_343](./imgs/SAC_RL_Episode_343_para_tuning_1.gif) |

| Episode 1000 | Episode 2000 |
|:---:|:---:|
| ![SAC_RL_Episode_1000](./imgs/SAC_RL_Episode_1000_para_tuning_1.gif) | ![SAC_RL_Episode_2000](./imgs/SAC_RL_Episode_2000_para_tuning_1.gif) |

**Interpretation of the result:**

By adding action noise and increasing the entropy, the humanoid learns new strategies for leveling up. However, even by changing these parameters, the saddle point of sitting cannot be overcome and the reward converges.

In [None]:
# Start Tensorboard
!tensorboard --logdir ./SAC_logs/

### Change the Reward Function

As the reward converged in the previous attempts at sitting, an attempt is now made to give greater weight to the upward movement when standing up at the start of the training.
For this purpose, a wrapper is used to implement a separate reward function.

In [None]:
class CustomRewardWrapper(gym.Wrapper):
    def __init__(self, env):
        super().__init__(env)
        self.current_step = 0

    def step(self, action):
        # Call the step method from the Environment
        observation, reward, done, done2, info = super().step(action)
        
        reward_linup = info.get('reward_linup')
        reward_quadctrl = info.get('reward_quadctrl')
        reward_impact = info.get('reward_impact')

        # Weight the move up higher at the Beginning of the Training and then decrease
        reward_term = max(1, 1000 - self.current_step * 0.01)
        reward = reward_term*reward_linup + reward_quadctrl + reward_impact +1

        self.current_step += 1

        return observation, reward, done, done2, info

In [None]:
total_episodes = 2000
max_timesteps_per_episode = 1000

# Calculate total timesteps
total_timesteps = total_episodes * max_timesteps_per_episode

buffer_size = 1000000 # replay buffer size
learning_rate = 0.02 # How much to change the model due to the eccor at each iteration
batch_size = 512 # number of experiences sampled from the replay buffer for each update of the model
tau=0.005
gamma = 0.99
ent_coef=0.4 # Entropy 

In [None]:
# Create and wrap the environment
env = gym.make("HumanoidStandup-v4", render_mode="rgb_array", max_episode_steps=1_000)
env = CustomRewardWrapper(env)
env = gym.wrappers.RecordVideo(env, video_dir)
observation, info = env.reset(seed=42)

# Define the action noise
n_actions = env.action_space.shape[-1]
action_noise = NormalActionNoise(mean=np.zeros(n_actions), sigma=0.1 * np.ones(n_actions))

model = SAC("MlpPolicy", env, verbose=1, buffer_size=buffer_size, learning_rate=learning_rate, batch_size=batch_size, gamma=gamma, action_noise=action_noise, ent_coef=ent_coef, tensorboard_log=log_dir)
model.learn(total_timesteps=total_timesteps, reset_num_timesteps=False)
# Save the agent
model.save("sac_humanoid")
vec_env = model.get_env()

del model  # delete trained model to demonstrate loading

# Load the trained agent
model = SAC.load("sac_humanoid", env=env)

# Evaluate the agent
mean_reward, std_reward = evaluate_policy(model, model.get_env(), n_eval_episodes=10)

# Enjoy trained agent
vec_env = model.get_env()
obs = vec_env.reset()
for i in range(100000):
    action, _states = model.predict(obs, deterministic=True)
    obs, rewards, dones, info = vec_env.step(action)
    vec_env.render("human")

**Results of default Soft Actor-Critic (SAC) training**

An NVIDIA RTX 6000 GPU was used for the training. At the beginning the reward decreased due to the decreasing reward function.
After 2,000 epochs and approx. 4 hours, 2,000,000 timesteps were performed and the reward converged at a lower level than the default reward:

![Reward Function](./imgs/rl_shaping/SAC_Reward_reward_shaping_1.JPG)

Unfortunately, with this model and these parameters, it was only possible to train stable sitting and not complete standing up:
| Episode 0 | Episode 343 |
|:---:|:---:|
| ![SAC_RL_Episode_0](./imgs/rl_shaping/SAC_RL_Episode_0_reward_shaping_1.gif) | ![SAC_RL_Episode_343](./imgs/rl_shaping/SAC_RL_Episode_343_reward_shaping_1.gif) |

| Episode 1000 | Episode 2000 |
|:---:|:---:|
| ![SAC_RL_Episode_1000](./imgs/rl_shaping/SAC_RL_Episode_1000_reward_shaping_1.gif) | ![SAC_RL_Episode_2000](./imgs/rl_shaping/SAC_RL_Episode_2000_reward_shaping_1.gif) |

**Interpretation of the result:**

By changing the reward function, the agent made no real progress and it was not possible to start learning new actions after the decline phase.
An attempt is now being made to make the decrease in rewards slower at the beginning.

### Change the Reward Function with slower decrease

Training the agent with a rapidly decreasing reward function did not improve the training. An attempt is therefore now being made to reduce the reward more slowly.

In [None]:
class CustomRewardWrapper(gym.Wrapper):
    def __init__(self, env):
        super().__init__(env)
        self.current_step = 0

    def step(self, action):
        # Call the step method from the Environment
        observation, reward, done, done2, info = super().step(action)
        
        reward_linup = info.get('reward_linup')
        reward_quadctrl = info.get('reward_quadctrl')
        reward_impact = info.get('reward_impact')

        # Weight the move up higher at the Beginning of the Training and then decrease
        reward_term = max(1, 10 - self.current_step * 0.0001)
        reward = reward_term*reward_linup + reward_quadctrl + reward_impact +1

        self.current_step += 1

        return observation, reward, done, done2, info

In [None]:
total_episodes = 2000
max_timesteps_per_episode = 1000

# Calculate total timesteps
total_timesteps = total_episodes * max_timesteps_per_episode

buffer_size = 1000000 # replay buffer size
learning_rate = 0.02 # How much to change the model due to the eccor at each iteration
batch_size = 256 # number of experiences sampled from the replay buffer for each update of the model
tau=0.005
gamma = 0.99
ent_coef=0.3 # Entropy 

In [None]:
# Create and wrap the environment
env = gym.make("HumanoidStandup-v4", render_mode="rgb_array", max_episode_steps=1_000)
env = CustomRewardWrapper(env)
env = gym.wrappers.RecordVideo(env, video_dir)
observation, info = env.reset(seed=42)

# Define the action noise
n_actions = env.action_space.shape[-1]
action_noise = NormalActionNoise(mean=np.zeros(n_actions), sigma=0.1 * np.ones(n_actions))

model = SAC("MlpPolicy", env, verbose=1, buffer_size=buffer_size, learning_rate=learning_rate, batch_size=batch_size, gamma=gamma, action_noise=action_noise, ent_coef=ent_coef, tensorboard_log=log_dir)
model.learn(total_timesteps=total_timesteps, reset_num_timesteps=False)
# Save the agent
model.save("sac_humanoid")
vec_env = model.get_env()

del model  # delete trained model to demonstrate loading

# Load the trained agent
model = SAC.load("sac_humanoid", env=env)

# Evaluate the agent
mean_reward, std_reward = evaluate_policy(model, model.get_env(), n_eval_episodes=10)

# Enjoy trained agent
vec_env = model.get_env()
obs = vec_env.reset()
for i in range(100000):
    action, _states = model.predict(obs, deterministic=True)
    obs, rewards, dones, info = vec_env.step(action)
    vec_env.render("human")

**Results of default Soft Actor-Critic (SAC) training**

Due to the converging rewards, the training was stopped after 1,020,000 timesteps.

An NVIDIA RTX 6000 GPU was used for the training. At the beginning the reward decreased due to the decreasing reward function.
The reward converged at a lower level than the default reward:

![Reward Function](./imgs/rl_shaping/SAC_Reward_reward_shaping_2.JPG)

Unfortunately, with this model and these parameters, it was only possible to train stable sitting and not complete standing up:
| Episode 0 | Episode 64 |
|:---:|:---:|
| ![SAC_RL_Episode_0](./imgs/rl_shaping/SAC_RL_Episode_0_reward_shaping_2.gif) | ![SAC_RL_Episode_343](./imgs/rl_shaping/SAC_RL_Episode_64_reward_shaping_2.gif) |

| Episode 343 | Episode 1000 |
|:---:|:---:|
| ![SAC_RL_Episode_1000](./imgs/rl_shaping/SAC_RL_Episode_343_reward_shaping_2.gif) | ![SAC_RL_Episode_2000](./imgs/rl_shaping/SAC_RL_Episode_1000_reward_shaping_2.gif) |

**Interpretation of the result:**

Once again, the increased emphasis on standing up at the beginning of the training did not bring any improvement and the agent was not able to learn anything.
We are now trying to give more weight to standing up but keep the weighting stable.

### Change the Reward Function with a constant leverage of the upward movement

Training the agent with a slower decreasing reward function did not improve the training. An attempt is therefore now being made to keep the reward constant.

In [None]:
class CustomRewardWrapper(gym.Wrapper):
    def __init__(self, env):
        super().__init__(env)

    def step(self, action):
        # Call the step method from the Environment
        observation, reward, done, done2, info = super().step(action)
        
        reward_linup = info.get('reward_linup')
        reward_quadctrl = info.get('reward_quadctrl')
        reward_impact = info.get('reward_impact')

        # Weight the move up higher at the Beginning of the Training and then decrease
        reward_term = 1.1
        reward = reward_term*reward_linup + reward_quadctrl + reward_impact +1

        return observation, reward, done, done2, info

In [None]:
total_episodes = 2000
max_timesteps_per_episode = 1000

# Calculate total timesteps
total_timesteps = total_episodes * max_timesteps_per_episode

buffer_size = 1000000 # replay buffer size
learning_rate = 0.02 # How much to change the model due to the eccor at each iteration
batch_size = 256 # number of experiences sampled from the replay buffer for each update of the model
tau=0.005
gamma = 0.99
ent_coef=0.5 # Entropy 

In [None]:
# Create and wrap the environment
env = gym.make("HumanoidStandup-v4", render_mode="rgb_array", max_episode_steps=1_000)
env = CustomRewardWrapper(env)
env = gym.wrappers.RecordVideo(env, video_dir)
observation, info = env.reset(seed=42)

# Define the action noise
n_actions = env.action_space.shape[-1]
action_noise = NormalActionNoise(mean=np.zeros(n_actions), sigma=0.1 * np.ones(n_actions))

model = SAC("MlpPolicy", env, verbose=1, buffer_size=buffer_size, learning_rate=learning_rate, batch_size=batch_size, gamma=gamma, action_noise=action_noise, ent_coef=ent_coef, tensorboard_log=log_dir)
model.learn(total_timesteps=total_timesteps, reset_num_timesteps=False)
# Save the agent
model.save("sac_humanoid")
vec_env = model.get_env()

del model  # delete trained model to demonstrate loading

# Load the trained agent
model = SAC.load("sac_humanoid", env=env)

# Evaluate the agent
mean_reward, std_reward = evaluate_policy(model, model.get_env(), n_eval_episodes=10)

# Enjoy trained agent
vec_env = model.get_env()
obs = vec_env.reset()
for i in range(100000):
    action, _states = model.predict(obs, deterministic=True)
    obs, rewards, dones, info = vec_env.step(action)
    vec_env.render("human")

**Results of default Soft Actor-Critic (SAC) training**

Due to the decreasing reward, the training was stopped after 1,020,000 timesteps.

An NVIDIA RTX 6000 GPU was used for the training. At the beginning the reward decreased due to the decreasing reward function.
The reward converged at a lower level than the default reward:

![Reward Function](./imgs/rl_shaping/SAC_Reward_reward_shaping_3.JPG)

Unfortunately, with this model and these parameters, it was only possible to train stable sitting and not complete standing up:
| Episode 0 | Episode 64 |
|:---:|:---:|
| ![SAC_RL_Episode_0](./imgs/rl_shaping/SAC_RL_Episode_0_reward_shaping_3.gif) | ![SAC_RL_Episode_343](./imgs/rl_shaping/SAC_RL_Episode_64_reward_shaping_3.gif) |

| Episode 343 | Episode 1000 |
|:---:|:---:|
| ![SAC_RL_Episode_1000](./imgs/rl_shaping/SAC_RL_Episode_343_reward_shaping_3.gif) | ![SAC_RL_Episode_2000](./imgs/rl_shaping/SAC_RL_Episode_1000_reward_shaping_3.gif) |

**Interpretation of the result:**

With this configuration, the seated position was achieved very quickly. Unfortunately, the sitting position could not be overcome this time either.

An attempt is now made to increase the constant weighting of the upward movement again and to initially set the entropy parameter high and then learn it adaptively.

### Change the Reward Function with a higher constant leverage of the upward movement and higher entropy


In [None]:
class CustomRewardWrapper(gym.Wrapper):
    def __init__(self, env):
        super().__init__(env)

    def step(self, action):
        # Call the step method from the Environment
        observation, reward, done, done2, info = super().step(action)
        
        reward_linup = info.get('reward_linup')
        reward_quadctrl = info.get('reward_quadctrl')
        reward_impact = info.get('reward_impact')

        # Weight the move up higher at the Beginning of the Training and then decrease
        reward_term = 1.7
        reward = reward_term*reward_linup + reward_quadctrl + reward_impact +1

        return observation, reward, done, done2, info

In [None]:
total_episodes = 2000
max_timesteps_per_episode = 1000

# Calculate total timesteps
total_timesteps = total_episodes * max_timesteps_per_episode

buffer_size = 1000000 # replay buffer size
learning_rate = 0.02 # How much to change the model due to the eccor at each iteration
batch_size = 64 # number of experiences sampled from the replay buffer for each update of the model
tau=0.005
gamma = 0.99
ent_coef='auto_0.8' # Entropy, learn it automatically using start value 0.8

In [None]:
# Create and wrap the environment
env = gym.make("HumanoidStandup-v4", render_mode="rgb_array", max_episode_steps=1_000)
env = CustomRewardWrapper(env)
env = gym.wrappers.RecordVideo(env, video_dir)
observation, info = env.reset(seed=42)

# Define the action noise
n_actions = env.action_space.shape[-1]
action_noise = NormalActionNoise(mean=np.zeros(n_actions), sigma=0.1 * np.ones(n_actions))

model = SAC("MlpPolicy", env, verbose=1, buffer_size=buffer_size, learning_rate=learning_rate, batch_size=batch_size, gamma=gamma, action_noise=action_noise, ent_coef=ent_coef, tensorboard_log=log_dir)
model.learn(total_timesteps=total_timesteps, reset_num_timesteps=False)
# Save the agent
model.save("sac_humanoid")
vec_env = model.get_env()

del model  # delete trained model to demonstrate loading

# Load the trained agent
model = SAC.load("sac_humanoid", env=env)

# Evaluate the agent
mean_reward, std_reward = evaluate_policy(model, model.get_env(), n_eval_episodes=10)

# Enjoy trained agent
vec_env = model.get_env()
obs = vec_env.reset()
for i in range(100000):
    action, _states = model.predict(obs, deterministic=True)
    obs, rewards, dones, info = vec_env.step(action)
    vec_env.render("human")

**Results of the training**

Due to the decreasing reward and converging in a crouching position, the training was stopped after 1,012,000 timesteps.

An NVIDIA RTX 6000 GPU was used for the training.
The reward converged at a lower level than the default reward:

![Reward Function](./imgs/rl_shaping/SAC_Reward_reward_shaping_4.JPG)

Unfortunately, with this model and these parameters, it was only possible to train a crouching position and not complete standing up:
| Episode 0 | Episode 64 |
|:---:|:---:|
| ![SAC_RL_Episode_0](./imgs/rl_shaping/SAC_RL_Episode_0_reward_shaping_4.gif) | ![SAC_RL_Episode_64](./imgs/rl_shaping/SAC_RL_Episode_64_reward_shaping_4.gif) |

| Episode 343 | Episode 1000 |
|:---:|:---:|
| ![SAC_RL_Episode_343](./imgs/rl_shaping/SAC_RL_Episode_343_reward_shaping_4.gif) | ![SAC_RL_Episode_1000](./imgs/rl_shaping/SAC_RL_Episode_1000_reward_shaping_4.gif) |

**Interpretation of the result:**

With this configuration, it was only possible to remain in a crouching position.

An attempt is now being made to weight the upward movement constantly at 1.1 and to further increase the exploration.

### Change the Reward Function with a constant leverage of the upward movement and high exploration

In [None]:
class CustomRewardWrapper(gym.Wrapper):
    def __init__(self, env):
        super().__init__(env)

    def step(self, action):
        # Call the step method from the Environment
        observation, reward, done, done2, info = super().step(action)
        
        reward_linup = info.get('reward_linup')
        reward_quadctrl = info.get('reward_quadctrl')
        reward_impact = info.get('reward_impact')

        # Weight the move up higher at the Beginning of the Training and then decrease
        reward_term = 1.1
        reward = reward_term*reward_linup + reward_quadctrl + reward_impact +1

        return observation, reward, done, done2, info

In [None]:
total_episodes = 2000
max_timesteps_per_episode = 1000

# Calculate total timesteps
total_timesteps = total_episodes * max_timesteps_per_episode

buffer_size = 1000000 # replay buffer size
learning_rate = 0.02 # How much to change the model due to the eccor at each iteration
batch_size = 64 # number of experiences sampled from the replay buffer for each update of the model
tau=0.01
gamma = 0.99
ent_coef='auto_0.8' # Entropy, learn it automatically using start value 0.8

In [None]:
# Create and wrap the environment
env = gym.make("HumanoidStandup-v4", render_mode="rgb_array", max_episode_steps=1_000)
env = CustomRewardWrapper(env)
env = gym.wrappers.RecordVideo(env, video_dir)
observation, info = env.reset(seed=42)

# Define the action noise
n_actions = env.action_space.shape[-1]
action_noise = NormalActionNoise(mean=np.zeros(n_actions), sigma=0.2 * np.ones(n_actions))

model = SAC("MlpPolicy", env, verbose=1, buffer_size=buffer_size, learning_rate=learning_rate, batch_size=batch_size, tau=tau, gamma=gamma, action_noise=action_noise, ent_coef=ent_coef, tensorboard_log=log_dir)
model.learn(total_timesteps=total_timesteps, reset_num_timesteps=False)
# Save the agent
model.save("sac_humanoid")
vec_env = model.get_env()

del model  # delete trained model to demonstrate loading

# Load the trained agent
model = SAC.load("sac_humanoid", env=env)

# Evaluate the agent
mean_reward, std_reward = evaluate_policy(model, model.get_env(), n_eval_episodes=10)

# Enjoy trained agent
vec_env = model.get_env()
obs = vec_env.reset()
for i in range(100000):
    action, _states = model.predict(obs, deterministic=True)
    obs, rewards, dones, info = vec_env.step(action)
    vec_env.render("human")

### Parralel training with default reward, default parameters and Custom Neuronal network

In [3]:
total_episodes = 5000
max_timesteps_per_episode = 1000

# Calculate total timesteps
total_timesteps = total_episodes * max_timesteps_per_episode

buffer_size = 1400000 # replay buffer size
learning_rate = 0.02 # How much to change the model due to the eccor at each iteration
batch_size = 128 # number of experiences sampled from the replay buffer for each update of the model
tau=0.01
gamma = 0.99
ent_coef='auto_0.4' # Entropy, learn it automatically using start value 0.8

# Define the policy kwargs with custom network architecture
policy_kwargs = dict(
    net_arch=[256, 256, 256]  # Increased depth of the Network
)

In [4]:
# Create and wrap the environment
num_envs = 50
env = make_vec_env('HumanoidStandup-v4', n_envs=num_envs, env_kwargs={"render_mode": "rgb_array", "max_episode_steps": 1000})

# Wrap the environment to record videos
env = VecVideoRecorder(env, video_dir, record_video_trigger=lambda x: x % 2000 == 0, video_length=1000)

# Define the action noise
n_actions = env.action_space.shape[-1]
action_noise = NormalActionNoise(mean=np.zeros(n_actions), sigma=0.1 * np.ones(n_actions))

# Initialize the SAC model with the custom network architecture
model = SAC("MlpPolicy", env, verbose=1, buffer_size=buffer_size, learning_rate=learning_rate, 
            batch_size=batch_size, tau=tau, gamma=gamma, action_noise=action_noise, 
            ent_coef=ent_coef, tensorboard_log=log_dir, policy_kwargs=policy_kwargs)

# Train the agent
model.learn(total_timesteps=total_timesteps, reset_num_timesteps=False)

# Save the agent
model.save("sac_humanoid")
vec_env = model.get_env()

# Delete the trained model to demonstrate loading
del model

# Load the trained agent
model = SAC.load("sac_humanoid", env=vec_env)

# Evaluate the agent
mean_reward, std_reward = evaluate_policy(model, vec_env, n_eval_episodes=10)
print(f"Mean reward: {mean_reward} +/- {std_reward}")

# Enjoy the trained agent
obs = vec_env.reset()
for _ in range(1000):
    action, _states = model.predict(obs, deterministic=True)
    obs, rewards, dones, info = vec_env.step(action)
    vec_env.render("human")

# Close the video recorder
vec_env.close()

Using cuda device
Logging to SAC_logs/SAC_0
---------------------------------
| rollout/           |          |
|    ep_len_mean     | 1e+03    |
|    ep_rew_mean     | 6e+04    |
| time/              |          |
|    episodes        | 4        |
|    fps             | 1360     |
|    time_elapsed    | 73       |
|    total_timesteps | 100000   |
| train/             |          |
|    actor_loss      | -367     |
|    critic_loss     | 1.07e+03 |
|    ent_coef        | 0.212    |
|    ent_coef_loss   | 0.0988   |
|    learning_rate   | 0.02     |
|    n_updates       | 998      |
---------------------------------
---------------------------------
| rollout/           |          |
|    ep_len_mean     | 1e+03    |
|    ep_rew_mean     | 6e+04    |
| time/              |          |
|    episodes        | 8        |
|    fps             | 1360     |
|    time_elapsed    | 73       |
|    total_timesteps | 100000   |
---------------------------------
---------------------------------
| ro

KeyboardInterrupt: 