<div style="display: flex; align-items: center; justify-content: center;">
    <img style="float: right;" src="imgs/ost.png" width=260, height=130>
</div>
<div style="text-align: center;">
    <h1>Learning a Humanoid to Standup with Reinforcement Learning</h1>
    <h2>Christoph Landolt</h2>
    <h3>June 2024</h3>
</div>

## Python Library Requirements

This project requires the following Python libraries:

- mujoco
- gymnasium
- stable-baselines3

## Documentation of the Mujoco Humanoid simulation (Adapted from [Gymnasium Documentation](https://gymnasium.farama.org/environments/mujoco/humanoid_standup/))
The environment which is described in [Paper](https://ieeexplore.ieee.org/document/6386025) simulates a human consisting of two legs, two arms and a Torso.

The environment is designet to learn a humanoid to stand up or to lern how to walk.

### Overview
| Description | Information |
|----------|----------|
| Action Space   | ``` Box(-0.4, 0.4, (17,), float32) ```   |
| Observation Space    | ```Box(-inf, inf, (348,), float64)```     |
| import learning to Walk    | 	```gymnasium.make("Humanoid-v4")```     |
| import standup   | ```gymnasium.make("HumanoidStandup-v4")```     |


### Action Space
An action represents the torques applied at the hinge joints.

![imgs/humanoid.png](imgs/humanoid.png)

| Num | Name  | Joint | Type (Unit) | 
|----------|----------|----------|----------|
| 0 | abdomen_y | hinge | torque (N m) |
| 1 | abdomen_z | hinge | torque (N m) | 
| 2 | abdomen_x | hinge | torque (N m) |
| 3 | right_hip_x | hinge | torque (N m) |
| 4 | right_hip_z | hinge | torque (N m) |
| 5 | right_hip_y | hinge | torque (N m) |
| 6 | right_knee | hinge | torque (N m) | 
| 7 | left_hip_x  | hinge | torque (N m) |
| 8 | left_hip_z | hinge | torque (N m) |
| 9 | left_hip_y | hinge | torque (N m) | 
| 10 | left_knee | hinge | torque (N m) | 
| 11 | right_shoulder1 | hinge | torque (N m) | 
| 12 | right_shoulder2 | hinge | torque (N m) | 
| 13 | right_elbow | hinge | torque (N m) | 
| 14 | left_shoulder1 | hinge | torque (N m) | 
| 15 | left_shoulder2 | hinge | torque (N m) | 
| 16 | left_elbow | hinge | torque (N m) | 

### Observation Space

### Rewards
**Standup-Task**

The total reward is: reward = uph_cost + 1 - quad_ctrl_cost - quad_impact_cost.
- uph_cost: A reward for moving up
- quad_ctrl_cost: A negative reward to penalize the Humanoid for taking actions that are too large.
- impact_cost: A negative reward to penalize the Humanoid if the external contact forces are too large.

**Learning to Walk-Task**

### Load the Required Libraries

In [None]:
from __future__ import annotations

import os
import multiprocessing


import gymnasium as gym

from stable_baselines3 import SAC, TD3, A2C, PPO
from stable_baselines3.common.evaluation import evaluate_policy

### Choice of the best RL Algorithm
1. The basic idea is to test several RL algorithms with the standard parameters in parallel and to monitor the training progress using the reward.
2. Subsequently, the algorithm that makes the fastest training progress is to be implemented and tuned.

In [None]:
# Create directories to hold models and logs for the tensor board
model_dir = "models"
log_dir = "logs"
os.makedirs(model_dir, exist_ok=True)
os.makedirs(log_dir, exist_ok=True)

In [None]:
### Define training parameters
TIMESTEPS = 25000

In [None]:
def train(env, humanoid_training_algo):
    match humanoid_training_algo:
        case 'PPO':
            model = PPO('MlpPolicy', env, verbose=1, device='cuda', tensorboard_log=log_dir)
        case 'SAC':
            model = SAC('MlpPolicy', env, verbose=1, device='cuda', tensorboard_log=log_dir)
        case 'TD3':
            model = TD3('MlpPolicy', env, verbose=1, device='cuda', tensorboard_log=log_dir)
        case 'A2C':
            model = A2C('MlpPolicy', env, verbose=1, device='cuda', tensorboard_log=log_dir)
        case _:
            print('Algorithm not found')
            return
    iters = 0
    while True:
        iters += 1

        model.learn(total_timesteps=TIMESTEPS, reset_num_timesteps=False)
        model.save(f"{model_dir}/{humanoid_training_algo}_{TIMESTEPS*iters}")

def test(env, humanoid_training_algo, path_to_model):

    match humanoid_training_algo:
        case 'PPO':
            model = PPO.load(path_to_model, env=env)
        case 'SAC':
            model = SAC.load(path_to_model, env=env)
        case 'TD3':
            model = TD3.load(path_to_model, env=env)
        case 'A2C':
            model = A2C.load(path_to_model, env=env)
        case _:
            print('Algorithm not found')
            return

    obs = env.reset()[0]
    done = False
    extra_steps = 500
    while True:
        action, _ = model.predict(obs)
        obs, _, done, _, _ = env.step(action)

        if done:
            extra_steps -= 1

            if extra_steps < 0:
                break

In [None]:
env = gym.make("HumanoidStandup-v4", render_mode="None")
algorithms = ['PPO', 'SAC', 'TD3', 'A2C']

processes = []
for algorithm in algorithms:
    p = multiprocessing.Process(target=train, args=(env, algorithm,))
    p.start()
    processes.append(p)
for p in processes:
    p.join()

# Start Tensorboard
!tensorboard --logdir ./tensorboard/

In [None]:
total_timesteps = 10000000
n_steps = 5000
learning_rate = 0.02
batch_size = 512
gamma = 0.99

In [None]:
experiment = Experiment(
    api_key="eDVXm91zIoTyF8BUArQquxAmM",
    project_name="deeprl",
    workspace="clandolt",
)

# Create and wrap the environment
env = gym.make("HumanoidStandup-v4", render_mode="rgb_array")
env = gym.wrappers.RecordVideo(env, 'test')
env = CometLogger(env, experiment)
observation, info = env.reset(seed=42)

model = PPO("MlpPolicy", env, verbose=1, n_steps=n_steps, learning_rate=learning_rate, batch_size=batch_size, gamma=gamma)
model.learn(total_timesteps=total_timesteps, reset_num_timesteps=False)
# Save the agent
model.save("ppo_humanoid")
vec_env = model.get_env()

del model  # delete trained model to demonstrate loading

# Load the trained agent
# NOTE: if you have loading issue, you can pass `print_system_info=True`
# to compare the system on which the model was trained vs the current one
# model = DQN.load("dqn_lunar", env=env, print_system_info=True)
model = PPO.load("ppo_humanoid", env=env)

# Evaluate the agent
# NOTE: If you use wrappers with your environment that modify rewards,
#       this will be reflected here. To evaluate with original rewards,
#       wrap environment in a "Monitor" wrapper before other wrappers.
mean_reward, std_reward = evaluate_policy(model, model.get_env(), n_eval_episodes=10)

# Enjoy trained agent
vec_env = model.get_env()
obs = vec_env.reset()
for i in range(100000):
    action, _states = model.predict(obs, deterministic=True)
    obs, rewards, dones, info = vec_env.step(action)
    vec_env.render("human")

In [None]:
experiment.end()

In [None]:
experiment.display()