# Standing Behaviour
> This directory contains all the building blocks required to achieve the most basic learnable behaviour: standing still.  
> 
> ⚠️ **IMPORTANT WARNING** ⚠️  
> This notebook, together with the contents of the entire directory, is intended to serve as a tutorial and documentation of the results achieved so far.  
> It should **NOT** be modified.  
> 
> __To experiment with this task, please create a copy of this directory and make your changes there.__

## Contents
This directory includes:
- This notebook  
- The file `reward_function.py`, which defines a reward function tailored for achieving standing behaviour with deep reinforcement learning  
- `ppo_stand.zip`, a pretrained policy that demonstrates standing behaviour  
- Three config files (`env`, `agent`, and `terrain` configs) that specify all parameters for this experiment. In particular, note the definition of the **home positions** of each joint in `agentConfig.yaml`, since these determine the pose the robot assumes when resting  
- `stand.pkl`, a state file produced at the end of a training session. It is used by the environment to store necessary data such as the total number of training steps  

**Note:** The file `SpotmicroEnv.py` defines the custom training environment. For the notebook to work, it must be located in the *grandparent directory* of this one.

## Use
This notebook, along with the directory contents, demonstrates the basic workflow of training and testing a simple policy.  
It also serves as documentation, since this policy will be used as a foundation for later experiments.

The notebook is organized as follows:
- The first cell (imports) must be executed every time; it loads almost all dependencies  
- The first section covers training a custom policy.  
  - To experiment, copy this directory elsewhere and adjust the reward function or hyperparameters there  
  - Otherwise, you can stick to the provided base policy and skip directly to testing  
- The final section covers testing: exploring the results and analyzing the learned policy  


In [1]:
import time
import numpy as np
from stable_baselines3 import PPO

import sys
import os

# Start from the current working directory (where notebook is)
cwd = os.getcwd()

# Go two levels up (to the "grandparent")
grandparent_dir = os.path.abspath(os.path.join(cwd, "..", ".."))

# Add to sys.path if not already there
if grandparent_dir not in sys.path:
    sys.path.insert(0, grandparent_dir)

from SpotmicroEnv import SpotmicroEnv
from reward_function import reward_function, RewardState

from stable_baselines3.common.vec_env import DummyVecEnv, VecNormalize

pybullet build time: Apr  4 2025 18:56:19


# Training
The following cells will launch a training session for a policy
The first cell will only load the necessary assets and set everything up, while the second one will load tensorboard to visualize useful data about the ongoing training.

> NOTE: this directory contains a pre-trained policy "stand". You can skip the training cells if you don't need anything specific, and jump to the testing section

## Parameters
- You can set the name of the policy being trained by assignin it to the "run" variable.
- You can set the number of checkpoints that will be saved, changing the number within "checkpoint_callback"
- You can adjust learning rate, entropy coefficient, clip range and the rest of the hyperparameters on the last fiew lines of the notebook

## The rewad function
The reward function is defined in another file, and is crucial to the success of the experiment. In this case, I have defined 5 different rewards/penalties to define a good standing behaviour:
- _Uprightness_: this metric should measure the posture of the gait, and should encourage it to stand upright. It is measured through roll and pitch
- _Height_: the closer the agent is to a target height set by the user, the bigger the reward it receives
- _Vertical velocity penalty_: any sudden and fast movement on the z-axis is heavily penalized, to encourage stillness
- _Joint deviation penalty_: the more the position of each joint strays from a set position (homing positions), the heavier the penalty is. This ecnourages the agent to stick to a predefined resting pose
-  _Action sparsity reward_: this metric rewards the agent for small actions, and should discourage ample movements

Each reward /penalty is linearly combined withh the others with a weight that highlight its importance

In [None]:
from stable_baselines3.common.callbacks import CheckpointCallback
from stable_baselines3.common.env_checker import check_env
from stable_baselines3.common.logger import configure

# ========= CONFIG ==========
TOTAL_STEPS = 4_000_000
run = "stand"
log_dir = f"./logs/{run}"

def clipped_linear_schedule(initial_value, min_value=1e-5):
    def schedule(progress_remaining):
        return max(progress_remaining * initial_value, min_value)
    return schedule

checkpoint_callback = CheckpointCallback(
    save_freq=TOTAL_STEPS // 10,
    save_path=f"{run}_checkpoints",
    name_prefix=f"ppo_{run}"
)

# ========= ENV ==========
env = SpotmicroEnv(
    use_gui=False,
    reward_fn=reward_function, 
    reward_state=RewardState(), 
    dest_save_file=f"{run}.pkl"
)
check_env(env, warn=True)

# ========= MODEL ==========
model = PPO(
    "MlpPolicy", 
    env,
    verbose=0,   # no default printouts
    learning_rate=clipped_linear_schedule(3e-4),
    ent_coef=0.002,
    clip_range=0.1,
    tensorboard_log=log_dir,
)

# Custom logger: ONLY csv + tensorboard (no stdout table)
new_logger = configure(log_dir, ["csv", "tensorboard"])
model.set_logger(new_logger)

In [None]:
%load_ext tensorboard
%tensorboard --logdir ./logs

In [None]:

# ========= TRAIN ==========
model.learn(
    total_timesteps=TOTAL_STEPS,
    reset_num_timesteps=False,
    callback=checkpoint_callback
)
model.save(f"policies/ppo_{run}")
env.close()

# Testing
The following cells allow to test the policy we have just trained. All we have to do is assign the name of the policy we have trained to the "policy" variable.
You can then run the second to last cell any times you want, and observe a single episode until termination. When you are done, execute the last cell to clean everything up.

> If in any case there seems to be some sort of weird error, try to reload the kernel of this jupyter notebook first (pybullet is kind of messy in its cleanup phase)

In [2]:
policy = "stand"

# === Build raw env ===
def make_env():
    return SpotmicroEnv(
        use_gui=True,
        reward_fn=reward_function,
        reward_state=RewardState(),
        src_save_file=f"{policy}.pkl",
    )

# DummyVecEnv wrapper
raw_env = DummyVecEnv([make_env])

# === Load VecNormalize stats ===
eval_env = VecNormalize.load(f"{policy}_vecnormalize.pkl", raw_env)

# Very important: disable training updates during evaluation
eval_env.training = False
eval_env.norm_reward = False

# === Load model ===
model = PPO.load(f"stand_checkpoints/ppo_{policy}_5000000_steps.zip")

print("Loaded policy and VecNormalize stats")

# === Run rollout ===
obs = eval_env.reset()
for _ in range(2000):  # run some steps
    action, _ = model.predict(obs, deterministic=True)
    obs, reward, done, info = eval_env.step(action)
    eval_env.render()
    if done:
        obs = eval_env.reset()

    spot_env = eval_env.venv.envs[0]
    print((spot_env.agent.action)

    time.sleep(1/60.)

eval_env.close()

Loaded policy and VecNormalize stats
0.2388004007632255
0.3696849817743819
0.5167956919939182
0.6801194948781619
0.8278761917069666
1.0262515278346402
1.2097958684299193
1.4429800756029343
1.6140598054105366
1.7180859218920375
1.7639425292760773
1.771574255348582




1.7209127408395615
1.6336596928409026
1.4960493094777372
1.3188408360442327
1.087598300777501
0.8200915191933889
0.5176367972160534
0.15206790386413926
-0.19762640528386277
-0.5928322853929802
-0.9556706907456414
-1.3443500534680326
-1.7197123870687114
-2.0673091551761087
-2.400566616542124
-2.707979552787404
-3.001546057805224
-3.298978112706183
-3.552685311857992
-3.8106310449847887
-4.027002030741603
-4.247835792229207
-4.440841234041152
-4.634222590844813
-4.800352040154301
-4.961863146005966
-5.0951368959335195
-5.196981740553549
-5.283579650057621
-5.335560916729451
-5.3834835349622
-5.396303639729968
-5.3984737562749325
-5.374017301587189
-5.340837470155638
-5.283970828420545
-5.219075118540653
-5.160799922694551
-5.079580214168658
-4.985579103271806
-4.8685928540045085
-4.739359294385329
-4.589829012341637
-4.425444174795276
-4.240873863449409
-4.041610502970308
-3.8222466328893616
-3.5844277138754905
-3.3308151698887483
-3.1218609960770887
-2.931346551756417
-2.757401908994533

error: Not connected to physics server.

# What is next?
The next important step towards walking is convincing the policy to move at all. It is not a trivial task, since it involves designing a reward function that makes moving more attractive than both standing still and falling flat. This task is explored in the notebook inside the "shuffling" directory