# Standing Behaviour
> This directory contains all the building blocks required to achieve the most basic learnable behaviour: standing still.  
> 
> ⚠️ **IMPORTANT WARNING** ⚠️  
> This notebook, together with the contents of the entire directory, is intended to serve as a tutorial and documentation of the results achieved so far.  
> It should **NOT** be modified.  
> 
> __To experiment with this task, please create a copy of this directory and make your changes there.__

## Contents
This directory includes:
- This notebook  
- The file `reward_function.py`, which defines a reward function tailored for achieving standing behaviour with deep reinforcement learning  
- `ppo_stand.zip`, a pretrained policy that demonstrates standing behaviour  
- Three config files (`env`, `agent`, and `terrain` configs) that specify all parameters for this experiment. In particular, note the definition of the **home positions** of each joint in `agentConfig.yaml`, since these determine the pose the robot assumes when resting  
- `stand.pkl`, a state file produced at the end of a training session. It is used by the environment to store necessary data such as the total number of training steps  

**Note:** The file `SpotmicroEnv.py` defines the custom training environment. For the notebook to work, it must be located in the *grandparent directory* of this one.

## Use
This notebook, along with the directory contents, demonstrates the basic workflow of training and testing a simple policy.  
It also serves as documentation, since this policy will be used as a foundation for later experiments.

The notebook is organized as follows:
- The first cell (imports) must be executed every time; it loads almost all dependencies  
- The first section covers training a custom policy.  
  - To experiment, copy this directory elsewhere and adjust the reward function or hyperparameters there  
  - Otherwise, you can stick to the provided base policy and skip directly to testing  
- The final section covers testing: exploring the results and analyzing the learned policy  


In [1]:
import time
import numpy as np
from stable_baselines3 import PPO

import sys
import os

# Start from the current working directory (where notebook is)
cwd = os.getcwd()


# Go two levels up (to the "grandparent")
grandparent_dir = os.path.abspath(os.path.join(cwd, "..", ".."))

# Add to sys.path if not already there
if grandparent_dir not in sys.path:
    sys.path.insert(0, grandparent_dir)

from SpotmicroEnv import SpotmicroEnv
from reward_function import reward_function, RewardState

from stable_baselines3.common.vec_env import DummyVecEnv, VecNormalize

pybullet build time: Apr  4 2025 18:56:19


# Training
The following cells will launch a training session for a policy
The first cell will only load the necessary assets and set everything up, while the second one will load tensorboard to visualize useful data about the ongoing training.

> NOTE: this directory contains a pre-trained policy "stand". You can skip the training cells if you don't need anything specific, and jump to the testing section

## Parameters
- You can set the name of the policy being trained by assignin it to the "run" variable.
- You can set the number of checkpoints that will be saved, changing the number within "checkpoint_callback"
- You can adjust learning rate, entropy coefficient, clip range and the rest of the hyperparameters on the last fiew lines of the notebook

## The rewad function
The reward function is defined in another file, and is crucial to the success of the experiment. In this case, just two metrics are sufficient to define the desired behaviour:
- The agent is given a reward inversely proportional to the average magnitude of each action: the closer the action average is to 0 (and to the homing position) the higher the reward
- The agent is given a penalty proportional to the mean squared percentual effort applied to the joints (the effort applied to each joint is normalized by the highest value of the torque allowed for the given joint -> percentual effort)

These 2 metrics "teach" the robot to stand in a comfortable position, without any jittering or movement, using the least amount of energy possible.

In [1]:
from stable_baselines3.common.callbacks import CheckpointCallback
from stable_baselines3.common.env_checker import check_env
from stable_baselines3.common.logger import configure

# ========= CONFIG ==========
TOTAL_STEPS = 3_000_000
run = "stand3"
log_dir = f"./logs/{run}"

def clipped_linear_schedule(initial_value, min_value=1e-5):
    def schedule(progress_remaining):
        return max(progress_remaining * initial_value, min_value)
    return schedule

checkpoint_callback = CheckpointCallback(
    save_freq=TOTAL_STEPS // 5,
    save_path=f"{run}_checkpoints",
    name_prefix=f"ppo_{run}"
)

# ========= ENV ==========
env = SpotmicroEnv(
    use_gui=False,
    reward_fn=reward_function, 
    reward_state=RewardState(), 
    dest_save_file=f"{run}.pkl"
)
check_env(env, warn=True)

# ========= MODEL ==========
model = PPO(
    "MlpPolicy", 
    env,
    verbose=0,   # no default printouts
    learning_rate=clipped_linear_schedule(3e-4),
    ent_coef=0.001,
    clip_range=0.1,
    tensorboard_log=log_dir,
)

# Custom logger: ONLY csv + tensorboard (no stdout table)
new_logger = configure(log_dir, ["csv", "tensorboard"])
model.set_logger(new_logger)

NameError: name 'SpotmicroEnv' is not defined

In [3]:
%load_ext tensorboard
%tensorboard --logdir ./logs

In [4]:
# ========= TRAIN ==========
model.learn(
    total_timesteps=TOTAL_STEPS,
    reset_num_timesteps=False,
    callback=checkpoint_callback
)
model.save(f"ppo_{run}")
env.close()

# Evaluation
The following cells allow to test the policy we have just trained. All we have to do is assign the name of the policy we have trained to the "policy" variable.
You can then run the second to last cell any times you want, and observe a single episode until termination. When you are done, execute the last cell to clean everything up.

> If in any case there seems to be some sort of weird error, try to reload the kernel of this jupyter notebook first (pybullet is kind of messy in its cleanup phase)

In [2]:
policy = "stand"

env = SpotmicroEnv(
    use_gui=True, 
    reward_fn=reward_function,
    reward_state=RewardState(),
    src_save_file=f"{policy}.pkl"
    )
obs, _ = env.reset()

# === Load model ===
model = PPO.load(f"ppo_{policy}")
#model = PPO.load(f"{policy}_checkpoints/ppo_{policy}_3000000_steps")
base_steps = env.num_steps

t0 = time.time()
for _ in range(3001):
    action, _ = model.predict(obs, deterministic=True)
    #action = np.array([j.from_position_to_action(hp) for j, hp in zip(env.agent.motor_joints, env.agent.homing_positions)])
    obs, reward, terminated, truncated, info = env.step(action)
    
    time.sleep(1/60.)
    if terminated or truncated:
        print("Terminated")
        env.plot_reward_components()  # plot per episode
        obs, _ = env.reset()
        print(f"Num steps: {env.num_steps - base_steps}")
        break
    
t1 = time.time()
print(f"Elapsed real time: {t1-t0}")

env.close()

error: Not connected to physics server.

# What is next?
The next important step towards walking is convincing the policy to move at all. It is not a trivial task, since it involves designing a reward function that makes moving more attractive than both standing still and falling flat. This task is explored in the notebook inside the "shuffling" directory