# Standing behaviour
> This directory (or folder) gathers all the building blocks necessary to achieve the most basic of learnable behaviours: standing still
> This noteboook, along with the content of the while directory, is intended to be reviewed as a tutorial and documentation of the results achieved so far, and as such it should NOT be modified. __To experiment with this task, I suggest creating a copy of this dir and modifying that instead__


 ## Contents
 The directory contains:
 - This notebook
 - The file `reward_function.py` that defines the a reward function tailored towards achieving standing behaviours with deep reinforcment learning
 - `ppo_stand.zip`, a pretrained policy that exhibits standing behaviour
 - Three config files (env, agent and terrain config) that define all the parameters needed for this experiment. in this regards, it is particularly important the definition of the homming positions of each joint inside `agentConfig.yaml`, since it defines the pose the gait will assume when resting
 -  `stand.pkl`, a "state" file that is produced at the end of a training session and is used by the env to retain some necessary data, such as the length of the training session (in number of steps)

N.B.: the file `SpotmicroEnv.py` is a custom program that defines thewhole training environment, and has to be located in the "grandparent" directory of this one in order for the notebook to be able to find it.

## Use
This notebook, along with the content of this directory, is intended to show the basic workflow of training and testing a simple policy.
It also serves as documentation, since this policy will be buil upon in next experiments.

The notebook is organized as follows:
- The firs cell has to be executed every time, and imports almost all the dependecies needed
- The first section is dedicated to the training of a custom policy. To experiment yourself, copy the content of  this dir in another one and try to tweak the reward function or the hyperparameters. Otherwise, I suggest sticking to the base policy and skipping to the next part
- The last section is dedicated to Testing, AKA exploring the results obtained and analyzing the resulting policy. 

In [7]:
import time
import numpy as np
from stable_baselines3 import PPO

import sys
import os

# Start from the current working directory (where notebook is)
cwd = os.getcwd()

# Go two levels up (to the "grandparent")
grandparent_dir = os.path.abspath(os.path.join(cwd, "..", ".."))

# Add to sys.path if not already there
if grandparent_dir not in sys.path:
    sys.path.insert(0, grandparent_dir)

from SpotmicroEnv import SpotmicroEnv
from reward_function import reward_function, RewardState

# Training
The following cells will launch a training session for a policy
The first cell will only load the necessary assets and set everything up, while the second one will load tensorboard to visualize useful data about the ongoing training.

> NOTE: this directory contains a pre-trained policy "stand". You can skip the training cells if you don't need anything specific, and jump to the testing section

## Parameters
- You can set the name of the policy being trained by assignin it to the "run" variable.
- You can set the number of checkpoints that will be saved, changing the number within "checkpoint_callback"
- You can adjust learning rate, entropy coefficient, clip range and the rest of the hyperparameters on the last fiew lines of the notebook

In [2]:
from stable_baselines3.common.callbacks import CheckpointCallback
from stable_baselines3.common.env_checker import check_env
from stable_baselines3.common.logger import configure

# ========= CONFIG ==========
TOTAL_STEPS = 3_000_000
run = "stand"
log_dir = f"./logs/{run}"

def clipped_linear_schedule(initial_value, min_value=1e-5):
    def schedule(progress_remaining):
        return max(progress_remaining * initial_value, min_value)
    return schedule

checkpoint_callback = CheckpointCallback(
    save_freq=TOTAL_STEPS // 10,
    save_path=f"./policies/{run}_checkpoints",
    name_prefix=f"ppo_{run}"
)

# ========= ENV ==========
env = SpotmicroEnv(
    use_gui=False,
    reward_fn=reward_function, 
    reward_state=RewardState(), 
    dest_save_file=f"states/{run}.pkl"
)
check_env(env, warn=True)

# ========= MODEL ==========
model = PPO(
    "MlpPolicy", 
    env,
    verbose=0,   # no default printouts
    learning_rate=clipped_linear_schedule(3e-4),
    ent_coef=0.002,
    clip_range=0.1,
    tensorboard_log=log_dir,
)

# Custom logger: ONLY csv + tensorboard (no stdout table)
new_logger = configure(log_dir, ["csv", "tensorboard"])
model.set_logger(new_logger)

rear_right_leg_link_cover

In [3]:
%load_ext tensorboard
%tensorboard --logdir ./logs

In [None]:

# ========= TRAIN ==========
model.learn(
    total_timesteps=TOTAL_STEPS,
    reset_num_timesteps=False,
    callback=checkpoint_callback
)
model.save(f"policies/ppo_{run}")
env.close()

# Testing
The following cells allow to test the policy we have just trained. All we have to do is assign the name of the policy we have trained to the "policy" variable.
You can then run the second to last cell any times you want, and observe a single episode until termination. When you are done, execute the last cell to clean everything up.

> If in any case there seems to be some sort of weird error, try to reload the kernel of this jupyter notebook first (pybullet is kind of messy in its cleanup phase)

In [5]:
policy = "stand"

env = SpotmicroEnv(
    use_gui=True, 
    reward_fn=reward_function,
    src_save_file=f"{policy}.pkl"
    )
obs, _ = env.reset()

model = PPO.load(f"ppo_{policy}")
print("Loaded env")

startThreads creating 1 threads.
starting thread 0
started thread 0 
argc=2
argv[0] = --unused
argv[1] = --start_demo_name=Physics Server
ExampleBrowserThreadFunc started
X11 functions dynamically loaded using dlopen/dlsym OK!
X11 functions dynamically loaded using dlopen/dlsym OK!
Creating context
Created GL 3.3 context
Direct GLX rendering context obtained
Making context current
GL_VENDOR=AMD
GL_RENDERER=AMD Radeon Graphics (radeonsi, renoir, LLVM 19.1.1, DRM 3.59, 6.11.0-21-generic)
GL_VERSION=4.6 (Core Profile) Mesa 24.2.8-1ubuntu1~24.04.1
GL_SHADING_LANGUAGE_VERSION=4.60
pthread_getconcurrency()=0
Version = 4.6 (Core Profile) Mesa 24.2.8-1ubuntu1~24.04.1
Vendor = AMD
Renderer = AMD Radeon Graphics (radeonsi, renoir, LLVM 19.1.1, DRM 3.59, 6.11.0-21-generic)
b3Printf: Selected demo: Physics Server
startThreads creating 1 threads.
starting thread 0
started thread 0 
MotionThreadFunc thread started
ven = AMD
ven = AMD

b3Printf: No inertial data for link, using mass=1, localinertiadi

In [6]:
terminated = False

while not terminated:
    action, states = model.predict(obs, deterministic=True)
    obs, reward, terminated, truncated, info = env.step(action)
    time.sleep(1/60.)

#env.plot_reward_components()
obs, _ = env.reset()

error: Not connected to physics server.

In [None]:
env.close()