# Trotting behaviour
The purpose of this notebook is to teach how to move to a policy that has learned to _stand_. 
It is the first step towards walking, and usually manifests with a policy that moves around dragging its feet (-> shuffling around) or that moves with little rythmic jumps (-> trotting).
In this case we have obtained the latter result, that we hope to refine into proper walking with the next notebook.
Again, we will provide a training and an evaluation section, along with the reward function and the "stand" policy

In [3]:
import time
import numpy as np
from stable_baselines3 import PPO

import sys
import os

# Start from the current working directory (where notebook is)
cwd = os.getcwd()

# Go two levels up (to the "grandparent")
grandparent_dir = os.path.abspath(os.path.join(cwd, "..", ".."))

# Add to sys.path if not already there
if grandparent_dir not in sys.path:
    sys.path.insert(0, grandparent_dir)

from SpotmicroEnv import SpotmicroEnv
from reward_function import reward_function, RewardState

# Training
The training process is functionally the same we have encountered in the "standing" notebook. The reward funciton for this purpose, however, is much more complex than the one we havve seen in that notebook, since the final goal is much more complex now.

## Reward function
The reward function for this notebook was designed with the goal of moving a policy that wants to stand still in mind. 
We want to reward following specific directions, but we need to reward any motion above everything else. For this reason, the only reward in the reward function is that deriving from the optimal tracking of the reference velocity. All other components are penalties, whose weight add up to sligh0lty less than the weight of the reward. This is done to avoid any "suicide" from the robot, that might find a local optimum in "cutting all penalties short by terminating early".

The penalties ensure that the robot:
- Follows reference angular velocity
- Stays at a proper height and with a proper posture
- Does not drift from the target direction
- Uses as much small action as possible
- Stays as much close to the homing position as possible

In [4]:
from stable_baselines3.common.callbacks import CheckpointCallback
from stable_baselines3.common.env_checker import check_env
from stable_baselines3.common.logger import configure

# ========= CONFIG ==========
TOTAL_STEPS = 13_000_000
run = "trot"
base="stand"

log_dir = f"./logs/{run}"

def clipped_linear_schedule(initial_value, min_value=1e-5):
    def schedule(progress_remaining):
        return max(progress_remaining * initial_value, min_value)
    return schedule

checkpoint_callback = CheckpointCallback(
    save_freq=TOTAL_STEPS // 13,
    save_path=f"{run}_checkpoints",
    name_prefix=f"ppo_{run}"
)

# ========= ENV ==========
env = SpotmicroEnv(
    use_gui=False,
    reward_fn=reward_function, 
    reward_state=RewardState(),
    src_save_file=f"{base}.pkl",
    dest_save_file=f"{run}.pkl"
)
check_env(env, warn=True)

# ========= MODEL ==========
model = PPO.load(f"ppo_{base}")
model.set_env(env)
model.tensorboard_log = log_dir

# Custom logger: ONLY csv + tensorboard (no stdout table)
new_logger = configure(log_dir, ["csv", "tensorboard"])
model.set_logger(new_logger)



In [2]:
%load_ext tensorboard
%tensorboard --logdir ./logs

In [None]:
model.learn(
    total_timesteps=TOTAL_STEPS,
    reset_num_timesteps=False,
    callback=checkpoint_callback
)
model.save(f"ppo_{run}")
env.close()

# Evaluation
The evaluation process is exaclty the same as shown in the walking notebook.

## Results
The resulting policy exhibit promising behaviour, that closely resembles walking ans only has to be refined and made robust ahainst rougher terrains. 

In [2]:
policy = "trot"

env = SpotmicroEnv(
    use_gui=True, 
    reward_fn=reward_function,
    reward_state=RewardState(),
    src_save_file=f"{policy}.pkl"
    )
obs, _ = env.reset()

# === Load model ===
model = PPO.load(f"ppo_{policy}", device = 'cpu')
#model = PPO.load(f"{policy}_checkpoints/ppo_{policy}_16001216_steps")

# === Run rollout ===
for _ in range(3001):
    action, _ = model.predict(obs, deterministic=True)
    obs, reward, terminated, truncated, info = env.step(action)

    if terminated or truncated:
        print("Terminated")
        env.plot_reward_components()
        obs, _ = env.reset()
    time.sleep(1/60)

env.close()

error: Cannot load URDF file.

# Next steps
The next step is to refine this policy and make the "step" movement more natural and more robust. This will be done by modelling a rough terrain to train the robot on, so that the policy has to lift the leg more, and be more cautious overall.