dantorri

Bipedal Locomotion with Reinforcement Learning

Current Workflow (Train 100k, Evaluate, Render)

This repository now supports an end-to-end PPO workflow:

Create/use Miniforge conda env biped311.
Run a smoke test to validate physics + control.
Train PPO for 100k timesteps.
Evaluate the saved model.
Render the learned gait in PyBullet GUI.

Next Steps

See NEXT_STEPS.md for a concrete execution plan after initial setup.

Setup (Miniforge / conda)

conda create -n biped311 python=3.11 -y
conda activate biped311
pip install -r requirements.txt

1) Smoke Test Environment

python -m ppo.smoke_test --episodes 2 --steps 300

2) Train PPO for 100k timesteps

python -m ppo.train --timesteps 100000 --run-name ppo_100k

Expected artifact:

artifacts/models/ppo_100k.zip

3) Evaluate trained model

python -m ppo.eval --model artifacts/models/ppo_100k.zip --episodes 5 --max-steps 2000

4) Render trained policy

python -m ppo.render --model artifacts/models/ppo_100k.zip --episodes 2 --max-steps 2000

Troubleshooting Humanoid Spawn / Rigid Behavior

If the humanoid is sideways, in-ground, or appears rigid:

Pull latest code (ppo/env.py) where reset now rebuilds the simulation world each episode.
Ensure default joint motors are disabled before torque control.
Re-run smoke test before training.

Project Overview

We train a bipedal humanoid controller in PyBullet using reinforcement learning. The current implementation focuses on PPO with configurable reward components for forward velocity, survival, energy usage, and fall penalties.

Files

ppo/env.py — Gymnasium-compatible PyBullet humanoid environment.
ppo/config.py — training and reward hyperparameters.
ppo/train.py — PPO training entrypoint.
ppo/smoke_test.py — random-action environment verification.
ppo/eval.py — deterministic model evaluation.
ppo/render.py — GUI gait playback for trained models.
TESTING.md — step-by-step local testing guide. Update with: This project currently uses: Miniforge / conda environment biped311 pybullet from conda-forge CPU torch OpenMP workaround on Windows

Problem Statement

Walking on two legs is a difficult control problem, but solving it is essential for effective humanoid and terrain-capable robotics. This motion, called bipedal locomotion, is inherently unstable and requires continuous feedback control.

The challenge is amplified by:

Nonlinear dynamics,
High-dimensional state and action spaces, and
Frequent contact changes with the ground.

Rather than hand-designing a controller, this project uses reinforcement learning (RL) in simulation. The agent learns to walk in a PyBullet environment by maximizing rewards for speed, stability, and energy efficiency.

Current Development Status

✅ Initial PPO development scaffold is implemented:

Gymnasium-compatible PyBullet humanoid environment (ppo/env.py),
Configurable reward structure and training hyperparameters (ppo/config.py),
PPO training entrypoint using Stable-Baselines3 (ppo/train.py).

Local Testing

For step-by-step pull/setup/smoke-test instructions, see TESTING.md.

PPO Quick Start

python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
python -m ppo.train --timesteps 100000 --run-name ppo_smoke

Artifacts are saved to:

artifacts/models/
artifacts/logs/
artifacts/tensorboard/

Approach

We simulate a humanoid robot from a URDF model in PyBullet and wrap the simulator in a Gymnasium-compatible environment for use with standard RL libraries.

Observation space

The policy receives:

Torso position and orientation,
Linear and angular velocity,
Joint angles, and
Joint velocities.

Action space

A continuous action vector specifies target joint torques for each actuated joint, subject to maximum torque limits.

Reward design

The reward function combines:

Forward-velocity tracking toward a target speed and direction,
Survival bonus per timestep,
Penalty for energy consumption,
Large fall penalty with early episode termination.

Reward weights are treated as tunable hyperparameters.

Algorithms Compared

This project compares two policy-gradient methods implemented in Stable-Baselines3:

PPO (on-policy): uses a clipped surrogate objective for stable policy updates.
SAC (off-policy actor-critic): maximizes expected return and policy entropy to improve exploration.

Both methods use MLP policy/value networks trained with gradient descent, linking directly to optimization methods covered in course material.

Evaluation Metrics

We will compare PPO and SAC using:

Cumulative reward,
Forward distance traveled,
Timesteps before falling,
Energy cost per meter,
Wall-clock training time and sample efficiency.

Ethical Considerations

Key ethical risks include:

Simulation-to-real transfer risk: policies that are safe in simulation may fail unpredictably on hardware.
- Mitigation: enforce torque/joint limits and safety constraints before real-world deployment.
Potential misuse: bipedal robots may be deployed for surveillance, raising privacy concerns.
Socioeconomic impacts: increased automation can contribute to job displacement in some sectors.

Software and Compute

Software: Python, PyBullet, Stable-Baselines3, Gymnasium, PyTorch, Matplotlib.
Compute: personal machines with GPU acceleration.
Expected training budget: approximately 1–5 million timesteps, typically several hours on modern GPUs.

References

Coumans & Bai, PyBullet Physics Simulation.
Fujimoto et al., TD3 (potential additional baseline).
Haarnoja et al., Soft Actor-Critic (SAC).
Raffin et al., Stable-Baselines3.
Schulman et al., Proximal Policy Optimization (PPO).
Sutton & Barto, Reinforcement Learning: An Introduction.

Expected Deliverables

Trained PPO and SAC policies that achieve stable bipedal walking,
Training curves and comparative PPO vs. SAC analysis,
Reward ablation study isolating reward-component effects,
Video recordings of gait behavior at multiple training stages,
Final report with ethical discussion of sim-to-real deployment risk.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly