This repository now supports an end-to-end PPO workflow:
- Create/use Miniforge conda env
biped311. - Run a smoke test to validate physics + control.
- Train PPO for 100k timesteps.
- Evaluate the saved model.
- Render the learned gait in PyBullet GUI.
See NEXT_STEPS.md for a concrete execution plan after initial setup.
conda create -n biped311 python=3.11 -y
conda activate biped311
pip install -r requirements.txtpython -m ppo.smoke_test --episodes 2 --steps 300python -m ppo.train --timesteps 100000 --run-name ppo_100kExpected artifact:
artifacts/models/ppo_100k.zip
python -m ppo.eval --model artifacts/models/ppo_100k.zip --episodes 5 --max-steps 2000python -m ppo.render --model artifacts/models/ppo_100k.zip --episodes 2 --max-steps 2000If the humanoid is sideways, in-ground, or appears rigid:
- Pull latest code (
ppo/env.py) where reset now rebuilds the simulation world each episode. - Ensure default joint motors are disabled before torque control.
- Re-run smoke test before training.
We train a bipedal humanoid controller in PyBullet using reinforcement learning. The current implementation focuses on PPO with configurable reward components for forward velocity, survival, energy usage, and fall penalties.
ppo/env.py— Gymnasium-compatible PyBullet humanoid environment.ppo/config.py— training and reward hyperparameters.ppo/train.py— PPO training entrypoint.ppo/smoke_test.py— random-action environment verification.ppo/eval.py— deterministic model evaluation.ppo/render.py— GUI gait playback for trained models.TESTING.md— step-by-step local testing guide. Update with: This project currently uses: Miniforge / conda environment biped311 pybullet from conda-forge CPU torch OpenMP workaround on Windows
Walking on two legs is a difficult control problem, but solving it is essential for effective humanoid and terrain-capable robotics. This motion, called bipedal locomotion, is inherently unstable and requires continuous feedback control.
The challenge is amplified by:
- Nonlinear dynamics,
- High-dimensional state and action spaces, and
- Frequent contact changes with the ground.
Rather than hand-designing a controller, this project uses reinforcement learning (RL) in simulation. The agent learns to walk in a PyBullet environment by maximizing rewards for speed, stability, and energy efficiency.
✅ Initial PPO development scaffold is implemented:
- Gymnasium-compatible PyBullet humanoid environment (
ppo/env.py), - Configurable reward structure and training hyperparameters (
ppo/config.py), - PPO training entrypoint using Stable-Baselines3 (
ppo/train.py).
For step-by-step pull/setup/smoke-test instructions, see TESTING.md.
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
python -m ppo.train --timesteps 100000 --run-name ppo_smokeArtifacts are saved to:
artifacts/models/artifacts/logs/artifacts/tensorboard/
We simulate a humanoid robot from a URDF model in PyBullet and wrap the simulator in a Gymnasium-compatible environment for use with standard RL libraries.
The policy receives:
- Torso position and orientation,
- Linear and angular velocity,
- Joint angles, and
- Joint velocities.
A continuous action vector specifies target joint torques for each actuated joint, subject to maximum torque limits.
The reward function combines:
- Forward-velocity tracking toward a target speed and direction,
- Survival bonus per timestep,
- Penalty for energy consumption,
- Large fall penalty with early episode termination.
Reward weights are treated as tunable hyperparameters.
This project compares two policy-gradient methods implemented in Stable-Baselines3:
- PPO (on-policy): uses a clipped surrogate objective for stable policy updates.
- SAC (off-policy actor-critic): maximizes expected return and policy entropy to improve exploration.
Both methods use MLP policy/value networks trained with gradient descent, linking directly to optimization methods covered in course material.
We will compare PPO and SAC using:
- Cumulative reward,
- Forward distance traveled,
- Timesteps before falling,
- Energy cost per meter,
- Wall-clock training time and sample efficiency.
Key ethical risks include:
- Simulation-to-real transfer risk: policies that are safe in simulation may fail unpredictably on hardware.
- Mitigation: enforce torque/joint limits and safety constraints before real-world deployment.
- Potential misuse: bipedal robots may be deployed for surveillance, raising privacy concerns.
- Socioeconomic impacts: increased automation can contribute to job displacement in some sectors.
- Software: Python, PyBullet, Stable-Baselines3, Gymnasium, PyTorch, Matplotlib.
- Compute: personal machines with GPU acceleration.
- Expected training budget: approximately 1–5 million timesteps, typically several hours on modern GPUs.
- Coumans & Bai, PyBullet Physics Simulation.
- Fujimoto et al., TD3 (potential additional baseline).
- Haarnoja et al., Soft Actor-Critic (SAC).
- Raffin et al., Stable-Baselines3.
- Schulman et al., Proximal Policy Optimization (PPO).
- Sutton & Barto, Reinforcement Learning: An Introduction.
- Trained PPO and SAC policies that achieve stable bipedal walking,
- Training curves and comparative PPO vs. SAC analysis,
- Reward ablation study isolating reward-component effects,
- Video recordings of gait behavior at multiple training stages,
- Final report with ethical discussion of sim-to-real deployment risk.
