A reinforcement learning lab for robotic simulation on AMD GPUs, currently focused on the robosuite Panda + Lift baseline.
This project was developed, trained, and validated end-to-end on an AMD Ryzen AI Max+ 395 laptop, making full use of its integrated Radeon 8060S GPU for both OpenGL simulation rendering and ROCm / PyTorch AI compute. It has proven to be a very capable portable development platform for robotic RL work.
中文说明请见 README_zh.md
ROCm_Robotics_RL_Lab/
├── docs/ # Blog posts and publishing assets
├── environments/
│ ├── gym_wrapper.py # robosuite -> Gymnasium adapters
│ └── pick_cube_place_cup.py # Custom pick-and-place environment prototype
├── scripts/
│ ├── quickstart.py # Quickstart example
│ ├── train_sac.py # SAC training script for Panda Lift
│ ├── train_ppo.py # PPO training script for Panda Lift
│ ├── evaluate.py # Evaluation and video-recording script
│ ├── seed_sweep.py # Multi-seed runner
│ └── param_sweep.py # Parameter sweep runner
├── model_loading.py # SB3 model loading helpers
├── requirements.txt
├── README.md
└── README_zh.md
# Create virtual environment
uv venv .venv --python 3.12
source .venv/bin/activate
# Install the ROCm build of PyTorch (AMD GPU)
uv pip install torch torchvision --index-url https://download.pytorch.org/whl/rocm7.1Verify that the GPU is visible:
python -c "import torch; print(torch.cuda.is_available()); print(torch.cuda.get_device_name(0))"Install dependencies:
uv pip install -r requirements.txt
uv pip install stable-baselines3[extra]
# One-time robosuite initialization
python .venv/lib/python3.12/site-packages/robosuite/scripts/setup_macros.pycd ROCm_Robotics_RL_Lab
python scripts/quickstart.pyThis trains a small SAC policy on robosuite's Lift task with the Panda robot.
# SAC training
python scripts/train_sac.py --total-timesteps 500000 --n-envs 4
# PPO training
python scripts/train_ppo.py --total-timesteps 1000000 --n-envs 8
# Watch actions live during training (single environment)
python scripts/train_sac.py --total-timesteps 500000 --n-envs 1 --render
# Enable the Stable-Baselines3 progress bar explicitly if desired
python scripts/train_sac.py --total-timesteps 500000 --n-envs 4 --progress-bartrain_sac.py now includes more robust defaults for parallel environments:
--save-freqand--eval-freqare automatically scaled byn_envsinto SB3 callback frequencies, avoiding overly sparse checkpointing and evaluation in vectorized training- Default hyperparameters are biased toward stable task success instead of quickly exploiting dense shaping rewards:
learning_rate=1e-4,batch_size=512,learning_starts=20000,tau=0.002, andgradient_steps=1 - These defaults reduce critic / actor oscillation and make it less likely for the policy to overfit to shaping rewards before it truly solves Lift
- Training episodes terminate immediately on success, while time limits are treated as
truncated, reducing value-learning bias caused by confusing timeouts with true terminal states - Based on the first round of parameter sweeps, the current success-oriented default adds a
+100terminal reward on successful episodes (--success-bonus) and does not apply an extra timeout penalty (--timeout-penalty 0); this configuration performed best on average across multiple seeds best_successnow uses20evaluation episodes and a20%success threshold by default (--n-eval-episodes 20 --min-best-success-rate 0.20), balancing reliability and training speedbest_success/best_metrics.jsonrecords the checkpointtimestep,success_rate, andmean_reward, and the same information is printed at the end of training
If you want a stricter threshold, set it explicitly:
python scripts/train_sac.py --total-timesteps 500000 --n-envs 4 --n-eval-episodes 20 --min-best-success-rate 0.20If the policy still prefers dense shaping reward over actually finishing the task, you can further strengthen the success signal:
python scripts/train_sac.py --total-timesteps 500000 --n-envs 4 --n-eval-episodes 20 --min-best-success-rate 0.20 --success-bonus 100 --timeout-penalty 0If you want to spell out the currently recommended stable training configuration, use:
python scripts/train_sac.py --total-timesteps 500000 --n-envs 4 --learning-rate 1e-4 --batch-size 512 --learning-starts 20000 --gradient-steps 1 --tau 0.002 --n-eval-episodes 20 --min-best-success-rate 0.20 --success-bonus 100 --timeout-penalty 0If you suspect large seed sensitivity, you can batch-run multiple seeds directly:
python scripts/seed_sweep.py --seeds 42 123 456 --total-timesteps 500000 --n-envs 4This script will:
- call the existing
scripts/train_sac.pyfor each seed - evaluate each run's
best_successcheckpoint first, falling back to the final model if needed - summarize per-seed training and evaluation results in
models/seed_sweeps/<timestamp>/summary.json
If you want to inspect the commands without actually launching the runs, use:
python scripts/seed_sweep.py --seeds 42 123 456 --dry-runpython scripts/evaluate.py --model models/sac_lift_final.zip --algo sac --n-episodes 10
python scripts/evaluate.py --model models/best/best_model.zip --algo sac --no-render
python scripts/evaluate.py --model <path> --algo ppo --record-video --video-dir videos/The evaluation script counts success over the whole episode: if the task is completed at any step, the episode is marked successful instead of checking only the final frame.
The active, validated workflow in this repository is:
robosuiteLift task- Panda robot
- Stable-Baselines3 SAC / PPO
- AMD GPU + ROCm + PyTorch training
- OpenGL-based rendering, evaluation, and video capture
The custom pick_cube_place_cup.py environment remains in the repository as a prototype for future work, but the current documented and tested baseline is Panda Lift.
| Parameter | Recommended value | Description |
|---|---|---|
| learning_rate | 1e-4 | Learning rate |
| buffer_size | 1,000,000 | Replay buffer size |
| batch_size | 512 | Batch size |
| learning_starts | 20,000 | Random sampling steps before learning starts |
| gradient_steps | 1 | Conservative update frequency |
| gamma | 0.99 | Discount factor |
| tau | 0.002 | Soft update coefficient |
| Parameter | Default | Description |
|---|---|---|
| learning_rate | 3e-4 | Learning rate |
| n_steps | 2048 | Steps per update |
| batch_size | 64 | Batch size |
| n_epochs | 10 | Training epochs |
| clip_range | 0.2 | Clipping range |
Use TensorBoard to inspect training curves:
tensorboard --logdir logs/- The repository's validated baseline is Panda + Lift in robosuite.
- For live rendering during training, use
--rendertogether with--n-envs 1. - For headless evaluation, prefer
--no-renderto avoid GLFW / DISPLAY issues. - The training scripts disable SB3's rich progress bar by default to avoid cleanup tracebacks from
tqdm/richin some environments; add--progress-barif you want it enabled.
MIT License