ForceVLA is based on the π₀ model, a flow-based diffusion vision-language-action model (VLA); Both training and inference are based on π₀.
To run the models in this repository, you will need an NVIDIA GPU with at least the following specifications. These estimations assume a single GPU, but you can also use multiple GPUs with model parallelism to reduce per-GPU memory requirements by configuring fsdp_devices in the training config. Please also note that the current training script does not yet support multi-node training.
| Mode | Memory Required | Example GPU |
|---|---|---|
| Inference | > 8 GB | RTX 4090 |
| Fine-Tuning (LoRA) | > 22.5 GB | RTX 4090 |
| Fine-Tuning (Full) | > 70 GB | A100 (80GB) / H100 |
The repo has been tested with Ubuntu 22.04, we do not currently support other operating systems.
https://huggingface.co/datasets/qiaojunyu/ForceVLA-real-data
When cloning this repo, make sure to update submodules:
conda create -n forcevla python=3.11 -y
python -m pip install --upgrade pip setuptools wheel
conda install -c nvidia cuda-toolkit=12.8cd lerobot/
conda install ffmpeg=7.1.1 -c conda-forge
pip install -e .cd ./openpi
pip install -e .cd dlimp/
pip install -e .cd packages/
cd openpi-client/
pip install -e .cd flaxformer/
pip install -e .export HF_LEROBOT_HOME="xxxxxx"
python scripts/compute_norm_stats.py --config-name forcevla_lora
XLA_PYTHON_CLIENT_MEM_FRACTION=0.9 python scripts/train.py forcevla_lora --exp-name=my_experiment --overwrite --batch_size 32 --save_interval 2000 --keep_period 10000