What is MIKASA-Robo-VLA?
MIKASA-Robo-VLA extends the MIKASA-Robo memory benchmark to language-conditioned Vision-Language-Action research. It provides tabletop robotic manipulation environments that require an agent to retain and use information across delayed, occluded, temporal, or multi-stage interactions.
The canonical VLA benchmark contains 90 tasks with natural-language instructions, ManiSkill/Gymnasium environments, and released trajectory datasets for training and evaluation. The benchmark task manifest is mikasa_robo_vla_envs.csv.
What changed from MIKASA-Robo (RL release)
- Task set grows from 32 → 90 registered environments covering 10 memory types (vs 4 in the RL release).
- Every task ships a natural-language
LANGUAGE_INSTRUCTIONfor VLA conditioning. - Episodes are grouped into three horizon splits (Short / Medium / Long) so multi-task training and evaluation are tractable.
- 22,500 PPO / motion-planning oracle trajectories are released on Hugging Face in RLDS and LeRobotDataset v3 formats — no further conversion needed (6+ million transitions).
- Dense and normalised-dense rewards are calibrated for every task, enabling both offline imitation learning and online RL.
- The original 32-task RL implementation is available from the
mikasa-robo-rlbranch and remains undermikasa_robo_suite/rl/for backwards compatibility.
Installation
pip install mikasa-robo-suite