π§ (Undergoing Sorting project process)Prompt2Act: Mapping Prompts into Sequence of Robotic Actions with Large Foundation Models
Official implementation of our system Prompt2Act, which maps open-ended multi-modal prompts into real-world robotic actions via large vision-language models, mixed execution agents, and visual grounding modules.
- Multimodal Prompt Understanding: Supports vision-language prompts such as images, sketches, pointing gestures, and free-form instructions.
- Mixed Execution Agent: Combines predefined symbolic functions and on-the-fly code generation to execute diverse tasks.
- Visual Grounding via VG-Marker: Automatically identifies objects and assigns semantic anchors from raw scenes using open-vocabulary segmentation + GPT-4o.
- Supports Novel Tasks: From "Arrange pens by reference photo" to "Pick the toy pointed by hand", with no finetuning needed.
- Zero-shot generalization to occluded, unseen, and cluttered environments.
Prompt2Act/
β
βββ prompt2act/ # Core system modules
β βββ planner/ # LLM-based sequence planner
β βββ executor/ # Mixed Execution Agent (predefined + code-gen)
β βββ visual_grounding/ # VG-Marker
β βββ utils/
β
βββ data/ # Demo trajectories & prompt configs
βββ models/ # LLM/VLM interface wrappers
βββ scripts/ # Evaluation scripts and launchers
βββ README.md
We recommend Python 3.10 and Linux/Ubuntu.
# Clone repository
git clone https://github.com/Zero-coder/Prompt2Act.git
cd Prompt2Act
# Create virtual env (optional)
python3 -m venv venv
source venv/bin/activate
# Install dependencies
pip install -r requirements.txtRun a predefined task (e.g., Arrange Pens by Reference) in simulation:
python scripts/run_demo.py \
--task arrange_pens \
--prompt examples/prompts/arrange_pens.jsonYou can also modify the prompt file to test new tasks:
{
"instruction": "Please arrange these pens as shown in the image.",
"image": "reference_pens.jpg"
}To fully enable Prompt2Act, you need:
- GPT-4o / GPT-4V (OpenAI API or local Azure proxy)
- SAM / Grounded-SAM for segmentation
- VG-Marker (included, no training needed)
- Predefined skills:
pick,place,rotate, etc.
We provide starter checkpoints and test prompts in /data.
To evaluate Prompt2Act under different generalization axes:
python scripts/eval_benchmark.py \
--benchmark occlusion \
--config configs/eval_occlusion.yamlYou can test:
- Visual generalization (new textures, occlusion)
- Reasoning (visual constraints, sketch understanding)
- Embodiment shift (sim vs real)
If you find this project helpful, please consider citing our paper:
@article{jiang2025prompt2act,
title={Prompt2Act: Mapping Prompts into Sequence of Robotic Actions with Large Foundation Models},
author={Maowei Jiang and Qi Wang and Hongfeng Ai and Zhiyong Dong and Yusong Hu and Ao Liang and Yifan Wang and Ruiqi Li and Quangao Liu and Moquan Chen and Peter BuΕ‘ and Long Zeng},
journal={ Information Fusion (under_revision)},
year={2025}
}Maintained by @Zero-coder. Please open issues or pull requests for contributions.