Pengxin Xu1,2 · Xincheng Lin3 · Luping Xiao2,4 · Qing Jiang5 · Meishan Zhang1 · Hao Fei6,† · Shanghang Zhang7 · Xingyu Chen2,†
1HIT (Shenzhen), 2ZGCA, 3HUST, 4BUPT, 5SCUT, 6Oxford, 7PKU
†Corresponding author
SceneParser is a VLM-based hierarchical parser for physical scene understanding.
Given an RGB image and an object- or scene-level query, it generates a structured
JSON hierarchy that binds objects, parts, and affordance points into explicit
scene -> object -> part -> affordance chains. This repository provides the
training, evaluation, data conversion, and released checkpoint workflow needed to
reproduce SceneParser on SceneParser-Bench.
conda create -n sceneparser python=3.10 -y
conda activate sceneparser
pip install torch==2.7.0 torchvision --index-url https://download.pytorch.org/whl/cu128
pip install -r requirements.txt
pip install -v -e .Download the SceneParser JSONL annotations from the released SceneParser-Bench
HuggingFace dataset and place them under datasets:
mkdir -p datasets
# Download train.jsonl and val.jsonl from the SceneParser-Bench dataset release.The JSONL annotations use relative image paths:
datasets/EgoObjects/images/<image_name>.jpg
Download the EgoObjects images from the official release:
https://github.com/facebookresearch/EgoObjects
https://ai.meta.com/datasets/egoobjects-downloads/
Download both image archives:
EgoObjectsV1_images.zip
images.zip
Place the extracted images under datasets/EgoObjects/images:
mkdir -p datasets/EgoObjects
unzip EgoObjectsV1_images.zip -d datasets/EgoObjects
unzip images.zip -d datasets/EgoObjectsAfter extraction, make sure this path exists:
datasets/EgoObjects/images/<image_name>.jpg
If the archives are extracted into a different folder layout, move or symlink
the combined image folder to datasets/EgoObjects/images.
Training reads TSV files. Convert datasets/train.jsonl with:
python3 datasets/tools/convert__to_tsv_mp.py \
--json_file datasets/train.jsonl \
--save_image_tsv_path datasets/train_tsv/images.tsv \
--save_ann_tsv_path datasets/train_tsv/annotations.tsv \
--save_ann_lineidx_path datasets/train_tsv/annotations.tsv.lineidx \
--num_workers 32Optional sanity check:
wc -l datasets/train.jsonl datasets/train_tsv/annotations.tsv.lineidxThe training pipeline uses a three-stage curriculum. By default, scripts read
training TSV files from datasets/train_tsv and write checkpoints to
finetuning/work_dirs.
Stage 1 trains from the base model using no-pseudo supervision:
MODEL_NAME_OR_PATH=IDEA-Research/SceneParser \
bash finetuning/scripts/sft_sceneparser_curriculum_stage1_nopseudo_when_available.shStage 2 continues from Stage 1 and mixes 70% no-pseudo with 30% pseudo-completed samples:
bash finetuning/scripts/sft_sceneparser_curriculum_stage2_mixed70pseudo30_when_available.shStage 3 continues from Stage 2 and mixes 50% no-pseudo with 50% pseudo-completed samples:
bash finetuning/scripts/sft_sceneparser_curriculum_stage3_mixed50pseudo50_when_available.shUseful overrides:
GPUS_PER_NODE=8
NNODES=1
SCENEPARSER_TSV_DIR=/path/to/train_tsv
OUTPUT_DIR=work_dirs/my_runThe evaluation flow has two steps:
- Run inference with a trained checkpoint to generate
answer.jsonl. - Run hierarchical metrics and export the four final report metrics.
Download the released SceneParser model checkpoint and use it as MODEL_PATH:
MODEL_PATH=/path/to/SceneParser-model \
TEST_JSONL=datasets/val.jsonl \
OUTPUT_DIR=evaluation/results/curriculum_stage3_eval \
NUM_SHARDS=8 \
bash evaluation/scripts/eval_sceneparser_obj_sharded.shThe script writes:
evaluation/results/curriculum_stage3_eval/answer.jsonl
evaluation/results/curriculum_stage3_eval/eval_results_filtered.json
evaluation/results/curriculum_stage3_eval/final_metrics.json
final_metrics.json contains only the four public metrics:
L1 object-level hierarchy score
L2 object-part hierarchy score
L3 object-part-affordance hierarchy score
ParseRate hierarchical completeness
This code release is licensed under IDEA License 1.0 for non-commercial research use; it builds on Rex-Omni and Qwen, so please comply with all upstream licenses.
This repository builds on Rex-Omni. We thank the authors for their excellent open-source work.
@article{sceneparser2026,
title = {SceneParser: Hierarchical Scene Parsing for Visual Semantics Understanding},
author = {Pengxin Xu and Xincheng Lin and Luping Xiao and Qing Jiang and Meishan Zhang and Hao Fei and Shanghang Zhang and Xingyu Chen},
journal = {arXiv preprint arXiv:2605.14923},
year = {2026}
}