SceneParser: Hierarchical Scene Parsing for Visual Semantics Understanding

Pengxin Xu^1,2 · Xincheng Lin³ · Luping Xiao^2,4 · Qing Jiang⁵ · Meishan Zhang¹ · Hao Fei^6,† · Shanghang Zhang⁷ · Xingyu Chen^2,†

¹HIT (Shenzhen), ²ZGCA, ³HUST, ⁴BUPT, ⁵SCUT, ⁶Oxford, ⁷PKU

^†Corresponding author

📖 Abstract

SceneParser is a VLM-based hierarchical parser for physical scene understanding. Given an RGB image and an object- or scene-level query, it generates a structured JSON hierarchy that binds objects, parts, and affordance points into explicit scene -> object -> part -> affordance chains. This repository provides the training, evaluation, data conversion, and released checkpoint workflow needed to reproduce SceneParser on SceneParser-Bench.

⚙️ Installation

conda create -n sceneparser python=3.10 -y
conda activate sceneparser
pip install torch==2.7.0 torchvision --index-url https://download.pytorch.org/whl/cu128
pip install -r requirements.txt
pip install -v -e .

📦 Data Preparation

Download the SceneParser JSONL annotations from the released SceneParser-Bench HuggingFace dataset and place them under datasets:

mkdir -p datasets
# Download train.jsonl and val.jsonl from the SceneParser-Bench dataset release.

The JSONL annotations use relative image paths:

datasets/EgoObjects/images/<image_name>.jpg

Download the EgoObjects images from the official release:

https://github.com/facebookresearch/EgoObjects
https://ai.meta.com/datasets/egoobjects-downloads/

Download both image archives:

EgoObjectsV1_images.zip
images.zip

Place the extracted images under datasets/EgoObjects/images:

mkdir -p datasets/EgoObjects
unzip EgoObjectsV1_images.zip -d datasets/EgoObjects
unzip images.zip -d datasets/EgoObjects

After extraction, make sure this path exists:

datasets/EgoObjects/images/<image_name>.jpg

If the archives are extracted into a different folder layout, move or symlink the combined image folder to datasets/EgoObjects/images.

🔁 Convert JSONL To TSV

Training reads TSV files. Convert datasets/train.jsonl with:

python3 datasets/tools/convert__to_tsv_mp.py \
  --json_file datasets/train.jsonl \
  --save_image_tsv_path datasets/train_tsv/images.tsv \
  --save_ann_tsv_path datasets/train_tsv/annotations.tsv \
  --save_ann_lineidx_path datasets/train_tsv/annotations.tsv.lineidx \
  --num_workers 32

Optional sanity check:

wc -l datasets/train.jsonl datasets/train_tsv/annotations.tsv.lineidx

🏗️ Training

The training pipeline uses a three-stage curriculum. By default, scripts read training TSV files from datasets/train_tsv and write checkpoints to finetuning/work_dirs.

Stage 1 trains from the base model using no-pseudo supervision:

MODEL_NAME_OR_PATH=IDEA-Research/SceneParser \
bash finetuning/scripts/sft_sceneparser_curriculum_stage1_nopseudo_when_available.sh

Stage 2 continues from Stage 1 and mixes 70% no-pseudo with 30% pseudo-completed samples:

bash finetuning/scripts/sft_sceneparser_curriculum_stage2_mixed70pseudo30_when_available.sh

Stage 3 continues from Stage 2 and mixes 50% no-pseudo with 50% pseudo-completed samples:

bash finetuning/scripts/sft_sceneparser_curriculum_stage3_mixed50pseudo50_when_available.sh

Useful overrides:

GPUS_PER_NODE=8
NNODES=1
SCENEPARSER_TSV_DIR=/path/to/train_tsv
OUTPUT_DIR=work_dirs/my_run

📊 Evaluation

The evaluation flow has two steps:

Run inference with a trained checkpoint to generate answer.jsonl.
Run hierarchical metrics and export the four final report metrics.

Download the released SceneParser model checkpoint and use it as MODEL_PATH:

MODEL_PATH=/path/to/SceneParser-model \
TEST_JSONL=datasets/val.jsonl \
OUTPUT_DIR=evaluation/results/curriculum_stage3_eval \
NUM_SHARDS=8 \
bash evaluation/scripts/eval_sceneparser_obj_sharded.sh

The script writes:

evaluation/results/curriculum_stage3_eval/answer.jsonl
evaluation/results/curriculum_stage3_eval/eval_results_filtered.json
evaluation/results/curriculum_stage3_eval/final_metrics.json

final_metrics.json contains only the four public metrics:

L1        object-level hierarchy score
L2        object-part hierarchy score
L3        object-part-affordance hierarchy score
ParseRate hierarchical completeness

📜 License

This code release is licensed under IDEA License 1.0 for non-commercial research use; it builds on Rex-Omni and Qwen, so please comply with all upstream licenses.

🙏 Acknowledgement

This repository builds on Rex-Omni. We thank the authors for their excellent open-source work.

📖 Citation

@article{sceneparser2026,
  title   = {SceneParser: Hierarchical Scene Parsing for Visual Semantics Understanding},
  author  = {Pengxin Xu and Xincheng Lin and Luping Xiao and Qing Jiang and Meishan Zhang and Hao Fei and Shanghang Zhang and Xingyu Chen},
  journal = {arXiv preprint arXiv:2605.14923},
  year    = {2026}
}

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
assets		assets
datasets/tools		datasets/tools
evaluation		evaluation
finetuning		finetuning
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SceneParser: Hierarchical Scene Parsing for Visual Semantics Understanding

📖 Abstract

⚙️ Installation

📦 Data Preparation

🔁 Convert JSONL To TSV

🏗️ Training

📊 Evaluation

📜 License

🙏 Acknowledgement

📖 Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

SceneParser: Hierarchical Scene Parsing for Visual Semantics Understanding

📖 Abstract

⚙️ Installation

📦 Data Preparation

🔁 Convert JSONL To TSV

🏗️ Training

📊 Evaluation

📜 License

🙏 Acknowledgement

📖 Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages