Skip to content

ZGCA-HMI-Lab/SceneParser

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SceneParser: Hierarchical Scene Parsing for Visual Semantics Understanding

GitHub arXiv hf_model hf_data

Pengxin Xu1,2 · Xincheng Lin3 · Luping Xiao2,4 · Qing Jiang5 · Meishan Zhang1 · Hao Fei6,† · Shanghang Zhang7 · Xingyu Chen2,†

1HIT (Shenzhen), 2ZGCA, 3HUST, 4BUPT, 5SCUT, 6Oxford, 7PKU

Corresponding author

📖 Abstract

SceneParser is a VLM-based hierarchical parser for physical scene understanding. Given an RGB image and an object- or scene-level query, it generates a structured JSON hierarchy that binds objects, parts, and affordance points into explicit scene -> object -> part -> affordance chains. This repository provides the training, evaluation, data conversion, and released checkpoint workflow needed to reproduce SceneParser on SceneParser-Bench.

⚙️ Installation

conda create -n sceneparser python=3.10 -y
conda activate sceneparser
pip install torch==2.7.0 torchvision --index-url https://download.pytorch.org/whl/cu128
pip install -r requirements.txt
pip install -v -e .

📦 Data Preparation

Download the SceneParser JSONL annotations from the released SceneParser-Bench HuggingFace dataset and place them under datasets:

mkdir -p datasets
# Download train.jsonl and val.jsonl from the SceneParser-Bench dataset release.

The JSONL annotations use relative image paths:

datasets/EgoObjects/images/<image_name>.jpg

Download the EgoObjects images from the official release:

https://github.com/facebookresearch/EgoObjects
https://ai.meta.com/datasets/egoobjects-downloads/

Download both image archives:

EgoObjectsV1_images.zip
images.zip

Place the extracted images under datasets/EgoObjects/images:

mkdir -p datasets/EgoObjects
unzip EgoObjectsV1_images.zip -d datasets/EgoObjects
unzip images.zip -d datasets/EgoObjects

After extraction, make sure this path exists:

datasets/EgoObjects/images/<image_name>.jpg

If the archives are extracted into a different folder layout, move or symlink the combined image folder to datasets/EgoObjects/images.

🔁 Convert JSONL To TSV

Training reads TSV files. Convert datasets/train.jsonl with:

python3 datasets/tools/convert__to_tsv_mp.py \
  --json_file datasets/train.jsonl \
  --save_image_tsv_path datasets/train_tsv/images.tsv \
  --save_ann_tsv_path datasets/train_tsv/annotations.tsv \
  --save_ann_lineidx_path datasets/train_tsv/annotations.tsv.lineidx \
  --num_workers 32

Optional sanity check:

wc -l datasets/train.jsonl datasets/train_tsv/annotations.tsv.lineidx

🏗️ Training

The training pipeline uses a three-stage curriculum. By default, scripts read training TSV files from datasets/train_tsv and write checkpoints to finetuning/work_dirs.

Stage 1 trains from the base model using no-pseudo supervision:

MODEL_NAME_OR_PATH=IDEA-Research/SceneParser \
bash finetuning/scripts/sft_sceneparser_curriculum_stage1_nopseudo_when_available.sh

Stage 2 continues from Stage 1 and mixes 70% no-pseudo with 30% pseudo-completed samples:

bash finetuning/scripts/sft_sceneparser_curriculum_stage2_mixed70pseudo30_when_available.sh

Stage 3 continues from Stage 2 and mixes 50% no-pseudo with 50% pseudo-completed samples:

bash finetuning/scripts/sft_sceneparser_curriculum_stage3_mixed50pseudo50_when_available.sh

Useful overrides:

GPUS_PER_NODE=8
NNODES=1
SCENEPARSER_TSV_DIR=/path/to/train_tsv
OUTPUT_DIR=work_dirs/my_run

📊 Evaluation

The evaluation flow has two steps:

  1. Run inference with a trained checkpoint to generate answer.jsonl.
  2. Run hierarchical metrics and export the four final report metrics.

Download the released SceneParser model checkpoint and use it as MODEL_PATH:

MODEL_PATH=/path/to/SceneParser-model \
TEST_JSONL=datasets/val.jsonl \
OUTPUT_DIR=evaluation/results/curriculum_stage3_eval \
NUM_SHARDS=8 \
bash evaluation/scripts/eval_sceneparser_obj_sharded.sh

The script writes:

evaluation/results/curriculum_stage3_eval/answer.jsonl
evaluation/results/curriculum_stage3_eval/eval_results_filtered.json
evaluation/results/curriculum_stage3_eval/final_metrics.json

final_metrics.json contains only the four public metrics:

L1        object-level hierarchy score
L2        object-part hierarchy score
L3        object-part-affordance hierarchy score
ParseRate hierarchical completeness

📜 License

This code release is licensed under IDEA License 1.0 for non-commercial research use; it builds on Rex-Omni and Qwen, so please comply with all upstream licenses.

🙏 Acknowledgement

This repository builds on Rex-Omni. We thank the authors for their excellent open-source work.

📖 Citation

@article{sceneparser2026,
  title   = {SceneParser: Hierarchical Scene Parsing for Visual Semantics Understanding},
  author  = {Pengxin Xu and Xincheng Lin and Luping Xiao and Qing Jiang and Meishan Zhang and Hao Fei and Shanghang Zhang and Xingyu Chen},
  journal = {arXiv preprint arXiv:2605.14923},
  year    = {2026}
}

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors