Skip to content

aim-uofa/TVRBench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Where to Look: Can Foundation Models Reach a Target Viewpoint Through Active Exploration?

Liyang Li*, Muzhi Zhu*, Zhiyue Zhao, Hengyu Zhao, Ke Liu, Linhao Zhong, Hao Chen, Chunhua Shen
Zhejiang University
*Equal contribution   Corresponding author

arXiv  HF Paper  HF Model  HF Data

Overview

Target Viewpoint Reproduction (TVR) is a closed-loop active perception task: the agent receives a target image and an initial observation in a 3D indoor environment, then acts (translate, rotate, adjust head pitch) until its observation matches the target viewpoint.

TVRBench is an indoor-simulation benchmark built on AI2-THOR, covering single-room (iTHOR) and multi-room (ProcTHOR) scenes with diagnostics for exploration efficiency, spatial memory, and perception-to-action mapping.

Installation

conda create -n tvrbench python=3.10 -y
conda activate tvrbench
pip install -r requirements.txt

# Vulkan driver (Linux, required for AI2-THOR CloudRendering)
apt-get -y install libvulkan1 vulkan-tools

Data

Benchmark data is included in the repo:

data/
├── scene_splits.json              # scene split assignments (SFT / Eval / RL)
├── procthor-10k/                  # ProcTHOR house definitions
│   ├── train.jsonl.gz
│   ├── val.jsonl.gz
│   └── test.jsonl.gz
└── tasks/
    ├── sft.json                   # 1,600 SFT tasks (40 scenes)
    ├── eval.json                  # 500 eval tasks (80 scenes)
    └── rl.json                    # 4,800 RL tasks (120 scenes)

6,900 tasks across 240 scenes (120 iTHOR + 120 ProcTHOR), split into 4 difficulty categories:

Category Visual Complexity Navigation
iTHOR easy (SR-easy) high (seg >= 9) short (2–8 steps)
iTHOR hard (SR-hard) low (seg 3–6) short (2–8 steps)
ProcTHOR easy (LR-easy) high (seg >= 9) cross-room (10–20 steps)
ProcTHOR hard (LR-hard) low (seg 3–6) cross-room (10–20 steps)

Models & Datasets

Available on HuggingFace:

Resource Link Description
TVR-Qwen3.5-9B-VA-SFT-RL Model Best model (51.4% SR) — VA-SFT + Online GRPO
TVR-SFT-VA Dataset Visual-Action SFT data (1,600 trajectories)
TVR-SFT-VA-CoT Dataset VA-SFT with Chain-of-Thought reasoning

Evaluation

1. Start vLLM Server

conda activate vllm

CUDA_VISIBLE_DEVICES=0,1 vllm serve TVRBench/tvr-qwen3.5-9b-va-sft-rl \
    --tensor-parallel-size 2 --max-model-len 16384 \
    --trust-remote-code

2. Run Evaluation

conda activate tvrbench

PYTHONPATH=. python scripts/eval_va.py \
    --model_name TVRBench/tvr-qwen3.5-9b-va-sft-rl \
    --api_base http://localhost:8000/v1 \
    --task_file data/tasks/eval.json \
    --output_dir outputs/eval_results \
    --temperature 0.0 \
    --gpu_ids 2 3 --procs_per_gpu 5 \
    --resume
  • --gpu_ids: GPUs for AI2-THOR rendering (separate from vLLM GPUs)
  • --procs_per_gpu: parallel evaluation workers per GPU
  • --resume: skip completed tasks

Training

SFT with LLaMA-Factory

Tested with LLaMA-Factory v0.9.5. Download SFT data from HuggingFace, then update paths in configs/train_va_sft.yaml:

pip install -e LLaMA-Factory
llamafactory-cli train configs/train_va_sft.yaml

Online RL with verl

Tested with verl v0.8.0.

MODEL_PATH=/path/to/va-sft-checkpoint \
bash scripts/run_online_rl.sh full

See configs/online_rl/ for RL training configuration.

License

This project is licensed under the Apache License 2.0.

Acknowledgments

This project uses AI2-THOR, LLaMA-Factory, verl, and vLLM.

About

TVRBench: Target Viewpoint Reproduction Benchmark for Active Spatial Intelligence

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors