Liyang Li*, Muzhi Zhu*, Zhiyue Zhao, Hengyu Zhao, Ke Liu, Linhao Zhong, Hao Chen, Chunhua Shen†
Zhejiang University
*Equal contribution †Corresponding author
Target Viewpoint Reproduction (TVR) is a closed-loop active perception task: the agent receives a target image and an initial observation in a 3D indoor environment, then acts (translate, rotate, adjust head pitch) until its observation matches the target viewpoint.
TVRBench is an indoor-simulation benchmark built on AI2-THOR, covering single-room (iTHOR) and multi-room (ProcTHOR) scenes with diagnostics for exploration efficiency, spatial memory, and perception-to-action mapping.
conda create -n tvrbench python=3.10 -y
conda activate tvrbench
pip install -r requirements.txt
# Vulkan driver (Linux, required for AI2-THOR CloudRendering)
apt-get -y install libvulkan1 vulkan-toolsBenchmark data is included in the repo:
data/
├── scene_splits.json # scene split assignments (SFT / Eval / RL)
├── procthor-10k/ # ProcTHOR house definitions
│ ├── train.jsonl.gz
│ ├── val.jsonl.gz
│ └── test.jsonl.gz
└── tasks/
├── sft.json # 1,600 SFT tasks (40 scenes)
├── eval.json # 500 eval tasks (80 scenes)
└── rl.json # 4,800 RL tasks (120 scenes)
6,900 tasks across 240 scenes (120 iTHOR + 120 ProcTHOR), split into 4 difficulty categories:
| Category | Visual Complexity | Navigation |
|---|---|---|
| iTHOR easy (SR-easy) | high (seg >= 9) | short (2–8 steps) |
| iTHOR hard (SR-hard) | low (seg 3–6) | short (2–8 steps) |
| ProcTHOR easy (LR-easy) | high (seg >= 9) | cross-room (10–20 steps) |
| ProcTHOR hard (LR-hard) | low (seg 3–6) | cross-room (10–20 steps) |
Available on HuggingFace:
| Resource | Link | Description |
|---|---|---|
| TVR-Qwen3.5-9B-VA-SFT-RL | Model | Best model (51.4% SR) — VA-SFT + Online GRPO |
| TVR-SFT-VA | Dataset | Visual-Action SFT data (1,600 trajectories) |
| TVR-SFT-VA-CoT | Dataset | VA-SFT with Chain-of-Thought reasoning |
conda activate vllm
CUDA_VISIBLE_DEVICES=0,1 vllm serve TVRBench/tvr-qwen3.5-9b-va-sft-rl \
--tensor-parallel-size 2 --max-model-len 16384 \
--trust-remote-codeconda activate tvrbench
PYTHONPATH=. python scripts/eval_va.py \
--model_name TVRBench/tvr-qwen3.5-9b-va-sft-rl \
--api_base http://localhost:8000/v1 \
--task_file data/tasks/eval.json \
--output_dir outputs/eval_results \
--temperature 0.0 \
--gpu_ids 2 3 --procs_per_gpu 5 \
--resume--gpu_ids: GPUs for AI2-THOR rendering (separate from vLLM GPUs)--procs_per_gpu: parallel evaluation workers per GPU--resume: skip completed tasks
Tested with LLaMA-Factory v0.9.5. Download SFT data from HuggingFace, then update paths in configs/train_va_sft.yaml:
pip install -e LLaMA-Factory
llamafactory-cli train configs/train_va_sft.yamlTested with verl v0.8.0.
MODEL_PATH=/path/to/va-sft-checkpoint \
bash scripts/run_online_rl.sh fullSee configs/online_rl/ for RL training configuration.
This project is licensed under the Apache License 2.0.
This project uses AI2-THOR, LLaMA-Factory, verl, and vLLM.
