Where to Look: Can Foundation Models Reach a Target Viewpoint Through Active Exploration?

Liyang Li^*, Muzhi Zhu^*, Zhiyue Zhao, Hengyu Zhao, Ke Liu, Linhao Zhong, Hao Chen, Chunhua Shen^†
Zhejiang University
^*Equal contribution ^†Corresponding author

Overview

Target Viewpoint Reproduction (TVR) is a closed-loop active perception task: the agent receives a target image and an initial observation in a 3D indoor environment, then acts (translate, rotate, adjust head pitch) until its observation matches the target viewpoint.

TVRBench is an indoor-simulation benchmark built on AI2-THOR, covering single-room (iTHOR) and multi-room (ProcTHOR) scenes with diagnostics for exploration efficiency, spatial memory, and perception-to-action mapping.

Installation

conda create -n tvrbench python=3.10 -y
conda activate tvrbench
pip install -r requirements.txt

# Vulkan driver (Linux, required for AI2-THOR CloudRendering)
apt-get -y install libvulkan1 vulkan-tools

Data

Benchmark data is included in the repo:

data/
├── scene_splits.json              # scene split assignments (SFT / Eval / RL)
├── procthor-10k/                  # ProcTHOR house definitions
│   ├── train.jsonl.gz
│   ├── val.jsonl.gz
│   └── test.jsonl.gz
└── tasks/
    ├── sft.json                   # 1,600 SFT tasks (40 scenes)
    ├── eval.json                  # 500 eval tasks (80 scenes)
    └── rl.json                    # 4,800 RL tasks (120 scenes)

6,900 tasks across 240 scenes (120 iTHOR + 120 ProcTHOR), split into 4 difficulty categories:

Category	Visual Complexity	Navigation
iTHOR easy (SR-easy)	high (seg >= 9)	short (2–8 steps)
iTHOR hard (SR-hard)	low (seg 3–6)	short (2–8 steps)
ProcTHOR easy (LR-easy)	high (seg >= 9)	cross-room (10–20 steps)
ProcTHOR hard (LR-hard)	low (seg 3–6)	cross-room (10–20 steps)

Models & Datasets

Available on HuggingFace:

Resource	Link	Description
TVR-Qwen3.5-9B-VA-SFT-RL	Model	Best model (51.4% SR) — VA-SFT + Online GRPO
TVR-SFT-VA	Dataset	Visual-Action SFT data (1,600 trajectories)
TVR-SFT-VA-CoT	Dataset	VA-SFT with Chain-of-Thought reasoning

Evaluation

1. Start vLLM Server

conda activate vllm

CUDA_VISIBLE_DEVICES=0,1 vllm serve TVRBench/tvr-qwen3.5-9b-va-sft-rl \
    --tensor-parallel-size 2 --max-model-len 16384 \
    --trust-remote-code

2. Run Evaluation

conda activate tvrbench

PYTHONPATH=. python scripts/eval_va.py \
    --model_name TVRBench/tvr-qwen3.5-9b-va-sft-rl \
    --api_base http://localhost:8000/v1 \
    --task_file data/tasks/eval.json \
    --output_dir outputs/eval_results \
    --temperature 0.0 \
    --gpu_ids 2 3 --procs_per_gpu 5 \
    --resume

--gpu_ids: GPUs for AI2-THOR rendering (separate from vLLM GPUs)
--procs_per_gpu: parallel evaluation workers per GPU
--resume: skip completed tasks

Training

SFT with LLaMA-Factory

Tested with LLaMA-Factory v0.9.5. Download SFT data from HuggingFace, then update paths in configs/train_va_sft.yaml:

pip install -e LLaMA-Factory
llamafactory-cli train configs/train_va_sft.yaml

Online RL with verl

Tested with verl v0.8.0.

MODEL_PATH=/path/to/va-sft-checkpoint \
bash scripts/run_online_rl.sh full

See configs/online_rl/ for RL training configuration.

License

This project is licensed under the Apache License 2.0.

Acknowledgments

This project uses AI2-THOR, LLaMA-Factory, verl, and vLLM.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
assets		assets
configs		configs
data		data
scripts		scripts
tvrbench		tvrbench
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Where to Look: Can Foundation Models Reach a Target Viewpoint Through Active Exploration?

Overview

Installation

Data

Models & Datasets

Evaluation

1. Start vLLM Server

2. Run Evaluation

Training

SFT with LLaMA-Factory

Online RL with verl

License

Acknowledgments

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Where to Look: Can Foundation Models Reach a Target Viewpoint Through Active Exploration?

Overview

Installation

Data

Models & Datasets

Evaluation

1. Start vLLM Server

2. Run Evaluation

Training

SFT with LLaMA-Factory

Online RL with verl

License

Acknowledgments

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages