LagerNVS is a feed-forward model for novel view synthesis (NVS). Given one or more input images of a scene, it synthesizes new views from arbitrary camera viewpoints in a single forward pass, without per-scene optimization. The model uses 3D biases without explicit 3D representations. The architecture features a large 3D-aware encoder (from VGGT pre-training) to extract scene tokens and a transformer-based renderer that conditions on these tokens via cross-attention to render novel views.
# Clone the repository
git clone https://github.com/facebookresearch/lagernvs.git
cd lagernvs
# Create conda environment
conda create -n lagernvs python=3.10
conda activate lagernvs
# Install PyTorch (CUDA 12.6)
pip install torch==2.8.0 torchvision==0.23.0 torchaudio==2.8.0 --index-url https://download.pytorch.org/whl/cu126
# Install remaining dependencies
pip install -r requirements.txtThe model checkpoints are hosted on HuggingFace as gated repositories. You must authenticate before downloading:
- Create a HuggingFace account at https://huggingface.co if you don't have one.
- Request access by visiting the model page (e.g.,
facebook/lagernvs_general_512) and clicking "Agree and access repository". - Create an access token at https://huggingface.co/settings/tokens. Create a token with at least
Readscope. - Set the token as an environment variable:
export HF_TOKEN=hf_your_token_hereTo persist this across sessions, add the export to your ~/.bashrc or write the token to the HuggingFace cache:
mkdir -p ~/.cache/huggingface
echo "hf_your_token_here" > ~/.cache/huggingface/tokenYou can verify access with:
python -c "from huggingface_hub import list_repo_files; print('\n'.join(list_repo_files('facebook/lagernvs_general_512')))"Run inference with the general model on your own images:
python minimal_inference.py --images path/to/img1.png path/to/img2.pngTo use a different checkpoint (see Available Checkpoints below):
# Re10k model (256px, 2-view, posed)
python minimal_inference.py \
--images path/to/img1.png path/to/img2.png \
--model_repo facebook/lagernvs_re10k_2v_256 \
--attention_type full_attention \
--target_size 256 \
--mode square_crop
# DL3DV model (256px, 2-6 views, posed)
python minimal_inference.py \
--images path/to/img1.png path/to/img2.png path/to/img3.png \
--model_repo facebook/lagernvs_dl3dv_2-6_v_256 \
--attention_type bidirectional_cross_attention \
--target_size 256 \
--mode square_cropRun python minimal_inference.py --help for all options (--video_length, --output, etc.).
See minimal_inference.py for the fully commented source of truth. For interactive step-by-step exploration with visualization of intermediate results (loaded images, camera trajectories, sampled output frames), see the inference.ipynb notebook.
All models use the EncDecVitB/8 architecture (VGGT encoder + 12-layer renderer, patch size 8). Three checkpoints are available on HuggingFace.
We recommend using the general model for most use cases, as it is trained on a large dataset of scenes and can handle a wide range of input conditions.
Re10k and DL3DV models are shared primarily for benchmarking and reproducitbility.
| Checkpoint | HuggingFace Repo | Training Data | Resolution | Train Cond. Views | Camera Poses |
|---|---|---|---|---|---|
| General | facebook/lagernvs_general_512 |
13 datasets | 512 (longer side) | 1-10 | Posed and unposed |
| Re10k | facebook/lagernvs_re10k_2v_256 |
Re10k only | 256x256 | 2 | Posed only |
| DL3DV | facebook/lagernvs_dl3dv_2-6_v_256 |
DL3DV only | 256x256 | 2-6 | Posed only |
Checkpoints are auto-downloaded from HuggingFace when using hf:// paths in config files.
-
Download Re10k data prepared by pixelSplat (CVPR 2024), hosted here and unzip.
-
Set up your data root directory. Download and run the preprocessing script from LVSM process_data.py:
export LAGERNVS_DATA_ROOT=/path/to/your/data
# Preprocess test split
python process_data.py \
--base_path /path/to/downloaded_and_unzipped/re10k_from_pixelsplat \
--output_dir $LAGERNVS_DATA_ROOT/re10k \
--mode testThe expected dataset organization after preprocessing is:
$LAGERNVS_DATA_ROOT/re10k/
└── test/
├── images/
│ ├── <sequence_id>/
│ │ ├── 00000.png
│ │ ├── 00001.png
│ │ └── ...
├── metadata/
│ ├── <sequence_id>.json
│ └── ...
└── full_list.txt
- Run evaluation:
# Evaluate on Re10k (posed, 2-view, 256x256)
torchrun --nproc_per_node=8 run_eval.py \
-c config/eval_re10k.yaml \
-e re10k_evalThe script defaults to using 8 GPUs with global batch size 512. By default, it saves images and renders videos as part of evaluation - this can be slow and use a lot of memory and storage. Adjust the batch and GPU size according to your hardware and optionally remove visualization saving from run_eval.py .
- Verify expected scores: PSNR: 31.39 SSIM: 0.928 LPIPS: 0.078
- Download the DL3DV benchmark subset required for evaluation:
export LAGERNVS_DATA_ROOT=/path/to/your/data
cd data_prep/dl3dv
python download_eval.py \
--output_dir $LAGERNVS_DATA_ROOT/dl3dv \
--view_indices_path ../../assets/dl3dv_6v.jsonThis script automatically downloads scenes from the correct HuggingFace repositories:
- Most scenes come from DL3DV/DL3DV-ALL-960P (scenes with XK prefix like "2K/...", "3K/...")
- A few (5) benchmark scenes are not included in the 'ALL' version, and thus have to be separately downloaded from DL3DV/DL3DV-10K-Benchmark
Note: You need to request access to the DL3DV datasets on HuggingFace and authenticate via huggingface-cli login or add your token to environment variables before running the download script.
- The data structure should look like:
$LAGERNVS_DATA_ROOT/dl3dv/
├── <batch>/<sequence_id>/
│ ├── images_4/
│ │ ├── frame_00001.png
│ │ └── ...
│ └── transforms.json
├── full_list_train.txt
└── full_list_test.txt
- Run evaluation:
# Evaluate on DL3DV (posed, 6-view, 256x256)
torchrun --nproc_per_node=8 run_eval.py \
-c config/eval_dl3dv.yaml \
-e dl3dv_evalThe script defaults to using 8 GPUs with global batch size 512. By default, it saves images and renders videos as part of evaluation - this can be slow and use a lot of memory and storage. Adjust the batch and GPU size according to your hardware and optionally remove visualization saving from run_eval.py.
We provide view indices for 2-view, 4-view, and 6-view evaluation in assets/dl3dv_2v.json, assets/dl3dv_4v.json, and assets/dl3dv_6v.json. To evaluate with a different number of views, update the download command and modify config/eval_dl3dv.yaml to point to the appropriate JSON file and set num_cond_views accordingly.
- Verify expected scores (6-view): PSNR: 29.45 SSIM: 0.904 LPIPS: 0.068
The general model can be evaluated on any dataset that has been preprocessed in the format described above (Re10k or DL3DV format). Here we show unposed evaluation on DL3DV at 512x512 resolution.
-
Ensure DL3DV data is prepared as described in the DL3DV section above.
-
Run evaluation:
# Evaluate general model on DL3DV (unposed, 6-view, 512x512)
torchrun --nproc_per_node=8 run_eval.py \
-c config/eval_dl3dv_general.yaml \
-e dl3dv_general_eval- Customizing evaluation settings:
The config file config/eval_dl3dv_general.yaml can be modified for different evaluation scenarios:
- Posed vs unposed: Set
zero_out_cam_cond_p: 0.0for posed evaluation (ground truth poses) orzero_out_cam_cond_p: 1.0for unposed evaluation (source cameras unavailable, ground truth poses used for scale normalization and specifying target camera). - Number of views: Update
num_cond_viewsandtest_view_indices_pathto useassets/dl3dv_2v.json,assets/dl3dv_4v.json, orassets/dl3dv_6v.json - Resolution: Adjust
data.im_size_hwto change input/output resolution. Images are center-cropped to match the target aspect ratio, then resized to the specified dimensions. Both dimensions must be divisible by 8, longer side must be 512.
For evaluation on other datasets, organize your data in a similar format (images in a directory, camera parameters in JSON) and adapt the dataloader accordingly. Camera parameters should follow OpenCV conventions (right-handed coordinate system, camera looks along +Z axis, +Y points down).
Training LagerNVS requires preparing the training data first. We provide data loaders and training configs for Re10k and DL3DV datasets.
-
Download Re10k data prepared by pixelSplat (CVPR 2024) from here and unzip.
-
Set up your data root directory and run preprocessing:
export LAGERNVS_DATA_ROOT=/path/to/your/data
# Download the preprocessing script from LVSM
# https://github.com/Haian-Jin/LVSM/blob/main/process_data.py
# Preprocess train split
python process_data.py \
--base_path /path/to/downloaded_and_unzipped/re10k_from_pixelsplat \
--output_dir $LAGERNVS_DATA_ROOT/re10k \
--mode train
# Preprocess test split
python process_data.py \
--base_path /path/to/downloaded_and_unzipped/re10k_from_pixelsplat \
--output_dir $LAGERNVS_DATA_ROOT/re10k \
--mode testThe expected structure after preprocessing:
$LAGERNVS_DATA_ROOT/re10k/
├── train/
│ ├── images/<sequence_id>/*.png
│ ├── metadata/<sequence_id>.json
│ └── full_list.txt
└── test/
├── images/<sequence_id>/*.png
├── metadata/<sequence_id>.json
└── full_list.txt
-
Request access to the DL3DV dataset on HuggingFace:
- Go to https://huggingface.co/datasets/DL3DV/DL3DV-ALL-960P
- Request access and authenticate via
huggingface-cli login
-
Download the DL3DV metadata CSV:
wget https://raw.githubusercontent.com/DL3DV-10K/Dataset/main/cache/DL3DV-valid.csv- Download training data using our provided script:
export LAGERNVS_DATA_ROOT=/path/to/your/data
# Download a batch of DL3DV training data (e.g., batch 2K with 1000 scenes)
cd data_prep/dl3dv
python download_train.py \
--output_dir $LAGERNVS_DATA_ROOT/dl3dv \
--batch 2K \
--csv_path /path/to/DL3DV-valid.csv \
--max_scenes 1000 # adjust based on your needs
--test_view_indices_paths ../../assets/dl3dv_2v.json ../../assets/dl3dv_4v.json ../../assets/dl3dv_6v.json # excludes eval data from train download and .txt
# For full training, download multiple batches (1K through 10K)
# Each batch contains approximately 1000 scenes- Download evaluation data:
python download_eval.py \
--output_dir $LAGERNVS_DATA_ROOT/dl3dv \
--view_indices_path ../../assets/dl3dv_6v.jsonThe expected structure:
$LAGERNVS_DATA_ROOT/dl3dv/
├── <batch>/<sequence_id>/
│ ├── images_4/
│ │ ├── frame_00001.png
│ │ └── ...
│ └── transforms.json
├── full_list_train.txt
└── full_list_test.txt
Train on Re10k or DL3DV from scratch:
# Train on Re10k (256x256, 2-view conditioning)
torchrun --nproc_per_node=8 train.py \
-c config/train_re10k.yaml \
-e re10k_train
# Train on DL3DV (256x256, 2-6 view conditioning)
torchrun --nproc_per_node=8 train.py \
-c config/train_dl3dv.yaml \
-e dl3dv_train
# Train general model on Re10k + DL3DV (512x512, 1-10 view conditioning)
torchrun --nproc_per_node=8 train.py \
-c config/train_base.yaml \
-e base_trainTraining hyperparameters can be overridden via command line:
# Reduce batch size for fewer GPUs
torchrun --nproc_per_node=4 train.py \
-c config/train_re10k.yaml \
-e re10k_train_4gpu \
--opt.batch_size 256
# Single GPU training for testing
torchrun --nproc_per_node=1 train.py \
-c config/train_re10k.yaml \
-e re10k_test \
--opt.batch_size 4 \
--opt.num_iter_total 100To train on your own datasets:
-
Prepare data in Re10k format:
- Images:
images/<sequence_id>/NNNNN.png - Camera parameters:
metadata/<sequence_id>.jsonwith per-frame entries containing:fxfycxcy: [fx, fy, cx, cy] intrinsics in pixelsw2c: 4x4 world-to-camera matrix (OpenCV convention: +Y down, +Z forward)
- See
data_root/re10k/test/metadata/for examples of the expected JSON format.
- Images:
-
Create a dataset class in
data/sources/followingre10k_dataset.pyas a template. -
Register your dataset in
data/dataset_factory.py:
_dataset_registry = {
"re10k": (re10k_dataset.Re10kDataset, SEQUENTIAL),
"dl3dv": (dl3dv_dataset.Dl3dvDataset, SEQUENTIAL),
"my_dataset": (my_dataset.MyDataset, SEQUENTIAL), # Add this
}- Add to config:
data:
subdataset_list:
- name: my_dataset
view_sampler_range: [10, 30]
expansion_factor: 1.0
equalization_length: 50000- Checkpoints are saved every 5000 iterations to
output/<exp_name>/checkpoints/ - Training logs are written to TensorBoard in
output/<exp_name>/ - Included training dataloaders normalize scene scales based on camera positions (do not require depth maps)
- For multi-dataset training, use the base config template and add multiple subdatasets
The training configs for Re10k and DL3DV are provided in config/train_re10k.yaml and config/train_dl3dv.yaml. For the general model, the checkpoint is available at facebook/lagernvs_general_512. Full details necessary for reproducing the general model, including hyperparameters and datasets used, are available in our paper.
If you use this code, please cite:
@InProceedings{lagernvs2026,
title={LagerNVS: Latent Geometry for Fully Neural Real-time Novel View Synthesis},
author={Stanislaw Szymanowicz and Minghao Chen and Jianyuan Wang and Christian Rupprecht and Andrea Vedaldi},
booktitle={CVPR},
year={2026}
}This project is licensed under the FAIR Noncommercial Research License. See LICENSE for details.
- The model is not expected to perform well on dynamic inputs, humans or animals.
- The model was trained with data where all source and target images have the same intrinsics. Deviation from this, or using images with large distortion will lead to performance degradation.
- Performance on very high-frequency regions such as grass or tree leaves is poor.
- The model does not hallucinate unobserved regions, because it is not a generative model.
This release is intended to support the open source research community.