Alexander Veicht
· Sunghwan Hong
· Dániel Baráth
· Marc Pollefeys
ETH Zürich · Microsoft
ZipSplat decouples Gaussian placement from the pixel grid, reconstructing a scene from a few
unposed images in under a second, with higher quality from far fewer Gaussians.
ZipSplat is a feed-forward model for 3D Gaussian Splatting: it reconstructs a scene from a few unposed images in a single forward pass, with no poses, intrinsics, or per-scene optimization. Instead of emitting one Gaussian per pixel, it compresses the input views into a compact set of scene tokens and decodes each into a small group of Gaussians, reaching state-of-the-art quality with far fewer Gaussians and a single knob to trade quality for size. This repository hosts the inference, evaluation, and training code for ZipSplat, together with our pretrained weights.
We provide a standalone inference package zipsplat that requires only minimal
dependencies and Python >= 3.10. First clone the repository and install it:
git clone https://github.com/cvg/ZipSplat.git && cd ZipSplat
python -m pip install -e .
# optional: xformers' fused SwiGLU kernel the released checkpoint was trained with (small speedup)
python -m pip install -e ".[xformers]"Rendering uses gsplat and requires a CUDA GPU; the pretrained weights are downloaded automatically on first use.
The easiest way to try ZipSplat is the interactive viewer: reconstruct a scene and explore it live in your browser. Drag the compression slider to trade Gaussians for quality in real time, and toggle token coloring to see how each token's group of Gaussians is placed:
pip install "zipsplat[viewer]"
python -m zipsplat.viewer assets/examples/drone.mp4 # or assets/examples/office/, or your own dir/glob/videoHere is a minimal usage example:
import math, torch
from zipsplat import ZipSplat, Camera, Pose, load_image, viz
model = ZipSplat(weights="zipsplat").cuda().eval()
images = [load_image(p) for p in paths] # raw images, any size (auto-resized)
gaussians = model(images)[0] # feed-forward 3D Gaussians
# render a novel view
camera = Camera.from_fov(math.radians(60), w=512, h=512)
pose = Pose.from_Rt(torch.eye(3), torch.zeros(3)) # identity pose
rgb, info = gaussians.render(camera, pose)
# export
gaussians.save_ply("scene.ply") # open in any 3DGS viewer
viz.turntable(gaussians, "turntable.mp4", sweep_deg=None) # wiggle orbit videogaussians is a Gaussians object; gaussians.save_ply(...) writes a standard 3DGS .ply you can
drop into SuperSplat or any Gaussian-splat viewer.
Input: images, a video, or a single image
from zipsplat import load_image, load_video
# multiple images (any sizes; center-cropped to square and resized internally)
gaussians = model([load_image(p) for p in paths])[0]
# a single image
gaussians = model([load_image("photo.jpg")])[0]
# a video clip (evenly samples num_frames across the clip)
gaussians = model(load_video("assets/examples/drone.mp4", num_frames=24))[0]The model handles 1-24+ views; more views give wider coverage, fewer give a tighter reconstruction.
Compression: fewer Gaussians
compression is the query-sampling ratio in (0, 1]. 1.0 (default) uses every token; lower values
run k-means to keep a subset, shrinking the Gaussian count with graceful quality falloff.
gaussians = model(images, compression=1.0) # full quality
gaussians = model(images, compression=0.25) # ~4x fewer Gaussians
print(gaussians.num_gaussians)Rendering
# a single view (scalar camera/pose) or a batch of views ([V])
rgb, info = gaussians.render(camera, pose) # mode="RGB" by default
rgbd, _ = gaussians.render(camera, pose, mode="RGB+ED") # append expected depthCameras/poses are moved to the scene's device automatically. info["alphas"] holds the rendered
opacity. Build cameras with Camera.from_fov, Camera.from_focal, or Camera.from_K.
Turntable video
from zipsplat import viz
viz.turntable(gaussians, "orbit.mp4") # full 360-degree turntable
viz.turntable(gaussians, "wiggle.mp4", sweep_deg=None) # gentle wiggle (best for front-facing scenes)Camera priors (optional)
If you have calibrated cameras and poses, pass them with use_priors=True to inject them into the
backbone. Intrinsics are adjusted automatically for the internal resize.
gaussians = model(images, cameras=cameras, poses=poses, use_priors=True)| Name | Backbone | Train res | Notes |
|---|---|---|---|
zipsplat |
DA3-Giant | 252 px | Default release checkpoint (RE10K + DL3DV) |
Weights are hosted on the Hugging Face Hub and fetched on
first use. You can also pass a local path or URL: ZipSplat(weights="path/to/checkpoint.tar").
The training and evaluation code lives in splatfactory, a general library for
feed-forward Gaussian Splatting. It streams large multi-view datasets, composes models and
losses through Hydra configs, trains across multiple GPUs, and benchmarks novel-view synthesis on a
suite of standard datasets. ZipSplat is the model we train with it; the recipes below reproduce the
released checkpoint and the paper tables.
splatfactory needs a CUDA toolchain to build two compiled extensions
(gsplat for rasterization and
fused-ssim for the SSIM loss), so install in this order:
# from the repo root
conda create -n splatfactory python=3.12 -y && conda activate splatfactory
conda install -c nvidia cuda-toolkit -y # provides nvcc for the kernels below
# 1) torch first (CUDA build), then the training package + its core deps
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu128
pip install -e splatfactory/
# 2) CUDA extensions need a built torch present, so disable build isolation
pip install --no-build-isolation --use-pep517 \
git+https://github.com/JoannaCCJH/gsplat.git \
git+https://github.com/rahul-goel/fused-ssim/
python -c "from gsplat.cuda._backend import _C" # sanity check
# optional: xformers' fused SwiGLU kernel the released checkpoint was trained with
pip install -e "splatfactory/[xformers]"We use a gsplat fork rather than upstream because it exposes
a per-Gaussian activated flag from the rasterizer. Training uses it to detach Gaussians that were
already rendered from the geometry (chamfer) loss, so that term only supervises the Gaussians the
images don't already constrain. Inference works with stock gsplat; this fork only matters for
training.
Preparing the data
Every dataset is converted into a common format: uncompressed tar shards of scenes (images,
cameras, poses, and optional depth) written under data/<dataset>/<HxW>/{train,test}-scenes/. The
shipped configs point data.dataset_dir at these paths, so the default commands below produce
exactly what training and evaluation expect. Each dataset has a download step (fetch the raw data)
and a convert step (pack it into shards); add --num-workers N to parallelize either. For
evaluation you only need the test split, so pass --split test to convert where available.
RE10K
A single archive (pixelSplat/MVSplat format) packed into shards at 360x640:
# download -> data/re10k/raw/re10k.zip
python -m splatfactory.datasets.scripts.re10k.download
# convert -> data/re10k/360x640/{train,test}-scenes
python -m splatfactory.datasets.scripts.re10k.convert --num-workers 8
# evaluation only: just the test split
python -m splatfactory.datasets.scripts.re10k.convert --split testDL3DV
Gated on Hugging Face, so accept the dataset terms and run huggingface-cli login first. We use the
960P release (540x960). Redistribution is not permitted; the script pulls directly from Hugging Face
per the DL3DV terms.
# download -> data/dl3dv/raw/960P/<batch>/<hash>.zip (one zip per scene)
python -m splatfactory.datasets.scripts.dl3dv.download --resolution 960P --num-workers 4
# convert -> data/dl3dv/540x960/{train,test}-scenes
python -m splatfactory.datasets.scripts.dl3dv.convert --resolution 960P --num-workers 8
# evaluation only: just the test split
python -m splatfactory.datasets.scripts.dl3dv.convert --resolution 960P --split testMipNeRF360 (evaluation only)
A single ~3.6 GB archive, converted from COLMAP poses into one shard per scene at 822x1236 (the 7 public scenes). The whole dataset is an evaluation set, so there is no train/test split to select.
# download -> data/mipnerf360/raw/360_v2.zip
python -m splatfactory.datasets.scripts.mipnerf360.download
# convert -> data/mipnerf360/822x1236/test-scenes
python -m splatfactory.datasets.scripts.mipnerf360.convertTwo optional post-processing tools operate on any shard directory:
repack re-buckets shards to a uniform size, and
resize writes a fixed-resolution copy (e.g. --size 252) to a new directory.
Depth. Every dataset needs per-view depth maps, for both training and evaluation: training uses them for
the depth and chamfer loss, and the data pipeline uses them to normalize each scene's scale. The
shards from convert carry images and poses but no depth; add it in place with
the extract_depth script, which runs DA3 on each
scene (needs a GPU and the depth-anything-3 package). Run it on every shard directory you train or
evaluate on:
python -m splatfactory.datasets.scripts.extract_depth data/re10k/360x640/train-scenes
python -m splatfactory.datasets.scripts.extract_depth data/re10k/360x640/test-scenes
python -m splatfactory.datasets.scripts.extract_depth data/dl3dv/540x960/train-scenes
python -m splatfactory.datasets.scripts.extract_depth data/dl3dv/540x960/test-scenes
python -m splatfactory.datasets.scripts.extract_depth data/mipnerf360/822x1236/test-scenesIt can be parallelized across GPUs with torchrun:
torchrun --nproc_per_node=4 -m splatfactory.datasets.scripts.extract_depth <shard-dir>.
With the data prepared, evaluate the released model with the zipsplat_eval config; its weights
are downloaded automatically, so no local checkpoint is needed:
python -m splatfactory.eval.run <benchmark> --conf zipsplat_eval eval.num_views=<N>Each benchmark renders held-out target views and reports PSNR, SSIM, and LPIPS, writing results to
outputs/results/. eval.num_views=N sets how many context views the model sees, while the scenes
and context/target views are fixed by the indices in splatfactory/eval/indices/, so runs are
exactly reproducible. To score one of your own training runs instead, pass --checkpoint <exp> in
place of --conf zipsplat_eval. The per-benchmark commands below cover the protocols we report:
DL3DV (6 / 12 / 24 context views)
Evaluate the released model (append model.eval_use_priors=true to feed ground-truth camera priors):
python -m splatfactory.eval.run dl3dv_benchmark --conf zipsplat_eval eval.num_views=6
python -m splatfactory.eval.run dl3dv_benchmark --conf zipsplat_eval eval.num_views=12
python -m splatfactory.eval.run dl3dv_benchmark --conf zipsplat_eval eval.num_views=24
# with ground-truth camera priors:
python -m splatfactory.eval.run dl3dv_benchmark --conf zipsplat_eval eval.num_views=6 model.eval_use_priors=trueEvaluate your own checkpoint:
python -m splatfactory.eval.run dl3dv_benchmark --checkpoint <exp_name> eval.num_views=6Results (PSNR / SSIM / LPIPS, pose-free; DA3 and YoNoSplat for reference):
| Method | 6 views | 12 views | 24 views |
|---|---|---|---|
| DA3 | 23.77 / 0.795 / 0.165 | 22.38 / 0.736 / 0.208 | 21.69 / 0.711 / 0.229 |
| YoNoSplat | 24.10 / 0.783 / 0.160 | 22.73 / 0.736 / 0.200 | 22.01 / 0.710 / 0.223 |
| ZipSplat | 25.24 / 0.804 / 0.172 | 24.27 / 0.767 / 0.197 | 24.14 / 0.768 / 0.198 |
| ZipSplat + priors | 25.34 / 0.810 / 0.169 | 24.37 / 0.773 / 0.194 | 24.23 / 0.773 / 0.194 |
RE10K (6 context views)
Evaluate the released model (append model.eval_use_priors=true to feed ground-truth camera priors):
python -m splatfactory.eval.run re10k_benchmark --conf zipsplat_eval eval.num_views=6
# with ground-truth camera priors:
python -m splatfactory.eval.run re10k_benchmark --conf zipsplat_eval eval.num_views=6 model.eval_use_priors=trueEvaluate your own checkpoint:
python -m splatfactory.eval.run re10k_benchmark --checkpoint <exp_name> eval.num_views=6Results (pose-free; DA3 and YoNoSplat for reference):
| Method | PSNR | SSIM | LPIPS |
|---|---|---|---|
| DA3 | 20.90 | 0.724 | 0.234 |
| YoNoSplat | 24.99 | 0.835 | 0.151 |
| ZipSplat | 26.20 | 0.842 | 0.158 |
| ZipSplat + priors | 27.19 | 0.872 | 0.143 |
MipNeRF360 (32 / 64 / 128 context views)
Evaluate the released model (append model.eval_use_priors=true to feed ground-truth camera priors):
python -m splatfactory.eval.run mipnerf360_benchmark --conf zipsplat_eval eval.num_views=32
python -m splatfactory.eval.run mipnerf360_benchmark --conf zipsplat_eval eval.num_views=64
python -m splatfactory.eval.run mipnerf360_benchmark --conf zipsplat_eval eval.num_views=128
# with ground-truth camera priors:
python -m splatfactory.eval.run mipnerf360_benchmark --conf zipsplat_eval eval.num_views=32 model.eval_use_priors=trueEvaluate your own checkpoint:
python -m splatfactory.eval.run mipnerf360_benchmark --checkpoint <exp_name> eval.num_views=32Results (PSNR / SSIM / LPIPS, pose-free; DA3 and YoNoSplat for reference):
| Method | 32 views | 64 views | 128 views |
|---|---|---|---|
| DA3 | 20.94 / 0.577 / 0.295 | 20.30 / 0.554 / 0.311 | 20.19 / 0.568 / 0.306 |
| YoNoSplat | 17.62 / 0.409 / 0.465 | 17.77 / 0.413 / 0.466 | 17.16 / 0.409 / 0.509 |
| ZipSplat | 21.72 / 0.594 / 0.325 | 22.18 / 0.615 / 0.298 | 22.29 / 0.624 / 0.290 |
| ZipSplat + priors | 22.95 / 0.655 / 0.276 | 23.31 / 0.675 / 0.260 | 23.37 / 0.683 / 0.255 |
ZipSplat is initialized from the DA3-Giant backbone, so first download those weights into weights/
and convert them to the .pth layout the backbone expects:
huggingface-cli download depth-anything/DA3-GIANT model.safetensors --local-dir weights
mv weights/model.safetensors weights/da3-giant.safetensor
python -m splatfactory.models.encoders.dav3_encoder --size giant # -> weights/da3-giant.pthWith the data prepared (including the depth maps) and the backbone in place, start training from the repo root:
# single GPU
python -m splatfactory.train zipsplat --conf zipsplat
# multiple GPUs on one node
torchrun --nproc_per_node=4 -m splatfactory.train zipsplat --conf zipsplat --distributedzipsplat is the experiment name (checkpoints are written to outputs/training/zipsplat/); use any
name you like. The released model trains on the 50/50 RE10K + DL3DV mix at 252px for 450K steps; our
run used 16 GPUs (4 nodes x 4) at a global batch size of 384, so scale data.batch_size to your
hardware. Add --restore to resume an interrupted run. Configs are Hydra, so
anything can be overridden from the command line, e.g. a smaller backbone:
python -m splatfactory.models.encoders.dav3_encoder --size small # -> weights/da3-small.pth
python -m splatfactory.train zipsplat-da3s --conf zipsplat model/backbone=da3s data.batch_size=96To log training to TensorBoard or
Weights & Biases, set train.writer:
torchrun --nproc_per_node=4 -m splatfactory.train zipsplat --conf zipsplat --distributed train.writer=tensorboardThe trained model can then be evaluated by its experiment name (see Evaluation):
python -m splatfactory.eval.run dl3dv_benchmark --checkpoint zipsplat --tag zipsplat-retrained eval.num_views=6@article{veicht2026zipsplat,
title = {ZipSplat: Fewer Gaussians, Better Splats},
author = {Veicht, Alexander and Hong, Sunghwan and Bar{\'a}th, D{\'a}niel and Pollefeys, Marc},
journal = {arXiv preprint arXiv:2606.05102},
year = {2026}
}The code in this repository (the zipsplat and splatfactory packages) is released under the
Apache-2.0 License.
The pretrained weights are released separately under CC BY-NC 4.0, non-commercial use only. This is required by their dependencies: the released checkpoint is initialized from DA3-Giant (CC BY-NC 4.0) and trained on DL3DV-10K (CC BY-NC 4.0). See the Hugging Face model card for the full weights license and dataset attribution.
splatfactory is built mainly on the excellent glue-factory
training framework. Our model is initialized from the
Depth Anything 3 backbone, and rendering uses
gsplat. We thank the authors of these projects for
releasing their code and models.