Skip to content

cvg/ZipSplat

Repository files navigation

ZipSplat: Fewer Gaussians, Better Splats

Alexander Veicht · Sunghwan Hong · Dániel Baráth · Marc Pollefeys
ETH Zürich  ·  Microsoft

ZipSplat teaser
ZipSplat decouples Gaussian placement from the pixel grid, reconstructing a scene from a few unposed images in under a second, with higher quality from far fewer Gaussians.


ZipSplat is a feed-forward model for 3D Gaussian Splatting: it reconstructs a scene from a few unposed images in a single forward pass, with no poses, intrinsics, or per-scene optimization. Instead of emitting one Gaussian per pixel, it compresses the input views into a compact set of scene tokens and decodes each into a small group of Gaussians, reaching state-of-the-art quality with far fewer Gaussians and a single knob to trade quality for size. This repository hosts the inference, evaluation, and training code for ZipSplat, together with our pretrained weights.

Setup and demo

We provide a standalone inference package zipsplat that requires only minimal dependencies and Python >= 3.10. First clone the repository and install it:

git clone https://github.com/cvg/ZipSplat.git && cd ZipSplat
python -m pip install -e .
# optional: xformers' fused SwiGLU kernel the released checkpoint was trained with (small speedup)
python -m pip install -e ".[xformers]"

Rendering uses gsplat and requires a CUDA GPU; the pretrained weights are downloaded automatically on first use.

The easiest way to try ZipSplat is the interactive viewer: reconstruct a scene and explore it live in your browser. Drag the compression slider to trade Gaussians for quality in real time, and toggle token coloring to see how each token's group of Gaussians is placed:

pip install "zipsplat[viewer]"
python -m zipsplat.viewer assets/examples/drone.mp4   # or assets/examples/office/, or your own dir/glob/video

Inference

Here is a minimal usage example:

import math, torch
from zipsplat import ZipSplat, Camera, Pose, load_image, viz

model = ZipSplat(weights="zipsplat").cuda().eval()
images = [load_image(p) for p in paths] # raw images, any size (auto-resized)
gaussians = model(images)[0] # feed-forward 3D Gaussians

# render a novel view
camera = Camera.from_fov(math.radians(60), w=512, h=512)
pose = Pose.from_Rt(torch.eye(3), torch.zeros(3)) # identity pose
rgb, info = gaussians.render(camera, pose)

# export
gaussians.save_ply("scene.ply")                             # open in any 3DGS viewer
viz.turntable(gaussians, "turntable.mp4", sweep_deg=None)   # wiggle orbit video

gaussians is a Gaussians object; gaussians.save_ply(...) writes a standard 3DGS .ply you can drop into SuperSplat or any Gaussian-splat viewer.

Input: images, a video, or a single image
from zipsplat import load_image, load_video

# multiple images (any sizes; center-cropped to square and resized internally)
gaussians = model([load_image(p) for p in paths])[0]

# a single image
gaussians = model([load_image("photo.jpg")])[0]

# a video clip (evenly samples num_frames across the clip)
gaussians = model(load_video("assets/examples/drone.mp4", num_frames=24))[0]

The model handles 1-24+ views; more views give wider coverage, fewer give a tighter reconstruction.

Compression: fewer Gaussians

compression is the query-sampling ratio in (0, 1]. 1.0 (default) uses every token; lower values run k-means to keep a subset, shrinking the Gaussian count with graceful quality falloff.

gaussians = model(images, compression=1.0)   # full quality
gaussians = model(images, compression=0.25)  # ~4x fewer Gaussians
print(gaussians.num_gaussians)
Rendering
# a single view (scalar camera/pose) or a batch of views ([V])
rgb, info = gaussians.render(camera, pose)               # mode="RGB" by default
rgbd, _ = gaussians.render(camera, pose, mode="RGB+ED")  # append expected depth

Cameras/poses are moved to the scene's device automatically. info["alphas"] holds the rendered opacity. Build cameras with Camera.from_fov, Camera.from_focal, or Camera.from_K.

Turntable video
from zipsplat import viz

viz.turntable(gaussians, "orbit.mp4")                  # full 360-degree turntable
viz.turntable(gaussians, "wiggle.mp4", sweep_deg=None) # gentle wiggle (best for front-facing scenes)
Camera priors (optional)

If you have calibrated cameras and poses, pass them with use_priors=True to inject them into the backbone. Intrinsics are adjusted automatically for the internal resize.

gaussians = model(images, cameras=cameras, poses=poses, use_priors=True)

Models

Name Backbone Train res Notes
zipsplat DA3-Giant 252 px Default release checkpoint (RE10K + DL3DV)

Weights are hosted on the Hugging Face Hub and fetched on first use. You can also pass a local path or URL: ZipSplat(weights="path/to/checkpoint.tar").

Training and evaluation

The training and evaluation code lives in splatfactory, a general library for feed-forward Gaussian Splatting. It streams large multi-view datasets, composes models and losses through Hydra configs, trains across multiple GPUs, and benchmarks novel-view synthesis on a suite of standard datasets. ZipSplat is the model we train with it; the recipes below reproduce the released checkpoint and the paper tables.

Setup

splatfactory needs a CUDA toolchain to build two compiled extensions (gsplat for rasterization and fused-ssim for the SSIM loss), so install in this order:

# from the repo root
conda create -n splatfactory python=3.12 -y && conda activate splatfactory
conda install -c nvidia cuda-toolkit -y                      # provides nvcc for the kernels below

# 1) torch first (CUDA build), then the training package + its core deps
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu128
pip install -e splatfactory/

# 2) CUDA extensions need a built torch present, so disable build isolation
pip install --no-build-isolation --use-pep517 \
    git+https://github.com/JoannaCCJH/gsplat.git \
    git+https://github.com/rahul-goel/fused-ssim/
python -c "from gsplat.cuda._backend import _C"              # sanity check

# optional: xformers' fused SwiGLU kernel the released checkpoint was trained with
pip install -e "splatfactory/[xformers]"

We use a gsplat fork rather than upstream because it exposes a per-Gaussian activated flag from the rasterizer. Training uses it to detach Gaussians that were already rendered from the geometry (chamfer) loss, so that term only supervises the Gaussians the images don't already constrain. Inference works with stock gsplat; this fork only matters for training.

Preparing the data

Every dataset is converted into a common format: uncompressed tar shards of scenes (images, cameras, poses, and optional depth) written under data/<dataset>/<HxW>/{train,test}-scenes/. The shipped configs point data.dataset_dir at these paths, so the default commands below produce exactly what training and evaluation expect. Each dataset has a download step (fetch the raw data) and a convert step (pack it into shards); add --num-workers N to parallelize either. For evaluation you only need the test split, so pass --split test to convert where available.

RE10K

A single archive (pixelSplat/MVSplat format) packed into shards at 360x640:

# download -> data/re10k/raw/re10k.zip
python -m splatfactory.datasets.scripts.re10k.download
# convert -> data/re10k/360x640/{train,test}-scenes
python -m splatfactory.datasets.scripts.re10k.convert --num-workers 8
# evaluation only: just the test split
python -m splatfactory.datasets.scripts.re10k.convert --split test
DL3DV

Gated on Hugging Face, so accept the dataset terms and run huggingface-cli login first. We use the 960P release (540x960). Redistribution is not permitted; the script pulls directly from Hugging Face per the DL3DV terms.

# download -> data/dl3dv/raw/960P/<batch>/<hash>.zip  (one zip per scene)
python -m splatfactory.datasets.scripts.dl3dv.download --resolution 960P --num-workers 4
# convert -> data/dl3dv/540x960/{train,test}-scenes
python -m splatfactory.datasets.scripts.dl3dv.convert --resolution 960P --num-workers 8
# evaluation only: just the test split
python -m splatfactory.datasets.scripts.dl3dv.convert --resolution 960P --split test
MipNeRF360 (evaluation only)

A single ~3.6 GB archive, converted from COLMAP poses into one shard per scene at 822x1236 (the 7 public scenes). The whole dataset is an evaluation set, so there is no train/test split to select.

# download -> data/mipnerf360/raw/360_v2.zip
python -m splatfactory.datasets.scripts.mipnerf360.download
# convert -> data/mipnerf360/822x1236/test-scenes
python -m splatfactory.datasets.scripts.mipnerf360.convert

Two optional post-processing tools operate on any shard directory: repack re-buckets shards to a uniform size, and resize writes a fixed-resolution copy (e.g. --size 252) to a new directory.

Depth. Every dataset needs per-view depth maps, for both training and evaluation: training uses them for the depth and chamfer loss, and the data pipeline uses them to normalize each scene's scale. The shards from convert carry images and poses but no depth; add it in place with the extract_depth script, which runs DA3 on each scene (needs a GPU and the depth-anything-3 package). Run it on every shard directory you train or evaluate on:

python -m splatfactory.datasets.scripts.extract_depth data/re10k/360x640/train-scenes
python -m splatfactory.datasets.scripts.extract_depth data/re10k/360x640/test-scenes
python -m splatfactory.datasets.scripts.extract_depth data/dl3dv/540x960/train-scenes
python -m splatfactory.datasets.scripts.extract_depth data/dl3dv/540x960/test-scenes
python -m splatfactory.datasets.scripts.extract_depth data/mipnerf360/822x1236/test-scenes

It can be parallelized across GPUs with torchrun: torchrun --nproc_per_node=4 -m splatfactory.datasets.scripts.extract_depth <shard-dir>.

Evaluation

With the data prepared, evaluate the released model with the zipsplat_eval config; its weights are downloaded automatically, so no local checkpoint is needed:

python -m splatfactory.eval.run <benchmark> --conf zipsplat_eval eval.num_views=<N>

Each benchmark renders held-out target views and reports PSNR, SSIM, and LPIPS, writing results to outputs/results/. eval.num_views=N sets how many context views the model sees, while the scenes and context/target views are fixed by the indices in splatfactory/eval/indices/, so runs are exactly reproducible. To score one of your own training runs instead, pass --checkpoint <exp> in place of --conf zipsplat_eval. The per-benchmark commands below cover the protocols we report:

DL3DV (6 / 12 / 24 context views)

Evaluate the released model (append model.eval_use_priors=true to feed ground-truth camera priors):

python -m splatfactory.eval.run dl3dv_benchmark --conf zipsplat_eval eval.num_views=6
python -m splatfactory.eval.run dl3dv_benchmark --conf zipsplat_eval eval.num_views=12
python -m splatfactory.eval.run dl3dv_benchmark --conf zipsplat_eval eval.num_views=24
# with ground-truth camera priors:
python -m splatfactory.eval.run dl3dv_benchmark --conf zipsplat_eval eval.num_views=6 model.eval_use_priors=true

Evaluate your own checkpoint:

python -m splatfactory.eval.run dl3dv_benchmark --checkpoint <exp_name> eval.num_views=6

Results (PSNR / SSIM / LPIPS, pose-free; DA3 and YoNoSplat for reference):

Method 6 views 12 views 24 views
DA3 23.77 / 0.795 / 0.165 22.38 / 0.736 / 0.208 21.69 / 0.711 / 0.229
YoNoSplat 24.10 / 0.783 / 0.160 22.73 / 0.736 / 0.200 22.01 / 0.710 / 0.223
ZipSplat 25.24 / 0.804 / 0.172 24.27 / 0.767 / 0.197 24.14 / 0.768 / 0.198
ZipSplat + priors 25.34 / 0.810 / 0.169 24.37 / 0.773 / 0.194 24.23 / 0.773 / 0.194
RE10K (6 context views)

Evaluate the released model (append model.eval_use_priors=true to feed ground-truth camera priors):

python -m splatfactory.eval.run re10k_benchmark --conf zipsplat_eval eval.num_views=6
# with ground-truth camera priors:
python -m splatfactory.eval.run re10k_benchmark --conf zipsplat_eval eval.num_views=6 model.eval_use_priors=true

Evaluate your own checkpoint:

python -m splatfactory.eval.run re10k_benchmark --checkpoint <exp_name> eval.num_views=6

Results (pose-free; DA3 and YoNoSplat for reference):

Method PSNR SSIM LPIPS
DA3 20.90 0.724 0.234
YoNoSplat 24.99 0.835 0.151
ZipSplat 26.20 0.842 0.158
ZipSplat + priors 27.19 0.872 0.143
MipNeRF360 (32 / 64 / 128 context views)

Evaluate the released model (append model.eval_use_priors=true to feed ground-truth camera priors):

python -m splatfactory.eval.run mipnerf360_benchmark --conf zipsplat_eval eval.num_views=32
python -m splatfactory.eval.run mipnerf360_benchmark --conf zipsplat_eval eval.num_views=64
python -m splatfactory.eval.run mipnerf360_benchmark --conf zipsplat_eval eval.num_views=128
# with ground-truth camera priors:
python -m splatfactory.eval.run mipnerf360_benchmark --conf zipsplat_eval eval.num_views=32 model.eval_use_priors=true

Evaluate your own checkpoint:

python -m splatfactory.eval.run mipnerf360_benchmark --checkpoint <exp_name> eval.num_views=32

Results (PSNR / SSIM / LPIPS, pose-free; DA3 and YoNoSplat for reference):

Method 32 views 64 views 128 views
DA3 20.94 / 0.577 / 0.295 20.30 / 0.554 / 0.311 20.19 / 0.568 / 0.306
YoNoSplat 17.62 / 0.409 / 0.465 17.77 / 0.413 / 0.466 17.16 / 0.409 / 0.509
ZipSplat 21.72 / 0.594 / 0.325 22.18 / 0.615 / 0.298 22.29 / 0.624 / 0.290
ZipSplat + priors 22.95 / 0.655 / 0.276 23.31 / 0.675 / 0.260 23.37 / 0.683 / 0.255

Training

ZipSplat is initialized from the DA3-Giant backbone, so first download those weights into weights/ and convert them to the .pth layout the backbone expects:

huggingface-cli download depth-anything/DA3-GIANT model.safetensors --local-dir weights
mv weights/model.safetensors weights/da3-giant.safetensor
python -m splatfactory.models.encoders.dav3_encoder --size giant   # -> weights/da3-giant.pth

With the data prepared (including the depth maps) and the backbone in place, start training from the repo root:

# single GPU
python -m splatfactory.train zipsplat --conf zipsplat
# multiple GPUs on one node
torchrun --nproc_per_node=4 -m splatfactory.train zipsplat --conf zipsplat --distributed

zipsplat is the experiment name (checkpoints are written to outputs/training/zipsplat/); use any name you like. The released model trains on the 50/50 RE10K + DL3DV mix at 252px for 450K steps; our run used 16 GPUs (4 nodes x 4) at a global batch size of 384, so scale data.batch_size to your hardware. Add --restore to resume an interrupted run. Configs are Hydra, so anything can be overridden from the command line, e.g. a smaller backbone:

python -m splatfactory.models.encoders.dav3_encoder --size small   # -> weights/da3-small.pth
python -m splatfactory.train zipsplat-da3s --conf zipsplat model/backbone=da3s data.batch_size=96

To log training to TensorBoard or Weights & Biases, set train.writer:

torchrun --nproc_per_node=4 -m splatfactory.train zipsplat --conf zipsplat --distributed train.writer=tensorboard

The trained model can then be evaluated by its experiment name (see Evaluation):

python -m splatfactory.eval.run dl3dv_benchmark --checkpoint zipsplat --tag zipsplat-retrained eval.num_views=6

BibTeX

@article{veicht2026zipsplat,
  title   = {ZipSplat: Fewer Gaussians, Better Splats},
  author  = {Veicht, Alexander and Hong, Sunghwan and Bar{\'a}th, D{\'a}niel and Pollefeys, Marc},
  journal = {arXiv preprint arXiv:2606.05102},
  year    = {2026}
}

License

The code in this repository (the zipsplat and splatfactory packages) is released under the Apache-2.0 License.

The pretrained weights are released separately under CC BY-NC 4.0, non-commercial use only. This is required by their dependencies: the released checkpoint is initialized from DA3-Giant (CC BY-NC 4.0) and trained on DL3DV-10K (CC BY-NC 4.0). See the Hugging Face model card for the full weights license and dataset attribution.

Acknowledgements

splatfactory is built mainly on the excellent glue-factory training framework. Our model is initialized from the Depth Anything 3 backbone, and rendering uses gsplat. We thank the authors of these projects for releasing their code and models.

About

ZipSplat: Fewer Gaussians, Better Splats

Resources

License

Stars

Watchers

Forks

Contributors

Languages