Siyuan Bian*, Congrong Xu*, Jun Gao
This repository is the official code for MDA, from the paper "Modeling Depth Ambiguity: A Mixture-Density Representation for Flying-Point-Free Depth Estimation" (arXiv · project page).
Common feed-forward depth models predict one depth value per pixel. At object edges this fails: the pixel covers both foreground and background, so its depth is ambiguous, and a single value falls between the two surfaces — a flying point that corrupts the reconstruction.
MDA replaces the single value with a mixture density: each pixel predicts a few depth hypotheses with probabilities, then picks one instead of averaging. This largely eliminates flying points, stays robust to input blur, adds negligible overhead, and works across backbones — both DA3 and VGGT.
- 2026-06-02: Public release. Training code, evaluation scripts, and the
mda_mog_sky_l2andvggt_mog_l2checkpoints are now available.
MDA installs in two passes. The core package covers inference and the mixture-density head. A few extras are needed only for training, evaluation, or Gaussian-splatting export.
conda create -n mda python=3.10 -y
conda activate mda
# Install PyTorch for your CUDA version. This is one example.
pip install torch==2.7.1 torchvision==0.22.1 torchaudio==2.7.1 --index-url https://download.pytorch.org/whl/cu128
# Core package (mixture-density head + inference).
pip install -e .
# Optional: all inference extras (point-cloud viewers, format converters).
pip install -e ".[all]"Training and evaluation extras. The Hydra/Lightning training launcher and the benchmark eval scripts use a few libraries that pyproject.toml does not declare. Install them when you need to train or evaluate:
# Training stack.
pip install hydra-core lightning lightning-bolts torchmetrics rootutils \
accelerate peft pyyaml tqdm
# Eval and visualization utilities (src/testing/eval_cut3r/*, src/training/da3_wrapper.py).
pip install matplotlib scipy sympyffmpeg. The bash src/testing/run_demo.sh <video> flow extracts frames from a video, so ffmpeg must be on your PATH:
- Debian / Ubuntu:
sudo apt-get install ffmpeg - macOS (Homebrew):
brew install ffmpeg
The pretrained MDA checkpoints are on the Hugging Face Hub at sy000/MDA. Download them into checkpoints/MDA/, the path src/testing/utils/model_choice.py expects:
hf download sy000/MDA --local-dir checkpoints/MDAThis places two checkpoints:
--model_name |
Backbone | Checkpoint file | Notes |
|---|---|---|---|
mda_mog_sky_l2 (default) |
DA3 Giant + Gaussian mixture + sky | checkpoints/MDA/DA3_MOG_Sky_LogL2.ckpt |
Default model; main results in the paper. |
vggt_mog_l2 |
VGGT-1B + MDA head | checkpoints/MDA/VGGT_MOG_LogL2.ckpt |
Same head on a VGGT-1B backbone. |
(hf ships with huggingface_hub. Run pip install -U huggingface_hub if the command is missing.)
demo.py takes a folder of images, a single video file (frames extracted with ffmpeg), or a single image (monocular inference). All settings live in the DemoConfig at the top of the file; every field is also a CLI flag.
# 1. Bundled multi-view examples (video frames or unordered indoor stills).
python demo.py assets/examples/dolomiti
python demo.py assets/examples/diode_indoor
# 2. Single-image (monocular) example.
python demo.py assets/examples/mono/painting/painting.jpeg
# 3. Your own data.
python demo.py path/to/video.mp4 --fps 5
python demo.py path/to/image_folder --image_stride 10 # keep every 10th image
python demo.py path/to/image_folder --model_name vggt_mog_l2The default model is mda_mog_sky_l2. Override it with --model_name (see the table above, or src/testing/utils/model_choice.py for all names). Outputs go to --output_dir (default eval_results/demo/<input_basename>/<model_name>/):
After inference, an interactive viser point-cloud viewer launches automatically (disable with --no-viewer). To browse several finished runs in one viewer with a dropdown:
python view.py --data_dir eval_results/demo --method mda_mog_sky_l2The original shell pipeline (src/testing/run_demo.sh wrapping src/testing/run_inference_video.py) is still available for the .ply-export flow.
Training uses Hydra to compose an experiment config under configs/experiment/mda/ (the .yaml extension is implicit). Each config finetunes a pretrained DA3 or VGGT checkpoint with K = 4 mixture components for 10k steps on 4 × RTX Pro 6000, learning rate 1e-4, batch size 48 (paper §5.1.1).
# Default: DA3 + Gaussian mixture + sky component.
python src/training/train.py experiment=mda/da3_mog_sky_fullOther recipes under configs/experiment/mda/:
| Config | Description |
|---|---|
da3_mog_sky_full |
DA3 + Gaussian mixture + sky component (default) |
da3_mog_sky_full_l1 |
DA3 + Laplacian mixture (paper Table 1, "LMM" row) |
vggt_mog_full |
VGGT backbone + MDA head |
Override any Hydra field on the command line:
python src/training/train.py experiment=mda/da3_mog_sky_full \
trainer.devices=4 data.num_views=8 logger=wandbTraining data. The synthetic training mix follows the DA3 recipe: AriaSyntheticENV, HyperSim, MvsSynth, OmniWorld, PointOdyssey, TartanAir, vKitti2, DynamicReplica, UnrealStereo4K (paper §5.1.1).
Two launcher scripts cover the two benchmark tracks in the paper. They use the same checkpoints as the demo. Each script selects the model by name through src/testing/utils/model_choice.py, and both default to mda_mog_sky_l2. To evaluate a different model, edit the model_names array at the top of the script (for example, set it to vggt_mog_l2).
# Boundary-quality benchmark (NRGBD, 7Scenes, HiRoom) — paper Table 1.
bash src/testing/eval_cut3r/mv_recon/run_mv_recon.sh
# Video-depth benchmark (Sintel, Bonn, KITTI, DIODE) — paper Table 2.
bash src/testing/eval_cut3r/video_depth/run_video_depth.shBoth scripts write per-dataset and per-model outputs under eval_results/.
If you build on MDA, please cite:
@misc{bian2026modeling,
title = {Modeling Depth Ambiguity: A Mixture-Density Representation for Flying-Point-Free Depth Estimation},
author = {Siyuan Bian and Congrong Xu and Jun Gao},
year = {2026},
eprint = {2606.02552},
archivePrefix = {arXiv},
primaryClass = {cs.CV},
url = {https://arxiv.org/abs/2606.02552}
}This codebase builds on these open-source releases:
- Depth Anything 3 — one of the two backbones, and the source of the DINOv2-based encoder, DPT head, and inference code.
- Stream3R — the Hydra/Lightning training launcher, multi-view DUSt3R data modules, and streaming VGGT-style sequence wrapper.
We thank the authors for their work.