Boosting Monocular Metric Depth Estimation via Bokeh Rendering

ICML 2026

Hangwei Zhang^1,2 Armando Fortes¹ Tianyi Wei¹ Xingang Pan¹

¹S-Lab, Nanyang Technological University ²Beihang University

BokehDepth decouples bokeh synthesis from depth prediction and uses lens-aware defocus as a supervision-free geometric cue to improve the accuracy and physical consistency of monocular metric depth estimation. Left: conventional pipelines predict depth from a single sharp image and render bokeh from a noisy depth map. Right: our two-stage framework — Stage-1 generates a calibrated bokeh stack from a single image, and Stage-2 fuses the induced defocus cues to produce sharper, more reliable metric depth.

Method

Bokeh and monocular depth are two sides of the same lens geometry, but existing methods use this link in only one direction. Conventional bokeh pipelines depend on a predicted depth map, so any depth error turns into a wrong blur radius or a broken occlusion edge. Monocular metric models, in turn, struggle on textureless and distant regions, which is exactly where defocus carries the strongest geometric signal. BokehDepth closes this loop. It treats synthetic defocus as a supervision-free geometric cue and feeds it back into the depth estimator through two stages.

(a) Standard monocular depth estimation maps a single RGB image to a depth map. (b) Classical bokeh rendering needs both the image and a depth map. (c) BokehDepth instead generates a calibrated bokeh stack from the image and uses the resulting defocus cues to sharpen depth estimation.

Stage-1 — Physically Grounded Bokeh Generation. We build on FLUX.1-Kontext, a rectified-flow MMDiT backbone, and add a lightweight bokeh cross-attention adapter. Heterogeneous optical settings (focal length, aperture, focus distance) collapse into a single calibrated scalar K from the thin-lens circle-of-confusion model, which captures the near-linear relation between blur radius and disparity offset (r ≈ K · Δdisp). Conditioned on K, Stage-1 turns one sharp image into a multi-strength bokeh stack with no depth map at any point. A unified data pipeline aligns real defocused photos, synthetic renderings, and paired datasets onto this shared K axis.

Stage-2 — Bokeh Stack Fusion for Depth. A Divided Space Focus Attention (DSFA) module is inserted into the ViT depth encoder. It first runs spatial attention inside each frame, conditioned on that frame's strength K_f, and then runs focus attention across frames at matching spatial locations, modulated by FiLM. Each location can therefore read how its blur grows with K, which is the physical depth-from-defocus cue. Only the reference-frame tokens are passed on, so the original DPT decoder and metric head stay untouched. DSFA is a plug-and-play addition that drops into strong depth foundations such as Depth Anything V2 and UniDepthV2.

A Depth-from-Bokeh Sweep proposition shows this cue is principled: under calibrated control, regressing the bokeh radius on K across the stack gives an unbiased and consistent estimate of each pixel's inverse-depth offset, which recovers metric depth.

Installation

We ship a one-shot installer that mirrors the bokehdepth conda environment we used during development. Everything (CUDA runtime included) is fetched from PyPI through the +cu128 wheels, so the host machine only needs a working NVIDIA driver and conda.

git clone https://github.com/<your-org>/BokehDepth.git
cd BokehDepth
bash env/install.sh            # creates the `bokehdepth` env and wires LD_LIBRARY_PATH
conda activate bokehdepth       # ready to use

env/install.sh does three things:

Creates / refreshes the conda environment from env/environment.yml (Python 3.10, ffmpeg, and an optional GCC toolchain).
Installs every pip package listed in env/requirements.txt, which includes torch==2.8.0+cu128, xformers, diffusers, transformers, accelerate, and the rest of the BokehDepth stack.
Drops two scripts into ${CONDA_PREFIX}/etc/conda/{activate,deactivate}.d/ so that activating the env automatically prepends the right CUDA wheel directories (and the conda env's lib/) to LD_LIBRARY_PATH. This is what lets run_inference.sh stay clean.

Override the env name with CONDA_ENV=myenv bash env/install.sh.

If you also need the optional UniDepth CUDA extensions (only used for training losses, not for inference), build them after activation:
cd UniDepth/unidepth/ops/knn            && python setup.py install && cd -
cd UniDepth/unidepth/ops/extract_patches && python setup.py install && cd -

Weights

Pretrained checkpoints live on the fogradio/BokehDepth Hugging Face model card. Place them under weights/ so that the inference script can pick them up with its defaults:

mkdir -p weights
huggingface-cli download fogradio/BokehDepth --local-dir weights/

The Stage-1 base model (black-forest-labs/FLUX.1-Kontext-dev) is downloaded on first use through diffusers; make sure your Hugging Face token has accepted the FLUX license.

Inference

With the env activated and the weights in place, run:

bash run_inference.sh

run_inference.sh reads its defaults from environment variables (override any of them inline). The interesting ones:

Variable	Default	Purpose
`REF_IMAGE`	`examples/ref.png`	Input RGB image
`K_VALUES`	`10.0 20.0 30.0`	Bokeh strengths used to build the defocus stack
`ADAPTER_CKPT`	`weights/bokeh_lora.bin`	Stage-1 LoRA adapter
`WEIGHTS_PATH`	`weights/UDv2_dsfa_release.pth`	Stage-2 UniDepthV2-DSFA checkpoint
`CONFIG_PATH`	`UniDepth/configs/config_v2_vitl14_DSFA_inference.json`	Stage-2 model config
`OUTPUT_ROOT`	`examples/`	Where per-run subdirectories are created

Each run produces a timestamped directory under OUTPUT_ROOT/ containing the Stage-1 defocus stack, the Stage-2 metric depth (depth.npy + depth_color.png), and a pipeline_summary.json recording every argument that was used.

TODO

Release model
Release inference code
Release training code

Citation

If you find our work useful, please consider citing:

@article{zhang2025bokehdepth,
  title={Boosting Monocular Metric Depth Estimation via Bokeh Rendering},
  author={Zhang, Hangwei and Fortes, Armando Teles and Wei, Tianyi and Pan, Xingang},
  journal={arXiv preprint arXiv:2512.12425},
  year={2025}
}

Acknowledgement

We build upon BokehDiffusion, FLUX.1-Kontext, Depth Anything V2, and UniDepthV2. We would like to thank these projects that made this work possible.

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
UniDepth		UniDepth
assets		assets
bokeh-generation		bokeh-generation
env		env
examples		examples
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pipeline.py		pipeline.py
run_inference.sh		run_inference.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Boosting Monocular Metric Depth Estimation via Bokeh Rendering

Method

Installation

Weights

Inference

TODO

Citation

Acknowledgement

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Boosting Monocular Metric Depth Estimation via Bokeh Rendering

Method

Installation

Weights

Inference

TODO

Citation

Acknowledgement

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages