ICML 2026
Hangwei Zhang1,2 Armando Fortes1 Tianyi Wei1 Xingang Pan1
1S-Lab, Nanyang Technological University 2Beihang University
BokehDepth decouples bokeh synthesis from depth prediction and uses lens-aware defocus as a supervision-free geometric cue to improve the accuracy and physical consistency of monocular metric depth estimation. Left: conventional pipelines predict depth from a single sharp image and render bokeh from a noisy depth map. Right: our two-stage framework — Stage-1 generates a calibrated bokeh stack from a single image, and Stage-2 fuses the induced defocus cues to produce sharper, more reliable metric depth.
Bokeh and monocular depth are two sides of the same lens geometry, but existing methods use this link in only one direction. Conventional bokeh pipelines depend on a predicted depth map, so any depth error turns into a wrong blur radius or a broken occlusion edge. Monocular metric models, in turn, struggle on textureless and distant regions, which is exactly where defocus carries the strongest geometric signal. BokehDepth closes this loop. It treats synthetic defocus as a supervision-free geometric cue and feeds it back into the depth estimator through two stages.
(a) Standard monocular depth estimation maps a single RGB image to a depth map. (b) Classical bokeh rendering needs both the image and a depth map. (c) BokehDepth instead generates a calibrated bokeh stack from the image and uses the resulting defocus cues to sharpen depth estimation.
Stage-1 — Physically Grounded Bokeh Generation.
We build on FLUX.1-Kontext, a rectified-flow MMDiT backbone, and add a lightweight bokeh cross-attention adapter. Heterogeneous optical settings (focal length, aperture, focus distance) collapse into a single calibrated scalar K from the thin-lens circle-of-confusion model, which captures the near-linear relation between blur radius and disparity offset (r ≈ K · Δdisp). Conditioned on K, Stage-1 turns one sharp image into a multi-strength bokeh stack with no depth map at any point. A unified data pipeline aligns real defocused photos, synthetic renderings, and paired datasets onto this shared K axis.
Stage-2 — Bokeh Stack Fusion for Depth.
A Divided Space Focus Attention (DSFA) module is inserted into the ViT depth encoder. It first runs spatial attention inside each frame, conditioned on that frame's strength K_f, and then runs focus attention across frames at matching spatial locations, modulated by FiLM. Each location can therefore read how its blur grows with K, which is the physical depth-from-defocus cue. Only the reference-frame tokens are passed on, so the original DPT decoder and metric head stay untouched. DSFA is a plug-and-play addition that drops into strong depth foundations such as Depth Anything V2 and UniDepthV2.
A Depth-from-Bokeh Sweep proposition shows this cue is principled: under calibrated control, regressing the bokeh radius on
Kacross the stack gives an unbiased and consistent estimate of each pixel's inverse-depth offset, which recovers metric depth.
We ship a one-shot installer that mirrors the bokehdepth conda environment we used during development. Everything (CUDA runtime included) is fetched from PyPI through the +cu128 wheels, so the host machine only needs a working NVIDIA driver and conda.
git clone https://github.com/<your-org>/BokehDepth.git
cd BokehDepth
bash env/install.sh # creates the `bokehdepth` env and wires LD_LIBRARY_PATH
conda activate bokehdepth # ready to useenv/install.sh does three things:
- Creates / refreshes the conda environment from
env/environment.yml(Python 3.10,ffmpeg, and an optional GCC toolchain). - Installs every pip package listed in
env/requirements.txt, which includestorch==2.8.0+cu128,xformers,diffusers,transformers,accelerate, and the rest of the BokehDepth stack. - Drops two scripts into
${CONDA_PREFIX}/etc/conda/{activate,deactivate}.d/so that activating the env automatically prepends the right CUDA wheel directories (and the conda env'slib/) toLD_LIBRARY_PATH. This is what letsrun_inference.shstay clean.
Override the env name with
CONDA_ENV=myenv bash env/install.sh.If you also need the optional UniDepth CUDA extensions (only used for training losses, not for inference), build them after activation:
cd UniDepth/unidepth/ops/knn && python setup.py install && cd - cd UniDepth/unidepth/ops/extract_patches && python setup.py install && cd -
Pretrained checkpoints live on the fogradio/BokehDepth Hugging Face model card. Place them under weights/ so that the inference script can pick them up with its defaults:
mkdir -p weights
huggingface-cli download fogradio/BokehDepth --local-dir weights/The Stage-1 base model (black-forest-labs/FLUX.1-Kontext-dev) is downloaded on first use through diffusers; make sure your Hugging Face token has accepted the FLUX license.
With the env activated and the weights in place, run:
bash run_inference.shrun_inference.sh reads its defaults from environment variables (override any of them inline). The interesting ones:
| Variable | Default | Purpose |
|---|---|---|
REF_IMAGE |
examples/ref.png |
Input RGB image |
K_VALUES |
10.0 20.0 30.0 |
Bokeh strengths used to build the defocus stack |
ADAPTER_CKPT |
weights/bokeh_lora.bin |
Stage-1 LoRA adapter |
WEIGHTS_PATH |
weights/UDv2_dsfa_release.pth |
Stage-2 UniDepthV2-DSFA checkpoint |
CONFIG_PATH |
UniDepth/configs/config_v2_vitl14_DSFA_inference.json |
Stage-2 model config |
OUTPUT_ROOT |
examples/ |
Where per-run subdirectories are created |
Each run produces a timestamped directory under OUTPUT_ROOT/ containing the Stage-1 defocus stack, the Stage-2 metric depth (depth.npy + depth_color.png), and a pipeline_summary.json recording every argument that was used.
- Release model
- Release inference code
- Release training code
If you find our work useful, please consider citing:
@article{zhang2025bokehdepth,
title={Boosting Monocular Metric Depth Estimation via Bokeh Rendering},
author={Zhang, Hangwei and Fortes, Armando Teles and Wei, Tianyi and Pan, Xingang},
journal={arXiv preprint arXiv:2512.12425},
year={2025}
}We build upon BokehDiffusion, FLUX.1-Kontext, Depth Anything V2, and UniDepthV2. We would like to thank these projects that made this work possible.


