Implementation of DiffHDR: Re-Exposing LDR Videos with Video Diffusion Models (arXiv:2604.06161v1).
DiffHDR fine-tunes Wan-2.1-VACE-14B with LoRA adapters to reconstruct plausible HDR radiance in over- and underexposed regions of LDR video. It uses Log-Gamma color mapping to compress HDR dynamic range into the pretrained VAE's operational range, luminance-based exposure mask detection, and a Context-Focused Cross-Attention (CFA) module at inference for spatially-guided HDR recovery.
- Hardware Requirements
- Installation
- Downloading the Base Model
- Training
- Inference
- Evaluation
- Project Structure
- Key Design Decisions
- Troubleshooting
Training (paper configuration): 8 × A100 80 GB GPUs (effective batch size 32 = 1 sample/GPU × 8 GPUs × 4 gradient accumulation steps).
Practical minimum: 4 × A100 80 GB or equivalent. The 14B-parameter backbone alone requires ~28 GB at bfloat16; factor in activations, LoRA gradients, VAE (float32), and optimizer state.
Inference: 1 × GPU with ≥ 24 GB VRAM. The VAE always runs in float32; the DiT in bfloat16.
# 1. Clone
git clone <repo_url> diffhdr
cd diffhdr
# 2. Create a virtual environment (Python 3.10 recommended)
python3.10 -m venv .venv
source .venv/bin/activate
# 3. Install PyTorch (match your CUDA version; example for CUDA 12.4)
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu124
# 4. Install project dependencies
pip install -r requirements.txt
# 5. Install OpenEXR system libraries (required for EXR I/O)
# Ubuntu/Debian:
sudo apt-get install -y libopenexr-dev
# CentOS/RHEL/AlmaLinux:
sudo dnf install -y OpenEXR-devel
# 6. Verify the install
python -c "import diffhdr; print('OK')"Note on
decord: Ifdecordfails to install via pip, build it from source or replace video decoding withimageio-ffmpeg— the inference script has an automatic fallback.
DiffHDR uses Wan-2.1-VACE-14B as its backbone. Download it from HuggingFace and store it locally so training can run offline:
# Create a local model directory
mkdir -p models
# Download the diffusers-compatible model
pip install huggingface-hub
python - <<'EOF'
from huggingface_hub import snapshot_download
snapshot_download(
repo_id="Wan-AI/Wan2.1-VACE-14B-diffusers",
local_dir="models/Wan2.1-VACE-14B-diffusers",
ignore_patterns=["*.msgpack", "flax_model*"],
)
EOFAfter downloading, point the config at the local path (see Step 3: Configure Training).
Training data must be organized as a flat directory of sequence folders, each containing exactly 81 frames as OpenEXR files in linear Rec.709 color space:
<dataset_root>/
├── sequence_000/
│ ├── frame_00000.exr
│ ├── frame_00001.exr
│ ├── ...
│ ├── frame_00080.exr # 81 frames total (indices 0–80)
│ └── caption.txt # generated in Step 1
├── sequence_001/
│ ├── frame_00000.exr
│ ...
│ └── caption.txt
...
File naming: The default pattern is frame_%05d.exr (zero-padded to 5 digits). This is configurable via data.frame_pattern in the YAML config. .hdr / .pic (Radiance) files are also supported.
Frame format: Linear Rec.709, no gamma encoding. Blender EXR exports are already in this format — do not apply gamma correction when loading. Use at least half-float precision in the EXR files.
Validation split: By default, the last 50 sequences (sorted alphabetically) are held out for validation. Adjust data.val_split in the config. If your dataset is small (e.g., fewer than 50 sequences), reduce this number accordingly.
Minimum dataset size: The paper trains on ~5,400 Polyhaven-sourced sequences. For experimentation, dozens of sequences work, but quality will be limited.
Every sequence must have a caption.txt describing its over- and underexposed regions. This is a required preprocessing step — training falls back to a null caption for sequences without one, which degrades text conditioning.
Captions are generated automatically using Qwen3-VL-2B-Instruct:
python scripts/generate_captions.py \
--dataset_root /path/to/dataset \
--device cuda \
--frame_stride 8 # sample every 8th frame per sequenceWhat it does: For each sequence, the script loads every 8th HDR frame, Reinhard tone-maps them to LDR, feeds them to Qwen3-VL with a structured prompt, and writes the result to caption.txt.
Caption format:
[Overexposed: bright sky and sun reflections on water]; [Underexposed: dark interior and shadowed corners]
The parser is case-insensitive, so [overexposed: ...] is equally valid.
Options:
--overwrite— regenerate captions even ifcaption.txtalready exists--frame_stride N— sample every Nth frame (default 8)
Cost: Each sequence takes a few GPU-seconds. For ~5,400 sequences, budget several GPU-hours on a single GPU. Already-captioned sequences are skipped automatically on re-runs.
VAE encoding the HDR ground-truth frames is the single most expensive operation during training (~12–15 s per clip in float32). Since the HDR frames do not change between steps, the latents can be computed once and cached:
python scripts/precompute_hdr_latents.py \
--dataset_root /path/to/dataset \
--model_path models/Wan2.1-VACE-14B-diffusers \
--gamma 2.2 \
--M 1000.0 \
--sequence_frames 81 \
--clip_frames 33 \
--resolution 720 1280 \
--device cuda:0 \
--all_starts # cache every possible temporal offset (49 per sequence)What it does: For every sequence and every possible 33-frame temporal starting offset (0 through 48), it:
- Loads the HDR frames
- Applies a deterministic center-crop to the target resolution
- Log-Gamma encodes the HDR
- Runs the Wan VAE encoder (float32)
- Saves the resulting latent to
<sequence_dir>/latents/clip_<start:04d>_latent.pt
Expected speedup: ~35–45% reduction in wall-clock time per training step (from 2 VAE calls to 1 per micro-step).
Critical: --gamma and --M must match the values in configs/train.yaml. They are saved to <dataset_root>/latent_config.json for cross-checking. If you change these hyperparameters, delete the latents/ directories and recompute.
Disk usage: Each cached latent is a (16, 9, 90, 160) float32 tensor ≈ 830 KB. With 49 offsets per sequence and 5,400 sequences: ~220 GB total. For experimentation, omit --all_starts to cache only start=0 (saves ~50× disk space at the cost of less temporal diversity).
The dataset loader picks up cached latents automatically — no flag is needed in train.py.
Edit configs/train.yaml to point at your data and model:
# Key fields to adjust:
model:
backbone: "models/Wan2.1-VACE-14B-diffusers" # local path after downloading
data:
dataset_root: "/path/to/your/dataset"
val_split: 50 # sequences held out for validation; reduce for small datasets
training:
batch_size: 1 # per GPU — keep at 1 for 80 GB GPUs
gradient_accumulation: 4
num_steps: 10000
learning_rate: 1.0e-4
color_mapping:
gamma: 2.2 # Log-Gamma compression strength — requires empirical tuning
M: 1000.0 # Maximum representable radiance — requires empirical tuning
output_dir: "checkpoints"
use_wandb: false # set to true and log in with `wandb login` to enable W&BLog-Gamma hyperparameters (gamma, M): These are not specified in the paper. gamma=2.2 (matching standard perceptual gamma encoding) and M=1000.0 (≈10 stops above SDR white point) are physically motivated defaults. Validate them empirically on your data distribution.
initial_checkpoint: Set to true to save a step-0 checkpoint before any gradient updates. Useful for verifying that checkpoint saving and the resume path work before committing to a long run.
accelerate manages multi-GPU training. Run the interactive setup once:
accelerate configRecommended answers for 4–8 A100 training:
- Compute environment:
This machine - Distributed type:
multi-GPU - Number of processes (GPUs):
4(or8) - Mixed precision:
bf16 - Gradient accumulation: leave blank (set in
train.yamlinstead)
The config is saved to ~/.cache/huggingface/accelerate/default_config.yaml.
# Standard multi-GPU launch
accelerate launch train.py --config configs/train.yaml
# Single GPU (for debugging / small-scale experiments)
python train.py --config configs/train.yamlOn startup, the script prints a pre-flight summary:
[DiffHDR] LoRA targets: 384 Linear modules
[DiffHDR] Transformer: 186,499,072 trainable / 14,232,576,000 total (1.310%)
Verify that:
- The trainable fraction is between 0% and 5% (confirms LoRA freeze is correct)
- LoRA targets include attention projections (
to_q,to_k,to_v,to_out.0) and FFN layers (ffn.net.0.proj,ffn.net.2) - The VAE is not in the trainable parameters
The training loop logs every 50 steps and saves checkpoints every 500 steps (configurable via training.log_every and training.checkpoint_every).
Checkpoints saved during training contain only the LoRA adapter weights (not the full 14B model):
checkpoints/
└── checkpoint-005000/
├── adapter_config.json
├── adapter_model.safetensors # LoRA delta weights only (~750 MB)
├── color_mapping_config.json # gamma, M, and base_model path
└── step.txt # global step for resume
To resume training from a checkpoint:
accelerate launch train.py \
--config configs/train.yaml \
--resume_from_checkpoint checkpoints/checkpoint-005000The script reads step.txt and resumes from the correct global step. No changes to the YAML are needed.
Two SLURM job scripts are provided:
Precompute latents (run first, 1 GPU, ~12h):
sbatch precompute_latents.slurmTraining (4 GPUs, 48h):
sbatch train.slurmTo resume from a checkpoint via SLURM:
sbatch --export=ALL,RESUME_CHECKPOINT=checkpoints/checkpoint-005000 train.slurmThe scripts assume:
- Working directory:
/scratch/eli/disney/diffhdr - Virtual environment:
${WORKDIR}/.venv - Accelerate config at:
~/.cache/huggingface/accelerate/default_config.yaml - Modules:
cuda12.4/toolkit
Adjust #SBATCH directives (--partition, --nodelist, --gres) and path variables for your cluster.
Console logs: The training loop prints loss and learning rate every log_every steps (default 50):
12:34:56 INFO diffhdr.train step=100/10000 loss=0.04213 lr=1.00e-04
W&B: Set use_wandb: true in the YAML and authenticate with wandb login before launching. Logs loss/flow, lr, and step.
Checkpoints: Saved every checkpoint_every steps (default 500) to output_dir/checkpoint-<step:06d>/. A checkpoint-final/ is written at the end of training. Only LoRA weights are saved (~750 MB each vs. ~28 GB for the full model).
python inference.py \
--input_video path/to/ldr_video.mp4 \
--output_dir path/to/output/ \
--lora_checkpoint checkpoints/checkpoint-final \
--prompt "[Overexposed: bright sky and sun]; [Underexposed: dark shadows under trees]" \
--num_inference_steps 50 \
--guidance_scale 7.5The --prompt should describe the specific exposure conditions in the video. Write separate descriptions for overexposed and underexposed regions. If you don't know them in advance, the exposure mask detector will handle spatial routing regardless; the prompt controls high-level semantic guidance.
Prompt format:
[Overexposed: <describe bright/blown-out areas>]; [Underexposed: <describe dark/shadowed areas>]
Use [Overexposed: none] or [Underexposed: none] when a region is not present.
Loading the base model: The checkpoint's color_mapping_config.json stores the base model path used during training. Inference loads it automatically. Override with --base_model if the path has changed:
python inference.py \
... \
--base_model models/Wan2.1-VACE-14B-diffusersA reference image provides explicit appearance guidance for overexposed regions. It can be any LDR image (e.g., a correctly-exposed photo of the same scene, or an edited version of a key frame):
python inference.py \
--input_video path/to/ldr_video.mp4 \
--output_dir path/to/output/ \
--lora_checkpoint checkpoints/checkpoint-final \
--prompt "[Overexposed: bright interior window]; [Underexposed: dark corner furniture]" \
--reference_image path/to/reference.pngThe reference image is Log-Gamma encoded and prepended to the VACE context sequence along the temporal dimension.
Context-Focused Cross-Attention (CFA) routes different text embeddings to overexposed and underexposed spatial regions at inference. It is enabled by default.
python inference.py \
... \
--alpha_over 3.0 # amplify HDR recovery in overexposed regions (paper tests 1, 3, 5)
--alpha_under 3.0 # amplify HDR recovery in underexposed regionsHigher alpha values produce stronger regional correction but may introduce artifacts. Start at 1.0 (default) and increase if HDR recovery in specific regions is insufficient.
To disable CFA and use standard attention (faster, slightly lower quality):
python inference.py \
... \
--no_cfaTemporal windowing: Videos longer than 33 frames are processed in overlapping windows (33-frame windows with --window_overlap 8 frames of overlap by default). Blending is done in latent space before decoding. Adjust overlap with --window_overlap N.
Inference writes two outputs to --output_dir:
output/
├── exr/
│ ├── frame_00000.exr # HDR output, half-float EXR, linear Rec.709
│ ├── frame_00001.exr
│ └── ...
└── preview.mp4 # 8-bit tone-mapped preview (Reinhard, per-video)
- EXR frames are in linear Rec.709 with values in
[0, ∞). Small negative values from the VAE decoder are clamped to zero. Load them in Blender, Nuke, DaVinci Resolve, or any application that handles EXR. - preview.mp4 uses global Reinhard tone mapping across all frames for temporal brightness consistency. It is for quick visual inspection only — not a ground-truth HDR representation.
Evaluation computes perceptual quality metrics on the predicted EXR frames.
Reference-based metrics (requires ground-truth EXR frames):
# SI-HDR benchmark (Table 1): HDR-VDP3, PU21-PIQE, FID
python evaluate.py \
--pred_dir output/exr/ \
--gt_dir groundtruth/exr/ \
--metrics all \
--benchmark sihdr
# Video benchmarks (Tables 2–3): FovVideoVDP + no-reference
python evaluate.py \
--pred_dir output/exr/ \
--gt_dir groundtruth/exr/ \
--metrics fovvideovdp,dover,musiq,clipiqa \
--benchmark cinematicNo-reference only (no ground truth needed):
python evaluate.py \
--pred_dir output/exr/ \
--metrics dover,musiq,clipiqaAvailable metrics:
| Metric | Type | Requires GT |
|---|---|---|
hdrvdp3 |
Image perceptual quality | Yes |
pu21_piqe |
Perceptual uniformity × PIQE | No |
fid |
Fréchet Inception Distance | Yes |
fovvideovdp |
Video perceptual quality | Yes |
dover |
No-reference video quality | No |
musiq |
Multi-scale image quality | No |
clipiqa |
CLIP-based image quality | No |
FID note: FID requires pre-saved PNG/JPEG files rather than EXR. Use pytorch-fid directly after tone-mapping:
# Tone-map EXR to PNG, then:
python -m pytorch_fid pred_png_dir/ gt_png_dir/diffhdr/
├── configs/
│ └── train.yaml # training configuration
├── diffhdr/
│ ├── __init__.py
│ ├── color_mapping.py # Log-Gamma color mapping and sRGB↔linear helpers
│ ├── mask_detection.py # luminance-based exposure mask detection + EMA
│ ├── data_augmentation.py # LDR synthesis: exposure shift, noise, quantization
│ ├── dataset.py # HDRVideoDataset (EXR loading, latent cache)
│ ├── model.py # DiffHDRModel: Wan-2.1-VACE-14B + LoRA
│ ├── attention.py # CFA hooks (inference only)
│ ├── prompting.py # structured caption parsing and text encoding
│ └── losses.py # rectified flow-matching loss
├── scripts/
│ ├── generate_captions.py # one-time: Qwen3-VL caption generation
│ └── precompute_hdr_latents.py # one-time: HDR latent caching for speed
├── train.py # training script (accelerate / DDP)
├── inference.py # LDR video → HDR video
├── evaluate.py # evaluation metrics
├── train.slurm # SLURM job for training
├── precompute_latents.slurm # SLURM job for latent precomputation
└── requirements.txt
VAE is always float32 and always frozen. BF16 VAE decoding produces visible banding in smooth HDR gradients due to insufficient mantissa precision (Appendix C). VAE finetuning over-smooths latent representations and degrades downstream generation quality (Appendix B). Every VAE call is wrapped in torch.autocast(device_type='cuda', enabled=False) to force float32 even inside accelerate's BF16 context.
LoRA targets only the main denoising branch. The VACE context branch (vace_blocks.*) remains fully frozen. LoRA targets are discovered at runtime by inspecting all nn.Linear modules — never hardcoded — to accommodate naming convention differences across diffusers versions.
CFA is strictly inference-only. During training, a single text embedding from the full [Overexposed: ...]; [Underexposed: ...] caption is used. CFA's three-embedding decomposition (null, over, under) is applied only at inference. train.py asserts that no CFA hooks are registered before the training loop begins.
Log-Gamma hyperparameters are saved with every checkpoint. gamma and M are stored in color_mapping_config.json alongside the LoRA adapter files. Inference reads these values rather than relying on defaults, ensuring exact training/inference consistency.
Latent caching uses deterministic center-crop. When using precomputed HDR latents, the pixel frames at training time are center-cropped (not randomly cropped) to match the spatial region encoded into the cached latent. Random horizontal flip is applied consistently to both pixel frames and the latent.
"No trainable parameters" / assert pct > 0.0 fails
PEFT's get_peft_model handles parameter freezing internally. Do not call transformer.requires_grad_(False) after applying LoRA — this freezes the LoRA weights too, making training a no-op.
assert pct < 5.0 fails (trainable fraction > 5%)
LoRA is being applied to too many layers. Check discover_lora_targets() output in the logs. The VACE context branch (vace_blocks.*) must be excluded — verify the exclusion filter is working correctly.
ImportError: diffusers>=0.31.0 required or pipeline class not found
Upgrade diffusers: pip install -U 'diffusers>=0.37.0'. The pipeline class WanVACEPipeline was introduced in diffusers 0.37. To check what's available: python -c 'import diffusers; print([x for x in dir(diffusers) if "Wan" in x])'.
VACE context parameter not found (model runs without context conditioning)
The log line "VACE context parameter not found. Running DiT without context conditioning." indicates a diffusers version mismatch. The expected parameter name is control_hidden_states (diffusers ≥ 0.37). Check the parameter list in model._dit_forward() and add your version's name to ctx_param_candidates.
Training step time is > 120s on 4 × A100
This is expected without the latent cache. Run scripts/precompute_hdr_latents.py first. Without caching: ~168s/step → ~19 days for 10,000 steps. With caching: ~90–100s/step → ~10 days.
caption.txt missing for many sequences
Run scripts/generate_captions.py before training. The fallback caption [Overexposed: none]; [Underexposed: none] is used when caption.txt is absent, but text conditioning will not be meaningful for those sequences.
EXR frames all black or all white in preview
The preview uses Reinhard tone mapping. If the HDR output contains extreme values (very large or very small), the tone mapping key may need adjustment. Check the raw EXR values: linear HDR for a well-lit scene in Rec.709 should have most pixels in the range [0, 100] (with highlights up to ~1000).
Cached latents produce spatial misalignment with LDR inputs
Ensure --gamma and --M in precompute_hdr_latents.py exactly match color_mapping.gamma and color_mapping.M in configs/train.yaml. Also verify --resolution H W matches training.resolution (in T H W order in the YAML, but H W order in the precompute script). If in doubt, delete the latents/ directories and recompute.