DiffHDR

Implementation of DiffHDR: Re-Exposing LDR Videos with Video Diffusion Models (arXiv:2604.06161v1).

DiffHDR fine-tunes Wan-2.1-VACE-14B with LoRA adapters to reconstruct plausible HDR radiance in over- and underexposed regions of LDR video. It uses Log-Gamma color mapping to compress HDR dynamic range into the pretrained VAE's operational range, luminance-based exposure mask detection, and a Context-Focused Cross-Attention (CFA) module at inference for spatially-guided HDR recovery.

Hardware Requirements

Training (paper configuration): 8 × A100 80 GB GPUs (effective batch size 32 = 1 sample/GPU × 8 GPUs × 4 gradient accumulation steps).

Practical minimum: 4 × A100 80 GB or equivalent. The 14B-parameter backbone alone requires ~28 GB at bfloat16; factor in activations, LoRA gradients, VAE (float32), and optimizer state.

Inference: 1 × GPU with ≥ 24 GB VRAM. The VAE always runs in float32; the DiT in bfloat16.

Installation

# 1. Clone
git clone <repo_url> diffhdr
cd diffhdr

# 2. Create a virtual environment (Python 3.10 recommended)
python3.10 -m venv .venv
source .venv/bin/activate

# 3. Install PyTorch (match your CUDA version; example for CUDA 12.4)
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu124

# 4. Install project dependencies
pip install -r requirements.txt

# 5. Install OpenEXR system libraries (required for EXR I/O)
#    Ubuntu/Debian:
sudo apt-get install -y libopenexr-dev

#    CentOS/RHEL/AlmaLinux:
sudo dnf install -y OpenEXR-devel

# 6. Verify the install
python -c "import diffhdr; print('OK')"

Note on decord: If decord fails to install via pip, build it from source or replace video decoding with imageio-ffmpeg — the inference script has an automatic fallback.

Downloading the Base Model

DiffHDR uses Wan-2.1-VACE-14B as its backbone. Download it from HuggingFace and store it locally so training can run offline:

# Create a local model directory
mkdir -p models

# Download the diffusers-compatible model
pip install huggingface-hub
python - <<'EOF'
from huggingface_hub import snapshot_download
snapshot_download(
    repo_id="Wan-AI/Wan2.1-VACE-14B-diffusers",
    local_dir="models/Wan2.1-VACE-14B-diffusers",
    ignore_patterns=["*.msgpack", "flax_model*"],
)
EOF

After downloading, point the config at the local path (see Step 3: Configure Training).

Training

Data Structure

Training data must be organized as a flat directory of sequence folders, each containing exactly 81 frames as OpenEXR files in linear Rec.709 color space:

<dataset_root>/
├── sequence_000/
│   ├── frame_00000.exr
│   ├── frame_00001.exr
│   ├── ...
│   ├── frame_00080.exr   # 81 frames total (indices 0–80)
│   └── caption.txt       # generated in Step 1
├── sequence_001/
│   ├── frame_00000.exr
│   ...
│   └── caption.txt
...

File naming: The default pattern is frame_%05d.exr (zero-padded to 5 digits). This is configurable via data.frame_pattern in the YAML config. .hdr / .pic (Radiance) files are also supported.

Frame format: Linear Rec.709, no gamma encoding. Blender EXR exports are already in this format — do not apply gamma correction when loading. Use at least half-float precision in the EXR files.

Validation split: By default, the last 50 sequences (sorted alphabetically) are held out for validation. Adjust data.val_split in the config. If your dataset is small (e.g., fewer than 50 sequences), reduce this number accordingly.

Minimum dataset size: The paper trains on ~5,400 Polyhaven-sourced sequences. For experimentation, dozens of sequences work, but quality will be limited.

Step 1: Generate Captions

Every sequence must have a caption.txt describing its over- and underexposed regions. This is a required preprocessing step — training falls back to a null caption for sequences without one, which degrades text conditioning.

Captions are generated automatically using Qwen3-VL-2B-Instruct:

python scripts/generate_captions.py \
    --dataset_root /path/to/dataset \
    --device cuda \
    --frame_stride 8      # sample every 8th frame per sequence

What it does: For each sequence, the script loads every 8th HDR frame, Reinhard tone-maps them to LDR, feeds them to Qwen3-VL with a structured prompt, and writes the result to caption.txt.

Caption format:

[Overexposed: bright sky and sun reflections on water]; [Underexposed: dark interior and shadowed corners]

The parser is case-insensitive, so [overexposed: ...] is equally valid.

Options:

--overwrite — regenerate captions even if caption.txt already exists
--frame_stride N — sample every Nth frame (default 8)

Cost: Each sequence takes a few GPU-seconds. For ~5,400 sequences, budget several GPU-hours on a single GPU. Already-captioned sequences are skipped automatically on re-runs.

Step 2: Precompute HDR Latents (strongly recommended)

VAE encoding the HDR ground-truth frames is the single most expensive operation during training (~12–15 s per clip in float32). Since the HDR frames do not change between steps, the latents can be computed once and cached:

python scripts/precompute_hdr_latents.py \
    --dataset_root /path/to/dataset \
    --model_path models/Wan2.1-VACE-14B-diffusers \
    --gamma 2.2 \
    --M 1000.0 \
    --sequence_frames 81 \
    --clip_frames 33 \
    --resolution 720 1280 \
    --device cuda:0 \
    --all_starts          # cache every possible temporal offset (49 per sequence)

What it does: For every sequence and every possible 33-frame temporal starting offset (0 through 48), it:

Loads the HDR frames
Applies a deterministic center-crop to the target resolution
Log-Gamma encodes the HDR
Runs the Wan VAE encoder (float32)
Saves the resulting latent to <sequence_dir>/latents/clip_<start:04d>_latent.pt

Expected speedup: ~35–45% reduction in wall-clock time per training step (from 2 VAE calls to 1 per micro-step).

Critical: --gamma and --M must match the values in configs/train.yaml. They are saved to <dataset_root>/latent_config.json for cross-checking. If you change these hyperparameters, delete the latents/ directories and recompute.

Disk usage: Each cached latent is a (16, 9, 90, 160) float32 tensor ≈ 830 KB. With 49 offsets per sequence and 5,400 sequences: ~220 GB total. For experimentation, omit --all_starts to cache only start=0 (saves ~50× disk space at the cost of less temporal diversity).

The dataset loader picks up cached latents automatically — no flag is needed in train.py.

Step 3: Configure Training

Edit configs/train.yaml to point at your data and model:

# Key fields to adjust:

model:
  backbone: "models/Wan2.1-VACE-14B-diffusers"  # local path after downloading

data:
  dataset_root: "/path/to/your/dataset"
  val_split: 50          # sequences held out for validation; reduce for small datasets

training:
  batch_size: 1          # per GPU — keep at 1 for 80 GB GPUs
  gradient_accumulation: 4
  num_steps: 10000
  learning_rate: 1.0e-4

color_mapping:
  gamma: 2.2             # Log-Gamma compression strength — requires empirical tuning
  M: 1000.0              # Maximum representable radiance — requires empirical tuning

output_dir: "checkpoints"
use_wandb: false         # set to true and log in with `wandb login` to enable W&B

Log-Gamma hyperparameters (gamma, M): These are not specified in the paper. gamma=2.2 (matching standard perceptual gamma encoding) and M=1000.0 (≈10 stops above SDR white point) are physically motivated defaults. Validate them empirically on your data distribution.

initial_checkpoint: Set to true to save a step-0 checkpoint before any gradient updates. Useful for verifying that checkpoint saving and the resume path work before committing to a long run.

Step 4: Configure Accelerate

accelerate manages multi-GPU training. Run the interactive setup once:

accelerate config

Recommended answers for 4–8 A100 training:

Compute environment: This machine
Distributed type: multi-GPU
Number of processes (GPUs): 4 (or 8)
Mixed precision: bf16
Gradient accumulation: leave blank (set in train.yaml instead)

The config is saved to ~/.cache/huggingface/accelerate/default_config.yaml.

Step 5: Launch Training

# Standard multi-GPU launch
accelerate launch train.py --config configs/train.yaml

# Single GPU (for debugging / small-scale experiments)
python train.py --config configs/train.yaml

On startup, the script prints a pre-flight summary:

[DiffHDR] LoRA targets: 384 Linear modules
[DiffHDR] Transformer: 186,499,072 trainable / 14,232,576,000 total (1.310%)

Verify that:

The trainable fraction is between 0% and 5% (confirms LoRA freeze is correct)
LoRA targets include attention projections (to_q, to_k, to_v, to_out.0) and FFN layers (ffn.net.0.proj, ffn.net.2)
The VAE is not in the trainable parameters

The training loop logs every 50 steps and saves checkpoints every 500 steps (configurable via training.log_every and training.checkpoint_every).

Resuming from a Checkpoint

Checkpoints saved during training contain only the LoRA adapter weights (not the full 14B model):

checkpoints/
└── checkpoint-005000/
    ├── adapter_config.json
    ├── adapter_model.safetensors   # LoRA delta weights only (~750 MB)
    ├── color_mapping_config.json   # gamma, M, and base_model path
    └── step.txt                    # global step for resume

To resume training from a checkpoint:

accelerate launch train.py \
    --config configs/train.yaml \
    --resume_from_checkpoint checkpoints/checkpoint-005000

The script reads step.txt and resumes from the correct global step. No changes to the YAML are needed.

SLURM Cluster Training

Two SLURM job scripts are provided:

Precompute latents (run first, 1 GPU, ~12h):

sbatch precompute_latents.slurm

Training (4 GPUs, 48h):

sbatch train.slurm

To resume from a checkpoint via SLURM:

sbatch --export=ALL,RESUME_CHECKPOINT=checkpoints/checkpoint-005000 train.slurm

The scripts assume:

Working directory: /scratch/eli/disney/diffhdr
Virtual environment: ${WORKDIR}/.venv
Accelerate config at: ~/.cache/huggingface/accelerate/default_config.yaml
Modules: cuda12.4/toolkit

Adjust #SBATCH directives (--partition, --nodelist, --gres) and path variables for your cluster.

Monitoring and Checkpoints

Console logs: The training loop prints loss and learning rate every log_every steps (default 50):

12:34:56  INFO     diffhdr.train  step=100/10000  loss=0.04213  lr=1.00e-04

W&B: Set use_wandb: true in the YAML and authenticate with wandb login before launching. Logs loss/flow, lr, and step.

Checkpoints: Saved every checkpoint_every steps (default 500) to output_dir/checkpoint-<step:06d>/. A checkpoint-final/ is written at the end of training. Only LoRA weights are saved (~750 MB each vs. ~28 GB for the full model).

Inference

Basic Usage

python inference.py \
    --input_video path/to/ldr_video.mp4 \
    --output_dir path/to/output/ \
    --lora_checkpoint checkpoints/checkpoint-final \
    --prompt "[Overexposed: bright sky and sun]; [Underexposed: dark shadows under trees]" \
    --num_inference_steps 50 \
    --guidance_scale 7.5

The --prompt should describe the specific exposure conditions in the video. Write separate descriptions for overexposed and underexposed regions. If you don't know them in advance, the exposure mask detector will handle spatial routing regardless; the prompt controls high-level semantic guidance.

Prompt format:

[Overexposed: <describe bright/blown-out areas>]; [Underexposed: <describe dark/shadowed areas>]

Use [Overexposed: none] or [Underexposed: none] when a region is not present.

Loading the base model: The checkpoint's color_mapping_config.json stores the base model path used during training. Inference loads it automatically. Override with --base_model if the path has changed:

python inference.py \
    ... \
    --base_model models/Wan2.1-VACE-14B-diffusers

With a Reference Image

A reference image provides explicit appearance guidance for overexposed regions. It can be any LDR image (e.g., a correctly-exposed photo of the same scene, or an edited version of a key frame):

python inference.py \
    --input_video path/to/ldr_video.mp4 \
    --output_dir path/to/output/ \
    --lora_checkpoint checkpoints/checkpoint-final \
    --prompt "[Overexposed: bright interior window]; [Underexposed: dark corner furniture]" \
    --reference_image path/to/reference.png

The reference image is Log-Gamma encoded and prepended to the VACE context sequence along the temporal dimension.

CFA Controls

Context-Focused Cross-Attention (CFA) routes different text embeddings to overexposed and underexposed spatial regions at inference. It is enabled by default.

python inference.py \
    ... \
    --alpha_over 3.0     # amplify HDR recovery in overexposed regions (paper tests 1, 3, 5)
    --alpha_under 3.0    # amplify HDR recovery in underexposed regions

Higher alpha values produce stronger regional correction but may introduce artifacts. Start at 1.0 (default) and increase if HDR recovery in specific regions is insufficient.

To disable CFA and use standard attention (faster, slightly lower quality):

python inference.py \
    ... \
    --no_cfa

Temporal windowing: Videos longer than 33 frames are processed in overlapping windows (33-frame windows with --window_overlap 8 frames of overlap by default). Blending is done in latent space before decoding. Adjust overlap with --window_overlap N.

Output Files

Inference writes two outputs to --output_dir:

output/
├── exr/
│   ├── frame_00000.exr   # HDR output, half-float EXR, linear Rec.709
│   ├── frame_00001.exr
│   └── ...
└── preview.mp4           # 8-bit tone-mapped preview (Reinhard, per-video)

EXR frames are in linear Rec.709 with values in [0, ∞). Small negative values from the VAE decoder are clamped to zero. Load them in Blender, Nuke, DaVinci Resolve, or any application that handles EXR.
preview.mp4 uses global Reinhard tone mapping across all frames for temporal brightness consistency. It is for quick visual inspection only — not a ground-truth HDR representation.

Evaluation

Evaluation computes perceptual quality metrics on the predicted EXR frames.

Reference-based metrics (requires ground-truth EXR frames):

# SI-HDR benchmark (Table 1): HDR-VDP3, PU21-PIQE, FID
python evaluate.py \
    --pred_dir output/exr/ \
    --gt_dir   groundtruth/exr/ \
    --metrics all \
    --benchmark sihdr

# Video benchmarks (Tables 2–3): FovVideoVDP + no-reference
python evaluate.py \
    --pred_dir output/exr/ \
    --gt_dir   groundtruth/exr/ \
    --metrics fovvideovdp,dover,musiq,clipiqa \
    --benchmark cinematic

No-reference only (no ground truth needed):

python evaluate.py \
    --pred_dir output/exr/ \
    --metrics dover,musiq,clipiqa

Available metrics:

Metric	Type	Requires GT
`hdrvdp3`	Image perceptual quality	Yes
`pu21_piqe`	Perceptual uniformity × PIQE	No
`fid`	Fréchet Inception Distance	Yes
`fovvideovdp`	Video perceptual quality	Yes
`dover`	No-reference video quality	No
`musiq`	Multi-scale image quality	No
`clipiqa`	CLIP-based image quality	No

FID note: FID requires pre-saved PNG/JPEG files rather than EXR. Use pytorch-fid directly after tone-mapping:

# Tone-map EXR to PNG, then:
python -m pytorch_fid pred_png_dir/ gt_png_dir/

Project Structure

diffhdr/
├── configs/
│   └── train.yaml                  # training configuration
├── diffhdr/
│   ├── __init__.py
│   ├── color_mapping.py            # Log-Gamma color mapping and sRGB↔linear helpers
│   ├── mask_detection.py           # luminance-based exposure mask detection + EMA
│   ├── data_augmentation.py        # LDR synthesis: exposure shift, noise, quantization
│   ├── dataset.py                  # HDRVideoDataset (EXR loading, latent cache)
│   ├── model.py                    # DiffHDRModel: Wan-2.1-VACE-14B + LoRA
│   ├── attention.py                # CFA hooks (inference only)
│   ├── prompting.py                # structured caption parsing and text encoding
│   └── losses.py                   # rectified flow-matching loss
├── scripts/
│   ├── generate_captions.py        # one-time: Qwen3-VL caption generation
│   └── precompute_hdr_latents.py   # one-time: HDR latent caching for speed
├── train.py                        # training script (accelerate / DDP)
├── inference.py                    # LDR video → HDR video
├── evaluate.py                     # evaluation metrics
├── train.slurm                     # SLURM job for training
├── precompute_latents.slurm        # SLURM job for latent precomputation
└── requirements.txt

Key Design Decisions

VAE is always float32 and always frozen. BF16 VAE decoding produces visible banding in smooth HDR gradients due to insufficient mantissa precision (Appendix C). VAE finetuning over-smooths latent representations and degrades downstream generation quality (Appendix B). Every VAE call is wrapped in torch.autocast(device_type='cuda', enabled=False) to force float32 even inside accelerate's BF16 context.

LoRA targets only the main denoising branch. The VACE context branch (vace_blocks.*) remains fully frozen. LoRA targets are discovered at runtime by inspecting all nn.Linear modules — never hardcoded — to accommodate naming convention differences across diffusers versions.

CFA is strictly inference-only. During training, a single text embedding from the full [Overexposed: ...]; [Underexposed: ...] caption is used. CFA's three-embedding decomposition (null, over, under) is applied only at inference. train.py asserts that no CFA hooks are registered before the training loop begins.

Log-Gamma hyperparameters are saved with every checkpoint. gamma and M are stored in color_mapping_config.json alongside the LoRA adapter files. Inference reads these values rather than relying on defaults, ensuring exact training/inference consistency.

Latent caching uses deterministic center-crop. When using precomputed HDR latents, the pixel frames at training time are center-cropped (not randomly cropped) to match the spatial region encoded into the cached latent. Random horizontal flip is applied consistently to both pixel frames and the latent.

Troubleshooting

"No trainable parameters" / assert pct > 0.0 fails

PEFT's get_peft_model handles parameter freezing internally. Do not call transformer.requires_grad_(False) after applying LoRA — this freezes the LoRA weights too, making training a no-op.

assert pct < 5.0 fails (trainable fraction > 5%)

LoRA is being applied to too many layers. Check discover_lora_targets() output in the logs. The VACE context branch (vace_blocks.*) must be excluded — verify the exclusion filter is working correctly.

ImportError: diffusers>=0.31.0 required or pipeline class not found

Upgrade diffusers: pip install -U 'diffusers>=0.37.0'. The pipeline class WanVACEPipeline was introduced in diffusers 0.37. To check what's available: python -c 'import diffusers; print([x for x in dir(diffusers) if "Wan" in x])'.

VACE context parameter not found (model runs without context conditioning)

The log line "VACE context parameter not found. Running DiT without context conditioning." indicates a diffusers version mismatch. The expected parameter name is control_hidden_states (diffusers ≥ 0.37). Check the parameter list in model._dit_forward() and add your version's name to ctx_param_candidates.

Training step time is > 120s on 4 × A100

This is expected without the latent cache. Run scripts/precompute_hdr_latents.py first. Without caching: ~168s/step → ~19 days for 10,000 steps. With caching: ~90–100s/step → ~10 days.

caption.txt missing for many sequences

Run scripts/generate_captions.py before training. The fallback caption [Overexposed: none]; [Underexposed: none] is used when caption.txt is absent, but text conditioning will not be meaningful for those sequences.

EXR frames all black or all white in preview

The preview uses Reinhard tone mapping. If the HDR output contains extreme values (very large or very small), the tone mapping key may need adjustment. Check the raw EXR values: linear HDR for a well-lit scene in Rec.709 should have most pixels in the range [0, 100] (with highlights up to ~1000).

Cached latents produce spatial misalignment with LDR inputs

Ensure --gamma and --M in precompute_hdr_latents.py exactly match color_mapping.gamma and color_mapping.M in configs/train.yaml. Also verify --resolution H W matches training.resolution (in T H W order in the YAML, but H W order in the precompute script). If in doubt, delete the latents/ directories and recompute.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
configs		configs
diffhdr		diffhdr
scripts		scripts
.gitignore		.gitignore
ANALYSIS.md		ANALYSIS.md
CLAUDE.md		CLAUDE.md
DiffHDR.pdf		DiffHDR.pdf
README.md		README.md
evaluate.py		evaluate.py
inference.py		inference.py
precompute.slurm		precompute.slurm
requirements.txt		requirements.txt
train.py		train.py
train.slurm		train.slurm
train_8gpu.slurm		train_8gpu.slurm

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DiffHDR

Table of Contents

Hardware Requirements

Installation

Downloading the Base Model

Training

Data Structure

Step 1: Generate Captions

Step 2: Precompute HDR Latents (strongly recommended)

Step 3: Configure Training

Step 4: Configure Accelerate

Step 5: Launch Training

Resuming from a Checkpoint

SLURM Cluster Training

Monitoring and Checkpoints

Inference

Basic Usage

With a Reference Image

CFA Controls

Output Files

Evaluation

Project Structure

Key Design Decisions

Troubleshooting

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

DiffHDR

Table of Contents

Hardware Requirements

Installation

Downloading the Base Model

Training

Data Structure

Step 1: Generate Captions

Step 2: Precompute HDR Latents (strongly recommended)

Step 3: Configure Training

Step 4: Configure Accelerate

Step 5: Launch Training

Resuming from a Checkpoint

SLURM Cluster Training

Monitoring and Checkpoints

Inference

Basic Usage

With a Reference Image

CFA Controls

Output Files

Evaluation

Project Structure

Key Design Decisions

Troubleshooting

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages