Add INT8 VAE decode engine by ryanontheinside · Pull Request #1 · daydreamlive/DEMON

ryanontheinside · 2026-05-01T20:57:12Z

Summary

Adds an INT8-quantized TRT engine for the VAE decoder via TensorRT entropy calibration. 1.64x faster than fp16 on standalone 60s decode (33ms vs 54ms), 32.5 dB PSNR vs fp16, +88 MB VRAM vs fp16 (TRT retains kernels for both precisions).
Worth shipping for workloads where standalone VAE decode dominates (large vae_window, batch generation). For tight streaming with short vae_window and skip-cache reuse, the speedup dilutes to ~1.7% end-to-end because the diffusion decoder dominates the tick. See docs/int8_vae_decode.md for the full decision guide.
Full research narrative including hybrid + autotune negative results lives on branch archive/ryanontheinside/int8-research-2026-05-01.

What's in this PR

acestep/engine/trt/vae_decode_int8.py: INT8 builder with entropy/minmax calibrators and optional first/last conv pinning
scripts/build_vae_int8_engine.py: canonical CLI driver
scripts/collect_vae_calibration.py: regenerate calibration latents (32 prompt-diverse, 1500-frame each)
tests/benchmarks/bench_vae_int8_regular.py: regression bench harness (synthetic latent + wav round-trip)
demos/test_stream_cover_graph.py: adds --vae-decode-engine and --source-audio flags so streaming runs can A/B engines
docs/int8_vae_decode.md: full variant comparison report (matrix, streaming bench, VRAM analysis, decision guide, what-didn't-work)

Test plan

Generate calibration latents: uv run python scripts/collect_vae_calibration.py --frames 1500 --output-dir ~/.daydream-scope/models/demon/calibration/vae_latents_60s
Build engine: uv run python scripts/build_vae_int8_engine.py --engine-name vae_decode_int8_60s --opt-frames 1500 --max-frames 1500
Run bench: uv run python tests/benchmarks/bench_vae_int8_regular.py --int8-engine vae_decode_int8_60s --fp16-engine vae_decode_fp16_60s and confirm >=1.5x speedup, >=32 dB fp16-vs-int8 PSNR
Streaming sanity: uv run python demos/test_stream_cover_graph.py --vae-decode-engine vae_decode_int8_60s --vae-window 5 and confirm output is audibly comparable to the fp16 baseline run
A/B listen to bench_outputs/vae_int8/vae_decode_int8_60s/synth_trt_int8.wav vs synth_trt_fp16.wav (synthetic latent reveals the decoder difference; wav round-trip masks it because the encoder dominates)

Quantizes the ACE-Step Oobleck VAE decoder to INT8 via TensorRT's native PTQ flow. On RTX 5090, decoding 60s of audio takes ~33ms vs fp16's ~54ms (1.64x speedup) at a PSNR vs fp16 of 32.5 dB. Worth shipping for workloads where standalone VAE decode dominates (large vae_window, batch generation). For tight streaming pipelines with short vae_window and skip-cache reuse, the speedup dilutes to ~1.7% end-to-end because the diffusion decoder dominates the tick. Costs +88 MB VRAM vs fp16 (TRT retains kernels for both precisions). Components: acestep/engine/trt/vae_decode_int8.py - INT8 builder with calibrator scripts/build_vae_int8_engine.py - canonical CLI driver scripts/collect_vae_calibration.py - regenerate calibration latents tests/benchmarks/bench_vae_int8_regular.py - regression bench harness demos/test_stream_cover_graph.py - --vae-decode-engine, --source-audio docs/int8_vae_decode.md - full variant comparison + decision guide Full research narrative (variants B-I, hybrid + autotune negative results, per-variant wavs and JSONs) preserved on branch archive/ryanontheinside/int8-research-2026-05-01.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add INT8 VAE decode engine#1

Add INT8 VAE decode engine#1
ryanontheinside wants to merge 1 commit into
mainfrom
ryanontheinside/feat/vae-decode-int8

ryanontheinside commented May 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ryanontheinside commented May 1, 2026

Summary

What's in this PR

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant