Skip to content

Add INT8 VAE decode engine#1

Open
ryanontheinside wants to merge 1 commit into
mainfrom
ryanontheinside/feat/vae-decode-int8
Open

Add INT8 VAE decode engine#1
ryanontheinside wants to merge 1 commit into
mainfrom
ryanontheinside/feat/vae-decode-int8

Conversation

@ryanontheinside
Copy link
Copy Markdown
Collaborator

Summary

  • Adds an INT8-quantized TRT engine for the VAE decoder via TensorRT entropy calibration. 1.64x faster than fp16 on standalone 60s decode (33ms vs 54ms), 32.5 dB PSNR vs fp16, +88 MB VRAM vs fp16 (TRT retains kernels for both precisions).
  • Worth shipping for workloads where standalone VAE decode dominates (large vae_window, batch generation). For tight streaming with short vae_window and skip-cache reuse, the speedup dilutes to ~1.7% end-to-end because the diffusion decoder dominates the tick. See docs/int8_vae_decode.md for the full decision guide.
  • Full research narrative including hybrid + autotune negative results lives on branch archive/ryanontheinside/int8-research-2026-05-01.

What's in this PR

  • acestep/engine/trt/vae_decode_int8.py: INT8 builder with entropy/minmax calibrators and optional first/last conv pinning
  • scripts/build_vae_int8_engine.py: canonical CLI driver
  • scripts/collect_vae_calibration.py: regenerate calibration latents (32 prompt-diverse, 1500-frame each)
  • tests/benchmarks/bench_vae_int8_regular.py: regression bench harness (synthetic latent + wav round-trip)
  • demos/test_stream_cover_graph.py: adds --vae-decode-engine and --source-audio flags so streaming runs can A/B engines
  • docs/int8_vae_decode.md: full variant comparison report (matrix, streaming bench, VRAM analysis, decision guide, what-didn't-work)

Test plan

  • Generate calibration latents: uv run python scripts/collect_vae_calibration.py --frames 1500 --output-dir ~/.daydream-scope/models/demon/calibration/vae_latents_60s
  • Build engine: uv run python scripts/build_vae_int8_engine.py --engine-name vae_decode_int8_60s --opt-frames 1500 --max-frames 1500
  • Run bench: uv run python tests/benchmarks/bench_vae_int8_regular.py --int8-engine vae_decode_int8_60s --fp16-engine vae_decode_fp16_60s and confirm >=1.5x speedup, >=32 dB fp16-vs-int8 PSNR
  • Streaming sanity: uv run python demos/test_stream_cover_graph.py --vae-decode-engine vae_decode_int8_60s --vae-window 5 and confirm output is audibly comparable to the fp16 baseline run
  • A/B listen to bench_outputs/vae_int8/vae_decode_int8_60s/synth_trt_int8.wav vs synth_trt_fp16.wav (synthetic latent reveals the decoder difference; wav round-trip masks it because the encoder dominates)

Quantizes the ACE-Step Oobleck VAE decoder to INT8 via TensorRT's native
PTQ flow. On RTX 5090, decoding 60s of audio takes ~33ms vs fp16's ~54ms
(1.64x speedup) at a PSNR vs fp16 of 32.5 dB.

Worth shipping for workloads where standalone VAE decode dominates (large
vae_window, batch generation). For tight streaming pipelines with
short vae_window and skip-cache reuse, the speedup dilutes to ~1.7%
end-to-end because the diffusion decoder dominates the tick. Costs
+88 MB VRAM vs fp16 (TRT retains kernels for both precisions).

Components:
  acestep/engine/trt/vae_decode_int8.py - INT8 builder with calibrator
  scripts/build_vae_int8_engine.py - canonical CLI driver
  scripts/collect_vae_calibration.py - regenerate calibration latents
  tests/benchmarks/bench_vae_int8_regular.py - regression bench harness
  demos/test_stream_cover_graph.py - --vae-decode-engine, --source-audio
  docs/int8_vae_decode.md - full variant comparison + decision guide

Full research narrative (variants B-I, hybrid + autotune negative results,
per-variant wavs and JSONs) preserved on branch
archive/ryanontheinside/int8-research-2026-05-01.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant