Add INT8 VAE decode engine#1
Open
ryanontheinside wants to merge 1 commit into
Open
Conversation
Quantizes the ACE-Step Oobleck VAE decoder to INT8 via TensorRT's native PTQ flow. On RTX 5090, decoding 60s of audio takes ~33ms vs fp16's ~54ms (1.64x speedup) at a PSNR vs fp16 of 32.5 dB. Worth shipping for workloads where standalone VAE decode dominates (large vae_window, batch generation). For tight streaming pipelines with short vae_window and skip-cache reuse, the speedup dilutes to ~1.7% end-to-end because the diffusion decoder dominates the tick. Costs +88 MB VRAM vs fp16 (TRT retains kernels for both precisions). Components: acestep/engine/trt/vae_decode_int8.py - INT8 builder with calibrator scripts/build_vae_int8_engine.py - canonical CLI driver scripts/collect_vae_calibration.py - regenerate calibration latents tests/benchmarks/bench_vae_int8_regular.py - regression bench harness demos/test_stream_cover_graph.py - --vae-decode-engine, --source-audio docs/int8_vae_decode.md - full variant comparison + decision guide Full research narrative (variants B-I, hybrid + autotune negative results, per-variant wavs and JSONs) preserved on branch archive/ryanontheinside/int8-research-2026-05-01.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
vae_window, batch generation). For tight streaming with shortvae_windowand skip-cache reuse, the speedup dilutes to ~1.7% end-to-end because the diffusion decoder dominates the tick. Seedocs/int8_vae_decode.mdfor the full decision guide.archive/ryanontheinside/int8-research-2026-05-01.What's in this PR
acestep/engine/trt/vae_decode_int8.py: INT8 builder with entropy/minmax calibrators and optional first/last conv pinningscripts/build_vae_int8_engine.py: canonical CLI driverscripts/collect_vae_calibration.py: regenerate calibration latents (32 prompt-diverse, 1500-frame each)tests/benchmarks/bench_vae_int8_regular.py: regression bench harness (synthetic latent + wav round-trip)demos/test_stream_cover_graph.py: adds--vae-decode-engineand--source-audioflags so streaming runs can A/B enginesdocs/int8_vae_decode.md: full variant comparison report (matrix, streaming bench, VRAM analysis, decision guide, what-didn't-work)Test plan
uv run python scripts/collect_vae_calibration.py --frames 1500 --output-dir ~/.daydream-scope/models/demon/calibration/vae_latents_60suv run python scripts/build_vae_int8_engine.py --engine-name vae_decode_int8_60s --opt-frames 1500 --max-frames 1500uv run python tests/benchmarks/bench_vae_int8_regular.py --int8-engine vae_decode_int8_60s --fp16-engine vae_decode_fp16_60sand confirm >=1.5x speedup, >=32 dB fp16-vs-int8 PSNRuv run python demos/test_stream_cover_graph.py --vae-decode-engine vae_decode_int8_60s --vae-window 5and confirm output is audibly comparable to the fp16 baseline runbench_outputs/vae_int8/vae_decode_int8_60s/synth_trt_int8.wavvssynth_trt_fp16.wav(synthetic latent reveals the decoder difference; wav round-trip masks it because the encoder dominates)