feat: FP8 quantization & TensorRT build infrastructure by forkni · Pull Request #6 · dotsimulate/StreamDiffusion

forkni · 2026-04-04T05:47:19Z

Summary

Adds complete FP8 quantization pipeline via nvidia-modelopt: ONNX export → FP16 optimize → FP8 Q/DQ annotation → TRT STRONGLY_TYPED engine
Adds fp8=True parameter flow from StreamDiffusionWrapper.__init__() through _load_model(), compile_unet(), and EngineBuilder.build()
Engine path gains --fp8 suffix for separate cache (e.g. ...--controlnet--fp8--mode-img2img/unet.engine)
Two-pass cleanup in builder.py prevents ~14 GB intermediate file bloat on Windows (ONNX weights + onnx__* tensor files)
Fixes 4 TRT 10.12 compatibility bugs in FP8 build path (STRONGLY_TYPED, version-aware precision flags)
Patches ByteSize() for >2GB ONNX models in modelopt calibration (required for SDXL UNet)

Stacking

Stacks on pr3/ipadapter-vram-deps (which stacks on pr1/inference-performance). PR3 provides onnx/onnxruntime/modelopt dependency pins that FP8 imports require.

Commits (8)

61bcf86 — patch ByteSize() for >2GB ONNX in modelopt FP8 quantization
7a02ae2 — reduce FP8 calibration batches 128→8 (KVO cache OOM)
402d619 — merge calibration list-of-dicts into stacked dict for modelopt
7c05f59 — add NVIDIA DLLs to PATH and retry without quantize_mha on ORT EP failure
5ba15af — use single calibration batch, cleanup intermediates on retry
6b0b99d — resolve 4 FP8 bugs for TRT 10.12 (STRONGLY_TYPED network, version-aware flags)
24da142 — prevent intermediate file bloat on Windows (two-pass cleanup, onnx__* early deletion)
670aec4 — add FP8 parameter flow: wrapper → compile_unet → engine path prefix

Files Modified

File	Changes
`acceleration/tensorrt/fp8_quantize.py`	New file (464 lines): calibration data generation, `quantize_onnx_fp8()`
`acceleration/tensorrt/builder.py`	FP8 build stage, two-pass cleanup, build stats ordering fix
`acceleration/tensorrt/utilities.py`	`_build_fp8()` raw TRT builder, `onnx__*` early cleanup, external data support
`acceleration/tensorrt/models/models.py`	`get_dynamic_axes()` output axes for FP8 compatibility
`acceleration/tensorrt/__init__.py`	`compile_unet()` extracts `fp8`/`calibration_data_fn` from `engine_build_options`
`acceleration/tensorrt/engine_manager.py`	`--fp8` suffix in UNet engine path
`wrapper.py`	`fp8=` param, `self.fp8`, passthrough to `_load_model()` and `get_engine_path()`

Test plan

FP16 baseline engine builds without --fp8 suffix (no regression)
StreamDiffusionWrapper(fp8=True) generates engine path with --fp8 suffix
FP8 ONNX (*.fp8.onnx) preserved after build; intermediates cleaned
Engine directory contains only unet.engine, unet.fp8.onnx, build_stats.json
FP16 fallback triggers if modelopt not installed (graceful degradation)

🤖 Generated with Claude Code

…alibrationDataProvider

… failure

… cleanup intermediates on retry

…ngine build - Bug 1: KVO cache batch dim mismatch (kvo_calib_batch=2 vs sample=4) Set kvo_calib_batch=effective_batch to match ONNX shared axis '2B' - Bug 2: BuilderFlag.STRONGLY_TYPED removed in TRT 10.12 Guard with hasattr() fallback - Bug 3: Precision flags (FP8/FP16/TF32) incompatible with STRONGLY_TYPED Skip precision flags when STRONGLY_TYPED is network-level only - Bug 4: ModelOpt override_shapes bakes static dims into FP8 ONNX Add _restore_dynamic_axes() to restore dim_param after quantization - Fix IHostMemory.nbytes (no len()) in TRT 10.12 engine save logging - Default disable_mha_qdq=True (MHA stays FP16, 17min vs 3hr+ build) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Move build_stats.json write before cleanup to prevent accidental deletion. Add two-pass cleanup with gc.collect() between passes to release Python-held file handles that cause Windows lock failures. Delete onnx__ tensor files immediately after repacking into weights.pb during ONNX export (~4 GB freed before quantize stage starts). Adds actionable warning with manual cleanup instructions when file locks persist. Root cause: builder.py cleanup ran os.remove() once with silent except OSError, leaving ~14.5 GB of intermediates (onnx_data, weights.pb, onnx__* tensors, model weight dumps) when Windows file locks prevented deletion. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…ne path Extract FP8-specific changes from 00cf0c7 (without reformatting). Adds fp8 parameter flow from StreamDiffusionWrapper through to engine compilation with calibration data callback and --fp8 engine path suffix for cache separation. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

INTER-NYC and others added 9 commits April 4, 2026 01:37

fix: patch ByteSize() for >2GB ONNX in modelopt FP8 quantization

61bcf86

fix: reduce FP8 calibration batches 128→8 (KVO cache OOM, 281GB→17GB)

7a02ae2

fix: merge calibration list-of-dicts into stacked dict for modelopt C…

402d619

…alibrationDataProvider

fix: add NVIDIA DLLs to PATH and retry without quantize_mha on ORT EP…

7c05f59

… failure

fix: use single calibration batch for modelopt (avoid rank mismatch),…

5ba15af

… cleanup intermediates on retry

perf: pre-allocate KV clone buffers in CachedSTAttnProcessor2_0

ce0d51c

forkni mentioned this pull request Apr 6, 2026

perf: Tier 1 hot-path allocation elimination & text encoder stutter fix #7

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: FP8 quantization & TensorRT build infrastructure#6

feat: FP8 quantization & TensorRT build infrastructure#6
forkni wants to merge 9 commits intopr3/ipadapter-vram-depsfrom
pr2/fp8-tensorrt-build

forkni commented Apr 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

forkni commented Apr 4, 2026

Summary

Stacking

Commits (8)

Files Modified

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants