Skip to content

feat: FP8 quantization & TensorRT build infrastructure#6

Open
forkni wants to merge 9 commits intopr3/ipadapter-vram-depsfrom
pr2/fp8-tensorrt-build
Open

feat: FP8 quantization & TensorRT build infrastructure#6
forkni wants to merge 9 commits intopr3/ipadapter-vram-depsfrom
pr2/fp8-tensorrt-build

Conversation

@forkni
Copy link
Copy Markdown
Collaborator

@forkni forkni commented Apr 4, 2026

Summary

  • Adds complete FP8 quantization pipeline via nvidia-modelopt: ONNX export → FP16 optimize → FP8 Q/DQ annotation → TRT STRONGLY_TYPED engine
  • Adds fp8=True parameter flow from StreamDiffusionWrapper.__init__() through _load_model(), compile_unet(), and EngineBuilder.build()
  • Engine path gains --fp8 suffix for separate cache (e.g. ...--controlnet--fp8--mode-img2img/unet.engine)
  • Two-pass cleanup in builder.py prevents ~14 GB intermediate file bloat on Windows (ONNX weights + onnx__* tensor files)
  • Fixes 4 TRT 10.12 compatibility bugs in FP8 build path (STRONGLY_TYPED, version-aware precision flags)
  • Patches ByteSize() for >2GB ONNX models in modelopt calibration (required for SDXL UNet)

Stacking

Stacks on pr3/ipadapter-vram-deps (which stacks on pr1/inference-performance). PR3 provides onnx/onnxruntime/modelopt dependency pins that FP8 imports require.

Commits (8)

  1. 61bcf86 — patch ByteSize() for >2GB ONNX in modelopt FP8 quantization
  2. 7a02ae2 — reduce FP8 calibration batches 128→8 (KVO cache OOM)
  3. 402d619 — merge calibration list-of-dicts into stacked dict for modelopt
  4. 7c05f59 — add NVIDIA DLLs to PATH and retry without quantize_mha on ORT EP failure
  5. 5ba15af — use single calibration batch, cleanup intermediates on retry
  6. 6b0b99d — resolve 4 FP8 bugs for TRT 10.12 (STRONGLY_TYPED network, version-aware flags)
  7. 24da142 — prevent intermediate file bloat on Windows (two-pass cleanup, onnx__* early deletion)
  8. 670aec4 — add FP8 parameter flow: wrapper → compile_unet → engine path prefix

Files Modified

File Changes
acceleration/tensorrt/fp8_quantize.py New file (464 lines): calibration data generation, quantize_onnx_fp8()
acceleration/tensorrt/builder.py FP8 build stage, two-pass cleanup, build stats ordering fix
acceleration/tensorrt/utilities.py _build_fp8() raw TRT builder, onnx__* early cleanup, external data support
acceleration/tensorrt/models/models.py get_dynamic_axes() output axes for FP8 compatibility
acceleration/tensorrt/__init__.py compile_unet() extracts fp8/calibration_data_fn from engine_build_options
acceleration/tensorrt/engine_manager.py --fp8 suffix in UNet engine path
wrapper.py fp8= param, self.fp8, passthrough to _load_model() and get_engine_path()

Test plan

  • FP16 baseline engine builds without --fp8 suffix (no regression)
  • StreamDiffusionWrapper(fp8=True) generates engine path with --fp8 suffix
  • FP8 ONNX (*.fp8.onnx) preserved after build; intermediates cleaned
  • Engine directory contains only unet.engine, unet.fp8.onnx, build_stats.json
  • FP16 fallback triggers if modelopt not installed (graceful degradation)

🤖 Generated with Claude Code

INTER-NYC and others added 9 commits April 4, 2026 01:37
…ngine build

- Bug 1: KVO cache batch dim mismatch (kvo_calib_batch=2 vs sample=4)
  Set kvo_calib_batch=effective_batch to match ONNX shared axis '2B'
- Bug 2: BuilderFlag.STRONGLY_TYPED removed in TRT 10.12
  Guard with hasattr() fallback
- Bug 3: Precision flags (FP8/FP16/TF32) incompatible with STRONGLY_TYPED
  Skip precision flags when STRONGLY_TYPED is network-level only
- Bug 4: ModelOpt override_shapes bakes static dims into FP8 ONNX
  Add _restore_dynamic_axes() to restore dim_param after quantization
- Fix IHostMemory.nbytes (no len()) in TRT 10.12 engine save logging
- Default disable_mha_qdq=True (MHA stays FP16, 17min vs 3hr+ build)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Move build_stats.json write before cleanup to prevent accidental deletion.
Add two-pass cleanup with gc.collect() between passes to release Python-held
file handles that cause Windows lock failures. Delete onnx__ tensor files
immediately after repacking into weights.pb during ONNX export (~4 GB freed
before quantize stage starts). Adds actionable warning with manual cleanup
instructions when file locks persist.

Root cause: builder.py cleanup ran os.remove() once with silent except OSError,
leaving ~14.5 GB of intermediates (onnx_data, weights.pb, onnx__* tensors,
model weight dumps) when Windows file locks prevented deletion.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ne path

Extract FP8-specific changes from 00cf0c7 (without reformatting).
Adds fp8 parameter flow from StreamDiffusionWrapper through to
engine compilation with calibration data callback and --fp8 engine
path suffix for cache separation.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants