Skip to content

perf: Inference performance & pipeline correctness#4

Open
forkni wants to merge 11 commits intoSDTD_031_devfrom
pr1/inference-performance
Open

perf: Inference performance & pipeline correctness#4
forkni wants to merge 11 commits intoSDTD_031_devfrom
pr1/inference-performance

Conversation

@forkni
Copy link
Copy Markdown
Collaborator

@forkni forkni commented Apr 3, 2026

Summary

Systematic optimization pass eliminating all per-frame CUDA memory allocations in the inference hot path, removing unnecessary GPU synchronization, and fixing pipeline precision and observability issues.

Key Changes

  • Zero per-frame CUDA allocations: Pre-allocated reusable buffers (_latent_cache, _noise_buf, _image_decode_buf, _prev_image_buf) with .copy_() / .normal_() replacing all .clone() / torch.randn_like() calls
  • O(1) KVO cache updates: Circular buffer write replacing O(n) shift+clone (attention reads all slots as unordered K/V bag)
  • Removed unnecessary GPU sync: ControlNet/UNet stream.synchronize() removed (same-stream ordering guarantees), global torch.cuda.synchronize() replaced with event-scoped sync
  • L2 cache persistence: New cuda_l2_cache.py module pins UNet attention weights in GPU L2 cache (Ampere+, env-gated via SDTD_L2_PERSIST)
  • Precision fixes: VAE autocast respects self.dtype instead of hardcoded float16; scheduler division upcast to float32 at early timesteps
  • FPS observability: Separate inference FPS vs output FPS when similar image filter inflates frame counts
  • TRT build fix: SDXL engine build on Windows — external data detection, ByteSize overflow guard, file-lock tolerance
  • Compatibility: cuda-python 13.x import path fix, cached engine IO metadata

Files Modified

File Changes
pipeline.py Buffer pre-allocation, precision fixes, FPS tracking
cuda_l2_cache.py New (386 lines) — L2 cache persistence
wrapper.py L2 cache init hook
unet_engine.py Pre-allocated dicts, lazy KVO names
controlnet_engine.py Removed redundant stream sync
utilities.py Cached _allowed_inputs, event sync, ByteSize fix
builder.py SDXL Windows build hardening
td_manager.py Separate inference/output FPS

Impact

Metric Before After
Per-frame CUDA mallocs 4-6 0
KVO cache update O(n) shift+clone O(1) circular write
GPU sync Global (all streams) Event-scoped
VAE autocast Hardcoded fp16 Respects pipeline dtype
FPS reporting Inflated by cached frames Separate inference vs output

Test plan

  • Run img2img inference at 512x512 for 60s, verify dual FPS display
  • Run with SDTD_L2_PERSIST=0, confirm pipeline still runs
  • Run with bfloat16 pipeline, verify no VAE dtype errors
  • Build TRT engine for SDXL on Windows, verify success
  • Run on cuda-python 13.x, verify cudart import

🤖 Generated with Claude Code

dotsimulate and others added 9 commits April 2, 2026 21:59
- Fix external data detection in optimize_onnx to check .data/.onnx.data extensions (not just .pb)
- Handle torch.onnx.export creating external sidecar files with non-.pb names for >2GB SDXL models
- Normalize all external data to weights.pb for consistent downstream handling
- Add ByteSize check before single-file ONNX save to prevent silent >2GB serialization failure
- Add pre-build verification: check .opt.onnx exists and is non-empty before TRT engine build
- Tolerate Windows file-lock failures during post-build ONNX cleanup instead of crashing
- Add diagnostic logging for file sizes throughout export/optimize/build pipeline
Adds .clone() immediately after VAE decode in __call__ (img2img) and txt2img
inference paths. Prevents TRT VAE buffer being silently reused on the next
decode call when prev_image_result is read downstream.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- utilities.py: clean allocate_buffers, simplified ONNX external data
  handling with ByteSize() check, simplified optimize_onnx with .pb
  extension detection
- postprocessing_orchestrator.py: preserve HEAD docstring for
  _should_use_sync_processing (correctly describes temporal coherence
  and feedback loop behavior)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
In cuda-python 13.x, the 'cudart' module was moved to 'cuda.bindings.runtime'.
Add try/except import that prefers the new location and falls back to the
legacy 'cuda.cudart' path for cuda-python 12.x compatibility.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…tence

Pre-allocate latent and noise buffers to eliminate per-frame CUDA malloc:
- Replace prev_latent_result = x_0_pred_out.clone() with lazy-allocated
  _latent_cache buffer + copy_() in __call__, txt2img, and txt2img_sd_turbo
- Replace torch.randn_like() in TCD non-batched noise loop with lazy-allocated
  _noise_buf + .normal_() — eliminates per-step allocation on TCD path
- Both buffers allocate on first use (shape is fixed per pipeline instance)

Port cuda_l2_cache.py from CUDA 0.2.99 fork (PLAN_5 Feature 2):
- New file: src/streamdiffusion/tools/cuda_l2_cache.py
- Reserves GPU L2 cache for UNet attention weight tensors (mid_block, up_blocks.1)
- Gated by SDTD_L2_PERSIST=1 env var (default on), requires Ampere+ GPU
- Integrated at end of wrapper._load_model() with silent fallback on failure

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…division

- encode_image / decode_image: replace hardcoded torch.float16 autocast with
  self.dtype so the pipeline correctly honors the torch_dtype constructor param
  (e.g. bfloat16 would still get fp16 VAE without this fix)

- scheduler_step_batch: upcast numerator and alpha_prod_t_sqrt to float32
  before the F_theta division, then cast back to original dtype. When
  alpha_prod_t_sqrt is small (early timesteps), fp16 division can accumulate
  rounding error; fp32 upcast eliminates this at negligible cost (~1-3us/call).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
setup_l2_persistence() calls pin_hot_unet_weights(persist_mb=0) after
reserving L2. But pin_hot_unet_weights unconditionally called
reserve_l2_persisting_cache(0), which set the persisting L2 size to
0 bytes — undoing the first reservation entirely.

Fix: skip the Tier 1 reserve call in pin_hot_unet_weights when persist_mb=0,
since the caller (setup_l2_persistence) has already handled it.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ge filter is active

The FPS counter was inflated because skipped frames (cached results from the
similar image filter) returned in ~1ms instead of ~30ms, but were still counted
as processed frames. This caused reported FPS to be ~2x actual GPU inference
rate (e.g., 60 FPS reported while GPU at 50% utilization).

Added `last_frame_was_skipped` flag to pipeline and `inference_fps` tracking
to td_manager. Status line now shows: "FPS: 28.3 (out: 57.1)" separating
real inference rate from output rate. OSC now sends inference FPS as the
primary metric.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
INTER-NYC and others added 2 commits April 4, 2026 04:06
UNet2DConditionModelEngine has no named_parameters() — the previous code
crashed with AttributeError when TRT acceleration was enabled.

Two-path dispatch based on UNet type:
- PyTorch nn.Module: existing Tier 2 cudaStreamSetAttribute weight-pinning path
- TRT engine wrapper: new set_trt_persistent_cache() using
  IExecutionContext.persistent_cache_limit for activation caching in L2

TRT's persistent_cache_limit checks cudaLimitPersistingL2CacheSize at
assignment time (not context-creation time), so Tier 1 reservation must
precede the set call — which is the existing execution order.

Adds hasattr guard in pin_hot_unet_weights() so TRT engines short-circuit
cleanly without attempting named_parameters() iteration.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants