perf: Inference performance & pipeline correctness by forkni · Pull Request #4 · dotsimulate/StreamDiffusion

forkni · 2026-04-03T02:01:03Z

Summary

Systematic optimization pass eliminating all per-frame CUDA memory allocations in the inference hot path, removing unnecessary GPU synchronization, and fixing pipeline precision and observability issues.

Key Changes

Zero per-frame CUDA allocations: Pre-allocated reusable buffers (_latent_cache, _noise_buf, _image_decode_buf, _prev_image_buf) with .copy_() / .normal_() replacing all .clone() / torch.randn_like() calls
O(1) KVO cache updates: Circular buffer write replacing O(n) shift+clone (attention reads all slots as unordered K/V bag)
Removed unnecessary GPU sync: ControlNet/UNet stream.synchronize() removed (same-stream ordering guarantees), global torch.cuda.synchronize() replaced with event-scoped sync
L2 cache persistence: New cuda_l2_cache.py module pins UNet attention weights in GPU L2 cache (Ampere+, env-gated via SDTD_L2_PERSIST)
Precision fixes: VAE autocast respects self.dtype instead of hardcoded float16; scheduler division upcast to float32 at early timesteps
FPS observability: Separate inference FPS vs output FPS when similar image filter inflates frame counts
TRT build fix: SDXL engine build on Windows — external data detection, ByteSize overflow guard, file-lock tolerance
Compatibility: cuda-python 13.x import path fix, cached engine IO metadata

Files Modified

File	Changes
`pipeline.py`	Buffer pre-allocation, precision fixes, FPS tracking
`cuda_l2_cache.py`	New (386 lines) — L2 cache persistence
`wrapper.py`	L2 cache init hook
`unet_engine.py`	Pre-allocated dicts, lazy KVO names
`controlnet_engine.py`	Removed redundant stream sync
`utilities.py`	Cached _allowed_inputs, event sync, ByteSize fix
`builder.py`	SDXL Windows build hardening
`td_manager.py`	Separate inference/output FPS

Impact

Metric	Before	After
Per-frame CUDA mallocs	4-6	0
KVO cache update	O(n) shift+clone	O(1) circular write
GPU sync	Global (all streams)	Event-scoped
VAE autocast	Hardcoded fp16	Respects pipeline dtype
FPS reporting	Inflated by cached frames	Separate inference vs output

Test plan

Run img2img inference at 512x512 for 60s, verify dual FPS display
Run with SDTD_L2_PERSIST=0, confirm pipeline still runs
Run with bfloat16 pipeline, verify no VAE dtype errors
Build TRT engine for SDXL on Windows, verify success
Run on cuda-python 13.x, verify cudart import

🤖 Generated with Claude Code

- Fix external data detection in optimize_onnx to check .data/.onnx.data extensions (not just .pb) - Handle torch.onnx.export creating external sidecar files with non-.pb names for >2GB SDXL models - Normalize all external data to weights.pb for consistent downstream handling - Add ByteSize check before single-file ONNX save to prevent silent >2GB serialization failure - Add pre-build verification: check .opt.onnx exists and is non-empty before TRT engine build - Tolerate Windows file-lock failures during post-build ONNX cleanup instead of crashing - Add diagnostic logging for file sizes throughout export/optimize/build pipeline

Adds .clone() immediately after VAE decode in __call__ (img2img) and txt2img inference paths. Prevents TRT VAE buffer being silently reused on the next decode call when prev_image_result is read downstream. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- utilities.py: clean allocate_buffers, simplified ONNX external data handling with ByteSize() check, simplified optimize_onnx with .pb extension detection - postprocessing_orchestrator.py: preserve HEAD docstring for _should_use_sync_processing (correctly describes temporal coherence and feedback loop behavior) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

In cuda-python 13.x, the 'cudart' module was moved to 'cuda.bindings.runtime'. Add try/except import that prefers the new location and falls back to the legacy 'cuda.cudart' path for cuda-python 12.x compatibility. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…tence Pre-allocate latent and noise buffers to eliminate per-frame CUDA malloc: - Replace prev_latent_result = x_0_pred_out.clone() with lazy-allocated _latent_cache buffer + copy_() in __call__, txt2img, and txt2img_sd_turbo - Replace torch.randn_like() in TCD non-batched noise loop with lazy-allocated _noise_buf + .normal_() — eliminates per-step allocation on TCD path - Both buffers allocate on first use (shape is fixed per pipeline instance) Port cuda_l2_cache.py from CUDA 0.2.99 fork (PLAN_5 Feature 2): - New file: src/streamdiffusion/tools/cuda_l2_cache.py - Reserves GPU L2 cache for UNet attention weight tensors (mid_block, up_blocks.1) - Gated by SDTD_L2_PERSIST=1 env var (default on), requires Ampere+ GPU - Integrated at end of wrapper._load_model() with silent fallback on failure Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…division - encode_image / decode_image: replace hardcoded torch.float16 autocast with self.dtype so the pipeline correctly honors the torch_dtype constructor param (e.g. bfloat16 would still get fp16 VAE without this fix) - scheduler_step_batch: upcast numerator and alpha_prod_t_sqrt to float32 before the F_theta division, then cast back to original dtype. When alpha_prod_t_sqrt is small (early timesteps), fp16 division can accumulate rounding error; fp32 upcast eliminates this at negligible cost (~1-3us/call). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

setup_l2_persistence() calls pin_hot_unet_weights(persist_mb=0) after reserving L2. But pin_hot_unet_weights unconditionally called reserve_l2_persisting_cache(0), which set the persisting L2 size to 0 bytes — undoing the first reservation entirely. Fix: skip the Tier 1 reserve call in pin_hot_unet_weights when persist_mb=0, since the caller (setup_l2_persistence) has already handled it. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…ge filter is active The FPS counter was inflated because skipped frames (cached results from the similar image filter) returned in ~1ms instead of ~30ms, but were still counted as processed frames. This caused reported FPS to be ~2x actual GPU inference rate (e.g., 60 FPS reported while GPU at 50% utilization). Added `last_frame_was_skipped` flag to pipeline and `inference_fps` tracking to td_manager. Status line now shows: "FPS: 28.3 (out: 57.1)" separating real inference rate from output rate. OSC now sends inference FPS as the primary metric. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…() in pipeline hot path

UNet2DConditionModelEngine has no named_parameters() — the previous code crashed with AttributeError when TRT acceleration was enabled. Two-path dispatch based on UNet type: - PyTorch nn.Module: existing Tier 2 cudaStreamSetAttribute weight-pinning path - TRT engine wrapper: new set_trt_persistent_cache() using IExecutionContext.persistent_cache_limit for activation caching in L2 TRT's persistent_cache_limit checks cudaLimitPersistingL2CacheSize at assignment time (not context-creation time), so Tier 1 reservation must precede the set call — which is the existing execution order. Adds hasattr guard in pin_hot_unet_weights() so TRT engines short-circuit cleanly without attempting named_parameters() iteration. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…ceeding hardware max

dotsimulate and others added 9 commits April 2, 2026 21:59

perf: pre-allocate image output buffers, replace .clone() with .copy_…

e35a716

…() in pipeline hot path

forkni mentioned this pull request Apr 3, 2026

feat: IP-Adapter auto-resolution, VRAM offloading & dependency updates #5

Open

6 tasks

INTER-NYC and others added 2 commits April 4, 2026 04:06

fix: clamp TRT persistent_cache_limit to L2_cache_size//2 to avoid ex…

0c44db2

…ceeding hardware max

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: Inference performance & pipeline correctness#4

perf: Inference performance & pipeline correctness#4
forkni wants to merge 11 commits intoSDTD_031_devfrom
pr1/inference-performance

forkni commented Apr 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

forkni commented Apr 3, 2026

Summary

Key Changes

Files Modified

Impact

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants