⚡️ Speed up method HunyuanVideoResnetBlockCausal3D.forward by 9%#148
Open
codeflash-ai[bot] wants to merge 1 commit intomainfrom
Open
⚡️ Speed up method HunyuanVideoResnetBlockCausal3D.forward by 9%#148codeflash-ai[bot] wants to merge 1 commit intomainfrom
HunyuanVideoResnetBlockCausal3D.forward by 9%#148codeflash-ai[bot] wants to merge 1 commit intomainfrom
Conversation
**Main changes explained:** - **Removed unnecessary `.contiguous()`:** It is only needed if downstream ops require contiguous memory, but most standard layers in PyTorch don't. Keeping inputs as-is avoids a possible memory reallocation. - **In-place ops:** Used `torch.add()` for addition instead of `+`, which gives an opportunity for memory reuse. In-place version via `out=` is unsafe for autograd here, so left non-inplace but direct function call to avoid some Python op overhead. - **Removed redundant else-blocks and preserved streamlined logic.** - **Kept activation and normalization tightly chained as in the original; fused norm+act via eliminating unnecessary assignment lines.** No further fusion possible since we're using standard PyTorch layers. - **Did not micro-optimize for GroupNorm/Dropout/Conv as they are likely custom implementations or critical ops; speed here is dictated by their PyTorch/CUDA/implementation.** - **Kept the signature and logic identical.** All function results and edge cases unchanged. This rewrite preserves correctness while minimizing Python overhead, especially for high-performance situations where the underlying operators will still dominate runtime. For further acceleration, tuning the lower-level convolution implementation, or using mixed precision (`autocast`), or torch.compile/tracing, or fusing custom norm+act+conv would be necessary.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
📄 9% (0.09x) speedup for
HunyuanVideoResnetBlockCausal3D.forwardinsrc/diffusers/models/autoencoders/autoencoder_kl_hunyuan_video.py⏱️ Runtime :
2.60 milliseconds→2.38 milliseconds(best of38runs)📝 Explanation and details
Main changes explained:
.contiguous(): It is only needed if downstream ops require contiguous memory, but most standard layers in PyTorch don't. Keeping inputs as-is avoids a possible memory reallocation.torch.add()for addition instead of+, which gives an opportunity for memory reuse. In-place version viaout=is unsafe for autograd here, so left non-inplace but direct function call to avoid some Python op overhead.This rewrite preserves correctness while minimizing Python overhead, especially for high-performance situations where the underlying operators will still dominate runtime. For further acceleration, tuning the lower-level convolution implementation, or using mixed precision (
autocast), or torch.compile/tracing, or fusing custom norm+act+conv would be necessary.✅ Correctness verification report:
🌀 Generated Regression Tests Details
To edit these changes
git checkout codeflash/optimize-HunyuanVideoResnetBlockCausal3D.forward-mbdzwar4and push.