⚡️ Speed up method `HunyuanVideoDownsampleCausal3D.forward` by 5% by codeflash-ai[bot] · Pull Request #147 · aseembits93/diffusers

codeflash-ai · 2025-06-01T18:17:52Z

📄 5% (0.05x) speedup for `HunyuanVideoDownsampleCausal3D.forward` in `src/diffusers/models/autoencoders/autoencoder_kl_hunyuan_video.py`

⏱️ Runtime : 2.41 milliseconds → 2.29 milliseconds (best of 233 runs)

📝 Explanation and details

Optimization notes:

The main runtime was in self.conv(hidden_states). If inference-only, use torch.no_grad() for faster and more memory-efficient inference.
Avoid unnecessary variable assignments in the forward pass.
Avoid unnecessary imports: removed torch.utils.checkpoint since not in use.
Guarded fastpath is_contiguous, so only when input is already contiguous, avoiding internal tensor copying/allocation for some nn modules (helpful for 3D data), and only applied during inference.
No structural changes that impact return value or output.
Preserved all comments not related to the code that was slightly altered.

✅ Correctness verification report:

Test	Status
⚙️ Existing Unit Tests	🔘 None Found
🌀 Generated Regression Tests	✅ 37 Passed
⏪ Replay Tests	🔘 None Found
🔎 Concolic Coverage Tests	🔘 None Found
📊 Tests Coverage	100.0%

🌀 Generated Regression Tests Details

from typing import Optional

# imports
import pytest  # used for our unit tests
import torch
import torch.nn as nn
from src.diffusers.models.autoencoders.autoencoder_kl_hunyuan_video import \
    HunyuanVideoDownsampleCausal3D

# function to test
# Copyright 2024 The Hunyuan Team and The HuggingFace Team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.


# Dummy implementation of HunyuanVideoCausalConv3d for testing purposes
class HunyuanVideoCausalConv3d(nn.Conv3d):
    # For this test, we assume causal convolution is just Conv3d with padding
    # In reality, causal convolution would have a more complex implementation
    def __init__(self, in_channels, out_channels, kernel_size, stride, padding, bias=True):
        super().__init__(
            in_channels=in_channels,
            out_channels=out_channels,
            kernel_size=kernel_size,
            stride=stride,
            padding=padding,
            bias=bias
        )
from src.diffusers.models.autoencoders.autoencoder_kl_hunyuan_video import \
    HunyuanVideoDownsampleCausal3D

# unit tests

# ----------------------- BASIC TEST CASES -----------------------

def test_forward_basic_shape_and_type():
    # Test that output shape and dtype are correct for simple input
    model = HunyuanVideoDownsampleCausal3D(channels=3, out_channels=4, kernel_size=3, stride=2, padding=1)
    x = torch.randn(2, 3, 8, 16, 16)  # (batch, channels, depth, height, width)
    codeflash_output = model.forward(x); y = codeflash_output

def test_forward_default_out_channels():
    # Test that out_channels defaults to channels if not specified
    model = HunyuanVideoDownsampleCausal3D(channels=5, kernel_size=3, stride=2, padding=1)
    x = torch.randn(1, 5, 6, 8, 8)
    codeflash_output = model.forward(x); y = codeflash_output

def test_forward_bias_false():
    # Test that the model works when bias=False
    model = HunyuanVideoDownsampleCausal3D(channels=2, out_channels=2, bias=False)
    x = torch.randn(1, 2, 4, 4, 4)
    codeflash_output = model.forward(x); y = codeflash_output

def test_forward_stride_one():
    # Test with stride=1 (no downsampling)
    model = HunyuanVideoDownsampleCausal3D(channels=2, out_channels=2, stride=1)
    x = torch.randn(1, 2, 5, 5, 5)
    codeflash_output = model.forward(x); y = codeflash_output

def test_forward_kernel_size_tuple():
    # Test with kernel_size and stride as tuples
    model = HunyuanVideoDownsampleCausal3D(
        channels=2, out_channels=2, kernel_size=(3, 3, 3), stride=(2, 2, 2), padding=(1, 1, 1)
    )
    x = torch.randn(1, 2, 8, 8, 8)
    codeflash_output = model.forward(x); y = codeflash_output

# ----------------------- EDGE TEST CASES -----------------------

def test_forward_minimal_input():
    # Test minimal possible input (single batch, single channel, single voxel)
    model = HunyuanVideoDownsampleCausal3D(channels=1, out_channels=1, kernel_size=1, stride=1, padding=0)
    x = torch.randn(1, 1, 1, 1, 1)
    codeflash_output = model.forward(x); y = codeflash_output

def test_forward_zero_padding():
    # Test with zero padding, kernel size 3, stride 1
    model = HunyuanVideoDownsampleCausal3D(channels=2, out_channels=2, kernel_size=3, stride=1, padding=0)
    x = torch.randn(1, 2, 5, 5, 5)
    codeflash_output = model.forward(x); y = codeflash_output

def test_forward_large_kernel():
    # Test with kernel size equal to input size
    model = HunyuanVideoDownsampleCausal3D(channels=2, out_channels=2, kernel_size=5, stride=1, padding=0)
    x = torch.randn(1, 2, 5, 5, 5)
    codeflash_output = model.forward(x); y = codeflash_output

def test_forward_non_divisible_stride():
    # Test when input size is not divisible by stride
    model = HunyuanVideoDownsampleCausal3D(channels=2, out_channels=2, kernel_size=3, stride=2, padding=1)
    x = torch.randn(1, 2, 7, 7, 7)
    codeflash_output = model.forward(x); y = codeflash_output

def test_forward_singleton_batch_channel():
    # Test with batch size 1 and channel size 1
    model = HunyuanVideoDownsampleCausal3D(channels=1, out_channels=1, kernel_size=3, stride=2, padding=1)
    x = torch.randn(1, 1, 8, 8, 8)
    codeflash_output = model.forward(x); y = codeflash_output

def test_forward_invalid_input_shape():
    # Test that an error is raised for invalid input shape (not 5D)
    model = HunyuanVideoDownsampleCausal3D(channels=2)
    x = torch.randn(2, 2, 8, 8)  # 4D instead of 5D
    with pytest.raises(RuntimeError):
        model.forward(x)

def test_forward_channels_mismatch():
    # Test that an error is raised if input channels do not match model channels
    model = HunyuanVideoDownsampleCausal3D(channels=3, out_channels=2)
    x = torch.randn(1, 2, 8, 8, 8)  # input channels=2, model expects 3
    with pytest.raises(RuntimeError):
        model.forward(x)

def test_forward_negative_stride():
    # Test that negative stride raises an error
    with pytest.raises(ValueError):
        HunyuanVideoDownsampleCausal3D(channels=2, stride=-1)

def test_forward_zero_stride():
    # Test that zero stride raises an error
    with pytest.raises(ValueError):
        HunyuanVideoDownsampleCausal3D(channels=2, stride=0)

def test_forward_empty_tensor():
    # Test with empty tensor (zero batch size)
    model = HunyuanVideoDownsampleCausal3D(channels=2)
    x = torch.randn(0, 2, 8, 8, 8)
    codeflash_output = model.forward(x); y = codeflash_output

# ----------------------- LARGE SCALE TEST CASES -----------------------

def test_forward_large_batch():
    # Test with large batch size
    model = HunyuanVideoDownsampleCausal3D(channels=2, out_channels=2)
    x = torch.randn(64, 2, 8, 8, 8)  # 64 batches
    codeflash_output = model.forward(x); y = codeflash_output

def test_forward_large_spatial():
    # Test with large spatial dimensions, but <100MB tensor
    model = HunyuanVideoDownsampleCausal3D(channels=1, out_channels=1)
    x = torch.randn(1, 1, 16, 32, 32)  # 1*1*16*32*32*4 = 65,536 floats = 262,144 bytes
    codeflash_output = model.forward(x); y = codeflash_output

def test_forward_large_channels():
    # Test with large number of channels, but <100MB tensor
    model = HunyuanVideoDownsampleCausal3D(channels=32, out_channels=64)
    x = torch.randn(2, 32, 8, 8, 8)  # 2*32*8*8*8*4 = 32,768 floats = 131,072 bytes
    codeflash_output = model.forward(x); y = codeflash_output

def test_forward_gradient_flow():
    # Test that gradients flow through the module
    model = HunyuanVideoDownsampleCausal3D(channels=2, out_channels=2)
    x = torch.randn(2, 2, 8, 8, 8, requires_grad=True)
    codeflash_output = model.forward(x); y = codeflash_output
    loss = y.sum()
    loss.backward()

def test_forward_different_dtypes():
    # Test with float32 and float64
    model = HunyuanVideoDownsampleCausal3D(channels=2, out_channels=2)
    x32 = torch.randn(1, 2, 8, 8, 8, dtype=torch.float32)
    x64 = torch.randn(1, 2, 8, 8, 8, dtype=torch.float64)
    codeflash_output = model.forward(x32); y32 = codeflash_output
    codeflash_output = model.forward(x64); y64 = codeflash_output

def test_forward_device_cpu_cuda():
    # Test that the model works on both CPU and CUDA (if available)
    model = HunyuanVideoDownsampleCausal3D(channels=2, out_channels=2)
    x = torch.randn(1, 2, 8, 8, 8)
    codeflash_output = model.forward(x); y_cpu = codeflash_output
    if torch.cuda.is_available():
        model_cuda = HunyuanVideoDownsampleCausal3D(channels=2, out_channels=2).cuda()
        x_cuda = x.cuda()
        codeflash_output = model_cuda.forward(x_cuda); y_cuda = codeflash_output
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

from typing import Optional

# imports
import pytest  # used for our unit tests
import torch
import torch.nn as nn
from src.diffusers.models.autoencoders.autoencoder_kl_hunyuan_video import \
    HunyuanVideoDownsampleCausal3D

# function to test
# Copyright 2024 The Hunyuan Team and The HuggingFace Team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.


class HunyuanVideoCausalConv3d(nn.Conv3d):
    """
    A simple causal 3D convolution implementation for testing purposes.
    This implementation ensures causality by zeroing out weights for future frames in the temporal dimension.
    """
    def __init__(self, in_channels, out_channels, kernel_size, stride, padding, bias=True):
        if isinstance(kernel_size, int):
            kernel_size = (kernel_size, kernel_size, kernel_size)
        if isinstance(stride, int):
            stride = (stride, stride, stride)
        if isinstance(padding, int):
            padding = (padding, padding, padding)
        super().__init__(
            in_channels, out_channels, kernel_size=kernel_size, stride=stride, padding=padding, bias=bias
        )
        self.kernel_size = kernel_size

    def forward(self, input: torch.Tensor) -> torch.Tensor:
        # For a causal convolution, zero out weights for future frames in the temporal dimension
        # Only allow access to current and past frames
        # This is a simplified version for testing
        weight = self.weight.clone()
        t_center = self.kernel_size[0] // 2
        # Zero out weights that correspond to future frames
        if self.kernel_size[0] > 1:
            weight[:, :, t_center + 1 :, :, :] = 0
        # Use F.conv3d directly to avoid recursion
        return nn.functional.conv3d(
            input, weight, self.bias, self.stride, self.padding, self.dilation, self.groups
        )
from src.diffusers.models.autoencoders.autoencoder_kl_hunyuan_video import \
    HunyuanVideoDownsampleCausal3D

# unit tests

# --------- Basic Test Cases ---------

def test_forward_basic_shape_and_type():
    # Test that output shape and type are correct for a simple input
    batch, channels, time, height, width = 2, 3, 8, 16, 16
    model = HunyuanVideoDownsampleCausal3D(channels)
    x = torch.randn(batch, channels, time, height, width)
    codeflash_output = model.forward(x); y = codeflash_output
    # Output shape: batch, channels, downsampled time, downsampled height, downsampled width
    # Default kernel_size=3, stride=2, padding=1
    # Output size formula: floor((input + 2*pad - kernel)//stride + 1)
    def out_dim(i, k=3, s=2, p=1):
        return (i + 2*p - k)//s + 1

def test_forward_out_channels():
    # Test that out_channels argument changes output shape
    batch, channels, time, height, width = 1, 4, 10, 10, 10
    out_channels = 6
    model = HunyuanVideoDownsampleCausal3D(channels, out_channels=out_channels)
    x = torch.randn(batch, channels, time, height, width)
    codeflash_output = model.forward(x); y = codeflash_output

def test_forward_kernel_and_stride_variants():
    # Test with different kernel_size and stride
    batch, channels, time, height, width = 1, 2, 7, 12, 12
    model = HunyuanVideoDownsampleCausal3D(channels, kernel_size=5, stride=3, padding=2)
    x = torch.randn(batch, channels, time, height, width)
    codeflash_output = model.forward(x); y = codeflash_output
    # Output shape calculation
    def out_dim(i, k=5, s=3, p=2):
        return (i + 2*p - k)//s + 1

def test_forward_bias_false():
    # Test that disabling bias works
    batch, channels, time, height, width = 1, 2, 5, 8, 8
    model = HunyuanVideoDownsampleCausal3D(channels, bias=False)
    x = torch.randn(batch, channels, time, height, width)
    codeflash_output = model.forward(x); y = codeflash_output

def test_forward_grad():
    # Test that gradients flow through the module
    batch, channels, time, height, width = 1, 2, 6, 8, 8
    model = HunyuanVideoDownsampleCausal3D(channels)
    x = torch.randn(batch, channels, time, height, width, requires_grad=True)
    codeflash_output = model.forward(x); y = codeflash_output
    loss = y.sum()
    loss.backward()

# --------- Edge Test Cases ---------

def test_forward_singleton_dimensions():
    # Test with singleton batch, channel, spatial and temporal dimensions
    x = torch.randn(1, 1, 1, 1, 1)
    model = HunyuanVideoDownsampleCausal3D(1)
    codeflash_output = model.forward(x); y = codeflash_output

def test_forward_minimum_input_size():
    # Test with minimum input size that allows a single convolution
    # For kernel_size=3, padding=1, stride=2: input size 3
    x = torch.randn(1, 1, 3, 3, 3)
    model = HunyuanVideoDownsampleCausal3D(1)
    codeflash_output = model.forward(x); y = codeflash_output

def test_forward_odd_even_dimensions():
    # Test with odd and even input sizes
    x = torch.randn(1, 2, 5, 6, 7)
    model = HunyuanVideoDownsampleCausal3D(2)
    codeflash_output = model.forward(x); y = codeflash_output
    # Output shape calculation
    def out_dim(i, k=3, s=2, p=1):
        return (i + 2*p - k)//s + 1


def test_forward_causality():
    # Test that the convolution is causal in the temporal dimension
    # Changing a future frame should not affect the output at the current/past time
    batch, channels, time, height, width = 1, 1, 6, 4, 4
    model = HunyuanVideoDownsampleCausal3D(channels)
    x = torch.randn(batch, channels, time, height, width)
    x2 = x.clone()
    # Change the last time frame (future frame for earlier outputs)
    x2[:, :, -1] += 1000
    codeflash_output = model.forward(x); y1 = codeflash_output
    codeflash_output = model.forward(x2); y2 = codeflash_output

def test_forward_invalid_input_shape():
    # Test that invalid input shape raises an error
    model = HunyuanVideoDownsampleCausal3D(2)
    x = torch.randn(1, 2, 8, 8)  # missing one dimension
    with pytest.raises(RuntimeError):
        model.forward(x)

def test_forward_channel_mismatch():
    # Test that channel mismatch raises an error
    model = HunyuanVideoDownsampleCausal3D(3)
    x = torch.randn(1, 2, 8, 8, 8)  # input channels != model channels
    with pytest.raises(RuntimeError):
        model.forward(x)

# --------- Large Scale Test Cases ---------

def test_forward_large_batch():
    # Test with large batch size
    batch, channels, time, height, width = 32, 2, 8, 8, 8
    model = HunyuanVideoDownsampleCausal3D(channels)
    x = torch.randn(batch, channels, time, height, width)
    codeflash_output = model.forward(x); y = codeflash_output

def test_forward_large_spatial():
    # Test with large spatial dimensions (but <100MB)
    batch, channels, time, height, width = 1, 2, 4, 64, 64
    model = HunyuanVideoDownsampleCausal3D(channels)
    x = torch.randn(batch, channels, time, height, width)
    codeflash_output = model.forward(x); y = codeflash_output
    # Check output shape
    def out_dim(i, k=3, s=2, p=1):
        return (i + 2*p - k)//s + 1

def test_forward_large_temporal():
    # Test with large temporal dimension (but <100MB)
    batch, channels, time, height, width = 1, 2, 64, 8, 8
    model = HunyuanVideoDownsampleCausal3D(channels)
    x = torch.randn(batch, channels, time, height, width)
    codeflash_output = model.forward(x); y = codeflash_output

def test_forward_large_channels():
    # Test with large number of channels (but <100MB)
    batch, channels, time, height, width = 1, 64, 4, 8, 8
    model = HunyuanVideoDownsampleCausal3D(channels)
    x = torch.randn(batch, channels, time, height, width)
    codeflash_output = model.forward(x); y = codeflash_output

def test_forward_multiple_large_inputs():
    # Test with multiple large inputs in a loop (but keep total memory small)
    batch, channels, time, height, width = 2, 8, 8, 16, 16
    model = HunyuanVideoDownsampleCausal3D(channels)
    for _ in range(5):
        x = torch.randn(batch, channels, time, height, width)
        codeflash_output = model.forward(x); y = codeflash_output
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

To edit these changes git checkout codeflash/optimize-HunyuanVideoDownsampleCausal3D.forward-mbdzh935 and push.

**Optimization notes:** - The main runtime was in `self.conv(hidden_states)`. If inference-only, use `torch.no_grad()` for faster and more memory-efficient inference. - Avoid unnecessary variable assignments in the forward pass. - Avoid unnecessary imports: removed `torch.utils.checkpoint` since not in use. - Guarded fastpath is_contiguous, so only when input is already contiguous, avoiding internal tensor copying/allocation for some nn modules (helpful for 3D data), and only applied during inference. - No structural changes that impact return value or output. - Preserved all comments not related to the code that was slightly altered.

codeflash-ai bot added the ⚡️ codeflash Optimization PR opened by Codeflash AI label Jun 1, 2025

codeflash-ai bot requested a review from aseembits93 June 1, 2025 18:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

⚡️ Speed up method `HunyuanVideoDownsampleCausal3D.forward` by 5%#147

⚡️ Speed up method `HunyuanVideoDownsampleCausal3D.forward` by 5%#147
codeflash-ai[bot] wants to merge 1 commit intomainfrom
codeflash/optimize-HunyuanVideoDownsampleCausal3D.forward-mbdzh935

codeflash-ai bot commented Jun 1, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

0 participants

Conversation

codeflash-ai bot commented Jun 1, 2025

📄 5% (0.05x) speedup for HunyuanVideoDownsampleCausal3D.forward in src/diffusers/models/autoencoders/autoencoder_kl_hunyuan_video.py

📝 Explanation and details

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

0 participants

📄 5% (0.05x) speedup for `HunyuanVideoDownsampleCausal3D.forward` in `src/diffusers/models/autoencoders/autoencoder_kl_hunyuan_video.py`