# Benchmarking Fused vs Unfused Models

In this notebook, we will compare the performance of the unfused and fused models using a benchmarking function. This will help us understand the speed improvements achieved by fusing the convolution, batch normalization, and ReLU layers.

Note: Graph fusion is a technique typically employed automatically by tools like `torch.compile` and others. The demonstration below is intended to showcase the performance boost of a specific fusion, without addressing deployment tools.


## Defines a dummy unfused network

In [None]:
import torch
import torch.nn as nn

class UnfusedConvBNReLU(nn.Module):
    def __init__(self):
        super().__init__()
        self.conv = nn.Conv2d(3, 16, kernel_size=3, stride=1, padding=1, bias=True)
        self.bn = nn.BatchNorm2d(16)
        self.relu = nn.ReLU()

    def forward(self, x):
        return self.relu(self.bn(self.conv(x)))


## Fusion of Layers
The `torch.quantization.fuse_modules` function is used to fuse the convolution, batch normalization, and ReLU layers into a single operation. This improves runtime efficiency by reducing the overhead of separate operations.




In [None]:
model = UnfusedConvBNReLU().eval()
fused = torch.quantization.fuse_modules(model, [["conv", "bn", "relu"]], inplace=False).eval()

input_tensor = torch.randn(1, 3, 224, 224)
with torch.no_grad():
    output1 = model(input_tensor)
    output2 = fused(input_tensor)

print(torch.allclose(output1, output2, atol=1e-5))  # Should print: True

## Validation of Fusion
The outputs of the unfused and fused models are compared using torch.allclose to ensure they produce nearly identical results (within a tolerance of 1e-5).


In [None]:
import time

model = model.cuda()
fused = fused.cuda()
input_tensor = input_tensor.cuda()

def benchmark(model, name):
    model.eval()
    with torch.no_grad():
        # Warmup
        for _ in range(100):
            model(input_tensor)
        # Timing
        start = time.time()
        for _ in range(1000):
            model(input_tensor)
        torch.cuda.synchronize()
        end = time.time()
        print(f"{name}: {(end - start)*1000:.2f} ms")

benchmark(model, "Unfused")
benchmark(fused, "Fused")
