# GPU Tutor - 1 - 1_Conv2D_ReLU_BiasAdd

This notebook demonstrates CPU and GPU acceleration in PyTorch.

## Instructions
1. Run each cell in order
2. Make sure GPU is enabled: Runtime → Change runtime type → GPU
3. The notebook will test both CPU and GPU performance using PyTorch

In [None]:
# Install required packages
!pip install torch torchvision torchaudio
!pip install numpy

# Check GPU availability
import torch
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name()}")
    print(f"GPU Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")
else:
    print("No GPU available - will only run CPU tests")

## CPU-Optimized PyTorch Code
Below is the CPU implementation:

In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F


def module_fn(
    x: torch.Tensor,
    conv_weight: torch.Tensor,
    conv_bias: torch.Tensor,
    bias: torch.Tensor,
) -> torch.Tensor:
    """
    Functional implementation of a neural network layer that:
    1. Applies a 2D convolution with learnable weights and biases
    2. Applies ReLU activation function
    3. Adds a learnable bias term

    Args:
        x (Tensor): Input tensor of shape (N, C_in, H, W)
        conv_weight (Tensor): Convolution weights of shape (C_out, C_in, kernel_size, kernel_size)
        conv_bias (Tensor): Convolution bias of shape (C_out)
        bias (Tensor): Additional bias term of shape (C_out, 1, 1)

    Returns:
        Tensor: Output tensor of shape (N, C_out, H_out, W_out)
    """
    x = F.conv2d(x, conv_weight, conv_bias)
    x = torch.relu(x)
    x = x + bias
    return x


class Model(nn.Module):
    """
    Simple model that performs a convolution, applies ReLU, and adds a bias term.
    """

    def __init__(self, in_channels, out_channels, kernel_size, bias_shape):
        super(Model, self).__init__()
        conv = nn.Conv2d(in_channels, out_channels, kernel_size, padding=1)
        self.conv_weight = nn.Parameter(conv.weight)
        self.conv_bias = nn.Parameter(conv.bias)
        self.bias = nn.Parameter(torch.randn(bias_shape) * 0.02)

    def forward(self, x, fn=module_fn):
        return fn(x, self.conv_weight, self.conv_bias, self.bias)


batch_size = 128
in_channels = 3
out_channels = 16
height, width = 32, 32
kernel_size = 3
bias_shape = (out_channels, 1, 1)


def get_inputs():
    return [torch.randn(batch_size, in_channels, height, width)]


def get_init_inputs():
    return [in_channels, out_channels, kernel_size, bias_shape]


## GPU-Accelerated PyTorch Code
Below is the GPU implementation generated from the CPU code:

In [None]:
```python
import torch
import torch.nn as nn
import torch.nn.functional as F

# Check for GPU availability
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

def module_fn(
    x: torch.Tensor,
    conv_weight: torch.Tensor,
    conv_bias: torch.Tensor,
    bias: torch.Tensor,
) -> torch.Tensor:
    """
    Functional implementation of a neural network layer that:
    1. Applies a 2D convolution with learnable weights and biases
    2. Applies ReLU activation function
    3. Adds a learnable bias term

    Args:
        x (Tensor): Input tensor of shape (N, C_in, H, W)
        conv_weight (Tensor): Convolution weights of shape (C_out, C_in, kernel_size, kernel_size)
        conv_bias (Tensor): Convolution bias of shape (C_out)
        bias (Tensor): Additional bias term of shape (C_out, 1, 1)

    Returns:
        Tensor: Output tensor of shape (N, C_out, H_out, W_out)
    """
    x = F.conv2d(x, conv_weight, conv_bias)
    x = torch.relu(x)
    x = x + bias
    return x

class Model(nn.Module):
    """
    Simple model that performs a convolution, applies ReLU, and adds a bias term.
    """

    def __init__(self, in_channels, out_channels, kernel_size, bias_shape):
        super(Model, self).__init__()
        conv = nn.Conv2d(in_channels, out_channels, kernel_size, padding=1)
        self.conv_weight = nn.Parameter(conv.weight)
        self.conv_bias = nn.Parameter(conv.bias)
        self.bias = nn.Parameter(torch.randn(bias_shape) * 0.02)

    def forward(self, x, fn=module_fn):
        return fn(x, self.conv_weight, self.conv_bias, self.bias)

batch_size = 128
in_channels = 3
out_channels = 16
height, width = 32, 32
kernel_size = 3
bias_shape = (out_channels, 1, 1)

def get_inputs():
    return [torch.randn(batch_size, in_channels, height, width).to(device)]

def get_init_inputs():
    return [in_channels, out_channels, kernel_size, bias_shape]

# Initialize model and move it to the GPU
model = Model(*get_init_inputs()).to(device)

# Get inputs and move them to the GPU
inputs = get_inputs()

# Forward pass
output = model(*inputs)

# Print the output shape
print(output.shape)
```

### Summary
This code ensures that both the model and the inputs are on the GPU, leading to efficient computation. By checking for GPU availability and using the `.to(device)` method, it ensures portability and optimal performance.

## Generate Test Inputs
This cell generates test inputs for the function. Adjust as needed for your operation.

In [None]:
import torch
import numpy as np
import time

# Example for matrix multiplication (adjust as needed)
N = 512
A = torch.randn(N, N)
B = torch.randn(N, N)


## CPU Performance Test
Run and time the CPU code.

In [None]:
start = time.time()
result_cpu = module_fn(A, B)  # Adjust arguments as needed
cpu_time = time.time() - start
print(f'CPU result shape: {result_cpu.shape}')
print(f'CPU execution time: {cpu_time:.6f} seconds')


## GPU Performance Test
Run and time the GPU code.

In [None]:
if torch.cuda.is_available():
    A_gpu = A.cuda()
    B_gpu = B.cuda()
    torch.cuda.synchronize()
    start = time.time()
    result_gpu = module_fn_gpu(A_gpu, B_gpu)  # Adjust arguments as needed
    torch.cuda.synchronize()
    gpu_time = time.time() - start
    print(f'GPU result shape: {result_gpu.shape}')
    print(f'GPU execution time: {gpu_time:.6f} seconds')
    print(f'Speedup: {cpu_time/gpu_time:.2f}x')
else:
    print('CUDA not available - skipping GPU test')


## Memory Usage Comparison
Compare memory usage for CPU and GPU.

In [None]:
import psutil
import os
# CPU memory usage
process = psutil.Process(os.getpid())
cpu_mem = process.memory_info().rss / 1e6
print(f'CPU memory usage: {cpu_mem:.2f} MB')

# GPU memory usage
if torch.cuda.is_available():
    torch.cuda.empty_cache()
    A_gpu = A.cuda()
    B_gpu = B.cuda()
    torch.cuda.synchronize()
    result_gpu = module_fn_gpu(A_gpu, B_gpu)
    torch.cuda.synchronize()
    gpu_mem = torch.cuda.max_memory_allocated() / 1e6
    print(f'GPU memory usage: {gpu_mem:.2f} MB')
    torch.cuda.empty_cache()
else:
    print('CUDA not available for memory analysis')
