# GPU Tutor - 1 - 1_Square_matrix_multiplication_

This notebook demonstrates CPU and GPU acceleration in PyTorch.

## Instructions
1. Run each cell in order
2. Make sure GPU is enabled: Runtime → Change runtime type → GPU
3. The notebook will test both CPU and GPU performance using PyTorch

In [1]:
# Install required packages
!pip install torch torchvision torchaudio
!pip install numpy

# Check GPU availability
import torch
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name()}")
    print(f"GPU Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")
else:
    print("No GPU available - will only run CPU tests")

Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch)
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.4.5.8 (from torch)
  Downloading nvidia_cublas_cu12-12.4.5.8-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cufft-cu12==11.2.1.3 (from torch)
  Downloading nvidia_cufft_cu12-11.2.1.3-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-curand-cu12==10.3.5.147 (from torch)
  Downloading nvidia_curand_cu12-10.3.5

## CPU-Optimized PyTorch Code
Below is the CPU implementation:

In [1]:
import torch
import torch.nn as nn
import torch.nn.functional as F


def module_fn(A: torch.Tensor, B: torch.Tensor) -> torch.Tensor:
    """
    Performs a single square matrix multiplication (C = A * B).

    Args:
        A (torch.Tensor): Input matrix A of shape (N, N).
        B (torch.Tensor): Input matrix B of shape (N, N).

    Returns:
        torch.Tensor: Output matrix C of shape (N, N).
    """
    return torch.matmul(A, B)


class Model(nn.Module):
    """
    Simple model that performs a single square matrix multiplication (C = A * B)
    """

    def __init__(self):
        super(Model, self).__init__()

    def forward(self, A: torch.Tensor, B: torch.Tensor, fn=module_fn) -> torch.Tensor:
        return fn(A, B)


N = 2048


def get_inputs():
    A = torch.randn(N, N)
    B = torch.randn(N, N)
    return [A, B]


def get_init_inputs():
    return []  # No special initialization inputs needed


## GPU-Accelerated PyTorch Code
Below is the GPU implementation generated from the CPU code:

In [2]:
### Full GPU-Accelerated Code
import torch
import torch.nn as nn
import torch.nn.functional as F

# Check if GPU is available and set device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

def module_fn(A: torch.Tensor, B: torch.Tensor) -> torch.Tensor:
    """
    Performs a single square matrix multiplication (C = A * B).

    Args:
        A (torch.Tensor): Input matrix A of shape (N, N).
        B (torch.Tensor): Input matrix B of shape (N, N).

    Returns:
        torch.Tensor: Output matrix C of shape (N, N).
    """
    return torch.matmul(A, B)

class Model(nn.Module):
    """
    Simple model that performs a single square matrix multiplication (C = A * B)
    """

    def __init__(self):
        super(Model, self).__init__()

    def forward(self, A: torch.Tensor, B: torch.Tensor, fn=module_fn) -> torch.Tensor:
        return fn(A, B)

N = 2048

def get_inputs():
    # Initialize inputs and move them to the appropriate device
    A = torch.randn(N, N, device=device)
    B = torch.randn(N, N, device=device)
    return [A, B]

def get_init_inputs():
    return []  # No special initialization inputs needed

# Initialize the model and move it to the device
model = Model().to(device)

# Get inputs
inputs = get_inputs()

# Perform the forward pass
output = model(*inputs)

# If needed, move the output back to CPU (e.g., for further processing or storage)
output = output.cpu()

## Generate Test Inputs
This cell generates test inputs for the function. Adjust as needed for your operation.

In [3]:
import torch
import numpy as np
import time

# Example for matrix multiplication (adjust as needed)
N = 512
A = torch.randn(N, N)
B = torch.randn(N, N)

## CPU Performance Test
Run and time the CPU code.

In [4]:
start = time.time()
result_cpu = module_fn(A, B)  # Adjust arguments as needed
cpu_time = time.time() - start
print(f'CPU result shape: {result_cpu.shape}')
print(f'CPU execution time: {cpu_time:.6f} seconds')

CPU result shape: torch.Size([512, 512])
CPU execution time: 0.040392 seconds


## GPU Performance Test
Run and time the GPU code.

In [5]:
if torch.cuda.is_available():
    A_gpu = A.cuda()
    B_gpu = B.cuda()
    torch.cuda.synchronize()
    start = time.time()
    result_gpu = module_fn(A_gpu, B_gpu)  # Adjust arguments as needed
    torch.cuda.synchronize()
    gpu_time = time.time() - start
    print(f'GPU result shape: {result_gpu.shape}')
    print(f'GPU execution time: {gpu_time:.6f} seconds')
    print(f'Speedup: {cpu_time/gpu_time:.2f}x')
else:
    print('CUDA not available - skipping GPU test')

GPU result shape: torch.Size([512, 512])
GPU execution time: 0.001342 seconds
Speedup: 30.09x


## Memory Usage Comparison
Compare memory usage for CPU and GPU.

In [6]:
import psutil
import os
# CPU memory usage
process = psutil.Process(os.getpid())
cpu_mem = process.memory_info().rss / 1e6
print(f'CPU memory usage: {cpu_mem:.2f} MB')

# GPU memory usage
if torch.cuda.is_available():
    torch.cuda.empty_cache()
    A_gpu = A.cuda()
    B_gpu = B.cuda()
    torch.cuda.synchronize()
    result_gpu = module_fn(A_gpu, B_gpu)
    torch.cuda.synchronize()
    gpu_mem = torch.cuda.max_memory_allocated() / 1e6
    print(f'GPU memory usage: {gpu_mem:.2f} MB')
    torch.cuda.empty_cache()
else:
    print('CUDA not available for memory analysis')

CPU memory usage: 647.47 MB
GPU memory usage: 58.85 MB
