### ME2


#### Import necessary packages

This notebook demonstrates custom deep learning inference using CuPy for GPU acceleration.

- **CuPy** is the core library enabling fast, GPU-based array operations and custom neural network layers. All major computations (convolution, pooling, linear layers) are performed on the GPU using CuPy arrays.
- We also use PyTorch and torchvision for data loading and to access pretrained AlexNet weights, but the model itself is implemented from scratch with CuPy.
- Other packages (einops, tqdm, etc.) support tensor algebra and progress monitoring.

The imports below are grouped to highlight standard libraries, third-party utilities, and GPU/CuPy-specific tools.


In [2]:
# Standard library imports
from typing import Tuple

# Third-party imports
import torch
from torch import nn
from torch.utils.data import DataLoader
from einops import einsum, rearrange
from tqdm.auto import tqdm
from torchvision.datasets import ImageNet
from torchvision.models import AlexNet_Weights

# GPU/CuPy imports
import cupy as cp
from cupy.lib.stride_tricks import sliding_window_view

  from .autonotebook import tqdm as notebook_tqdm


#### Load the weights and biases of AlexNet

In this section, we extract the pretrained weights and biases from torchvision's AlexNet model.

- The weights and biases are loaded using the default configuration from `AlexNet_Weights`.
- These parameters are stored in a dictionary and printed to verify the available keys.
- Our custom layers will use these pretrained values for inference, ensuring the model matches the original AlexNet architecture.

This step is essential for initializing all custom layers with correct parameters before running inference on the validation set.


In [3]:
weights_and_biases = AlexNet_Weights.DEFAULT.get_state_dict()
print(weights_and_biases.keys())

odict_keys(['features.0.weight', 'features.0.bias', 'features.3.weight', 'features.3.bias', 'features.6.weight', 'features.6.bias', 'features.8.weight', 'features.8.bias', 'features.10.weight', 'features.10.bias', 'classifier.1.weight', 'classifier.1.bias', 'classifier.4.weight', 'classifier.4.bias', 'classifier.6.weight', 'classifier.6.bias'])


#### Load the data

In this section, we load the ImageNet validation dataset and prepare it for inference with our custom AlexNet implementation.

- We use torchvision's `ImageNet` class to access the validation split, applying the standard AlexNet preprocessing transforms.
- The DataLoader is configured to efficiently batch and serve images for evaluation.
- A custom `default_collate` function is used to convert PyTorch tensors to CuPy arrays, enabling GPU-accelerated inference with our custom layers.
- The batch size and worker settings are chosen to avoid multiprocessing issues with GPU contexts.

This setup allows us to run high-throughput inference on the validation set using a fully custom, CuPy-based AlexNet model.


In [4]:
def default_collate(batch):
    """
    A collation function that simply returns the batch as is.
    We convert torch tensors to numpy arrays since cp.pad doesn't work on tensors
    """
    imgs, labels = zip(*batch)  # imgs: tuple[torch.Tensor], labels: tuple[int]
    imgs = [cp.asarray(img.numpy(), dtype=cp.float32) for img in imgs]
    imgs = cp.stack(imgs, axis=0)                      # [B,3,224,224]
    labels = cp.asarray(labels, dtype=cp.int64)        # [B]
    return imgs, labels


# implement using ImageNet
imagenet_val = ImageNet(
    root="data/ImageNet1k",
    split="val",
    transform=AlexNet_Weights.IMAGENET1K_V1.transforms()
)

# the dataloader automatically segregates the labels
val_dataloader = DataLoader(
    imagenet_val,
    batch_size=512,
    shuffle=False,
    num_workers=0,
    collate_fn=default_collate
)

#### Define the custom Conv2d


In [5]:
class PatchMixin:
    """Mixin for extracting patches from input arrays with a given kernel size and stride.
    Used for convolution and pooling operations on CuPy arrays."""

    def __init__(self, kernel_size: int, stride: int) -> None:
        """Initialize patch extraction parameters.
        Args:
            kernel_size (int): Size of the square kernel.
            stride (int): Stride for patch extraction.
        """
        super().__init__()
        self.kernel_size = kernel_size
        self.stride = stride

    def _patch_with_stride(self, x_pad: cp.ndarray) -> cp.ndarray:
        """Extract k x k patches from the input array with the given stride.
        Args:
            x_pad (cp.ndarray): Input array of shape (b, c, h, w).
        Returns:
            cp.ndarray: Array of shape (b, c, h/stride, w/stride, k, k) containing the extracted patches.
        """
        windows = sliding_window_view(  # type: ignore
            x_pad,
            window_shape=(self.kernel_size, self.kernel_size),
            axis=(-2, -1)  # type: ignore
        )
        return windows[:, :, ::self.stride, ::self.stride, :, :]


class WeightsAndBiasMixin:
    """Mixin for loading pretrained weights and biases from a state dict and converting to CuPy arrays."""

    def __init__(self, *args, **kwargs) -> None:
        """Initialize the mixin (calls super)."""
        super().__init__(*args, **kwargs)

    def init_weights_and_bias(self, weight_loc: str, bias_loc: str) -> Tuple[cp.ndarray, cp.ndarray]:
        """Load weights and biases from the state dict and convert to CuPy arrays.
        Args:
            weight_loc (str): Key for weights in the state dict.
            bias_loc (str): Key for biases in the state dict.
        Returns:
            Tuple[cp.ndarray, cp.ndarray]: CuPy arrays for weights and biases.
        """
        weight_np = weights_and_biases[weight_loc].detach().cpu().numpy()
        bias_np = weights_and_biases[bias_loc].detach().cpu().numpy()
        weight = cp.asarray(weight_np)
        bias = cp.asarray(bias_np)
        return weight, bias


class CustomConv2d(WeightsAndBiasMixin, PatchMixin, nn.Module):
    """
    Custom 2D Convolution layer using CuPy, Einops, and einsum.
    Performs convolution on CuPy arrays for GPU acceleration.
    Limited shape flexibility for demonstration/inference.
    """

    def __init__(
        self,
        in_channels: int,
        out_channels: int,
        kernel_size: int,
        stride: int = 1,
        padding: int = 0,
        weight_loc: str = '',
        bias_loc: str = '',
    ) -> None:
        """Initialize the convolution layer with parameters and pretrained weights/biases.
        Args:
            in_channels (int): Number of input channels.
            out_channels (int): Number of output channels.
            kernel_size (int): Size of the convolution kernel.
            stride (int, optional): Stride for convolution. Defaults to 1.
            padding (int, optional): Padding for input. Defaults to 0.
            weight_loc (str, optional): Key for weights in state dict.
            bias_loc (str, optional): Key for biases in state dict.
        """
        super().__init__(kernel_size, stride)
        self.in_channels = in_channels
        self.out_channels = out_channels
        self.kernel_size = kernel_size
        self.stride = stride
        self.padding = padding

        self.weight, self.bias = self.init_weights_and_bias(
            weight_loc, bias_loc)
        self.reset_parameters()

    def reset_parameters(self) -> None:
        """No-op for pretrained weights. Provided for API compatibility."""
        pass

    def _apply_padding(self, x: cp.ndarray) -> cp.ndarray:
        """Apply zero padding to the input array if required.
        Args:
            x (cp.ndarray): Input array of shape (b, c, h, w).
        Returns:
            cp.ndarray: Padded input array.
        """
        if self.padding == 0:
            return x
        return cp.pad(
            x,
            pad_width=((0, 0), (0, 0), (self.padding, self.padding),
                       (self.padding, self.padding)),
            mode='constant',
            constant_values=0,
        )

    def forward(self, x: cp.ndarray) -> cp.ndarray:
        """Perform the forward pass of the convolution layer.
        Args:
            x (cp.ndarray): Input array of shape (b, c, h, w).
        Returns:
            cp.ndarray: Output array after convolution and bias addition.
        """
        x_pad = self._apply_padding(x)
        patched_windows = self._patch_with_stride(x_pad)
        pre_activation = einsum(
            patched_windows, self.weight, 'b c w h kw kh, o c kw kh -> b o w h')
        return pre_activation + self.bias[None, :, None, None]  # type: ignore

#### Custom ReLU


In [6]:
class CustomReLU(nn.Module):
    """
    Custom ReLU activation layer using CuPy for GPU acceleration.

    This layer applies the Rectified Linear Unit (ReLU) function to its input, setting all negative values to zero.
    All operations are performed on CuPy arrays for efficient GPU computation.
    Compatible with PyTorch's nn.Module interface for easy integration into custom models.
    """

    def forward(self, x: cp.ndarray) -> cp.ndarray:
        """
        Apply the ReLU activation function to the input tensor.

        Args:
            x (cp.ndarray): Input tensor (CuPy array) of any shape.

        Returns:
            cp.ndarray: Output tensor with negative values set to zero, same shape as input.
        """
        return cp.maximum(x, 0.0)

#### Custom MaxPool2d

This section introduces our custom Max Pooling layer, implemented from scratch for GPU acceleration using CuPy.


In [7]:
class CustomMaxPool2d(PatchMixin, nn.Module):
    """
    Custom Max Pooling layer using CuPy for GPU acceleration.

    This layer performs max pooling by extracting patches from the input tensor and selecting the maximum value from each patch.
    All operations are performed on CuPy arrays for efficient GPU computation.
    Compatible with PyTorch's nn.Module interface for easy integration into custom models.
    """

    def __init__(self, kernel_size: int, stride: int) -> None:
        super().__init__(kernel_size, stride)

    def forward(self, x: cp.ndarray) -> cp.ndarray:
        """
        Apply max pooling to the input tensor using the specified kernel size and stride.

        Args:
            x (cp.ndarray): Input tensor (CuPy array) of shape (batch, channels, height, width).

        Returns:
            cp.ndarray: Output tensor after max pooling, reduced spatial dimensions.
        """
        patched_windows = self._patch_with_stride(x)
        return cp.max(patched_windows, axis=(-2, -1))

#### Custom Adaptive AvgPool2d

This section introduces our custom Adaptive Average Pooling layer, implemented for GPU acceleration using CuPy.


In [8]:
class CustomAdaptiveAvgPool2d(nn.Module):
    """
    This layer reduces the spatial dimensions of the input tensor to a fixed output size by averaging over dynamically computed regions.
    All operations are performed on CuPy arrays for efficient GPU computation.
    Compatible with PyTorch's nn.Module interface for easy integration into custom models.
    """

    def __init__(self, output_size: tuple[int, int]) -> None:
        super().__init__()
        if isinstance(output_size, int):
            self.output_size = (output_size, output_size)
        else:
            self.output_size = output_size

    def forward(self, x: cp.ndarray) -> cp.ndarray:
        """
        Apply adaptive average pooling to the input tensor to produce a fixed output size.

        Args:
            x (cp.ndarray): Input tensor (CuPy array) of shape (batch, channels, height, width).

        Returns:
            cp.ndarray: Output tensor of shape (batch, channels, output_height, output_width) after adaptive average pooling.
        """
        b, c, h, w = x.shape
        out_h, out_w = self.output_size
        out = cp.zeros((b, c, out_h, out_w), dtype=x.dtype)
        for i in range(out_h):
            h_start = int(cp.floor(i * h / out_h))
            h_end = int(cp.ceil((i + 1) * h / out_h))
            for j in range(out_w):
                w_start = int(cp.floor(j * w / out_w))
                w_end = int(cp.ceil((j + 1) * w / out_w))
                region = x[:, :, h_start:h_end, w_start:w_end]
                out[:, :, i, j] = region.mean(axis=(-2, -1))
        return out

#### Custom Linear Module


In [9]:
class EinopsLinear(WeightsAndBiasMixin, nn.Module):
    """
    Custom Linear (fully connected) layer using CuPy and einops for GPU-accelerated inference.

    This layer performs matrix multiplication between input features and learned weights, followed by bias addition. All operations are performed on CuPy arrays for efficient GPU computation.
    The implementation uses einops's einsum for concise and flexible tensor algebra, making the code readable and efficient.
    Pretrained weights and biases from AlexNet are loaded and used to initialize the layer, ensuring compatibility with the original architecture.
    Compatible with PyTorch's nn.Module interface for easy integration into custom models.
    """

    def __init__(self, in_features: int, out_features: int, weight_loc: str, bias_loc: str):
        super().__init__()
        self.in_features = in_features
        self.out_features = out_features
        self.weight, self.bias = self.init_weights_and_bias(
            weight_loc, bias_loc)

    def forward(self, x: cp.ndarray) -> cp.ndarray:
        """
        Perform the forward pass of the linear layer using einsum for matrix multiplication and bias addition.

        Args:
            x (cp.ndarray): Input tensor of shape (batch, in_features), CuPy array.

        Returns:
            cp.ndarray: Output tensor of shape (batch, out_features) after linear transformation and bias addition.
        """
        y = einsum(x, self.weight, "b i, o i -> b o")
        if self.bias is not None:
            y = y + self.bias
        return y

#### AlexNet Class Implementation


In [10]:
class AlexNet(nn.Module):
    """
    Custom AlexNet implementation using CuPy for GPU-accelerated inference.

    This class replicates the original AlexNet architecture with all major layers (convolution, pooling, linear, activation) implemented from scratch using CuPy arrays for fast GPU computation.
    The model loads pretrained weights and biases from torchvision's AlexNet, ensuring compatibility and matching performance for inference.
    The pipeline consists of feature extraction (convolutions, activations, pooling), adaptive average pooling, and a classifier (fully connected layers).
    All layers are compatible with PyTorch's nn.Module interface, but computations are performed on CuPy arrays.
    """

    def __init__(self, num_classes: int = 1000) -> None:
        super().__init__()
        self.features = nn.Sequential(
            CustomConv2d(3, 64, kernel_size=11, stride=4, padding=2,
                         weight_loc='features.0.weight', bias_loc='features.0.bias'),
            CustomReLU(),
            CustomMaxPool2d(kernel_size=3, stride=2),
            CustomConv2d(64, 192, kernel_size=5, padding=2,
                         weight_loc='features.3.weight', bias_loc='features.3.bias'),
            CustomReLU(),
            CustomMaxPool2d(kernel_size=3, stride=2),
            CustomConv2d(192, 384, kernel_size=3, padding=1,
                         weight_loc='features.6.weight', bias_loc='features.6.bias'),
            CustomReLU(),
            CustomConv2d(384, 256, kernel_size=3, padding=1,
                         weight_loc='features.8.weight', bias_loc='features.8.bias'),
            CustomReLU(),
            CustomConv2d(256, 256, kernel_size=3, padding=1,
                         weight_loc='features.10.weight', bias_loc='features.10.bias'),
            CustomReLU(),
            CustomMaxPool2d(kernel_size=3, stride=2),
        )
        self.avgpool = CustomAdaptiveAvgPool2d((6, 6))
        self.classifier = nn.Sequential(
            EinopsLinear(256 * 6 * 6, 4096, weight_loc='classifier.1.weight',
                         bias_loc='classifier.1.bias'),
            CustomReLU(),
            EinopsLinear(4096, 4096, weight_loc='classifier.4.weight',
                         bias_loc='classifier.4.bias'),
            CustomReLU(),
            EinopsLinear(4096, num_classes, weight_loc='classifier.6.weight',
                         bias_loc='classifier.6.bias'),
        )

    def forward(self, x: cp.ndarray) -> cp.ndarray:
        """
        Perform the forward pass of the AlexNet model on CuPy arrays.

        Args:
            x (cp.ndarray): Input tensor of shape (batch, 3, 224, 224), CuPy array.

        Returns:
            cp.ndarray: Output tensor of shape (batch, num_classes) with class scores.

        Workflow:
            1. Feature extraction using custom convolution, activation, and pooling layers.
            2. Adaptive average pooling to reduce spatial dimensions.
            3. Flattening and classification using custom linear layers.
        """
        x = self.features(x)
        x = self.avgpool(x)
        b = x.shape[0]
        x = x.reshape(b, -1)  # flatten
        x = self.classifier(x)
        return x

#### Inference


In [11]:
model = AlexNet()
model.eval()

total = 0
correct = 0

# Estimate total batches for progress bar length
try:
    total_batches = len(val_dataloader)
except TypeError:
    total_batches = None

for images, labels in tqdm(val_dataloader, total=total_batches, desc="Evaluating", leave=True):
    # ensure CuPy arrays
    if isinstance(images, torch.Tensor):
        images = cp.asarray(images.numpy())
    if isinstance(labels, torch.Tensor):
        labels = cp.asarray(labels.numpy())

    outputs = model.forward(images)  # (b, num_classes) cp.ndarray
    predicted = cp.argmax(outputs, axis=1)
    total += int(labels.shape[0])
    correct += int((predicted == labels).sum())
    running_acc = correct / total if total else 0.0
    tqdm.write(f"Running Acc: {running_acc:.4f}")

accuracy = correct / total if total else 0.0
print(f"Validation Accuracy: {accuracy:.4f}")

Evaluating:   0%|          | 0/98 [00:01<?, ?it/s]


CompileException: In file included from /tmp/comgr-e79635/input/tmp/tmpihyv30dr/74f19f81b237cf4c57963c764cfafe03926df9ca.hsaco.cu:2:
In file included from /home/aleisley/Documents/mengai/ai231/.venv/lib/python3.12/site-packages/cupy/_core/include/cupy/carray.cuh:40:
/usr/lib/gcc/x86_64-redhat-linux/15/../../../../include/c++/15/cstddef:52:10: fatal error: 'stddef.h' file not found
   52 | #include <stddef.h>
      |          ^~~~~~~~~~
1 error generated when compiling for gfx1101.

In [1]:
import sys
import os
import cupy as cp
import cupy.cuda.runtime as rt

print("Python:", sys.executable)
print("CuPy:", cp.__version__)
print("is_hip:", rt.is_hip)

try:
    n = rt.getDeviceCount()
    print("device_count:", n)
    for i in range(n):
        props = rt.getDeviceProperties(i)
        print(f"dev {i}:", props["name"].decode())
except Exception as e:
    import traceback
    traceback.print_exc()
    print("ENV HINTS:")
    print("  ROCM_PATH=", os.environ.get("ROCM_PATH"))
    print("  LD_LIBRARY_PATH=", os.environ.get("LD_LIBRARY_PATH"))

Python: /home/aleisley/Documents/mengai/ai231/.venv/bin/python
CuPy: 13.6.0
is_hip: True
device_count: 1
dev 0: AMD Radeon RX 7800 XT


In [2]:
import cupy as cp
# import os
# os.environ["CUPY_HIPRTC_EXTRA_OPTIONS"] = (
#     "--std=c++17 "
#     "--include-path=/usr/lib/gcc/x86_64-redhat-linux/15/include "
#     "--include-path=/usr/lib/gcc/x86_64-redhat-linux/15/include-fixed"
# )

print(cp.arange(10).sum())

CompileException: In file included from /tmp/comgr-301093/input/tmp/tmp450x8gjs/3b99542c99c6c5d523766e96f43cf1bf3bea5a43.hsaco.cu:2:
In file included from /home/aleisley/Documents/mengai/ai231/.venv/lib/python3.12/site-packages/cupy/_core/include/cupy/carray.cuh:40:
/usr/lib/gcc/x86_64-redhat-linux/15/../../../../include/c++/15/cstddef:52:10: fatal error: 'stddef.h' file not found
   52 | #include <stddef.h>
      |          ^~~~~~~~~~
1 error generated when compiling for gfx1101.

In [3]:
import cupy
# x = cupy.array([1., 2., 3.])
y = cupy.arange(10)

CompileException: In file included from /tmp/comgr-9b34e6/input/tmp/tmpqr2ikmde/69c23a85c53435a76e586b8bcb294b7e7120df4e.hsaco.cu:2:
In file included from /home/aleisley/Documents/mengai/ai231/.venv/lib/python3.12/site-packages/cupy/_core/include/cupy/carray.cuh:40:
/usr/lib/gcc/x86_64-redhat-linux/15/../../../../include/c++/15/cstddef:52:10: fatal error: 'stddef.h' file not found
   52 | #include <stddef.h>
      |          ^~~~~~~~~~
1 error generated when compiling for gfx1101.

In [4]:
print("CuPy:", cupy.__version__)
print("is_hip:", rt.is_hip)

CuPy: 13.6.0
is_hip: True
