Okay, so here i try to build alexnet (from scratch with torch?) kinda breaking it down into smaller blocks for my understanding.

this is gonna get messy so if someone else or future me is reading this, apologies in adv.

# The Building Blocks

First lets build the core fundamental blocks of alexnet

## Understanding the convolution layer

a conv layer is a sliding window of sorts that learns 'filters' to detect edges, blobs, curves and so on in images.

In [None]:
import torch
import torch.nn as nn

In [None]:
conv = nn.Conv2d(
    in_channels=3, #rgb ip
    out_channels=8, #num_filters
    kernel_size=3, #size of filter (3x3)
    stride=1, #step size
    padding=1 #add 1 pxl border to keep size
)

# Dummy image data with batch of 1, 3 channels and 32x32px

x = torch.randn(1,3,32,32)

out = conv(x)

print("output shape:", out.shape)

output shape: torch.Size([1, 8, 32, 32])


What’s happening here?

- in_channels: number of channels in the input (3 for RGB images).

- out_channels: how many filters to learn (more filters → more feature types).

- kernel_size: filter’s width/height.

- stride: how far the filter jumps each step.

- padding: how many pixels you add around the edges to preserve size.


## Ading Non-linearity with ReLU
Without non-linear activation, your network is basically a fancy linear equation. AlexNet uses ReLU after each convolution to let the network model complex shapes.

In [None]:
relu = nn.ReLU()
out = relu(out)

print("output shape after ReLU", out.shape)

output shape after ReLU torch.Size([1, 8, 32, 32])


## Pooling (downsampling)

Pooling shrinks the feature map size. AlexNet uses MaxPooling with a kernel of 3 and stride 2 (so it overlaps).
Pooling keeps only the strongest activations, making the network less sensitive to small shifts.

In [None]:
pool = nn.MaxPool2d(kernel_size=3, stride=2)
out = pool(out)
print("After pooling:", out.shape)

After pooling: torch.Size([1, 8, 15, 15])


Conv → ReLU → Pool
This is the fundamental block of AlexNet.

# Understanding the Conv stages

We’ll build the first two stages exactly like the paper: big 11×11 stride-4 opener, overlapping max-pool, then a 5×5 block, another pool.

In [1]:
import torch
import torch.nn as nn

In [7]:
#create a dummy img with batch 1, 3 channels. h&w of 224

x = torch.zeros(1, 3, 224, 224)

In [8]:
conv1 = nn.Conv2d(3, 96, kernel_size=11, stride=4, padding=2)
relu1 = nn.ReLU(inplace=True)
lrn1 = nn.LocalResponseNorm(size=5, alpha=1e-4, beta=0.75, k=2.0) # historical AlexNet quirk for normalization
pool1 = nn.MaxPool2d(kernel_size=3, stride=2) # overlapping pooling

In [9]:
conv2 = nn.Conv2d(96, 256, kernel_size=5, stride=1, padding=2)
relu2 = nn.ReLU(inplace=True)
lrn2 = nn.LocalResponseNorm(size=5, alpha=1e-4, beta=0.75, k=2.0)
pool2 = nn.MaxPool2d(kernel_size=3, stride=2)

In [10]:
with torch.no_grad():
  y = conv1(x)
  print("conv1:", y.shape)
  y= lrn1(y)
  y = pool1(y)
  print("pool1:", y.shape)
  y = conv2(y)
  print("conv2:", y.shape)
  y = relu2(y)
  y = lrn2(y)
  y = pool2(y)
  print("pool2:", y.shape)

conv1: torch.Size([1, 96, 55, 55])
pool1: torch.Size([1, 96, 27, 27])
conv2: torch.Size([1, 256, 27, 27])
pool2: torch.Size([1, 256, 13, 13])


conv1 turns 224×224 into 55×55 because stride-4 jumps four pixels at a time. The formula is out = floor((W − K + 2P)/S) + 1 = (224−11+4)/4 +1 = 55.

pool1 with 3×3, stride-2 reduces 55→27 ((55−3)/2 + 1 = 27).

conv2 keeps it 27×27 (padding 2 on a 5×5 kernel preserves size).

pool2 reduces 27→13.

the 3rd stage consists of 3 more conv layers with ReLU after each, and then ends with a maxpooling layer. (We will skip this and use it in the next part when building the entire structure)

After the convolution layers, AlexNet used 3 FC layers, with dropouts and ReLU, finally down to 1000 outputs (number of classes).

We will move on to build the entire class module with defining the layers and forward pass and then a train loop.

# Building the AlexNet module