## Dataset

### Dataset Loading and Preprocessing

This section handles the loading and initial preparation of the MNIST dataset. MNIST contains 28x28 pixel grayscale images of handwritten digits (0-9).

**Key Operations:**

1.  **Data Loading (`load_mnist_images`, `load_mnist_labels`):**
    *   These functions read the MNIST dataset from its specific binary file format. `struct.unpack` is used to parse metadata (like image dimensions and count) from the file headers.
    *   Image data is read as a flat byte array and then reshaped to `(num_images, rows, cols)`.

2.  **One-Hot Encoding Labels:**
    *   For multi-class classification with a softmax output and categorical cross-entropy loss, integer labels (e.g., digit `5`) are converted into a one-hot vector format (e.g., `[0,0,0,0,0,1,0,0,0,0]` for 10 classes).
    *   This represents the true label as a probability distribution where the correct class has a probability of 1.
    *   **Formula:** For a label $y_i$ and $K$ classes, the one-hot vector $Y_i$ has elements $Y_{i,j}$:
        $$ Y_{i,j} = \begin{cases} 1 & \text{if } j = y_i \\ 0 & \text{otherwise} \end{cases} $$

In [1]:
import numpy as np
import struct
import matplotlib.pyplot as plt
from tqdm import tqdm
import math
def load_mnist_images(filename):
    with open(filename, 'rb') as f:
        # Leggi intestazione: magic number, numero immagini, righe, colonne
        magic, num_images, rows, cols = struct.unpack(">IIII", f.read(16))
        # Leggi tutti i pixel e convertili in array numpy
        images = np.frombuffer(f.read(), dtype=np.uint8)
        # Ridimensiona l'array in (num_images, rows, cols)
        images = images.reshape((num_images, rows, cols))
    return images

def load_mnist_labels(filename):
    with open(filename, 'rb') as f:
        magic, num_labels = struct.unpack(">II", f.read(8))
        labels = np.frombuffer(f.read(), dtype=np.uint8)
    return labels

#-------------- Data Extraction ---------------------------
train_images = load_mnist_images('MNIST/train-images-idx3-ubyte')
train_labels = load_mnist_labels('MNIST/train-labels-idx1-ubyte')

test_images = load_mnist_images('MNIST/t10k-images.idx3-ubyte')
test_labels = load_mnist_labels('MNIST/t10k-labels.idx1-ubyte')

#--------------- Train data manipulation ------------------
print(train_images.shape)  # (60000, 28, 28)
print(train_labels.shape)  # (60000,)

one_hot_labels = np.zeros(train_labels.shape[0]*10).reshape((train_labels.shape[0]),10)
for i in range(len(train_labels)):
    one_hot_labels[i][train_labels[i]]=1
train_labels = one_hot_labels

print(train_labels.shape) # (60000,10)

#--------------- Test data manipulation -------------------
print(test_images.shape)  # (10000, 28, 28)
print(test_labels.shape)  # (10000,)

one_hot_labels = np.zeros(test_labels.shape[0]*10).reshape((test_labels.shape[0]),10)
for i in range(len(test_labels)):
    one_hot_labels[i][test_labels[i]]=1
test_labels = one_hot_labels

print(test_labels.shape) # (10000,10)

(60000, 28, 28)
(60000,)
(60000, 10)
(10000, 28, 28)
(10000,)
(10000, 10)


## PyTorch CNN Model Architecture

A Convolutional Neural Network (CNN) is defined using PyTorch's `nn.Module` to serve as a reference and source of pre-trained weights.

**Architecture (defined as `SimpleCNN` class):**

1.  **Conv1 + ReLU1:** `nn.Conv2d(in_channels=1, out_channels=32, kernel_size=2, stride=2, padding=0)`
    *   Input: `(B, 1, 28, 28)`
    *   Output dimension: $O = \lfloor \frac{(I - K + 2P)}{S} \rfloor + 1 = \lfloor \frac{(28 - 2 + 0)}{2} \rfloor + 1 = 14$
    *   Output: `(B, 32, 14, 14)`

2.  **Conv2 + ReLU2:** `nn.Conv2d(in_channels=32, out_channels=64, kernel_size=2, stride=2, padding=1)`
    *   Input: `(B, 32, 14, 14)`
    *   Padded input dim: $14 + 2*1 = 16$
    *   Output dimension: $O = \lfloor \frac{(16 - 2 + 0)}{2} \rfloor + 1 = 8$
    *   Output: `(B, 64, 8, 8)`

3.  **Conv3 + ReLU3:** `nn.Conv2d(in_channels=64, out_channels=128, kernel_size=2, stride=2, padding=0)`
    *   Input: `(B, 64, 8, 8)`
    *   Output dimension: $O = \lfloor \frac{(8 - 2 + 0)}{2} \rfloor + 1 = 4$
    *   Output: `(B, 128, 4, 4)`

4.  **Flatten:** `nn.Flatten()`
    *   Input: `(B, 128, 4, 4)`
    *   Output: `(B, 128 * 4 * 4)` which is `(B, 2048)`

5.  **FC1 + ReLU4:** `nn.Linear(in_features=2048, out_features=250)`
    *   Input: `(B, 2048)`
    *   Operation: $Y = XW^T + b$
    *   Output: `(B, 250)`

6.  **FC2:** `nn.Linear(in_features=250, out_features=10)` (Output layer)
    *   Input: `(B, 250)`
    *   Output: `(B, 10)` (logits for 10 classes)

<figure style="text-align:center;">
    <img src="cnn.png", style="border-radius:20px;", height=300>
    <figcaption>CNN Architecture (B: Batch size)</figcaption>
</figure>

### Model and Dataset Declaration with Training

In [2]:
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, Dataset
import time
from tqdm import tqdm

# 1.------------------ CNN declaration -------------------

class SimpleCNN(nn.Module):
    def __init__(self, num_classes=10):
        super(SimpleCNN, self).__init__()

        # --------- Convolutional Layers ------------
        self.conv1 = nn.Conv2d(in_channels=1, out_channels=32, kernel_size=2, stride=2, padding=0)
        self.relu1 = nn.ReLU()

        self.conv2 = nn.Conv2d(in_channels=32, out_channels=64, kernel_size=2, stride=2, padding=1)
        self.relu2 = nn.ReLU()

        self.conv3 = nn.Conv2d(in_channels=64, out_channels=128, kernel_size=2, stride=2, padding=0)
        self.relu3 = nn.ReLU()
        # ---------- Flatten to become MLP's input -----------
        self.flatten = nn.Flatten()
        fc_input_size = 128 * 4 * 4
        # ---------- Multi Layer Perceptron ---------------
        # Only one hidden layer for classification
        self.fc1 = nn.Linear(in_features=fc_input_size, out_features=250)
        self.relu4 = nn.ReLU()
        self.fc2 = nn.Linear(in_features=250, out_features=num_classes)

    def forward(self, x):
        # First convolution: from 1x1x28x28 to 1x32x14x14
        x = self.conv1(x)
        x = self.relu1(x)
        # Second Convolution: from 1x32x14x14 to 1x64x8x8
        x = self.conv2(x)
        x = self.relu2(x)
        # Third Convolution: from 1x64x8x8 to 1x128x4x4
        x = self.conv3(x)
        x = self.relu3(x)
        # Flatten
        x = self.flatten(x)
        # MLP
        x = self.fc1(x)
        x = self.relu4(x)
        x = self.fc2(x)

        return x

# # 2.------------------ CNN's Dataset declaration ----------------------

# class CNNDataset(Dataset):
#     def __init__(self, digits, labels, transform=None):
#         assert len(digits) == len(labels), "Number of digits and labels doesn't match"
#         self.digits = digits
#         self.labels = labels

#     def __len__(self):
#         return len(self.digits)

#     def __getitem__(self, idx):
#         digit = self.digits[idx]
#         label = self.labels[idx]
#         digit = digit.unsqueeze(0) # Needed operation to add the dimension of greyscale images (28,28) -> (1,28,28)
#         return digit, label

# tri = torch.from_numpy(train_images).float() / 255
# trl = torch.from_numpy(train_labels).float()
# tsi = torch.from_numpy(test_images).float() / 255
# tsl = torch.from_numpy(test_labels).float()

# train_dataset = CNNDataset(tri,trl)
# test_dataset = CNNDataset(tsi,tsl)

# batch_size = 128
# train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
# test_loader = DataLoader(test_dataset, batch_size=batch_size, shuffle=False)

# # 3.------ Training Setup ---------------

# device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# print(f"device: {device}")

# model = SimpleCNN(num_classes=10).to(device)

# # Loss definition
# criterion = nn.CrossEntropyLoss() 

# # Optimisation definition
# learning_rate = 0.001
# optimizer = optim.Adam(model.parameters(), lr=learning_rate)

# num_epochs = 5 

# # 4.------- cycle training ------

# print("\nStarting Training...")
# for epoch in range(num_epochs):

#     model.train() 

#     running_loss = 0.0
#     start_time = time.time()
#     #tqdm is module used to have a progress bar
#     progress_bar = tqdm(train_loader, desc=f"Epoch {epoch+1}/{num_epochs}", leave=False)

#     for inputs, labels in progress_bar:

#         # move data on the device
#         inputs, labels = inputs.to(device), labels.to(device)

#         # make all gradients zero to avoid learning on gradients of previous steps
#         optimizer.zero_grad()

#         # Forward pass
#         outputs = model(inputs) 
#         # loss computation
#         loss = criterion(outputs, labels)

#         # Backward pass: compute the gradients
#         loss.backward()

#         # Weights update
#         optimizer.step()

#         # Update the loss
#         running_loss += loss.item() * inputs.size(0) # multiply for batch size to obtain the correct mean

#         # Update the progress bar
#         progress_bar.set_postfix(loss=f"{loss.item():.4f}")

#     # Epochs' mean loss computation
#     epoch_loss = running_loss / len(train_loader.dataset)
#     epoch_time = time.time() - start_time

#     print(f"Epoch {epoch+1}/{num_epochs} - Tempo: {epoch_time:.2f}s - Training Loss: {epoch_loss:.4f}")

#     # --- Test evaluation (after every epoch) ---
#     model.eval()
#     test_loss = 0.0
#     correct = 0
#     total = 0

#     with torch.no_grad(): # Disable gradient computation (we don't need gradients since we don't want to update the model in this phase)
#         i=0
#         for inputs, labels in test_loader:
#             if i >= 1:
#                 continue
#             inputs, labels = inputs.to(device), labels.to(device)
#             outputs = model(inputs)
#             loss = criterion(outputs, labels)
#             test_loss += loss.item() * inputs.size(0)
#             _, predicted = torch.max(outputs.data, 1) # Obtain index with the maximum probability (it is our result)
#             _,labels = torch.max(labels,1) # same for the test labels
#             total += labels.size(0)
#             correct += (predicted==labels).sum().item()
#             i+=1

#     avg_test_loss = test_loss / len(test_loader.dataset)
#     accuracy = 100 * correct / total
#     print(f"Epoch {epoch+1}/{num_epochs} - Test Loss: {avg_test_loss:.4f} - Test Accuracy: {accuracy:.2f}%")


# print("\nTraining Complete.")
# #2m 9.4 secondi per avere un'epoca con cuda
# # save the model
# torch.save(model.state_dict(), 'simple_cnn_mnist.pth')

### Weights extraction

### Extracting Pre-trained Weights from PyTorch Model

This section loads weights from a pre-trained PyTorch model (`simple_cnn_mnist.pth`) and converts them into NumPy arrays. These NumPy weights will be used for our custom CNN implementations to ensure consistency for inference comparisons.

**Process:**

1.  **Load State Dictionary:** `model.load_state_dict(torch.load(...))` populates the instantiated `SimpleCNN` model with saved parameters. `map_location=torch.device('cpu')` ensures CPU loading.
2.  **Evaluation Mode:** `model.eval()` sets the model to evaluation mode, which is good practice (disables layers like dropout if present).
3.  **NumPy Conversion:** For each layer, weights (`.weight`) and biases (`.bias`) are extracted:
    *   `.data.detach().numpy()`: Converts PyTorch tensors to NumPy arrays, detaching them from the computation graph.

**Shape Conventions and Transpositions:**

*   **Convolutional Kernels (`k1`, `k2`, `k3`):**
    *   PyTorch: `(out_channels, in_channels, kernel_height, kernel_width)`.
    *   Stored directly in `numpy_weights` with this shape, as our NumPy convolution functions expect this.
*   **Convolutional Biases (`b_conv1`, etc.):**
    *   PyTorch: `(out_channels,)`. Stored directly.
*   **Fully Connected Weights (`w1`, `w2`):**
    *   PyTorch: `(out_features, in_features)`.
    *   For NumPy $XW+b$ where $X$ is `(batch, in_features)`, $W$ must be `(in_features, out_features)`. Thus, PyTorch weights are transposed (`.T`).
*   **Fully Connected Biases (`b1`, `b2`):**
    *   PyTorch: `(out_features,)`. Reshaped to `(1, out_features)` for NumPy broadcasting.

In [3]:
model = SimpleCNN(num_classes=10)
model.load_state_dict(torch.load('simple_cnn_mnist.pth', map_location=torch.device('cpu'),weights_only=True)) # Carica su CPU

model.eval() # good practice is to set model in evaluation when you want to extract weights

# --- Parameters Extraction ⛏️ and Numpy Conversion ---

# Weights container
numpy_weights = {}

# Move model on cpu
model.to('cpu')

print("⛏️ Weights and Bias Extraction ⛏️\n")

# Layer Conv1
# PyTorch weight shape: (out_channels, in_channels, kernel_height, kernel_width)
# NumPy expected: (in_channels, out_channels, kernel_width, kernel_height) -> (1, 32, 3, 3)
pytorch_weights_of_kernels_in_layer_1 = model.conv1.weight.data.detach().numpy()
# Transpose: (out, in, kH, kW) -> (in, out, kW, kH)
numpy_weights['k1'] = pytorch_weights_of_kernels_in_layer_1

# PyTorch bias shape: (out_channels,)
numpy_weights['b_conv1'] = model.conv1.bias.data.detach().numpy() # Shape (32,)
print(f"k1: PyTorch Shape={pytorch_weights_of_kernels_in_layer_1.shape}, NumPy Shape={numpy_weights['k1'].shape}")
print(f"b_conv1: NumPy Shape={numpy_weights['b_conv1'].shape}")

# Layer Conv2
# PyTorch weight shape: (64, 32, 3, 3)
# NumPy expected: (32, 64, 3, 3)
pytorch_weights_of_kernels_in_layer_2 = model.conv2.weight.data.detach().numpy()
numpy_weights['k2'] = pytorch_weights_of_kernels_in_layer_2
numpy_weights['b_conv2'] = model.conv2.bias.data.detach().numpy() # Shape (64,)
print(f"k2: PyTorch Shape={pytorch_weights_of_kernels_in_layer_2.shape}, NumPy Shape={numpy_weights['k2'].shape}")
print(f"b_conv2: NumPy Shape={numpy_weights['b_conv2'].shape}")

# Layer Conv3
# PyTorch weight shape: (128, 64, 3, 3)
# NumPy expected: (64, 128, 3, 3)
pytorch_weights_of_kernels_in_layer_3 = model.conv3.weight.data.detach().numpy()
numpy_weights['k3'] = pytorch_weights_of_kernels_in_layer_3
numpy_weights['b_conv3'] = model.conv3.bias.data.detach().numpy() # Shape (128,)
print(f"k3: PyTorch Shape={pytorch_weights_of_kernels_in_layer_3.shape}, NumPy Shape={numpy_weights['k3'].shape}")
print(f"b_conv3: NumPy Shape={numpy_weights['b_conv3'].shape}")

# Layer FC1
# PyTorch weight shape: (out_features, in_features) -> (250, 2048)
# NumPy expected (per input @ W): (in_features, out_features) -> (2048, 250)
pytorch_fc1_layer_weights = model.fc1.weight.data.detach().numpy()
numpy_weights['w1'] = pytorch_fc1_layer_weights.T # Trasponi
# PyTorch bias shape: (out_features,) -> (250,)
# NumPy expected (per aggiunta diretta): (1, out_features) -> (1, 250)
pytorch_fc1_layer_biases = model.fc1.bias.data.detach().numpy()
numpy_weights['b1'] = pytorch_fc1_layer_biases.reshape(1, -1) # Rendi (1, 250)
print(f"w1: PyTorch Shape={pytorch_fc1_layer_weights.shape}, NumPy Shape={numpy_weights['w1'].shape}")
print(f"b1: PyTorch Shape={pytorch_fc1_layer_biases.shape}, NumPy Shape={numpy_weights['b1'].shape}")

# Layer FC2
# PyTorch weight shape: (num_classes, 250) -> (10, 250)
# NumPy expected: (250, num_classes) -> (250, 10)
pytorch_fc2_layer_weights = model.fc2.weight.data.detach().numpy()
numpy_weights['w2'] = pytorch_fc2_layer_weights.T # Trasponi
# PyTorch bias shape: (num_classes,) -> (10,)
# NumPy expected: (1, num_classes) -> (1, 10)
pytorch_fc2_layer_biases = model.fc2.bias.data.detach().numpy()
numpy_weights['b2'] = pytorch_fc2_layer_biases.reshape(1, -1) # Rendi (1, 10)
print(f"w2: PyTorch Shape={pytorch_fc2_layer_weights.shape}, NumPy Shape={numpy_weights['w2'].shape}")
print(f"b2: PyTorch Shape={pytorch_fc2_layer_biases.shape}, NumPy Shape={numpy_weights['b2'].shape}")

print("\nExtraction complete. Numpy weights are in the dictionary 'numpy_weights'.")

# Access Example:
np_k1 = numpy_weights['k1']
np_b_conv1 = numpy_weights['b_conv1']
np_k2 = numpy_weights['k2']
np_b_conv2 = numpy_weights['b_conv2']
np_k3 = numpy_weights['k3']
np_b_conv3 = numpy_weights['b_conv3']
np_w1 = numpy_weights['w1']
np_b1 = numpy_weights['b1']
np_w2 = numpy_weights['w2']
np_b2 = numpy_weights['b2']



# [[[[-0.06239345  0.16331542  0.28573602]
#    [ 0.299534    0.48019555  0.25194943]
#    [-0.24432278  0.3191273  -0.06802213]]]


#  [[[ 0.10294101 -0.14240074  0.01178457]
#    [ 0.3072691  -0.06823204  0.30347323]
#    [-0.06327374  0.3396498   0.07433306]]]



#    [[[[-0.06239345  0.16331542  0.28573602]
#    [ 0.299534    0.48019555  0.25194943]
#    [-0.24432278  0.3191273  -0.06802213]]

#   [[ 0.10294101 -0.14240074  0.01178457]
#    [ 0.3072691  -0.06823204  0.30347323]
#    [-0.06327374  0.3396498   0.07433306]]

⛏️ Weights and Bias Extraction ⛏️

k1: PyTorch Shape=(32, 1, 2, 2), NumPy Shape=(32, 1, 2, 2)
b_conv1: NumPy Shape=(32,)
k2: PyTorch Shape=(64, 32, 2, 2), NumPy Shape=(64, 32, 2, 2)
b_conv2: NumPy Shape=(64,)
k3: PyTorch Shape=(128, 64, 2, 2), NumPy Shape=(128, 64, 2, 2)
b_conv3: NumPy Shape=(128,)
w1: PyTorch Shape=(250, 2048), NumPy Shape=(2048, 250)
b1: PyTorch Shape=(250,), NumPy Shape=(1, 250)
w2: PyTorch Shape=(10, 250), NumPy Shape=(250, 10)
b2: PyTorch Shape=(10,), NumPy Shape=(1, 10)

Extraction complete. Numpy weights are in the dictionary 'numpy_weights'.


## CNN - NumPy

### Padding

### Zero-Padding in Convolutions

Zero-padding adds a border of zeros around an input image or feature map before convolution. For example:

$$
\begin{bmatrix}
1 & 2 & 3 \\
4 & 5 & 6 \\
7 & 8 & 9
\end{bmatrix}
\quad \xrightarrow{\textcolor{lightgreen}{\textnormal{zero padding}}} \quad
\begin{bmatrix}
\textcolor{lightgreen}{0} & \textcolor{lightgreen}{0} & \textcolor{lightgreen}{0} & \textcolor{lightgreen}{0} & \textcolor{lightgreen}{0} \\
\textcolor{lightgreen}{0} & 1 & 2 & 3 & \textcolor{lightgreen}{0} \\
\textcolor{lightgreen}{0} & 4 & 5 & 6 & \textcolor{lightgreen}{0} \\
\textcolor{lightgreen}{0} & 7 & 8 & 9 & \textcolor{lightgreen}{0} \\
\textcolor{lightgreen}{0} & \textcolor{lightgreen}{0} & \textcolor{lightgreen}{0} & \textcolor{lightgreen}{0} & \textcolor{lightgreen}{0}
\end{bmatrix}
$$

It's crucial for:

1.  **Controlling Output Spatial Dimensions:** Padding can be used to maintain or control the reduction in height/width of feature maps. The output dimension (e.g., height $O_H$) is given by:
    $$ O_H = \left\lfloor \frac{I_H - K_H + 2P_H}{S_H} \right\rfloor + 1 $$
    where $I_H$ is input height, $K_H$ kernel height, $P_H$ padding on one side of height, and $S_H$ stride.
2.  **Improving Feature Extraction at Borders:** Allows the kernel to process edge pixels more effectively.

**`np.pad()` Usage:**
For a 4D tensor (`BATCH, CHANNELS, HEIGHT, WIDTH`), `np.pad(array, ((0,0),(0,0),(p,p),(p,p)))` adds `p` zeros to the top/bottom of HEIGHT and left/right of WIDTH, leaving BATCH and CHANNELS unpadded.

In [4]:
img9 = np.arange(1,37).reshape(2,2,3,3)
pad_img9 = np.pad(img9,((0,0),(0,0),(1,1),(1,1)))
print(img9)
print(pad_img9)

[[[[ 1  2  3]
   [ 4  5  6]
   [ 7  8  9]]

  [[10 11 12]
   [13 14 15]
   [16 17 18]]]


 [[[19 20 21]
   [22 23 24]
   [25 26 27]]

  [[28 29 30]
   [31 32 33]
   [34 35 36]]]]
[[[[ 0  0  0  0  0]
   [ 0  1  2  3  0]
   [ 0  4  5  6  0]
   [ 0  7  8  9  0]
   [ 0  0  0  0  0]]

  [[ 0  0  0  0  0]
   [ 0 10 11 12  0]
   [ 0 13 14 15  0]
   [ 0 16 17 18  0]
   [ 0  0  0  0  0]]]


 [[[ 0  0  0  0  0]
   [ 0 19 20 21  0]
   [ 0 22 23 24  0]
   [ 0 25 26 27  0]
   [ 0  0  0  0  0]]

  [[ 0  0  0  0  0]
   [ 0 28 29 30  0]
   [ 0 31 32 33  0]
   [ 0 34 35 36  0]
   [ 0  0  0  0  0]]]]


### Delating

### Dilation for Backpropagation (Gradient w.r.t. Input)

The `delateOne` function "dilates" an input matrix by inserting a single row and column of zeros between existing rows and columns along its last two spatial dimensions. This means it inserts $S-1$ zeros when the forward convolution stride $S$ was 2.

**Relevance in Backpropagation for $\frac{\partial L}{\partial X}$:**

This dilation operation is a critical step when computing the gradient of the loss with respect to the input of a convolutional layer ($\frac{\partial L}{\partial X}$), especially if the forward pass utilized a stride $S > 1$. Here is why:
* When a forward convolution uses a stride $S > 1$, it effectively downsamples the input, resulting in an output feature map $Z$ with smaller spatial dimensions.
* To calculate $\frac{\partial L}{\partial X}$, we need to use the gradient flowing back from the subsequent layer, $\frac{\partial L}{\partial Z}$ (where $Z$ is the output of the strided convolution). **Since the original input $X$ has larger spatial dimensions than $Z$, the gradient $\frac{\partial L}{\partial Z}$ must be "upsampled" or "spread out" before it can be convolved with the kernel weights to produce a gradient of the correct shape for $X$.**

**Dilation Step:** This upsampling is achieved by inserting $S-1$ rows and columns of zeros between the elements of $\frac{\partial L}{\partial Z}$. The `delateOne` function in this notebook performs this specific operation for a stride $S=2$ (inserting one row/column of zeros).

After $\frac{\partial L}{\partial Z}$ is dilated to form $\left(\frac{\partial L}{\partial Z}\right)_{dilated}$, it is then typically padded (with $K-1$ zeros where $K$ is the kernel dimension, adjusted for any original padding) and subsequently convolved with the 180-degree rotated (or flipped) kernel ($W_{rot180}$). This entire sequence of operations (padding the dilated output gradient and convolving it with the flipped kernel) is what yields $\frac{\partial L}{\partial X}$ and is often referred to as a "full convolution" in this context (see "A guide to convolution arithmetic for deep learning" by Dumoulin and Visin, or the provided articles by Mayank Kaushik).

The image below illustrates the concept of dilating an output gradient tensor. This dilation is a preparatory step before the gradient is used in further convolution operations during backpropagation for layers that had striding in their forward pass.

<figure style="text-align: center;">
    <img src="https://miro.medium.com/v2/resize:fit:1100/format:webp/1*luRORFyTmj9mJ7rVhzlbZA.png" height=250, style="border-radius:20px;">
</figure>

Illustrative example of dilating an output gradient tensor (like $\frac{\partial L}{\partial Z}$) by inserting $S-1$ (stride minus one) zeros. For `delateOne`, $S=2$, so one zero is inserted between elements.

In [5]:
def delateOne(matrix):
    indix = np.arange(1,matrix.shape[3])
    matrix = np.insert(matrix,indix,0,3)
    indix = np.arange(-(matrix.shape[-2]-1),0)
    matrix = np.insert(matrix,indix,0,-2)
    return matrix

def dilate(tensor, stride):
    if stride == 1:
        return tensor

    batch_size, num_channels, height, width = tensor.shape
    
    dilated_height = height + (height - 1) * (stride - 1)
    dilated_width = width + (width - 1) * (stride - 1)
    
    dilated_tensor = np.zeros((batch_size, num_channels, dilated_height, dilated_width))
    dilated_tensor[:, :, ::stride, ::stride] = tensor
    return dilated_tensor

### Slow Convolution Layer: Forward

### "Slow" Convolution Forward Pass (Explicit Loops)

`Slow_ReLU_Conv` implements a 2D convolution followed by ReLU activation using explicit nested Python loops. This is educationally valuable for clarity but computationally inefficient.

**Process:**

1.  **Inputs:**
    *   `img`: Batch of images `(N, C_in, H_in, W_in)`.
    *   `ker`: Filters `(C_out, C_in, K_H, K_W)`.
    *   `bias`: Per-filter biases `(C_out,)`.
2.  **Padding & Output Size:** Input `img` is padded. Output dimensions $(O_H, O_W)$ are calculated using the standard formula (see Padding section).
3.  **Convolution:** For each output element $(n, f, y_{out}, x_{out})$:
    $$ \text{Output}(n, f, y_{out}, x_{out}) = \left( \sum_{c=0}^{C_{in}-1} \sum_{k_y=0}^{K_H-1} \sum_{k_x=0}^{K_W-1} \text{Img}_{pad}(n, c, y_{out}S + k_y, x_{out}S + k_x) \cdot \text{Ker}(f, c, k_y, k_x) \right) + \text{Bias}(f) $$
4.  **ReLU Activation:** If `applyReLU=True`: $\text{ActivatedOutput} = \max(0, \text{Output})$. A binary `mask` (1 where Output > 0, else 0) is also returned for backpropagation.

<figure style="text-align:center;">
    <img src="https://raw.githubusercontent.com/iamaaditya/iamaaditya.github.io/refs/heads/master/images/conv_arithmetic/full_padding_no_strides_transposed.gif" height="250", style="border-radius:20px;"/>
    <figcaption>Convolution of a 4x4 image (blue) with a 2x2 kernel (green)</figcaption>
</figure>

In [6]:
import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F

# This is a PyTorch Convolution example to be used to check if the convolution implemented in both slow and fast approaches are correct

class CustomConv(nn.Module):
    def __init__(self, kernel: torch.Tensor, bias: torch.Tensor = None, 
                 stride=1, padding=0):
        super().__init__()
        out_ch, in_ch, k_h, k_w = kernel.shape
        self.stride = stride
        self.padding = padding
        
        self.conv = nn.Conv2d(in_channels=in_ch,
                              out_channels=out_ch,
                              kernel_size=(k_h, k_w),
                              stride=stride,
                              padding=padding,
                              bias=(bias is not None))
        with torch.no_grad():
            self.conv.weight.copy_(kernel)
            if bias is not None:
                self.conv.bias.copy_(bias)

        self.conv.weight.requires_grad_(False)
        if bias is not None:
            self.conv.bias.requires_grad_(False)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        return F.relu(self.conv(x))

def Slow_ReLU_Conv(img,ker,bias=np.array(0),pad=0,stride=1,applyReLU=True):
    if applyReLU: # Forward case
        out_ch, in_ch, k_width, k_height = ker.shape
        nk_channel = out_ch
    else: # Backward case
        in_ch, out_ch, k_width, k_height = ker.shape
        nk_channel = in_ch

    # bias has shape out_ch, 1, 1. It's a scalar value for each channel broadcasted to the kernel's width and height
    # number of channels taken in input by the kernel 'in_ch' 
    # must be the same as the number of channels of the image 'channels'

    img = np.pad(img,((0,0),(0,0),(pad,pad),(pad,pad)))
    n_images, channels, i_height, i_width  = img.shape
    ni_height = int(((i_height - k_height) / stride) + 1) # new image height # Padding is already added
    ni_width = int(((i_width - k_width) / stride) + 1) # new image width
    ni = np.zeros((n_images, out_ch, ni_height, ni_width)).astype(np.float32) # new image

    if in_ch != channels:
        raise ValueError(f"number of channels taken in input by the kernel ({in_ch}) must be the same as the number of channels of the image ({channels})")

    for one_img in range(n_images):
        for one_k_channel in range(nk_channel):
            for i_nih in range(ni_height): # which cycles row by row of the new image
                for i_niw in range(ni_width): # which cycles column by column of the new image
                    current_sum = 0.0 # convolution sum for the specific output cell
                    # Convolution cycles
                    for channel in range(channels): # channels == in_ch
                        for i_kh in range(k_height):
                            input_y = (i_nih * stride) + i_kh # get the y location, the height
                            for i_kw in range(k_width):
                                input_x = (i_niw * stride) + i_kw # get the x location, the width
                                # check that everything stays in the measures
                                if 0 <= input_y < i_height and 0 <= input_x < i_width:
                                    input_val = img[one_img, channel, input_y, input_x]
                                    kernel_val = ker[one_k_channel, channel, i_kh, i_kw]
                                    current_sum += (input_val * kernel_val).astype(np.float32)
                    ni[one_img, one_k_channel, i_nih, i_niw] = current_sum
    if bias.all() != 0:
        bias = bias.reshape(bias.shape[0],1,1)
        if bias.shape[0] != out_ch:
            raise ValueError(f"bias dimension ({bias.shape[0]}) doesn't match kernel's number of channels ({out_ch})")
        ni = ni + bias
    ni = ni.astype(np.float32)
    if applyReLU:
        ni = np.maximum(0, ni)
        mask = ni.copy()
        mask[mask > 0] = 1
        return ni,mask
    else:
        return ni
#-------------------------------------------- Examples --------------------------------------------------------
img = np.arange(1,3*4+1).reshape(1,1,3,4).astype(np.float32)
print("-------img-------")
print(img)
ker = np.arange(1,8+1).reshape(2,1,2,2)
print("-------ker-------")
print(ker)
bias = np.array([1,2]).reshape(2,1,1)
res,mask = Slow_ReLU_Conv(img,ker,bias,pad=1,stride=2)
print("-------Conv Slow-------")
print(res)
# print("------mask-------")
# print(mask)


my_kernel = torch.from_numpy(ker).float()

my_bias = torch.from_numpy(np.array([1,2])).float()

modelC = CustomConv(kernel=my_kernel,bias=my_bias, stride=2, padding=1)

# input di prova (batch=1, canali=1, H=5, W=5)
x = torch.from_numpy(img)
y = modelC(x)
print("-------Conv PyTorch-------")
print(y)

-------img-------
[[[[ 1.  2.  3.  4.]
   [ 5.  6.  7.  8.]
   [ 9. 10. 11. 12.]]]]
-------ker-------
[[[[1 2]
   [3 4]]]


 [[[5 6]
   [7 8]]]]
-------Conv Slow-------
[[[[  5.  19.  13.]
   [ 47.  95.  45.]]

  [[ 10.  40.  30.]
   [104. 232. 126.]]]]
-------Conv PyTorch-------
tensor([[[[  5.,  19.,  13.],
          [ 47.,  95.,  45.]],

         [[ 10.,  40.,  30.],
          [104., 232., 126.]]]])


### Slow Convolution Layer: Backward

**Actors:**
1. W is the kernel
2. $\delta$ is the gradient
3. x is the input to the convolution layer during forward
4. b is the bias

**Steps:**

- **Derive delta**

Deriving delta with respect to ReLU activation consists in the hadamard product (element-wise product) of the gradient ($\delta$) and the mask obtained at the forward step, that is, all the elements in the convolved image greater than zero are put to one, the rest is zero.
$$
\delta^{(i)} = \delta_{\text{flat reshaped}} \cdot \text{mask}
$$

- **Gradient with respect to W**:

$$
\frac{\partial L}{\partial W^{(i)}} = \text{Convolution}(x^{(i)}, \delta)
$$
This convolution creates a matrix for every channel of input image $x^{i}$ and for every channel of output image $\delta$, thus resulting in the correct number of channels

- **Gradient w.r.t. the input \( x \)** (To go to the preceding layer):

$$
\delta^{(i-1)} = \text{Full\_Convolution}(\delta^{(i)}, W^{(i)})
$$

- **Gradient w.r.t the bias**

Since the bias is added equally across the spatial dimensions of each output channel, the gradient is the sum of all elements in each output channel:

$$
\frac{\partial L}{\partial b^{(i)}_c} = \sum_{h,w} \delta^{(i)}_{c,h,w}
$$

For batched inputs, sum also across the batch dimension:

$$
\frac{\partial L}{\partial b^{(i)}_c} = \sum_{n,h,w} \delta^{(i)}_{n,c,h,w}
$$

### "Slow" Convolution Backward Pass (Explicit Loops)

`Slow_ReLU_Gradient` computes gradients $\frac{\partial L}{\partial X}$ (input gradient, `gi`), $\frac{\partial L}{\partial W}$ (kernel gradient, `gk`), and $\frac{\partial L}{\partial b}$ (bias gradient, `gb`), given $\frac{\partial L}{\partial A}$ (output activation gradient, `d_img`).

**Steps based on the formulas in the preceding markdown cell:**

1.  **Backward ReLU:** $\frac{\partial L}{\partial Z} = \frac{\partial L}{\partial A} \odot \text{mask}$. (`d_img = np.multiply(d_img,mask)`)
2.  **Gradient w.r.t. Bias (`gb`):** $\frac{\partial L}{\partial b_f} = \sum_{n,h,w} (\frac{\partial L}{\partial Z})_{n,f,h,w}$. (`gb = d_img.sum((-1,-2))`, note: sum over batch is missing if batch_s > 1 and gradients are not accumulated externally).
3.  **Gradient w.r.t. Kernel (`gk`):** $\frac{\partial L}{\partial W} = \text{Conv}(X_{padded}, \frac{\partial L}{\partial Z})$.
    *   The code iterates through each element of `gk` and computes its value by summing products of corresponding elements from `img` (original input, padded with forward pass `pad`) and `d_img` (which is $\frac{\partial L}{\partial Z}$, after dilation if stride > 1 for this specific calculation, but the code uses the already dilated `d_img` from the `gi` section for the `dimg_height`, `dimg_width` loops).
4.  **Gradient w.r.t. Input (`gi`):** $\frac{\partial L}{\partial X} = \text{FullConv}(\left(\frac{\partial L}{\partial Z}\right)_{dilated}, W_{rot180})$.
    *   $\frac{\partial L}{\partial Z}$ is dilated if `stride==2` (`d_img = delateOne(d_img)`).
    *   It's then padded: `d_imgPadded = np.pad(d_img, ..., (k_height-1-pad, ...))`.
    *   Kernel `ker` is rotated 180 degrees (`ker180`).
    *   The code then performs the convolution using nested loops to compute `gi`.

This "NEW APPROACH" in comments refers to the direct loop-based implementation of these gradient convolutions.

In [7]:
def Slow_ReLU_Gradient(img,d_img,ker,mask,pad=0,stride=1):
    """
    NEW APPROACH !
    Performs the backward pass of the convolution layer. It takes the original image, 
    the gradient image, and then the kernel, padding and stride used in the convolution. Also the mask is needed to perform the ReLU operation.
    It returns the gradient w.r.t. the Original Image to back propagate and the gradient of the kernel
    """ 
    ############################################# Gradient of Input Image ####################################
    # The computation consists in a convolution where the image is the gradient of the output image delated (zeros between matrix elements) of stride-1
    # and padded of kernel-1 dimensions 
    # and the kernel 180 degrees rotation (flipped vertically and then horizontally)
    # FullConvolution(d_imgDelated, Rotated180Deg(kernel)) with stride 1
    out_ch, in_ch, k_height, k_width = ker.shape
    batch_s, in_ch, img_height, img_width = img.shape

    # backward ReLU
    d_img = np.multiply(d_img,mask)

    # Delating the gradient of output
    if stride == 2:
        d_img = delateOne(d_img)
    elif stride > 2:
        raise ValueError(f"Stride greater than 2 is not acceptable")
    d_imgPadded = np.pad(d_img,((0,0),(0,0),(k_height-1-pad,k_height-1-pad),(k_width-1-pad,k_width-1-pad)))
    batch_s, out_ch, dimg_height, dimg_width = d_img.shape
    
    # flipping the kernel
    ker180 = np.rot90(ker,2,(-2,-1))

    # Computation
    gi = np.zeros_like(img)
    current_sum = 0.0
    for bs in range(batch_s):
        for i_gih in range(img_height):
            for i_giw in range(img_width):
                for i_outch in range(out_ch):
                    for i_inch in range(in_ch):
                        for i_kh in range(k_height):
                            y = i_gih + i_kh
                            for i_kw in range(k_width):
                                x = i_gih + i_kw

                                if 0 <= y < d_imgPadded.shape[-2] and 0 <= x < d_imgPadded.shape[-1]:
                                    input_val = d_imgPadded[bs,i_outch,y,x]
                                    ker_val = ker180[i_outch,i_inch,i_kh,i_kw] 
                                else:
                                    break
                                current_sum += input_val*ker_val
                    gi[bs,i_inch,i_gih,i_giw] = current_sum
                    current_sum = 0.0

    ############################################# Gradient of Kernel ####################################
    # The computation consists in a convolution between the original image and the delated gradient of the output image in order to
    # find the kernel
    gk = np.zeros_like(ker)
    img = np.pad(img,((0,0),(0,0),(pad,pad),(pad,pad)))
    current_sum = 0.0
    for bs in range(batch_s):
        for i_gih in range(k_height):
            for i_giw in range(k_width):
                for i_inch in range(in_ch):
                    for i_outch in range(out_ch):
                        for i_kh in range(dimg_height):
                            y = i_gih + i_kh
                            for i_kw in range(dimg_width):
                                x = i_gih + i_kw
                                if 0 <= y < img_height and 0 <= x < img_width:
                                    input_val = img[bs,i_inch,y,x]
                                    ker_val = d_img[bs,i_outch,i_kh,i_kw] 
                                    current_sum += input_val*ker_val
                                else:
                                    break
                        gk[i_outch,i_inch,i_gih,i_giw] = current_sum
                        current_sum = 0.0

    ############################################# Gradient of Bias ####################################
    # The computation consists in summing the gradient of the output image together to find the bias for every channel
    gb = d_img.sum((-1,-2)) # sum over height and width
    
    ################################################### Return Results ###############################################
    return gi,gk,gb

in_ch = 1
out_ch = 2
idim = 7
kdim = 2
s = 2
p = 1
imAge = np.arange(1,1*in_ch*idim*idim+1).reshape(1,in_ch,idim,idim)
kerNel = np.arange(1,out_ch*in_ch*(kdim**2)+1).reshape(out_ch,in_ch,kdim,kdim)
dimAge,mask = Slow_ReLU_Conv(imAge,kerNel,stride=s,pad=p) 
dimAge = dimAge/np.mean(dimAge)
ggi,ggk,ggb = Slow_ReLU_Gradient(imAge,dimAge,kerNel,mask,stride=s,pad=p)
print(f"imAge: {imAge.shape}")
print(f"kerNel: {kerNel.shape}")
print(f"dimAge: {dimAge.shape}")
print(f"ggi: {ggi.shape}")
print(f"ggk: {ggk.shape}")

imAge: (1, 1, 7, 7)
kerNel: (2, 1, 2, 2)
dimAge: (1, 2, 4, 4)
ggi: (1, 1, 7, 7)
ggk: (2, 1, 2, 2)


### "Transposed" Convolution Layer: Forward

In [19]:
def Fast_ReLU_Conv(batch_of_images,kernels,biases=np.array(0),padding=0,stride=1,applyReLU=True):
    kernels_number, kernel_channels, kernel_height, kernel_width = kernels.shape

    # im2col: Window creation
    batch_of_images = np.pad(batch_of_images,((0,0),(0,0),(padding,padding),(padding,padding)))
    batch_size, input_channels, image_height, image_width = batch_of_images.shape

    sliding_windows = np.lib.stride_tricks.sliding_window_view(batch_of_images,(1,input_channels,kernel_height,kernel_width))[:,:,::stride,::stride]
    sliding_windows = sliding_windows.reshape((-1,(kernel_height * kernel_width * input_channels)))

    # Convolution
    kernels = kernels.reshape((-1,(kernel_height*kernel_width*input_channels))).transpose(1,0)
    image_dot_kernel = (sliding_windows @ kernels).astype(np.float32) # convolved image matrix

    # ReLU activation
    output_width = int(((image_width-kernel_width) / stride) + 1) # Padding was already added
    output_height = int(((image_height-kernel_height) / stride) + 1)

    # First operate a reshape keeping spatial ordering, which has channels at the end
    output = image_dot_kernel.reshape(batch_size, output_width, output_height, kernels_number)

    # Transpose to have input in shapes (batch, output_channel, height, width)
    output = output.transpose(0,3,1,2).astype(np.float32)

    if biases.any() != 0:
        output = (output + biases.reshape(1,-1,1,1))

    if applyReLU:
        output = np.maximum(0,output)

    mask = np.copy(output)
    mask[mask > 0] = 1

    return output, mask



img = np.arange(1,2*3*3+1).reshape(1,2,3,3).astype(np.float32)
# print("-------img-------")
# print(img)
ker = np.arange(1,16+1).reshape(2,2,2,2)
# print("-------ker-------")
# print(ker)
bias = np.array([1,2]).reshape(2,1,1)
res,mask = Slow_ReLU_Conv(img,ker,bias,pad=0,stride=1)
print("-------Conv Slow-------")
print(res)
X_c,mask = Fast_ReLU_Conv(img,ker,bias,padding = 0,stride=1)
print("-------Conv Fast-------")
print(X_c)
res,mask = Slow_ReLU_Conv(res,ker,bias,pad=0,stride=1)
print("-------Conv Slow-------")
print(res)
X_c,mask = Fast_ReLU_Conv(X_c,ker,bias,padding = 0,stride=1)
print("-------Conv Fast-------")
print(X_c)

-------Conv Slow-------
[[[[ 357.  393.]
   [ 465.  501.]]

  [[ 838.  938.]
   [1138. 1238.]]]]
-------Conv Fast-------
[[[[ 357.  393.]
   [ 465.  501.]]

  [[ 838.  938.]
   [1138. 1238.]]]]
-------Conv Slow-------
[[[[32231.]]

  [[79176.]]]]
-------Conv Fast-------
[[[[32231.]]

  [[79176.]]]]


### Transposed Convolution Forward Pass (Im2Col alternative approach)



In [20]:
def im2col_convolutional_matrix(batch_of_images, kernels, biases=None, padding=0, stride=1, applyReLU=True):
    # kernels.shape = (kernels_number, kernel_channels, kernel_height, kernel_width)
    # batch_of_images.shape = (batch_size, input_channels, image_height, image_width)

    kernels_number, kernel_channels, kernel_height, kernel_width = kernels.shape
    batch_size, input_channels, image_height, image_width = batch_of_images.shape

    if kernel_channels != input_channels:
        raise ValueError(f"Numero di canali del kernel ({kernel_channels}) non corrisponde ai canali dell'input ({input_channels})")

    # Calcola dimensioni output
    output_height = (image_height - kernel_height + 2 * padding) // stride + 1
    output_width = (image_width - kernel_width + 2 * padding) // stride + 1

    # Padding
    if padding > 0:
        batch_of_images_padded = np.pad(batch_of_images, ((0, 0), (0, 0), (padding, padding), (padding, padding)), mode='constant')
    else:
        batch_of_images_padded = batch_of_images

    # Estrarre le patch usando sliding_window_view (im2col)
    # La finestra ha le dimensioni del kernel e si muove sull'immagine paddata
    # La shape della finestra è (1 (per batch), input_channels, kernel_height, kernel_width)
    # L'ultimo '1' è per la dimensione del batch delle patch, che è sempre 1 qui.
    # Il '1' prima di input_channels è per il canale di output virtuale delle patch, che non ci serve qui.
    patches = np.lib.stride_tricks.sliding_window_view(
        batch_of_images_padded,
        (1, input_channels, kernel_height, kernel_width)
    )
    # patches.shape ora è (batch_size, 1, H_out_stride1, W_out_stride1, 1, input_channels, kernel_height, kernel_width)
    # H_out_stride1 e W_out_stride1 sono le dimensioni dell'output se stride fosse 1

    # Applichiamo lo stride alle patch selezionate
    patches_strided = patches[:, :, ::stride, ::stride, :, :, :, :]
    # patches_strided.shape: (batch_size, 1, output_height, output_width, 1, input_channels, kernel_height, kernel_width)

    # Reshape delle patch per la moltiplicazione matriciale (X_col)
    # Ogni riga di X_col è una patch appiattita
    # X_col shape: (batch_size * output_height * output_width, input_channels * kernel_height * kernel_width)
    X_col = patches_strided.transpose(0, 2, 3, 5, 6, 7, 1, 4).reshape(
        batch_size * output_height * output_width,
        input_channels * kernel_height * kernel_width
    )

    # Reshape dei kernel per la moltiplicazione matriciale (W_col)
    # Ogni colonna di W_col è un kernel appiattito
    # W_col shape: (input_channels * kernel_height * kernel_width, kernels_number)
    W_col = kernels.reshape(kernels_number, -1).T
    # kernels.shape (OC, IC, KH, KW) -> (OC, IC*KH*KW) -> .T -> (IC*KH*KW, OC)

    # Moltiplicazione matriciale per ottenere l'output srotolato
    # output_col shape: (batch_size * output_height * output_width, kernels_number)
    output_col = X_col @ W_col

    # Reshape dell'output nella forma 4D desiderata
    # (batch_size, output_height, output_width, kernels_number)
    output = output_col.reshape(batch_size, output_height, output_width, kernels_number)
    # Trasponi per avere (batch_size, kernels_number, output_height, output_width)
    output = output.transpose(0, 3, 1, 2)

    # Aggiunta dei bias
    if biases is not None and biases.any() != 0:
        output = output + biases.reshape(1, -1, 1, 1) # biases.shape = (kernels_number,)

    mask = None # Inizializza la maschera

    if applyReLU:
        output_activated = np.maximum(0, output)
        mask = (output_activated > 0).astype(output.dtype) # Maschera binaria (0 o 1)
        output = output_activated
    else: # Se non applichiamo ReLU, la maschera può essere considerata tutta 1 (per il backward)
        mask = np.ones_like(output)


    return output.astype(np.float32), mask.astype(np.float32)

img = np.arange(1,2*3*3+1).reshape(1,2,3,3).astype(np.float32)
print("-------img-------")
print(img)
ker = np.arange(1,16+1).reshape(2,2,2,2)
print("-------ker-------")
print(ker)
bias = np.array([1,2]).reshape(2,1,1)
print("-------bias-------")
print(bias)
res, mask = Slow_ReLU_Conv(img,ker,bias,pad=0,stride=1)
print("-------Conv Slow-------")
print(res)
print("-------Conv Mat-------")
res, mask = im2col_convolutional_matrix(img, ker, bias, padding=0, stride=1)
print(res)
# res,mask = Slow_ReLU_Conv(res,ker,bias,pad=0,stride=1)
# print("-------Conv Slow-------")
# print(res)
X_c, mask = Fast_ReLU_Conv(img,ker,bias,padding = 0,stride=1)
print("-------Conv Fast-------")
print(X_c)

-------img-------
[[[[ 1.  2.  3.]
   [ 4.  5.  6.]
   [ 7.  8.  9.]]

  [[10. 11. 12.]
   [13. 14. 15.]
   [16. 17. 18.]]]]
-------ker-------
[[[[ 1  2]
   [ 3  4]]

  [[ 5  6]
   [ 7  8]]]


 [[[ 9 10]
   [11 12]]

  [[13 14]
   [15 16]]]]
-------bias-------
[[[1]]

 [[2]]]
-------Conv Slow-------
[[[[ 357.  393.]
   [ 465.  501.]]

  [[ 838.  938.]
   [1138. 1238.]]]]
-------Conv Mat-------
[[[[ 357.  393.]
   [ 465.  501.]]

  [[ 838.  938.]
   [1138. 1238.]]]]
-------Conv Fast-------
[[[[ 357.  393.]
   [ 465.  501.]]

  [[ 838.  938.]
   [1138. 1238.]]]]


Dimensional reshape test to see if everything goes into the right place (it does)

In [21]:

import numpy as np
s = 1
p = 0
in_ch = 3
out_ch = 4
i_dim = 3
k_dim = 2
img = np.arange(1,i_dim*in_ch*i_dim+1).reshape(1,in_ch,i_dim,i_dim)
ker = np.arange(1,out_ch*k_dim*in_ch*k_dim+1).reshape(out_ch,in_ch,k_dim,k_dim)
d_img, mask = im2col_convolutional_matrix(img,ker,stride=s,padding=p)
print(d_img)
_,_,dimg_height,dimg_width = d_img.shape
windom_pimg = np.lib.stride_tricks.sliding_window_view(img,(1,1,dimg_height,dimg_width)).reshape(-1,dimg_height*dimg_width)
print(windom_pimg)
print(d_img.reshape(-1,dimg_height*dimg_width))
d_img = d_img.reshape(-1,dimg_height*dimg_width).transpose(1,0)
print(d_img)
iop = windom_pimg @ d_img
print(iop)
print(iop.transpose(1,0).reshape(out_ch,in_ch,k_dim,kdim))

[[[[1245. 1323.]
   [1479. 1557.]]

  [[2973. 3195.]
   [3639. 3861.]]

  [[4701. 5067.]
   [5799. 6165.]]

  [[6429. 6939.]
   [7959. 8469.]]]]
[[ 1  2  4  5]
 [ 2  3  5  6]
 [ 4  5  7  8]
 [ 5  6  8  9]
 [10 11 13 14]
 [11 12 14 15]
 [13 14 16 17]
 [14 15 17 18]
 [19 20 22 23]
 [20 21 23 24]
 [22 23 25 26]
 [23 24 26 27]]
[[1245. 1323. 1479. 1557.]
 [2973. 3195. 3639. 3861.]
 [4701. 5067. 5799. 6165.]
 [6429. 6939. 7959. 8469.]]
[[1245. 2973. 4701. 6429.]
 [1323. 3195. 5067. 6939.]
 [1479. 3639. 5799. 7959.]
 [1557. 3861. 6165. 8469.]]
[[ 17592.  43224.  68856.  94488.]
 [ 23196.  56892.  90588. 124284.]
 [ 34404.  84228. 134052. 183876.]
 [ 40008.  97896. 155784. 213672.]
 [ 68028. 166236. 264444. 362652.]
 [ 73632. 179904. 286176. 392448.]
 [ 84840. 207240. 329640. 452040.]
 [ 90444. 220908. 351372. 481836.]
 [118464. 289248. 460032. 630816.]
 [124068. 302916. 481764. 660612.]
 [135276. 330252. 525228. 720204.]
 [140880. 343920. 546960. 750000.]]
[[[[ 17592.  23196.]
   [ 34404.  4

### Fast Convolution Layer: Backward

In [22]:
def Fast_ReLU_Gradient(img,d_img,ker,mask,pad=0,stride=1):
    """
    NEW APPROACH !
    Performs the backward pass of the convolution layer. It takes the original image, 
    the gradient image, and then the kernel, padding and stride used in the convolution. Also the mask is needed to perform the ReLU operation.
    It returns the gradient w.r.t. the Original Image to back propagate and the gradient of the kernel
    """ 
    ############################################# Gradient of Input Image ####################################
    # The computation consists in a convolution where the image is the gradient of the output image delated (zeros between matrix elements) of stride-1
    # and padded of kernel-1 dimensions 
    # and the kernel 180 degrees rotation (flipped vertically and then horizontally)
    # FullConvolution(d_imgDelated, Rotated180Deg(kernel)) with stride 1
    out_ch, in_ch, k_height, k_width = ker.shape
    batch_s, in_ch, img_height, img_width = img.shape

    # backward ReLU
    d_img = np.multiply(d_img,mask)

    # Delating the gradient of output
    if stride == 2:
        d_img = delateOne(d_img)
    elif stride > 2:
        raise ValueError(f"Stride greater than 2 is not acceptable")
    d_imgPadded = np.pad(d_img,((0,0),(0,0),(k_height-1-pad,k_height-1-pad),(k_width-1-pad,k_width-1-pad)))

    batch_s, out_ch, dimg_height, dimg_width = d_img.shape
    
    # flipping the kernel
    ker180 = np.rot90(ker,2,(-2,-1))

    window_dPad = np.lib.stride_tricks.sliding_window_view(d_imgPadded,(1,out_ch,k_width,k_height)).reshape(-1,(k_width*k_height*out_ch)) # window matrix
    # Convolution
    ker = ker.reshape((-1,(k_width*k_height*out_ch))).transpose(1,0)

    gi = (window_dPad @ ker).astype(np.float32).transpose(1,0).reshape(img.shape)
    
    ############################################# Gradient of Kernel ####################################
    # The computation consists in a convolution between the original image and the delated gradient of the output image in order to
    # find the kernel
    imgPad = np.pad(img,((0,0),(0,0),(pad,pad),(pad,pad)))
    windom_pimg = np.lib.stride_tricks.sliding_window_view(imgPad,(1,1,dimg_height,dimg_width)).reshape(-1,dimg_height*dimg_width)
    d_img = d_img.reshape(-1,dimg_height*dimg_width).transpose(1,0)
    gk = (windom_pimg @ d_img).astype(np.float32).transpose(1,0).reshape(out_ch,in_ch,k_height,k_width)
    ############################################# Gradient of Bias ####################################
    # The computation consists in summing the gradient of the output image together to find the bias for every channel
    gb = d_img.sum((-1,-2)) # sum over height and width
    
    ################################################### Return Results ###############################################
    return gi,gk,gb

def im2col_gradient_convolutional_matrix(
    original_forward_input, # X
    d_image, # dL/dA (gradiente rispetto all'output attivato del layer corrente)
    kernels, # W
    mask,    # Maschera della ReLU
    padding, # Padding usato nella forward pass
    stride   # Stride usato nella forward pass
    ):

    # 0. Backward ReLU: dL/dZ = dL/dA * mask
    # (dL/dZ è il gradiente rispetto all'output del layer PRIMA della ReLU)
    gradient_pre_activation = np.multiply(d_image, mask) # dL/dZ

    # --- Parametri dimensionali ---
    kernels_number, input_channels_kernel, kernel_height, kernel_width = kernels.shape # OC, IC, KH, KW
    batch_size, input_channels_orig, input_height, input_width = original_forward_input.shape # BS, IC, IH, IW
    # _, _, output_height_grad, output_width_grad = gradient_pre_activation.shape # BS, OC, OH, OW

    # --- 1. Gradiente rispetto ai Bias (gradient_wrt_biases = dL/db) ---
    # Somma `gradient_pre_activation` (dL/dZ) lungo le dimensioni batch, altezza, larghezza.
    # gradient_pre_activation ha shape (batch_size, kernels_number, oh_grad, ow_grad)
    # gradient_wrt_biases deve avere shape (kernels_number,)
    gradient_wrt_biases = np.sum(gradient_pre_activation, axis=(0, 2, 3))

    # --- 2. Gradiente rispetto ai Pesi (gradient_wrt_kernels = dL/dW) ---
    # dL/dW = conv(X_padded, dL/dZ_dilated_se_necessario) o più precisamente X_col.T @ dL/dZ_col
    # X_col sono le patch dell'input originale che hanno generato l'output.
    # dL/dZ_col è il gradiente pre-attivazione appiattito.

    # Padding dell'input originale (come nella forward)
    if padding > 0:
        X_padded_for_dW = np.pad(original_forward_input, ((0,0), (0,0), (padding,padding), (padding,padding)), mode='constant')
    else:
        X_padded_for_dW = original_forward_input

    # Estrarre X_col dall'input originale paddato
    # La finestra ha le dimensioni del kernel originale
    patches_X_for_dW = np.lib.stride_tricks.sliding_window_view(
        X_padded_for_dW,
        (1, input_channels_orig, kernel_height, kernel_width)
    )[:, :, ::stride, ::stride, :, :, :, :] # Applica stride originale
    # Shape: (bs, 1, oh, ow, 1, ic, kh, kw)

    X_col_for_dW = patches_X_for_dW.transpose(0, 2, 3, 5, 6, 7, 1, 4).reshape(
        batch_size * gradient_pre_activation.shape[2] * gradient_pre_activation.shape[3], # bs * oh * ow
        input_channels_orig * kernel_height * kernel_width # ic * kh * kw
    )

    # Reshape di dL/dZ (gradient_pre_activation)
    # Shape: (bs, oc, oh, ow) -> (bs * oh * ow, oc)
    dLdZ_col = gradient_pre_activation.transpose(0, 2, 3, 1).reshape(
        batch_size * gradient_pre_activation.shape[2] * gradient_pre_activation.shape[3],
        kernels_number # oc
    )

    # Calcolo di dL/dW
    # (ic*kh*kw, bs*oh*ow) @ (bs*oh*ow, oc) -> (ic*kh*kw, oc)
    gradient_wrt_kernels_flat = X_col_for_dW.T @ dLdZ_col

    # Reshape nella forma dei kernel originali (oc, ic, kh, kw)
    gradient_wrt_kernels = gradient_wrt_kernels_flat.reshape(
        input_channels_orig, kernel_height, kernel_width, kernels_number
    ).transpose(3, 0, 1, 2)


    # --- 3. Gradiente rispetto all'Input (gradient_wrt_input = dL/dX) ---
    # Questo è una convoluzione trasposta (o "full convolution")
    # dL/dX = FullConv(dL/dZ_dilated, W_rot180)

    # a. Dilatazione di gradient_pre_activation (dL/dZ) se lo stride della forward era > 1
    dilated_grad_pre_activation = dilate(gradient_pre_activation, stride)

    # b. Kernel per dL/dX: ruotati di 180° e canali scambiati
    # kernels.shape = (OC, IC, KH, KW)
    # Per la convoluzione trasposta, i canali di input/output del kernel si invertono
    # flipped_kernels_for_dX.shape deve essere (IC_new, OC_new, KH, KW)
    # dove IC_new = OC (canali di dL/dZ), OC_new = IC (canali di dL/dX)
    # Quindi, shape finale (IC, OC, KH, KW)
    flipped_kernels_for_dX = np.rot90(kernels, 2, axes=(2, 3)).transpose(1, 0, 2, 3)

    # c. Padding per la convoluzione interna di dL/dX (convoluzione trasposta)
    # Padding_effettivo_trasposta = KernelDim - 1 - Padding_Forward_Originale
    # Questo padding si applica a dilated_grad_pre_activation
    padding_for_dX_conv = kernel_height - 1 - padding # padding è il padding della forward originale

    # d. Esegui la convoluzione (stride interno SEMPRE 1)
    # Stiamo usando la nostra funzione di convoluzione efficiente
    gradient_wrt_input_raw, _ = im2col_convolutional_matrix(
        dilated_grad_pre_activation, # Immagine di input per questa convoluzione
        flipped_kernels_for_dX,      # Kernels per questa convoluzione
        biases=None,                 # Nessun bias nel calcolo del gradiente dell'input
        padding=padding_for_dX_conv,
        stride=1,                    # Lo stride è sempre 1 qui
        applyReLU=False              # Nessuna attivazione qui
    )

    # e. Crop/slice finale per assicurare le dimensioni corrette di dL/dX
    # L'output della convoluzione trasposta potrebbe essere leggermente più grande dell'input originale
    # a seconda delle formule di padding e dimensioni.
    # Vogliamo che abbia le dimensioni di original_forward_input: (batch_size, input_channels_orig, input_height, input_width)
    gradient_wrt_input = gradient_wrt_input_raw[:, :, :input_height, :input_width]

    return gradient_wrt_input, gradient_wrt_kernels, gradient_wrt_biases

In [23]:
# def Fast_ReLU_Gradient(batch_of_images,d_image,kernel,mask,pad=0,stride=1):
#     out_ch, in_ch, kh, kw = kernel.shape # number of kernels, number of input channels, kernel width and kernel height
#     bs, nc, i_height,i_width = batch_of_images.shape # batch of images' number of images, number of channels, single image's width, single images's height

#     batchSize, out_ch, dh, dw = d_image.shape # number of kernels, number of input channels, kernel width and kernel height
#     ni_height = int(((i_height-1)*stride)+kh) # new image height
#     ni_width =  int(((i_width-1)*stride)+kw) # new image width
#     height_to_pad = (ni_height-dh)
#     width_to_pad = (ni_width-dw)

#     half_htp = height_to_pad//2
#     half_wtp = width_to_pad//2

#     d_image = np.multiply(d_image,mask)
#     d_imgP = np.pad(d_image,((0,0),(0,0),(half_htp,half_htp),(half_wtp,half_wtp)))

#     batch_of_images = np.pad(batch_of_images,((0,0),(0,0),(pad,pad),(pad,pad)))
#     bs, nc, iw, ih = batch_of_images.shape # batch of images' number of images, number of channels, single image's width, single images's height
    
#     ############################## Computing the gradient of the bias ##############################################
#     gb = d_image.sum((0,-1,-2)) # sum over batch, height and width

#     ########################################## Gradient of Kernel ###################################################
#     window_boi = np.lib.stride_tricks.sliding_window_view(batch_of_images,(1,1,dh,dw))[:,:,::stride,::stride].reshape((-1,(dw*dh*1))) # window matrix
#     d_image = d_image.reshape((-1,(dw*dh*1))).transpose(1,0)
#     gk = (window_boi @ d_image).transpose(1,0).reshape(out_ch, in_ch, kh, kw,).astype(np.float32) # convolved image matrix

#     ########################################## Gradient of Image ###################################################
#     gi,_ = Fast_ReLU_Conv(d_imgP,kernel.transpose(1,0,2,3),stride = stride,pad=pad,applyReLU=False)
#     # window_dboi = np.lib.stride_tricks.sliding_window_view(d_imgP,(1,out_ch,kh,kw))[:,:,::stride,::stride].reshape((-1,(kw*kh*out_ch))) # window matrix
#     # kernel = kernel.reshape((-1,(kw*kh*out_ch))).transpose(1,0)
#     # gi = (window_dboi @ kernel).reshape(bs, i_height, i_width, nc).transpose(0,3,1,2).astype(np.float32)

#     ################################################### Return Results ###############################################
#     return gi,gk,gb

# s = 2
# p = 1
# in_ch = 3
# out_ch = 32
# i_dim = 28
# k_dim = 3
# img = np.arange(1,i_dim*in_ch*i_dim+1).reshape(1,in_ch,i_dim,i_dim)
# ker = np.arange(1,out_ch*k_dim*in_ch*k_dim+1).reshape(out_ch,in_ch,k_dim,k_dim)
# bias = np.ones(out_ch)
# d_img,mask = Fast_ReLU_Conv(img,ker,bias,stride=s,pad=p)

# print("-------------img--------------")
# #print(img)
# print(img.shape)
# print("-------------ker--------------")
# #print(ker)
# print(ker.shape)
# print("################################")
# print("-------------d_img--------------")
# #print(d_img-2)
# print(d_img.shape)

# print("************************************")
# # a,b,c = Fast_ReLU_Gradient(img,d_img-2,ker,mask,stride=s,pad=p)
# # print("-------------gi-----------------")
# # print(a.shape)
# # print("--------------gk-----------------")
# # print(b.shape)
# # print("--------------gb-----------------")
# # print(c.shape)

# ####################### Expected result ##########################
# # [[[[ 1  2  3  4]
# #    [ 5  6  7  8]
# #    [ 9 10 11 12]
# #    [13 14 15 16]]]]
# # -------------d_img--------------
# # [[[[ 43.  53.  63.]
# #    [ 83.  93. 103.]
# #    [123. 133. 143.]]

# #   [[ 99. 125. 151.]
# #    [203. 229. 255.]
# #    [307. 333. 359.]]]]
# # (1, 2, 3, 3)
# # --------------------------------
# # [[[[ 964. 2034. 2494. 1246.]
# #    [2636. 5268. 6044. 2912.]
# #    [4332. 8372. 9148. 4320.]
# #    [2088. 3922. 4238. 1938.]]]]
# # [[[[ 6042.  6879.]
# #    [ 9390. 10227.]]]


# #  [[[15018. 17079.]
# #    [23262. 25323.]]]]
# # [ 837. 2061.]
# ##################################################

### MLP Layer: Forward

### MLP Forward Pass

`ReLU_SoftMax_FullyConnected` executes the forward pass of a two-layer Multi-Layer Perceptron (one hidden layer, one output layer), typically used for classification after feature extraction by convolutional layers.

**Operations:**

1.  **Input:** `input_array` ($X_{mlp}$), the flattened output from conv layers.
2.  **Hidden Layer (fc1):**
    *   Linear: $Z_1 = X_{mlp} W_1 + b_1$ (output `fl`)
    *   ReLU Activation: $A_1 = \max(0, Z_1)$ (output `fa`)
3.  **Output Layer (fc2):**
    *   Linear: $Z_2 = A_1 W_2 + b_2$ (output `sl`, logits)
    *   Softmax Activation: $P = \text{Softmax}(Z_2)$ (output `sa`, probabilities)
        $$ \text{Softmax}(z)_j = \frac{e^{z_j - \max(z)}}{\sum_{k} e^{z_k - \max(z)}} $$
        (The subtraction of $\max(z)$ aids numerical stability).

Returns `fl, fa, sl, sa` (pre-activations, activations, logits, probabilities).

In [24]:
def softmax(x):
    e_x = np.exp(x - np.max(x,axis=-1,keepdims=True))  # for numerical stability
    return e_x / np.sum(e_x,axis=-1,keepdims=True)

def ReLU_SoftMax_FullyConnected(input_array,w1,b1,w2,b2):
    fl = (input_array @ w1)+b1 # first layer
    fa = np.maximum(0,fl) # first activation: ReLU
    sl = (fa @ w2)+b2 # second layer
    sa = softmax(sl) # second activation: SoftMax
    return fl,fa,sl,sa

#print(softmax([1,2,3,100000]))
#print(softmax_no_NS([1,2,3,1000]))
#r = np.array(np.array([1,2,777,2]))
#print(softmax(r))
#r = np.array((np.array([1,2,777,2]),np.array([1,2,777,2]),np.array([1,2,777,2])))
#print(softmax(r))

### MLP Layer: Backward

### MLP Backward Pass

`ReLU_SoftMax_FC_Backward` computes gradients for the MLP. Inputs: batch size `bs`, predictions `pred` ($P$), true `labels` ($Y$), weights $W_1, W_2$, hidden activation `fa` ($A_1$), hidden pre-activation `fl` ($Z_1$), MLP input `i_mlp` ($X_{mlp}$).

**Gradients (from output layer backwards):**

1.  $\frac{\partial L}{\partial Z_2} = P - Y$ (`dL_dz2`)
2.  $\frac{\partial L}{\partial W_2} = A_1^T \frac{\partial L}{\partial Z_2}$ (`dL_dw2`)
3.  $\frac{\partial L}{\partial b_2} = \sum_{bs} \frac{\partial L}{\partial Z_2}$ (`dL_db2`)
4.  $\frac{\partial L}{\partial A_1} = \frac{\partial L}{\partial Z_2} W_2^T$ (`dL_dfa`)
5.  $\frac{\partial L}{\partial Z_1} = \frac{\partial L}{\partial A_1} \odot \text{ReLU}'(Z_1)$ (`dL_dfl`, where $\text{ReLU}'(Z_1)$ is 1 if $Z_1 > 0$, else 0)
6.  $\frac{\partial L}{\partial W_1} = X_{mlp}^T \frac{\partial L}{\partial Z_1}$ (`dL_dw1`)
7.  $\frac{\partial L}{\partial b_1} = \sum_{bs} \frac{\partial L}{\partial Z_1}$ (`dL_db1`)
8.  $\frac{\partial L}{\partial X_{mlp}} = \frac{\partial L}{\partial Z_1} W_1^T$ (`dL_i_mlp`) (gradient to pass to conv layers)

In [25]:
def ReLU_SoftMax_FC_Backward(bs,pred,labels,w1,w2,fa,fl,i_mlp):
    dL_dz2 = pred-labels[0:bs]
    dL_dw2 = fa.T @ dL_dz2
    dL_db2 = np.sum(dL_dz2, axis=0)
    dL_dfa = dL_dz2 @ w2.T
    dReLU = (fl > 0).astype(float)
    dL_dfl = dL_dfa * dReLU
    dL_dw1 = i_mlp.reshape(bs, -1).T @ dL_dfl
    dL_db1 = np.sum(dL_dfl, axis=0)
    dL_i_mlp = dL_dfl @ w1.T
    return dL_i_mlp,dL_dw1,dL_db1,dL_dw2,dL_db2

### Loss Function: Categorical Cross-Entropy

### Loss Function: Categorical Cross-Entropy

`crossEntropy` calculates the Categorical Cross-Entropy loss for a single sample.

**Formula:** Given predicted probabilities $P=(p_1, ..., p_K)$ and one-hot true label $Y=(y_1, ..., y_K)$:
$$ L(P, Y) = - \sum_{k=1}^{K} y_k \log(p_k) $$
If class $c$ is the true class ($y_c=1$), $L = - \log(p_c)$.
A small epsilon (`1/100000`) is added to $p$ to prevent $\log(0)$.

In [26]:
def crossEntropy(p,t):
    # p stands for prediction and t stands for true label
    # p = [0,0,1] and t = [1,0,0]
    p = p+(1/100000) # for numerical stability
    return -np.dot(t,np.log(p).T)

#c = [1,1000000000000000,1,1]
#c = softmax(c)
#print(c)
#c = crossEntropy(c,[0,1,0,0])
#print(c)

## Inference

## Inference: Comparing Implementations

This section compares the inference (prediction) performance and correctness of three CNN implementations: PyTorch, "Slow" NumPy (loop-based), and "Fast" NumPy (Im2Col-based). All use identical pre-trained weights.

**Objectives:**
1.  **Correctness:** Verify that all three models yield the same predictions.
2.  **Speed:** Compare average inference time per image.

The loop iterates through test images, runs each model, records predictions and times. This demonstrates the efficiency gains from optimized libraries (PyTorch) and vectorized NumPy (Fast) over naive loops (Slow). The `pad` and `stride` parameters in the NumPy calls (`Slow_ReLU_Conv`, `Fast_ReLU_Conv`) are set to match the PyTorch model's architecture.

In [33]:
import time
from tqdm import tqdm

np_k1 = numpy_weights['k1'].astype(np.float32)
np_b_conv1 = numpy_weights['b_conv1'].astype(np.float32)
np_k2 = numpy_weights['k2'].astype(np.float32)
np_b_conv2 = numpy_weights['b_conv2'].astype(np.float32)
np_k3 = numpy_weights['k3'].astype(np.float32)
np_b_conv3 = numpy_weights['b_conv3'].astype(np.float32)
np_w1 = numpy_weights['w1'].astype(np.float32)
np_b1 = numpy_weights['b1'].astype(np.float32)
np_w2 = numpy_weights['w2'].astype(np.float32)
np_b2 = numpy_weights['b2'].astype(np.float32)

dict_times={}
dict_times["ctorch"]=[]
dict_times["cslow"]=[]
dict_times["cfast"]=[]
dict_times["ccm"]=[]

dict_pred={}
dict_pred["ctorch"]=[]
dict_pred["cslow"]=[]
dict_pred["cfast"]=[]
dict_pred["ccm"]=[]

#length = test_labels.shape[0]
length = 100
correct = 0
skip = True
loop = tqdm(range(length),desc=" Inferring...")
for i in loop:
    c0 = test_images[i].reshape(1,1,28,28).astype(np.float32)
    torch_c0 = torch.from_numpy(c0).float()
    ############### CNN PyTorch Implementation ##################
    start_time = time.time()
    outputs = model(torch_c0)
    end_time = time.time()
    _, predicted1 = torch.max(outputs.data, 1)
    dict_times["ctorch"].append(end_time-start_time)
    dict_pred["ctorch"].append(np.array(predicted1))
    ############### CNN Slow Implementation #####################
    start_time = time.time()
    c1s,mask1s = Slow_ReLU_Conv(c0.astype(np.float32),np_k1,np_b_conv1,pad=0,stride=2)
    c2s,mask2s = Slow_ReLU_Conv(c1s.astype(np.float32),np_k2,np_b_conv2,pad=1,stride=2)
    c3s,mask3s = Slow_ReLU_Conv(c2s.astype(np.float32),np_k3,np_b_conv3,pad=0,stride=2)
    imlps = c3s.reshape(1,-1)
    _,_,_,res = ReLU_SoftMax_FullyConnected(imlps,np_w1,np_b1,np_w2,np_b2)
    predicted2 = np.argmax(res,1)
    end_time = time.time()
    dict_times["cslow"].append(end_time-start_time)
    dict_pred["cslow"].append(np.array(predicted2))
    ############### CNN Fast Implementation #####################
    start_time = time.time()
    c1f,mask1f = Fast_ReLU_Conv(c0.astype(np.float32),np_k1,np_b_conv1,padding=0,stride=2)
    c2f,mask2f = Fast_ReLU_Conv(c1f.astype(np.float32),np_k2,np_b_conv2,padding=1,stride=2)
    c3f,mask3f = Fast_ReLU_Conv(c2f.astype(np.float32),np_k3,np_b_conv3,padding=0,stride=2)
    imlpf = c3f.reshape(1,-1)
    _,_,_,res = ReLU_SoftMax_FullyConnected(imlpf,np_w1,np_b1,np_w2,np_b2)
    predicted3 = np.argmax(res,1)
    end_time = time.time()
    dict_times["cfast"].append(end_time-start_time)
    dict_pred["cfast"].append(np.array(predicted3))
    ############## CNN Convolutional Matrix Implementation ###########
    start_time = time.time()
    c1c,mask1c = im2col_convolutional_matrix(c0.astype(np.float32),np_k1,np_b_conv1,padding=0,stride=2)
    c2c,mask2c = im2col_convolutional_matrix(c1c.astype(np.float32),np_k2,np_b_conv2,padding=1,stride=2)
    c3c,mask3c = im2col_convolutional_matrix(c2c.astype(np.float32),np_k3,np_b_conv3,padding=0,stride=2)
    imlpc = c3c.reshape(1,-1)
    _,_,_,res = ReLU_SoftMax_FullyConnected(imlpc,np_w1,np_b1,np_w2,np_b2)
    predicted4 = np.argmax(res,1)
    end_time = time.time()
    dict_times["ccm"].append(end_time-start_time)
    dict_pred["ccm"].append(np.array(predicted4))
    #####################################################################################
    #### Check that outputs of Slow Approach and Fast Approach have the same results ###
    t = int(predicted1[0])
    s = int(predicted2[0])
    f = int(predicted3[0])
    c = int(predicted4[0])
    if t == s and t == f:
        correct+=1
    #####################################################################################
    ### Keep track of the times #########################################################
    tat = round(sum(dict_times['ctorch'])/(i+1),10)
    sat = round(sum(dict_times['cslow'])/(i+1),10)
    fat = round(sum(dict_times['cfast'])/(i+1),10)
    cat = round(sum(dict_times['ccm'])/(i+1),10)
    loop.set_postfix(average_times =f"t: {tat} s, s: {sat} s, f: {fat} s, c: {cat} s" , correct_predictions=f"{100*correct/(i+1)}%")
tat = round(sum(dict_times['ctorch'])/length,10)
sat = round(sum(dict_times['cslow'])/length,10)
fat = round(sum(dict_times['cfast'])/length,10)
cat = round(sum(dict_times['ccm'])/length,10)
print(f"Average forward execution time in seconds: \nPyTorch: {tat} s, \nSlow: {sat} s, \nFast: {fat} s, \nConvolutional Matrix: {cat} s")

 Inferring...: 100%|██████████| 100/100 [01:39<00:00,  1.01it/s, average_times=t: 0.0032586622 s, s: 0.9865179253 s, f: 0.0012524176 s, c: 0.0005700207 s, correct_predictions=100.0%]

Average forward execution time in seconds: 
PyTorch: 0.0032586622 s, 
Slow: 0.9865179253 s, 
Fast: 0.0012524176 s, 
Convolutional Matrix: 0.0005700207 s





## Training

### Test for Slow approach

In this panel the approach is tested to see if it learns or not. the test uses first just one image, then the first 100 for each eopch, in order to see if the loss descends during the training

#### Weights Initialization

### NumPy Model Training: Weights Initialization

For training our NumPy CNNs from scratch, weights and biases are initialized randomly.
The shapes are taken from `numpy_weights` (derived from the PyTorch model) to maintain architectural consistency. `np.random.rand()` provides initial values (uniform in [0,1)). While more advanced initializers exist, this suffices for observing basic learning.

In [None]:
k1 = np.random.rand(int(numpy_weights['k1'].flatten().shape[0])).reshape(numpy_weights['k1'].shape)
bc1 = np.random.rand(int(numpy_weights['b_conv1'].flatten().shape[0])).reshape(numpy_weights['b_conv1'].shape)
k2 = np.random.rand(int(numpy_weights['k2'].flatten().shape[0])).reshape(numpy_weights['k2'].shape)
bc2 = np.random.rand(int(numpy_weights['b_conv2'].flatten().shape[0])).reshape(numpy_weights['b_conv2'].shape)
k3 = np.random.rand(int(numpy_weights['k3'].flatten().shape[0])).reshape(numpy_weights['k3'].shape)
bc3 = np.random.rand(int(numpy_weights['b_conv3'].flatten().shape[0])).reshape(numpy_weights['b_conv3'].shape)
w1 = np.random.rand(int(numpy_weights['w1'].flatten().shape[0])).reshape(numpy_weights['w1'].shape)
b1 = np.random.rand(int(numpy_weights['b1'].flatten().shape[0])).reshape(numpy_weights['b1'].shape)
w2 = np.random.rand(int(numpy_weights['w2'].flatten().shape[0])).reshape(numpy_weights['w2'].shape)
b2 = np.random.rand(int(numpy_weights['b2'].flatten().shape[0])).reshape(numpy_weights['b2'].shape)

In [None]:
def avgList(listA):
    sum_li = sum(listA)
    length_li = len(listA)
    return round(sum_li/length_li,4)

#### Same Image

### Training the "Slow" NumPy CNN (Single Image Test)

This tests training the loop-based `Slow_ReLU_Conv` and `Slow_ReLU_Gradient` implementations on a single image.

**Per-Epoch Steps:**
1.  **Forward Pass:**
    *   `c0 -> Slow_ReLU_Conv (k1,bc1,p=0,s=2) -> c1s`
    *   `c1s -> Slow_ReLU_Conv (k2,bc2,p=1,s=2) -> c2s`
    *   `c2s -> Slow_ReLU_Conv (k3,bc3,p=0,s=2) -> c3s`
    *   `c3s -> flatten -> imlps -> ReLU_SoftMax_FullyConnected -> sa` (probabilities)
2.  **Loss:** `loss = crossEntropy(sa, true_label)`
3.  **Backward Pass:** Gradients are computed using `ReLU_SoftMax_FC_Backward` for MLP, then `Slow_ReLU_Gradient` is called sequentially for conv layers, propagating gradients backward.
4.  **Weight Update:** Parameters updated via $W_{new} = W_{old} - \eta \cdot \frac{\partial L}{\partial W_{old}}$.

The loss is plotted to observe learning. The padding and stride parameters in `Slow_ReLU_Conv` calls are set to match the PyTorch model architecture, ensuring the flattened features `imlps` have the correct dimension (2048) for the MLP.

In [None]:
import matplotlib.pyplot as plt
ToBeTrained = True
if ToBeTrained:
    avg_loss = []
    forward_time = []
    backward_time = []
    numEpochs = 20
    bs = 1
    lr = 0.001
    loop = tqdm(range(numEpochs))
    for i in loop:
        c0 = train_images[0].reshape(1,1,28,28).astype(np.float32)
        
        # Forward
        sfts = time.time() # slow forward time start
        c1s,mask1s = Slow_ReLU_Conv(c0.astype(np.float32),k1,bc1,pad=0,stride=2)
        c2s,mask2s = Slow_ReLU_Conv(c1s.astype(np.float32),k2,bc2,pad=1,stride=2)
        c3s,mask3s = Slow_ReLU_Conv(c2s.astype(np.float32),k3,bc3,pad=0,stride=2)

        imlps = c3s.reshape(1,-1)
        fl,fa,sl,sa = ReLU_SoftMax_FullyConnected(imlps,w1,b1,w2,b2)
        sfte = time.time() # slow forward time end
        sft = sfte - sfts
        forward_time.append(sft)
        
        # Loss
        loss = crossEntropy(sa,train_labels[0])
        avg_loss.append(loss)

        # Backward
        sbts = time.time() # slow backward time start
        dL_i_mlp,dL_dw1,dL_db1,dL_dw2,dL_db2 = ReLU_SoftMax_FC_Backward(bs,sa,train_labels[0],w1,w2,fa,fl,imlps)
        dL_i_mlp = dL_i_mlp.reshape(c3s.shape)

        gi3,gk3,gb3 = Slow_ReLU_Gradient(c2s,dL_i_mlp,k3,mask3s,pad=0,stride=2)

        gi2,gk2,gb2 = Slow_ReLU_Gradient(c1s,gi3,k2,mask2s,pad=1,stride=2)
        gi1,gk1,gb1 = Slow_ReLU_Gradient(c0,gi2,k1,mask1s,pad=0,stride=2)
        sbte = time.time() # slow backward time end
        sbt = sbte - sbts
        backward_time.append(sbt)

        # Weights update
        w1 -= lr*dL_dw1
        b1 -= lr*dL_db1
        w2 -= lr*dL_dw2
        b2 -= lr*dL_db2
        k3 -= lr*gk3
        k2 -= lr*gk2
        k1 -= lr*gk1
        bc3 -= lr*gb3.reshape(-1)
        bc2 -= lr*gb2.reshape(-1)
        bc1 -= lr*gb1.reshape(-1)
        
        if len(avg_loss) >= 2:
            loop.set_postfix(pendence=f" {avg_loss[i]-avg_loss[i-1]}",avgForward=f"{avgList(forward_time)} s", avgBackward=f"{avgList(backward_time)} s" )

    plt.plot(avg_loss)
    plt.show()
# 2.64135 <-> 2.64095
# 2.64055 <-> 2.64020
# 2.64015 <-> 2.63980
# 2.63910 <-> 2.63840

These are the results for 20 epochs of one image:
- average forward time : 3.6265 s
- average backward time : 9.8262 s

Plot of the loss:

<img src="IMAGES\Slow Approach.png">


### Test for Fast approach

In this panel the approach is tested to see if it learns or not. the test uses first just one image, then the first 100 for each eopch, in order to see if the loss descends during the training

#### Weights Initialization

In [None]:
k1 = np.random.rand(int(numpy_weights['k1'].flatten().shape[0])).reshape(numpy_weights['k1'].shape)
bc1 = np.random.rand(int(numpy_weights['b_conv1'].flatten().shape[0])).reshape(numpy_weights['b_conv1'].shape)
k2 = np.random.rand(int(numpy_weights['k2'].flatten().shape[0])).reshape(numpy_weights['k2'].shape)
bc2 = np.random.rand(int(numpy_weights['b_conv2'].flatten().shape[0])).reshape(numpy_weights['b_conv2'].shape)
k3 = np.random.rand(int(numpy_weights['k3'].flatten().shape[0])).reshape(numpy_weights['k3'].shape)
bc3 = np.random.rand(int(numpy_weights['b_conv3'].flatten().shape[0])).reshape(numpy_weights['b_conv3'].shape)
w1 = np.random.rand(int(numpy_weights['w1'].flatten().shape[0])).reshape(numpy_weights['w1'].shape)
b1 = np.random.rand(int(numpy_weights['b1'].flatten().shape[0])).reshape(numpy_weights['b1'].shape)
w2 = np.random.rand(int(numpy_weights['w2'].flatten().shape[0])).reshape(numpy_weights['w2'].shape)
b2 = np.random.rand(int(numpy_weights['b2'].flatten().shape[0])).reshape(numpy_weights['b2'].shape)

In [None]:
def avgList(listA):
    sum_li = sum(listA)
    length_li = len(listA)
    return round(sum_li/length_li,4)

#### Same Image

### Training the "Fast" NumPy CNN (Single Image Test)

This tests training using the Im2Col-based `Fast_ReLU_Conv` and the revised `Fast_ReLU_Gradient` (from cell `c808bdb6`) on a single image.

**Per-Epoch Steps (differences from "Slow" are conv/grad functions):**
1.  **Forward Pass:**
    *   `c0 -> Fast_ReLU_Conv (k1,bc1,p=0,s=2) -> c1s`
    *   `c1s -> Fast_ReLU_Conv (k2,bc2,p=1,s=2) -> c2s`
    *   `c2s -> Fast_ReLU_Conv (k3,bc3,p=0,s=2) -> c3s`
    *   `c3s -> flatten -> imlps -> ReLU_SoftMax_FullyConnected -> sa`
2.  **Loss:** `loss = crossEntropy(sa, true_label)`
3.  **Backward Pass:** `ReLU_SoftMax_FC_Backward` for MLP, then `Fast_ReLU_Gradient` (using `sliding_window_view` for `gi` and `gk`) for conv layers.
4.  **Weight Update:** Standard gradient descent.

The loss is plotted. Consistent padding/stride ensures correct feature dimensions for the MLP. This setup tests the learning capability and performance of the more optimized NumPy convolution functions.

In [None]:
import matplotlib.pyplot as plt
avg_loss = []
forward_time = []
backward_time = []
numEpochs = 20
bs = 1
lr = 0.001
loop = tqdm(range(numEpochs))
for i in loop:
    c0 = train_images[0].reshape(1,1,28,28).astype(np.float32)
    
    # Forward
    sfts = time.time() # slow forward time start
    c1s,mask1s = Fast_ReLU_Conv(c0.astype(np.float32),k1,bc1,pad=0,stride=2)
    c2s,mask2s = Fast_ReLU_Conv(c1s.astype(np.float32),k2,bc2,pad=1,stride=2)
    c3s,mask3s = Fast_ReLU_Conv(c2s.astype(np.float32),k3,bc3,pad=0,stride=2)
    imlps = c3s.reshape(1,-1)
    fl,fa,sl,sa = ReLU_SoftMax_FullyConnected(imlps,w1,b1,w2,b2)
    sfte = time.time() # slow forward time end
    sft = sfte - sfts
    forward_time.append(sft)
    
    # Loss
    loss = crossEntropy(sa,train_labels[0])
    avg_loss.append(loss)

    # Backward
    sbts = time.time() # slow backward time start
    dL_i_mlp,dL_dw1,dL_db1,dL_dw2,dL_db2 = ReLU_SoftMax_FC_Backward(bs,sa,train_labels[0],w1,w2,fa,fl,imlps)
    dL_i_mlp = dL_i_mlp.reshape(c3s.shape)

    gi3,gk3,gb3 = Fast_ReLU_Gradient(c2s,dL_i_mlp,k3,mask3s,pad=0,stride=2)
    gi2,gk2,gb2 = Fast_ReLU_Gradient(c1s,gi3,k2,mask2s,pad=1,stride=2)
    gi1,gk1,gb1 = Fast_ReLU_Gradient(c0,gi2,k1,mask1s,pad=0,stride=2)
    sbte = time.time() # slow backward time end
    sbt = sbte - sbts
    backward_time.append(sbt)

    # Weights update
    w1 -= lr*dL_dw1
    b1 -= lr*dL_db1
    w2 -= lr*dL_dw2
    b2 -= lr*dL_db2
    k3 -= lr*gk3
    k2 -= lr*gk2
    k1 -= lr*gk1
    bc3 -= lr*gb3
    bc2 -= lr*gb2
    bc1 -= lr*gb1
    
    if len(avg_loss) > 2:
        loop.set_postfix(pendence=f" {avg_loss[i]-avg_loss[i-1]}",avgForward=f"{avgList(forward_time)} s", avgBackward=f"{avgList(backward_time)} s" )

plt.plot(avg_loss)
plt.show()

These are the results for 20 epochs of one image:
- average forward time : 0.0022 s
- average backward time : 0.0097 s

Plot of the loss:

<img src="IMAGES\Fast Approach.png">
