# Building a Neural Network for Image Classification: A Step-by-Step Guide

Image classification is one of the fundamental deep learning tasks. While modern frameworks like PyTorch, JAX, Keras, and TensorFlow offer a convenient abstraction to build and train neural networks, crafting one from scratch provides a more comprehensive understanding of the nuances involved.

In this notebook, we will implement in Python the essential modules required to build and train a multilayer perceptron that classifies garment images. In particular, we will delve into the fundamentals of approximation, non-linearity, regularization, optimizers, gradients, and backpropagation. Additionally, we explore the significance of random parameter initialization and the benefits of training in mini-batches.

By the end, you will be able to construct the fundamental building blocks of a neural network from scratch, understand how it learns, and deploy it to HuggingFace to classify real-world garment images.

### The Intuition behind our Neural Network

Our goal is to classify garment images by approximating a large mathematical function based on a training dataset of such images. We will begin this process by randomly initializing the parameters of our mathematical function, and adjusting them to combine input pixel values, until we obtain favorable outputs (in form of class predictions). This iterative method seeks to identify features in the training dataset that differentiate between classes, facilitating more accurate predictions.

The foundation for this approach is the [Universal Approximation Theorem](https://en.wikipedia.org/wiki/Universal_approximation_theorem), which highlights the significance of combining linear operations and non-linear functions to approximate complex patterns, such as those needed for computer vision.

The principle of teaching computers through examples, rather than explicit programming, dates back to [Arthur Samuel](https://ieeexplore.ieee.org/document/5392560) in 1949. Samuel further suggested the concept of using weights as function parameters that can be adjusted to influence the program's behavior and outputs. And underscored the need for an automatic method to test and optimize these weights based on their performance in real tasks.
 
We will implement this approximation method and approximate the weights automatically, applying [Stochastic Gradient Descent](https://en.wikipedia.org/wiki/Stochastic_gradient_descent) in mini-batches, which in practice involves the following steps:

1. Initialize parameters (the weights and biases of our model).
2. Calculate predictions on a mini-batch.
3. Calculate the average loss between the predictions and the targets.
4. Calculate the gradients, which provide an indication of how the parameters need to change to minimize the loss.
5. Update the weights based on the gradients and a learning rate.
6. Repeat from step 2.
7. Stop the process once a condition is met, such as a time constraint or when the training/validation losses and metrics cease to improve.

*A mini-batch refers to a randomly selected subset of the training dataset that is used to calculate the loss and update the weights in each iteration*.

*Gradients are a measure inferred from the derivative of a function that indicates how the output of the function would change by modifying its parameters. Within the context of neural networks, they indicate the direction and magnitude in which we need to change each weight to improve our model*.

--------

### Architecture

In the following sections, we dive into the implementation details of the required components to build and train a multilayer perceptron that classifies garment images. For simpler integration with advanced functionality like computing gradients, these components will be defined as custom PyTorch modules.

#### Linear Layer

At the heart of our neural network are linear functions. These linear functions perform two key operations: (i) transformation of input values by their weights and biases through matrix multiplication, and (ii) dimensionality reduction (or augmentation in some cases).

This transformation projects input values into a different space, which along the use of stacked linear layers, enables the network to progressively learn more abstract and complex patterns.

Dimensionality reduction is achieved when the number of output units in a linear layer is smaller than the number of inputs. This compression forces the layer to capture the most salient features of the higher-dimensional input.

In [None]:
import torch
import torch.nn as nn

device = 'cuda:0' if torch.cuda.is_available() else 'cpu'
print('GPU State:', device)

In [None]:
class Linear(nn.Module):
    def __init__(self, in_features: int, out_features: int, std: float = 0.1):
        """
        Initializes linear layer with random weights. 
        Weights and biases are registered as parameters, allowing for 
        gradient computation and update during backpropagation.
        """
        super(Linear, self).__init__()
        self.in_features = in_features
        self.out_features = out_features

        weight = torch.randn(in_features, out_features, requires_grad=True) * std
        bias = torch.zeros(out_features, requires_grad=True)
        
        self.weight = nn.Parameter(weight)
        self.bias = nn.Parameter(bias)
        self.to(device=device)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        """
        Perform linear transformation by multiplying the input tensor
        with the weight matrix, and adding the bias.
        """
        return x @ self.weight + self.bias

    def __repr__(self) -> str:
        """
        String representation of the linear layer.

        Returns:
            str: A string containing the number of input and output features, and whether the layer has a bias.
        """
        return f'in_features={self.in_features}, out_features={self.out_features}, bias={self.bias is not None}'

It is important to note that the weights are randomly initialized to break symmetry and enable effective learning. If all parameters were initialized to the same value, such as zeros, they will compute the same gradients during backpropagation, leading to identical weight updates and slower (or non)convergence. This symmetry would prevent the network from learning patterns in the data.

Furthermore,  scaling weights is a common practice in initialization. This helps in controlling the variance, and can have a big impact on the training dynamics. Note that a large scaling value can lead to gradients becoming excessively large during backpropagation, resulting in the exploding gradients problem wherein weights would increase exponentially and overflow to NaN values.

#### Introducing non-linearity

Without non-linearity, no matter how many layers our neural network has, it would still behave like a single-layer perceptron. This is due to the fact that the sum of multiple linear functions is  itself another linear function, which would prevent the model from approximating complex patterns.

To overcome this limitation, we adhere to the Universal Approximation Theorem and introduce non-linearity by implementing the rectified linear unit (ReLU), a widely used and simple activation function that sets negative values to zero while preserving positive values.

In [None]:
class ReLU(nn.Module):
    """
    Rectified Linear Unit (ReLU) activation function.
    """

    @staticmethod
    def forward(x: torch.Tensor) -> torch.Tensor:
        return torch.clip(x, 0.)

#### Regularization

Regularization is a fundamental technique used to prevent overfitting in neural networks, ensuring that models generalize well to unseen data. One effective method of regularization is the implementation of the dropout function. Dropout works by randomly deactivating a subset of neurons in the network during training, which prevents the model from becoming overly reliant on any single neuron or feature.

In [None]:
class Dropout(nn.Module):
    """
    Applies the dropout regularization technique to the input tensor.
    During training, randomly sets a fraction of input units to 0 with probability `p`,
    scaling the remaining values by `1 / (1 - p)` to maintain the same expected output sum.
    During evaluation, no dropout is applied.
    """

    def __init__(self, p=0.2):
        super(Dropout, self).__init__()
        self.p = p

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        if self.training:
            mask = (torch.rand(x.shape) > self.p).float().to(x) / (1 - self.p)
            return x * mask
        return x

#### Transformation Layer

Since we are working with the Fashion MNIST dataset, where images need to be flattened, we include a view transformation.

In [None]:
class Flatten(nn.Module):
    """
    Reshape the input tensor by flattening all dimensions except the first dimension.
    """

    @staticmethod
    def forward(x: torch.Tensor) -> torch.Tensor:
        """
        x.view(x.size(0), -1) reshapes the x tensor to (x.size(0), N)
        where N is the product of the remaining dimensions.
        E.g. (batch_size, 28, 28) -> (batch_size, 784)
        """
        return x.view(x.size(0), -1)

#### Sequential Layer

To construct the full neural network architecture, we need a way to connect the individual linear operations and activation functions in a sequential manner, forming a feedforward path from the inputs to the outputs. This is achieved by using a sequential layer, which allows to define the specific order and composition of the various layers in our network.

In [None]:
class Sequential(nn.Module):
    """
    Sequential container for stacking multiple modules,
    passing the output of one module as input to the next.
    """

    def __init__(self, *layers):
        super(Sequential, self).__init__()
        self.layers = nn.ModuleList(layers)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        for layer in self.layers:
            x = layer(x)
        return x

    def __repr__(self) -> str:
        """
        String representation of the Sequential module.

        Returns:
            str: A string containing the layers in the Sequential container.
        """
        layer_str = '\n'.join([f' ({i}): {layer}' for i, layer in enumerate(self.layers)])
        return f'{self.__class__.__name__}(\n{layer_str}\n)'

#### Classifier Model

After flattening the input images, we stack linear operations with non-linear functions, enabling the network to learn hierarchical representations and approximate complex patterns in the data. This is essential for tasks like image classification, where the network needs to capture visual features to distinguish between various classes.

In [None]:
class Classifier(nn.Module):
    """
    Classifier model consisting of a sequence of linear layers and ReLU activations,
    followed by a final linear layer that outputs logits (unnormalized scores)
    for each of the 10 garment classes.
    """

    def __init__(self):
        """
        The output logits of the last layer can be passed directly to
        a loss function like CrossEntropyLoss, which will apply the 
        softmax function internally to calculate a probability distribution.
        """
        super(Classifier, self).__init__()
        self.labels = ['T-shirt/Top', 'Trouser/Jeans', 'Pullover', 'Dress', 'Coat', 'Sandal', 'Shirt', 'Sneaker', 'Bag', 'Ankle-Boot']
        
        self.main = Sequential(
            Flatten(),
            Linear(in_features=784, out_features=256),
            ReLU(),
            Dropout(0.2),
            Linear(in_features=256, out_features=64),
            ReLU(),
            Dropout(0.2),
            Linear(in_features=64, out_features=10),
        )

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        return self.main(x)
    
    def predictions(self, x):
        with torch.no_grad():
            logits = self.forward(x)
            probs = torch.nn.functional.softmax(logits, dim=1)
            predictions = dict(zip(self.labels, probs.cpu().detach().numpy().flatten()))    
        return predictions

In [None]:
model = Classifier().to(device)
model

Verify that the parameters have been properly registered.

In [None]:
list(model.parameters())[0]

#### Backpropagation

We implement a basic optimizer to automatically adjust the neural network's parameters, weights and biases,  based on gradients. Computed during backpropagation, gradients indicate how to update these parameters to minimize the loss function. Using these gradients, the optimizer updates the parameters in a stepwise manner, with the step size determined by the learning rate.

In [None]:
class Optimizer:
    """
    Update model parameters during training.
    
    It performs a simple gradient descent step by updating the parameters
    based on their gradients and the specified learning rate (lr).
    """

    def __init__(self, params, lr):
        self.params = list(params)
        self.lr = lr

    def step(self):
        for p in self.params:
            p.data -= p.grad.data * self.lr

    def zero_grad(self):
        """
        Reset the gradients of all parameters to zero.
        Since PyTorch accumulates gradients, this method ensures that
        the gradients from previous optimization steps do not interfere
        with the current step.
        """
        for p in self.params:
            p.grad = None

*In 1974, Paul Werbos introduced the concept of backpropagation for neural networks. This development was almost entirely ignored for decades, but today it is considered one of the most important AI foundations.*

#### Init Config

Feel free to experiment with different learning rates and batch sizes (32, 64, 128). Keep in mind that if the batch size is too large, it might exceed the GPU's memory capacity, causing an out-of-memory error.

In [None]:
@dataclass
class LearnerConfig:
    """
    Configuration class for the Learner.

    This class holds the hyperparameters and settings for training the model.
    """

    model: nn.Module
    criterion: nn.Module
    epochs: int
    batch_size: int
    lr: float
    device: str

    # Example configuration
    config = LearnerConfig(
        model=model,
        criterion=nn.CrossEntropyLoss(),
        epochs=25,
        batch_size=32,
        lr=0.005,
        device=device,
    )

--------

## Training

#### Data Loaders

In the training process, we need to efficiently handle the loading and preprocessing of the dataset. For this purpose, we will use torch.utils.data.DataLoader, a utility class provided by PyTorch that helps with batching, shuffling, and loading data in parallel.

Using mini-batches instead of the entire dataset results in (i) computational efficiency as GPUs tend to perform better when they have a larger amount of work to process in parallel, (ii) better generalization by randomly shuffling the mini-batches on every epoch, which introduces variance and prevents the model from overfitting, and (iii) reduced memory usage as it is a practical choice to not overload the GPU's memory with the entire dataset at once.

In [12]:
from torchvision.transforms import ToTensor
from torchvision import datasets
from torch.utils.data import DataLoader

train_data = datasets.FashionMNIST(root = 'data', train = True, transform = ToTensor(), download = True)
test_data = datasets.FashionMNIST(root = 'data', train = False, transform = ToTensor())
num_workers = 1

loaders = {'train' : DataLoader(train_data, batch_size=config.batch_size, shuffle=True, num_workers=num_workers),
           'test'  : DataLoader(test_data, batch_size=config.batch_size, shuffle=False, num_workers=num_workers)}

In [None]:
train_data.data.size(), test_data.data.size()

In [None]:
import matplotlib.pyplot as plt 

label_names = ['T-shirt/Top', 'Trouser', 'Pullover', 'Dress', 'Coat', 'Sandal', 'Shirt', 'Sneaker', 'Bag', 'Ankle boot']
figure = plt.figure(figsize=(8, 8))
cols, rows = 5, 5
for i in range(1, cols * rows + 1):
    sample_idx = torch.randint(len(train_data), size=(1,)).item()
    img, label = train_data[sample_idx]
    figure.add_subplot(rows, cols, i)
    plt.xticks([])
    plt.yticks([])
    plt.imshow(img.squeeze())
    plt.xlabel(label_names[label])
plt.subplots_adjust(hspace=0.3)
plt.show()

In [None]:
img.size()

#### Fitting the Model

With the neural network architecture and data loaders in place, we can now focus on the process of training the model, also known as fitting the model to the data. This involves implementing a training loop that iterates over the dataset, computes the predictions and loss, and updates the model's parameters using backpropagation and an optimization algorithm.

The training process can be divided into two main components: the training loop and the validation loop. The training loop is responsible for feeding the mini-batches of data to the model, computing the loss, and updating the model's parameters using backpropagation and the optimizer. This loop is typically run for a fixed number of epochs or until a certain stopping criterion is met.

On the other hand, the validation loop is used to evaluate the model's performance on a separate validation dataset, which is not used for training. This helps monitor the model's generalization performance and prevents overfitting to the training data.

In the following code, we implement a Learner class that encapsulates this logic and provides a convenient interface for fitting the model to the data and monitoring its performance.

In [None]:
class Learner:
    """
    Learner class for training and evaluating a model.

    This class encapsulates the training and validation loops, as well as
    utility methods for prediction, exporting the model, and calculating
    accuracy.
    """

    def __init__(self, config, loaders):
        """
        Initialize the Learner.

        Args:
            config (LearnerConfig): Configuration for the Learner.
            loaders (dict): Dictionary of data loaders for training and testing.
        """
        self.model = config.model
        self.loaders = loaders
        self.optimizer = Optimizer(self.model.parameters(), config.lr)
        self.criterion = config.criterion
        self.epochs = config.epochs
        self.device = config.device
        self.labels = ['T-shirt/Top', 'Trouser', 'Pullover', 'Dress', 'Coat', 'Sandal', 'Shirt', 'Sneaker', 'Bag', 'Ankle-Boot']
        self.model.to(self.device)

    def train_epoch(self, epoch):
        """
        Train the model for one epoch.
        """
        epoch_loss = 0.0
        for x, y in self.loaders["train"]:
            x, y = x.to(self.device), y.to(self.device)
            batch_size = x.size(0)

            # Zero out the gradients - otherwise, they will accumulate.
            self.optimizer.zero_grad()
   
            # Forward pass, loss calculation, and backpropagation
            output = self.model(x)
            loss = self.criterion(output, y)
            loss.backward()
            self.optimizer.step()

            epoch_loss += loss.item() * batch_size

        train_loss = epoch_loss / len(self.loaders['train'].dataset)
        return train_loss
    
    def valid_loss(self):
        """
        Calculate the validation loss.
        """
        self.model.eval()
        val_loss = 0.0
        with torch.no_grad():
            for x, y in self.loaders["test"]:
                x, y = x.to(self.device), y.to(self.device)
                output = self.model(x)
                val_loss += self.criterion(output, y).item() * y.size(0)
        val_loss /= len(self.loaders["test"].dataset)
        return val_loss

    def batch_accuracy(self, x, y):
        """
        Calculate the accuracy for a batch of inputs (x) and targets (y).
        """        
        _, preds = torch.max(x.data, 1)
        return (preds == y).sum().item() / x.size(0)

    def validate_epoch(self):
        """
        Evaluate the model on the test dataset after an epoch.
        """        
        accs = [self.batch_accuracy(self.model(x.to(self.device)), y.to(self.device))
                for x, y in self.loaders["test"]]
        return sum(accs) / len(accs)
            
    def fit(self):
        """
        Train the model for the specified number of epochs.
        """
        print('epoch\ttrain_loss\tval_loss\ttest_accuracy')
        for epoch in range(self.epochs):
            train_loss = self.train_epoch(epoch)
            valid_loss = self.valid_loss()
            batch_accuracy = self.validate_epoch()
            print(f'{epoch+1}\t{train_loss:.6f}\t{valid_loss:.6f}\t{batch_accuracy:.6f}')

        metrics = self.evaluate()
        return metrics
            
    def predict(self, x):
        with torch.no_grad():
            outputs = self.model(x.to(self.device))
            _, preds = torch.max(outputs.data, 1)
        return preds
    
    def export(self, path):
        torch.save(self.model, path)
                
    def evaluate(self):
        self.model.eval()
        all_preds = []
        all_targets = []

        with torch.no_grad():
            for x, y in self.loaders["test"]:
                x, y = x.to(self.device), y.to(self.device)
                outputs = self.model(x)
                _, preds = torch.max(outputs, 1)
                all_preds.extend(preds.cpu().numpy())
                all_targets.extend(y.cpu().numpy())

        class_precision = precision_score(all_targets, all_preds, average=None)
        class_recall = recall_score(all_targets, all_preds, average=None)
        class_f1 = f1_score(all_targets, all_preds, average=None)

        metrics = {label: {"precision": prec, "recall": rec, "f1": f1}
                   for label, prec, rec, f1 in zip(self.labels, class_precision, class_recall, class_f1)}

        return metrics

In [None]:
learner = Learner(config, loaders)

In [None]:
learner.fit()

In [None]:
labels = ['T-shirt/Top', 'Trouser', 'Pullover', 'Dress', 'Coat', 'Sandal', 'Shirt', 'Sneaker', 'Bag', 'Ankle-Boot']

# Determine the maximum label length for padding
max_label_len = max(len(label) for label in labels)

# Print header row
header = "Label".ljust(max_label_len + 2) + "Precision".ljust(12) + "Recall".ljust(12) + "F1-score"
print(header)
print("-" * len(header))

# Print metrics for each class in a row
for label, metric in zip(labels, metrics.values()):
    row = label.ljust(max_label_len + 2) + \
        f"{metric['precision']:.6f}".ljust(12) + \
        f"{metric['recall']:.6f}".ljust(12) + \
        f"{metric['f1']:.6f}"
    print(row)

--------

### Inference

After training the model, we can use it for inference, which involves making predictions on new, unseen data. The inference process is relatively straightforward: we pass the input data (e.g., new garment images) through the trained neural network, and it will output the predicted class probabilities or labels.

#### From Model

In [None]:
import torchvision.transforms as transforms

images, _ = next(iter(loaders['test']))
i = torch.randint(len(images), size=(1,)).item()
img = images[i]

plt.figure(figsize=(2, 2))
plt.imshow(img.squeeze())
plt.colorbar()
plt.show()

In [None]:
predictions = learner.model.predictions(img.to(device))
dict(sorted(predictions.items(), key=lambda item: item[1], reverse=True))

#### From Exported Model using a real Image

learner.export('fashion_mnist.pt')

In [None]:
from PIL import Image
import torchvision.transforms as transforms
import matplotlib.pyplot as plt
import numpy as np

transform = transforms.Compose(
    [
        transforms.Resize((28, 28)),
        transforms.Grayscale(),
        transforms.ToTensor(),
        transforms.Normalize((0.5,), (0.5,)),  # Normalize
        transforms.Lambda(lambda x: 1.0 - x),  # Invert colors
        transforms.Lambda(lambda x: x[0]),
        transforms.Lambda(lambda x: x.unsqueeze(0)),
    ]
)

In [None]:
img = Image.open('fashion/dress.png')
plt.figure(figsize=(4, 4))
plt.imshow(img)
plt.show()

In [None]:
img = transform(img)
plt.figure(figsize=(4, 4))
plt.imshow(img.squeeze())
plt.colorbar()
plt.show()

In [None]:
model = torch.load('fashion_mnist_2311.pt')
predictions = model.predictions(img.to(device))
dict(sorted(predictions.items(), key=lambda item: item[1], reverse=True))