# ResNet

## Theoretical Introduction

### ResNet: Residual Connections and Extremely Deep Networks

The **ResNet** (Residual Network) architecture, introduced by Kaiming He andcollaborators at Microsoft Research in 2015, represents a turning point in the design ofdeep neural networks for computer vision. Whereas LeNet-5 inaugurates the systematic useof convolutions and VGG consolidates depth as a key performance factor, ResNet introducesa structural mechanism that enables the effective and stable training of extremely deepnetworks (with more than one hundred layers).ResNet acquires particular relevance by winning the **ImageNet 2015** challenge with a**152-layer** variant, a depth that was previously considered practically unattainablefrom a training standpoint, due to optimization difficulties and numerical instability.

## The Degradation Problem in Deep Networks

Before ResNet, it was reasonable to assume that increasing the number of layers shouldlead, at least in principle, to models with greater representational capacity andimproved performance. However, empirical studies showed that beyond a certain threshold(around twenty or thirty layers), adding more depth not only failed to improveperformance but actually **degraded it**, even on the training set itself.This phenomenon is known as **degradation**. It is not a simple overfitting effect, sincethe error also increases during training, but rather a structural optimizationdifficulty. In very deep networks, the error signal that propagates backward tends tovanish or become numerically unstable. Layers close to the input receive very small ornoisy gradients, so their parameters barely update, preventing the model from exploitingits potential capacity.

### Experimental Evidence of Degradation

Pre-ResNet experiments show that a 56-layer network can exhibit a **higher trainingerror** than a 20-layer network with a comparable architecture. This observationcontradicts a basic theoretical expectation: a deeper model should be able, at a minimum,to reproduce the performance of a shallower one, by simple hypothesis inclusion (it wouldsuffice for some layers to implement the identity function).The fact that the error increases even on the training set demonstrates that the problemdoes not lie in representational capacity, but in the **difficulty of optimizing** verydeep networks using standard gradient-based training mechanisms.

## Residual Connections and Shortcut Blocks

The central innovation of ResNet is the introduction of **shortcut connections** (_skipconnections_) that give rise to the **residual block**. The idea is conceptually simple:instead of forcing each group of layers to learn a complete transformation, thearchitecture allows that group to learn only the **difference** (or residue) between theinput and the desired output.

### Mathematical Formulation of the Residual Block

In a conventional network, a layer or group of layers receives an input $x$ and learns afunction $F(x)$. The output of the block is simply $F(x)$.In ResNet, the output of a residual block is defined as$$y = F(x) + x,$$where:- $x$ is the input to the residual block.- $F(x)$ is the transformation learned by the internal layers (convolutions,  normalization, activations).- $y$ is the block output, obtained as the sum of the original input and the residual  term $F(x)$.This sum creates an **explicit identity path**, akin to a “highway” that traverses thenetwork, along which information can flow without being modified, in parallel to theconventional nonlinear transformations.

### Advantages of Residual Learning

Residual learning provides several fundamental advantages.First, it **facilitates optimization**. If the optimal transformation in a certain blockis close to the identity, it is easier for the network to learn a residual function with$F(x) \approx 0$ than to learn a complete transformation $H(x) \approx x$ from scratch.The function space of residuals tends to be “closer” to the origin and is therefore moreaccessible to gradient-based optimization methods.Second, **gradient propagation** improves significantly. During backpropagation, thegradient of the loss $L$ with respect to the input $x$ of a residual block satisfies, insimplified form,$$\frac{\partial L}{\partial x} =\frac{\partial L}{\partial y} \left( \frac{\partial F}{\partial x} + I \right),$$where $I$ is the identity matrix. This expression guarantees that, even in the limitingcase where $\frac{\partial F}{\partial x}$ tends to zero, there is always a directgradient path through the identity term $I$. In practice, this prevents the signal fromvanishing completely and contributes to stabilizing the training of very deep networks.Third, residual connections introduce a form of **adaptive depth**. If a block is notnecessary for the task, it can approximate $F(x) \approx 0$ and effectively behave as anidentity transformation, so that $y \approx x$. The network thus retains the ability to“neutralize” blocks that do not provide improvements, without compromising theinformation flow.Taken together, these mechanisms make it possible to train networks with more than onehundred layers without suffering the severe degradation that affected previousarchitectures. The input signal can traverse the network without being distorted, and thegradient has alternative paths that mitigate vanishing effects.

## Architectural Variants of ResNet

The ResNet family includes several configurations that differ mainly in depth and in thetype of residual block used. Summarizing:|      Model | Layers | Parameters | Blocks per stage | Block type || ---------: | -----: | ---------: | ---------------- | ---------- ||  ResNet-18 |     18 |      ~11 M | [2, 2, 2, 2]     | Basic      ||  ResNet-34 |     34 |      ~21 M | [3, 4, 6, 3]     | Basic      ||  ResNet-50 |     50 |      ~25 M | [3, 4, 6, 3]     | Bottleneck || ResNet-101 |    101 |      ~44 M | [3, 4, 23, 3]    | Bottleneck || ResNet-152 |    152 |      ~60 M | [3, 8, 36, 3]    | Bottleneck |

### Basic Block versus Bottleneck Block

In the shallower versions, such as **ResNet-18** and **ResNet-34**, the **basic block**is used, with the structure:```textx → [Conv 3×3] → [BN] → [ReLU] → [Conv 3×3] → [BN] → (+) → [ReLU]↘______________________________________________________________|```This block consists of two $3 \times 3$ convolutions followed by batch normalization andReLU activation, and a residual sum with the identity branch. The number of outputchannels matches that of the input (expansion factor 1).In the deeper variants, such as **ResNet-50**, **ResNet-101**, and **ResNet-152**, the**bottleneck block** is used, whose purpose is to reduce computational cost whilepreserving representational capacity. Its structure is:```textx → [Conv 1×1] → [BN] → [ReLU]  → [Conv 3×3] → [BN] → [ReLU]  → [Conv 1×1] → [BN] → (+) → [ReLU]↘______________________________________________________________|```This block combines three consecutive convolutions. The first, of size $1 \times 1$,reduces the channel dimensionality (for example, from 256 to 64 channels). The second, ofsize $3 \times 3$, performs the main processing on a reduced number of channels. Thethird, again of size $1 \times 1$, restores the original dimensionality (for example,from 64 back to 256 channels). The typical expansion factor is 4: the number of outputchannels is four times the number of intermediate channels.

### Computational Efficiency Analysis

The advantage of the bottleneck design becomes clear when comparing the number ofparameters of equivalent configurations.Consider two $3 \times 3$ convolutional layers with 256 input and output channels; thetotal number of parameters is$$\text{Parameters} = 2 \times (256 \times 3 \times 3 \times 256) = 1\,179\,648.$$In contrast, for a bottleneck block with pattern $256 \rightarrow 64 \rightarrow 256$:First $1 \times 1$ convolution:$$256 \times 1 \times 1 \times 64 = 16\,384.$$Central $3 \times 3$ convolution:$$64 \times 3 \times 3 \times 64 = 36\,864.$$Last $1 \times 1$ convolution:$$64 \times 1 \times 1 \times 256 = 16\,384.$$The complete block therefore contains approximately$$16\,384 + 36\,864 + 16\,384 = 69\,632$$parameters, which represents a **reduction of about 94 %** compared with two $3 \times 3$layers with 256 channels. The use of $1 \times 1$ layers to reduce and restoredimensionality allows the architecture to maintain representational capacity whiledrastically reducing computational cost.

## ResNet in Historical Perspective

The comparison between LeNet-5, VGG-16, and ResNet-50 illustrates the evolution of deepconvolutional architecture design:| Feature                | LeNet-5 (1998)     | VGG-16 (2014)  | ResNet-50 (2015)           || ---------------------- | ------------------ | -------------- | -------------------------- || Layers                 | 7                  | 16             | 50                         || Parameters             | ~60 K              | ~138 M         | ~25 M                      || Key innovation         | Convolutions       | Uniform depth  | Skip connections           || Main filters           | $5 \times 5$       | $3 \times 3$   | $1 \times 1$, $3 \times 3$ || Target problem         | Handwritten digits | Complex vision | Very deep networks         || Top-5 error (ImageNet) | —                  | ~7.3 %         | ~3.6 %                     |LeNet-5 introduces convolution and pooling as essential mechanisms to exploit the spatialstructure of images. VGG-16 increases depth to sixteen layers using a homogeneousarchitecture based on $3 \times 3$ filters, but at the cost of a very high number ofparameters (on the order of 138 million). ResNet-50, in contrast, reaches fifty layerswith about twenty-five million parameters, thanks to the systematic use of bottleneckblocks and residual connections, achieving a highly favorable combination of depth,stability, and efficiency.

## Current Impact and Importance of ResNet

ResNet is currently regarded as a reference architecture in both academic and industrialcontexts. The balance between depth, training stability, and efficiency makes it thebackbone of numerous systems for face recognition, autonomous driving, medical imagingdiagnosis, and large-scale visual analysis in multiple domains.Its advantages include efficient parameter usage (for example, ResNet-50 usesapproximately five times fewer parameters than VGG-16), the ability to train networkswith more than one hundred layers without severe gradient degradation, its suitability asa base structure for transfer learning, numerical stability during training, andversatility, which has inspired variants in vision, natural language processing, andother modalities.Beyond solving a specific technical problem, ResNet redefines deep architecture design byexplicitly incorporating identity paths that facilitate the flow of information andgradients through the network.

## Practical Implementation of ResNet for CIFAR-10

The following is a complete and functional implementation of ResNet (ResNet-18,ResNet-34, and ResNet-50) for the **CIFAR-10** dataset using PyTorch. The code isorganized so that it can be directly converted into a Jupyter Notebook and executedsequentially, from data loading to training, evaluation, and visualization of results.

### Importing Libraries

The first step is to import the modules required for defining the architecture, managingdata, training, and visualization.

In [None]:
# Standard librariesfrom typing import Any, List, Type, Unionimport time# Third-party librariesimport matplotlib.pyplot as pltimport numpy as npimport torchimport torch.nn.functional as Ffrom sklearn.manifold import TSNEfrom sklearn.metrics import confusion_matrix, classification_reportimport seaborn as snsfrom torch import nnfrom torch.utils.data import DataLoaderfrom torchinfo import summaryfrom torchvision import datasets, transformsfrom tqdm import tqdmprint(f"PyTorch version: {torch.__version__}")print(f"CUDA available: {torch.cuda.is_available()}")if torch.cuda.is_available():    print(f"CUDA device: {torch.cuda.get_device_name(0)}")

### Global Hyperparameter Configuration

The constants and hyperparameters that will be used throughout the experiment are definednext.

In [None]:
# Global configurationBATCH_SIZE: int = 128NUM_EPOCHS: int = 100LEARNING_RATE: float = 0.1WEIGHT_DECAY: float = 1e-4MOMENTUM: float = 0.9NUM_CLASSES: int = 10INPUT_SIZE: int = 32# CIFAR-10 class namesCIFAR10_CLASSES = [    "airplane", "automobile", "bird", "cat", "deer",    "dog", "frog", "horse", "ship", "truck"]print("Configuration:")print(f"  Batch size: {BATCH_SIZE}")print(f"  Epochs: {NUM_EPOCHS}")print(f"  Initial learning rate: {LEARNING_RATE}")print(f"  Momentum: {MOMENTUM}")print(f"  Weight decay: {WEIGHT_DECAY}")print(f"  Number of classes: {NUM_CLASSES}")

### Visualization Helper Function

The `show_images` function allows visualization of CIFAR-10 images together with theirground-truth labels and, optionally, model predictions.

In [None]:
def show_images(images, labels, predictions=None, classes=CIFAR10_CLASSES):    """    Visualize a set of images with their labels and predictions.    Args:        images: Image tensor [N, C, H, W].        labels: Label tensor [N].        predictions: Optional tensor of predictions [N].        classes: List of class names.    """    n_images = min(len(images), 8)    fig, axes = plt.subplots(1, n_images, figsize=(2 * n_images, 3))    if n_images == 1:        axes = [axes]    for idx in range(n_images):        img = images[idx]        label = labels[idx]        ax = axes[idx]        # Denormalize image (assuming standard normalization)        img = img / 2 + 0.5        img = img.numpy().transpose((1, 2, 0))        ax.imshow(img)        title = f"True: {classes[label]}"        if predictions is not None:            pred = predictions[idx]            color = "green" if pred == label else "red"            title += f"\nPred: {classes[pred]}"            ax.set_title(title, fontsize=9, color=color, fontweight="bold")        else:            ax.set_title(title, fontsize=9, fontweight="bold")        ax.axis("off")    plt.tight_layout()    plt.show()print("Visualization function defined correctly")

### Preparing the CIFAR-10 Dataset

CIFAR-10 is then loaded and the preprocessing and data augmentation transformations fortraining and validation are defined.

In [None]:
# CIFAR-10 normalization statisticsCIFAR10_MEAN = (0.4914, 0.4822, 0.4465)CIFAR10_STD = (0.2470, 0.2435, 0.2616)# Training transformations with data augmentationtransform_train = transforms.Compose([    transforms.RandomCrop(32, padding=4),    transforms.RandomHorizontalFlip(),    transforms.ToTensor(),    transforms.Normalize(CIFAR10_MEAN, CIFAR10_STD)])# Validation/test transformationstransform_test = transforms.Compose([    transforms.ToTensor(),    transforms.Normalize(CIFAR10_MEAN, CIFAR10_STD)])print("Downloading CIFAR-10 dataset...")train_dataset = datasets.CIFAR10(    root="./data",    train=True,    download=True,    transform=transform_train)test_dataset = datasets.CIFAR10(    root="./data",    train=False,    download=True,    transform=transform_test)print("\nDataset statistics:")print(f"  Training samples: {len(train_dataset):,}")print(f"  Test samples: {len(test_dataset):,}")print(f"  Number of classes: {len(train_dataset.classes)}")print("  Image size: 32×32 pixels (RGB)")

Data augmentation introduces variations in cropping and horizontal flipping that helpimprove generalization. Normalization with the CIFAR-10 mean and standard deviationcenters and scales each channel, which typically favors faster and more stableconvergence.

### Creating DataLoaders

The training and test `DataLoader` objects are then defined, configuring the number ofworker processes and other performance-oriented options.

In [None]:
train_dataloader = DataLoader(    dataset=train_dataset,    batch_size=BATCH_SIZE,    shuffle=True,    num_workers=4,    pin_memory=True,    persistent_workers=True)test_dataloader = DataLoader(    dataset=test_dataset,    batch_size=BATCH_SIZE,    shuffle=False,    num_workers=4,    pin_memory=True,    persistent_workers=True)print("DataLoaders configured:")print(f"  Training batches: {len(train_dataloader)}")print(f"  Test batches: {len(test_dataloader)}")

### Visual Exploration of the Dataset

Before training the model, it is useful to inspect a batch of images to verify thatloading and preprocessing are correctly configured.

In [None]:
# Get one training batchdata_iter = iter(train_dataloader)train_images, train_labels = next(data_iter)print("\nBatch dimensions:")print(f"  Images: {train_images.shape}")print(f"  Labels: {train_labels.shape}")print("\nDisplaying first 8 samples...")show_images(train_images[:8], train_labels[:8])

## ResNet Implementation

The basic and bottleneck residual blocks are implemented first, followed by the main`ResNet` class adapted to CIFAR-10 image size.

In [None]:
class BasicBlock(nn.Module):    """    Basic residual block for ResNet-18 and ResNet-34.    Structure:        x → [Conv 3×3] → [BN] → [ReLU]          → [Conv 3×3] → [BN] → (+) → [ReLU]        ↘__________________________________________________|    Expansion: 1 (number of output channels = number of input channels).    """    expansion: int = 1    def __init__(        self,        in_channels: int,        out_channels: int,        stride: int = 1,        downsample: nn.Module = None    ) -> None:        super().__init__()        # First 3×3 convolution        self.conv1 = nn.Conv2d(            in_channels,            out_channels,            kernel_size=3,            stride=stride,            padding=1,            bias=False        )        self.bn1 = nn.BatchNorm2d(out_channels)        # Second 3×3 convolution        self.conv2 = nn.Conv2d(            out_channels,            out_channels,            kernel_size=3,            stride=1,            padding=1,            bias=False        )        self.bn2 = nn.BatchNorm2d(out_channels)        # Shortcut branch (dimensionality adjustment if needed)        self.downsample = downsample        self.stride = stride    def forward(self, x: torch.Tensor) -> torch.Tensor:        identity = x        out = self.conv1(x)        out = self.bn1(out)        out = F.relu(out)        out = self.conv2(out)        out = self.bn2(out)        if self.downsample is not None:            identity = self.downsample(x)        out += identity        out = F.relu(out)        return out

In [None]:
class Bottleneck(nn.Module):    """    Bottleneck block for ResNet-50, ResNet-101, and ResNet-152.    Structure:        x → [Conv 1×1] → [BN] → [ReLU]          → [Conv 3×3] → [BN] → [ReLU]          → [Conv 1×1] → [BN] → (+) → [ReLU]        ↘__________________________________________________|    Expansion: 4 (output channels = 4 × intermediate channels).    """    expansion: int = 4    def __init__(        self,        in_channels: int,        out_channels: int,        stride: int = 1,        downsample: nn.Module = None    ) -> None:        super().__init__()        # 1×1 conv to reduce dimensionality        self.conv1 = nn.Conv2d(            in_channels,            out_channels,            kernel_size=1,            bias=False        )        self.bn1 = nn.BatchNorm2d(out_channels)        # 3×3 conv for main processing        self.conv2 = nn.Conv2d(            out_channels,            out_channels,            kernel_size=3,            stride=stride,            padding=1,            bias=False        )        self.bn2 = nn.BatchNorm2d(out_channels)        # 1×1 conv to restore dimensionality        self.conv3 = nn.Conv2d(            out_channels,            out_channels * self.expansion,            kernel_size=1,            bias=False        )        self.bn3 = nn.BatchNorm2d(out_channels * self.expansion)        self.downsample = downsample        self.stride = stride    def forward(self, x: torch.Tensor) -> torch.Tensor:        identity = x        out = self.conv1(x)        out = self.bn1(out)        out = F.relu(out)        out = self.conv2(out)        out = self.bn2(out)        out = F.relu(out)        out = self.conv3(out)        out = self.bn3(out)        if self.downsample is not None:            identity = self.downsample(x)        out += identity        out = F.relu(out)        return out

In [None]:
class ResNet(nn.Module):    """    ResNet implementation adapted for CIFAR-10.    Differences with respect to the original ImageNet version:      - First layer: Conv 3×3 instead of Conv 7×7.      - No initial MaxPooling (images are 32×32).      - Final classification layer adapted to CIFAR-10.    Args:        block: Block type (BasicBlock or Bottleneck).        layers: List with the number of blocks per stage.        num_classes: Number of output classes.    """    def __init__(        self,        block: Type[Union[BasicBlock, Bottleneck]],        layers: List[int],        num_classes: int = 10    ) -> None:        super().__init__()        self.in_channels = 64        # Initial layer adapted to CIFAR-10 (32×32)        self.conv1 = nn.Conv2d(            3, 64, kernel_size=3, stride=1, padding=1, bias=False        )        self.bn1 = nn.BatchNorm2d(64)        self.relu = nn.ReLU(inplace=True)        # Four stages of residual blocks        self.layer1 = self._make_layer(block, 64,  layers[0], stride=1)        self.layer2 = self._make_layer(block, 128, layers[1], stride=2)        self.layer3 = self._make_layer(block, 256, layers[2], stride=2)        self.layer4 = self._make_layer(block, 512, layers[3], stride=2)        # Global pooling and final classifier        self.avgpool = nn.AdaptiveAvgPool2d((1, 1))        self.fc = nn.Linear(512 * block.expansion, num_classes)        # Weight initialization        self._initialize_weights()    def _make_layer(        self,        block: Type[Union[BasicBlock, Bottleneck]],        out_channels: int,        num_blocks: int,        stride: int = 1    ) -> nn.Sequential:        """        Build one stage of residual blocks.        Args:            block: Residual block type.            out_channels: Number of output channels.            num_blocks: Number of blocks in the stage.            stride: Stride of the first block (downsampling).        Returns:            nn.Sequential containing the stage blocks.        """        downsample = None        # Dimensionality adjustment in the shortcut branch        if stride != 1 or self.in_channels != out_channels * block.expansion:            downsample = nn.Sequential(                nn.Conv2d(                    self.in_channels,                    out_channels * block.expansion,                    kernel_size=1,                    stride=stride,                    bias=False                ),                nn.BatchNorm2d(out_channels * block.expansion)            )        layers = []        layers.append(block(self.in_channels, out_channels, stride, downsample))        self.in_channels = out_channels * block.expansion        for _ in range(1, num_blocks):            layers.append(block(self.in_channels, out_channels))        return nn.Sequential(*layers)    def _initialize_weights(self) -> None:        """        Initialize weights using He (Kaiming) initialization.        """        for m in self.modules():            if isinstance(m, nn.Conv2d):                nn.init.kaiming_normal_(                    m.weight,                    mode="fan_out",                    nonlinearity="relu"                )            elif isinstance(m, nn.BatchNorm2d):                nn.init.constant_(m.weight, 1)                nn.init.constant_(m.bias, 0)            elif isinstance(m, nn.Linear):                nn.init.normal_(m.weight, 0, 0.01)                nn.init.constant_(m.bias, 0)    def forward(self, x: torch.Tensor) -> torch.Tensor:        """        ResNet forward pass.        Args:            x: Input tensor [B, 3, 32, 32].        Returns:            Classification logits [B, num_classes].        """        x = self.conv1(x)        x = self.bn1(x)        x = self.relu(x)        x = self.layer1(x)  # 32×32 → 32×32        x = self.layer2(x)  # 32×32 → 16×16        x = self.layer3(x)  # 16×16 →  8×8        x = self.layer4(x)  #  8×8 →  4×4        x = self.avgpool(x)  # 4×4 → 1×1        x = torch.flatten(x, 1)        x = self.fc(x)        return x    def get_features(self, x: torch.Tensor) -> torch.Tensor:        """        Extract features before the classification layer.        Useful for visualization of embeddings and transfer learning.        """        x = self.conv1(x)        x = self.bn1(x)        x = self.relu(x)        x = self.layer1(x)        x = self.layer2(x)        x = self.layer3(x)        x = self.layer4(x)        x = self.avgpool(x)        x = torch.flatten(x, 1)        return x

Convenience functions are then defined to instantiate the main variants:

In [None]:
def resnet18(num_classes: int = 10) -> ResNet:    """Construct a ResNet-18 for the given number of classes."""    return ResNet(BasicBlock, [2, 2, 2, 2], num_classes)def resnet34(num_classes: int = 10) -> ResNet:    """Construct a ResNet-34 for the given number of classes."""    return ResNet(BasicBlock, [3, 4, 6, 3], num_classes)def resnet50(num_classes: int = 10) -> ResNet:    """Construct a ResNet-50 for the given number of classes."""    return ResNet(Bottleneck, [3, 4, 6, 3], num_classes)print("ResNet architecture defined correctly")

This implementation clearly separates the fundamental components: `BasicBlock` forResNet-18/34, `Bottleneck` for ResNet-50/101/152, and the `ResNet` class, which assemblesthe stages, manages downsampling, and applies global average pooling before the finalclassification.

### Model Instantiation and Analysis

A ResNet-18 instance adapted to CIFAR-10 is created and its structure and parameter countare examined using `torchinfo.summary`.

In [None]:
# Create ResNet-18model = resnet18(num_classes=NUM_CLASSES)# Select devicedevice = torch.device("cuda" if torch.cuda.is_available() else "cpu")model = model.to(device)print(f"Device used: {device}")print(f"\n{'='*70}")print("RESNET-18 ARCHITECTURE SUMMARY")print(f"{'='*70}\n")summary(model, input_size=(BATCH_SIZE, 3, 32, 32), device=str(device))def count_parameters(module: nn.Module) -> int:    return sum(p.numel() for p in module.parameters())print(f"\n{'='*70}")print("PARAMETER ANALYSIS BY COMPONENT")print(f"{'='*70}")print(f"  Initial conv:     {count_parameters(model.conv1):>12,} parameters")print(f"  Layer 1 (64 ch.): {count_parameters(model.layer1):>12,} parameters")print(f"  Layer 2 (128 ch.):{count_parameters(model.layer2):>12,} parameters")print(f"  Layer 3 (256 ch.):{count_parameters(model.layer3):>12,} parameters")print(f"  Layer 4 (512 ch.):{count_parameters(model.layer4):>12,} parameters")print(f"  FC classifier:    {count_parameters(model.fc):>12,} parameters")print(f"  {'-'*66}")total_params = sum(p.numel() for p in model.parameters())trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)print(f"  TOTAL:            {total_params:>12,} parameters")print(f"  Trainable:        {trainable_params:>12,} parameters")print(f"  Memory (float32): {total_params * 4 / (1024**2):>10.2f} MB")

### Training Configuration

Stochastic Gradient Descent (SGD) with Nesterov momentum is used as the optimizer,together with a `MultiStepLR` learning rate schedule, which is a standard configurationfor ResNet on CIFAR-10.

In [None]:
print("TRAINING CONFIGURATION")print(f"{'='*70}")print(f"  Epochs: {NUM_EPOCHS}")print(f"  Initial learning rate: {LEARNING_RATE}")print(f"  Momentum: {MOMENTUM}")print(f"  Weight decay (L2): {WEIGHT_DECAY}")print(f"  Batch size: {BATCH_SIZE}")print(f"{'='*70}\n")# Optimizer: SGD with Nesterov momentumoptimizer = torch.optim.SGD(    params=model.parameters(),    lr=LEARNING_RATE,    momentum=MOMENTUM,    weight_decay=WEIGHT_DECAY,    nesterov=True)# Scheduler: MultiStepLRscheduler = torch.optim.lr_scheduler.MultiStepLR(    optimizer,    milestones=[60, 80],    gamma=0.1,    verbose=True)# Loss functionloss_function = nn.CrossEntropyLoss()print("Optimizer: SGD with Nesterov momentum")print("  Nesterov momentum adds a 'look-ahead' in the descent direction")print("\nScheduler: MultiStepLR")print("  Reduces learning rate ×0.1 at epochs 60 and 80")print("  Classic strategy for training ResNet on CIFAR-10")print("\nLoss function: CrossEntropyLoss")

In this setting, the typical `MultiStepLR` schedule is:- Epochs 0–60: $\text{LR} = 0.1$ (exploration phase).- Epochs 60–80: $\text{LR} = 0.01$ (refinement phase).- Epochs 80–100: $\text{LR} = 0.001$ (fine-tuning phase).

### Training and Validation Loop

The training loop is implemented with metric tracking and saving of the best modelaccording to test accuracy.

In [None]:
# Metric storagetrain_losses, train_accuracies = [], []test_losses, test_accuracies = [], []learning_rates = []# Variables for saving the best modelbest_test_acc = 0.0best_epoch = 0def calculate_accuracy(outputs: torch.Tensor, labels: torch.Tensor):    _, predicted = torch.max(outputs, 1)    correct = (predicted == labels).sum().item()    total = labels.size(0)    return correct, totalprint("STARTING TRAINING\n")print(f"{'='*70}\n")start_time = time.time()for epoch in range(NUM_EPOCHS):    epoch_start_time = time.time()    # Training phase    model.train()    running_loss, correct, total = 0.0, 0, 0    train_loop = tqdm(        train_dataloader,        desc=f"Epoch {epoch + 1}/{NUM_EPOCHS} [TRAIN]",        leave=False    )    for batch_image, batch_label in train_loop:        batch_image = batch_image.to(device)        batch_label = batch_label.to(device)        optimizer.zero_grad()        outputs = model(batch_image)        loss = loss_function(outputs, batch_label)        loss.backward()        optimizer.step()        running_loss += loss.item()        batch_correct, batch_total = calculate_accuracy(outputs, batch_label)        correct += batch_correct        total += batch_total        train_loop.set_postfix({            "loss": f"{loss.item():.4f}",            "acc": f"{100 * correct / total:.2f}%"        })    epoch_train_loss = running_loss / len(train_dataloader)    epoch_train_acc = 100 * correct / total    train_losses.append(epoch_train_loss)    train_accuracies.append(epoch_train_acc)    # Validation phase    model.eval()    test_loss, correct_test, total_test = 0.0, 0, 0    test_loop = tqdm(        test_dataloader,        desc=f"Epoch {epoch + 1}/{NUM_EPOCHS} [TEST]",        leave=False    )    with torch.no_grad():        for images, labels in test_loop:            images = images.to(device)            labels = labels.to(device)            outputs = model(images)            loss = loss_function(outputs, labels)            test_loss += loss.item()            batch_correct, batch_total = calculate_accuracy(outputs, labels)            correct_test += batch_correct            total_test += batch_total            test_loop.set_postfix({                "loss": f"{loss.item():.4f}",                "acc": f"{100 * correct_test / total_test:.2f}%"            })    epoch_test_loss = test_loss / len(test_dataloader)    epoch_test_acc = 100 * correct_test / total_test    test_losses.append(epoch_test_loss)    test_accuracies.append(epoch_test_acc)    # Update scheduler    scheduler.step()    current_lr = optimizer.param_groups[0]["lr"]    learning_rates.append(current_lr)    # Save best model according to test accuracy    if epoch_test_acc > best_test_acc:        best_test_acc = epoch_test_acc        best_epoch = epoch + 1        torch.save({            "epoch": epoch,            "model_state_dict": model.state_dict(),            "optimizer_state_dict": optimizer.state_dict(),            "test_acc": best_test_acc,        }, "resnet18_cifar10_best.pth")    epoch_time = time.time() - epoch_start_time    print(f"Epoch [{epoch + 1}/{NUM_EPOCHS}] - Time: {epoch_time:.2f}s")    print(f"  Train → Loss: {epoch_train_loss:.4f} | Acc: {epoch_train_acc:.2f}%")    print(f"  Test  → Loss: {epoch_test_loss:.4f} | Acc: {epoch_test_acc:.2f}%")    print(f"  LR: {current_lr:.6f} | Best test acc: {best_test_acc:.2f}% (epoch {best_epoch})")    print(f"  {'─'*66}\n")total_time = time.time() - start_timeprint(f"\n{'='*70}")print("TRAINING COMPLETED")print(f"{'='*70}")print(f"  Total time: {total_time / 60:.2f} minutes")print(f"  Average time per epoch: {total_time / NUM_EPOCHS:.2f} seconds")print(f"  Final test accuracy: {test_accuracies[-1]:.2f}%")print(f"  Best test accuracy: {best_test_acc:.2f}% at epoch {best_epoch}")# Save final model and metricstorch.save({    "epoch": NUM_EPOCHS,    "model_state_dict": model.state_dict(),    "optimizer_state_dict": optimizer.state_dict(),    "train_losses": train_losses,    "train_accuracies": train_accuracies,    "test_losses": test_losses,    "test_accuracies": test_accuracies,    "best_test_acc": best_test_acc,    "best_epoch": best_epoch,}, "resnet18_cifar10_final.pth")print("\nSaved models:")print("  - resnet18_cifar10_best.pth (best model)")print("  - resnet18_cifar10_final.pth (final model + metrics)")

Compared to architectures such as VGG, ResNet training on CIFAR-10 tends to be morestable and efficient at similar depths, owing to residual connections and moderateparameter usage.

### Visualization of Training Metrics

The evolution of loss, accuracy, and learning rate is visualized, along with thedifference between training and test accuracy.

In [None]:
epochs_range = range(1, NUM_EPOCHS + 1)fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2, figsize=(15, 12))# Lossax1.plot(epochs_range, train_losses, "o-", label="Train Loss",         linewidth=2, markersize=3, alpha=0.7)ax1.plot(epochs_range, test_losses, "s-", label="Test Loss",         linewidth=2, markersize=3, alpha=0.7)ax1.axvline(    x=best_epoch,    color="green",    linestyle="--",    alpha=0.5,    label=f"Best epoch ({best_epoch})")ax1.set_xlabel("Epoch", fontsize=12, fontweight="bold")ax1.set_ylabel("Loss", fontsize=12, fontweight="bold")ax1.set_title("Loss Evolution", fontsize=14, fontweight="bold")ax1.legend(fontsize=10)ax1.grid(True, alpha=0.3)# Accuracyax2.plot(epochs_range, train_accuracies, "o-", label="Train Accuracy",         linewidth=2, markersize=3, alpha=0.7)ax2.plot(epochs_range, test_accuracies, "s-", label="Test Accuracy",         linewidth=2, markersize=3, alpha=0.7)ax2.axvline(    x=best_epoch,    color="green",    linestyle="--",    alpha=0.5,    label=f"Best epoch ({best_epoch})")ax2.axhline(    y=best_test_acc,    color="red",    linestyle="--",    alpha=0.5,    label=f"Best acc: {best_test_acc:.2f}%")ax2.set_xlabel("Epoch", fontsize=12, fontweight="bold")ax2.set_ylabel("Accuracy (%)", fontsize=12, fontweight="bold")ax2.set_title("Accuracy Evolution", fontsize=14, fontweight="bold")ax2.legend(fontsize=10)ax2.grid(True, alpha=0.3)# Learning rateax3.plot(epochs_range, learning_rates, "o-", color="red",         linewidth=2, markersize=3, alpha=0.7)ax3.set_xlabel("Epoch", fontsize=12, fontweight="bold")ax3.set_ylabel("Learning Rate", fontsize=12, fontweight="bold")ax3.set_title("Learning Rate Schedule", fontsize=14, fontweight="bold")ax3.set_yscale("log")ax3.grid(True, alpha=0.3)ax3.axvline(x=60, color="orange", linestyle="--", alpha=0.5, label="LR decay")ax3.axvline(x=80, color="orange", linestyle="--", alpha=0.5)ax3.legend(fontsize=10)# Train–test gapgap = np.array(train_accuracies) - np.array(test_accuracies)ax4.plot(epochs_range, gap, "o-", color="purple",         linewidth=2, markersize=3, alpha=0.7)ax4.axhline(y=0, color="black", linestyle="-", linewidth=0.5)ax4.axhline(    y=5,    color="red",    linestyle="--",    alpha=0.5,    label="Overfitting threshold (5%)")ax4.set_xlabel("Epoch", fontsize=12, fontweight="bold")ax4.set_ylabel("Train–Test Gap (%)", fontsize=12, fontweight="bold")ax4.set_title("Train–Test Accuracy Difference", fontsize=14, fontweight="bold")ax4.legend(fontsize=10)ax4.grid(True, alpha=0.3)plt.tight_layout()plt.savefig("resnet18_training_history.png", dpi=300, bbox_inches="tight")plt.show()print("\nResult analysis:")final_gap = train_accuracies[-1] - test_accuracies[-1]print(f"  Overfitting detected: {'YES' if final_gap > 10 else 'NO'}")print(f"  Final train–test gap: {final_gap:.2f}%")print(f"  Best epoch: {best_epoch}")print(f"  Improvement from epoch 1: {test_accuracies[-1] - test_accuracies[0]:.2f}%")

Moderate values of the train–test accuracy gap indicate a good balance between fit andgeneralization. Learning rate drops at epochs 60 and 80 are typically accompanied bychanges in network behavior and subsequent improvements in test accuracy.

### Visualization of Model Predictions

Finally, model predictions on the test set are visualized, including both correctlyclassified examples and some errors, allowing qualitative inspection of model behavior.

In [None]:
print("\nVisualizing predictions of the best model...")# Get one test batchdata_iter = iter(test_dataloader)test_images, test_labels = next(data_iter)model.eval()with torch.no_grad():    test_images_device = test_images.to(device)    outputs = model(test_images_device)    _, predictions = torch.max(outputs, 1)    predictions = predictions.cpu()print("\nFirst 8 predictions:")show_images(test_images[:8], test_labels[:8], predictions[:8])# Examples of misclassificationsincorrect_indices = (predictions != test_labels).nonzero(as_tuple=True)[0]if len(incorrect_indices) >= 8:    print("\nExamples of incorrect predictions:")    error_indices = incorrect_indices[:8]    show_images(        test_images[error_indices],        test_labels[error_indices],        predictions[error_indices]    )else:    print(f"\nOnly {len(incorrect_indices)} misclassifications in this batch")

This end-to-end implementation illustrates in practice how the theoretical ideas ofresidual learning translate into a stable and efficient architecture for training deepconvolutional networks on CIFAR-10.