<a href="https://colab.research.google.com/github/garylau1/model_training/blob/main/ResNet_from_scatach.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction

Deep learning has revolutionized the field of computer vision, and one of its cornerstone architectures is the Residual Network (ResNet). ResNet-50, a specific variant of this architecture, is widely recognized for its exceptional ability to train deep networks by addressing the vanishing gradient problem. In this project, I aim to implement ResNet-50 from scratch to gain a deeper understanding of its inner workings, layer-by-layer construction, and the overall design principles that make it so effective.

The implementation process begins with building the foundational components of ResNet-50, including the Bottleneck blocks, which are the core building blocks of the network. These blocks allow ResNet-50 to achieve remarkable depth while maintaining computational efficiency. Subsequently, I will assemble the other essential layers, such as the convolutional layers, downsampling modules, and fully connected layers, to complete the architecture.

Once the architecture is fully constructed, I will demonstrate how to integrate pretrained weights into the custom model. By using pretrained weights, the model can leverage prior knowledge gained from training on large datasets, significantly enhancing its performance and reducing the training time required for new tasks. This final step not only validates the accuracy of the implementation but also showcases the versatility of ResNet-50 when applied to practical problems.

Through this project, I aim to develop a thorough understanding of ResNet-50 and its components, while also exploring the practical aspects of transferring knowledge using pretrained weights.

As part of further experiments, I will evaluate the impact of torch.compile on the training speed. This is essential because torch.compile can optimize the model execution by improving computational efficiency, and we aim to see how much it affects the performance on the GPU. The experiment will be conducted using the CIFAR-10 dataset, and the main goal is to observe the difference in training speed, rather than accuracy. This experiment will help determine the efficiency gains when using torch.compile and provide insight into its impact on real-time training on GPUs.



In [None]:
"""
During the experiment with torch.compile, a runtime error occurred indicating that Triton was either not installed or the installed version was outdated. Triton is a required dependency for PyTorch's default inductor backend, which is used for backend optimizations.

To resolve this issue, the following command was used:
"""

pip install -U --index-url https://aiinfra.pkgs.visualstudio.com/PublicPackages/_packaging/Triton-Nightly/pypi/simple/ triton-nightly

In [None]:
import torch
#import the torch and nn module I need

device= "cuda" if torch.cuda.is_available() else "cpu"
from torch import nn

import torchvision

# Set the device globally
#torch.set_default_device(device)

torch.backends.cuda.matmul.allow_tf32 = True

#The Bottleneck block

The Bottleneck block is a critical component of the ResNet-50 architecture, designed to enhance computational efficiency while maintaining expressive power. This block uses a three-layer structure with 1x1, 3x3, and 1x1 convolutional layers to reduce and restore the feature map dimensions. This approach minimizes the computational cost while retaining the ability to extract complex features.

In this implementation, the Bottleneck block supports downsampling and flexible stride configurations, making it adaptable for different stages of the ResNet architecture. Additionally, the shortcut connection allows the network to learn residual mappings, addressing the degradation problem in deep networks.

This implementation includes options for downsampling, changing the kernel size, and stride adjustments, enabling seamless integration into deeper ResNet layers.

In [None]:
class Bottleneck(nn.Module):
    """
    Implementation of the Bottleneck block for ResNet.

    The Bottleneck block is a three-layer residual block used in ResNet architectures.
    It performs dimensionality reduction and restoration using `1x1` convolutions
    while applying spatial processing with a `3x3` convolution. A shortcut
    connection is added to facilitate residual learning.

    Args:
        in_channel (int): Number of input channels.
        hidden_ (int): Number of intermediate channels (reduced dimension).
        out_channel (int): Number of output channels.
        kernel_sizes (int): Kernel size for `1x1` convolutions (default: 1).
        stride (int): Stride for convolutional layers (default: 1).
        downsample (bool): Whether to apply downsampling in the shortcut connection (default: True).
        change_kernel (bool): Whether to modify the stride in the `3x3` convolution (default: False).

    Attributes:
        conv1 (nn.Conv2d): First `1x1` convolution layer for dimensionality reduction.
        bn1 (nn.BatchNorm2d): BatchNorm layer for the first convolution.
        conv2 (nn.Conv2d): Second `3x3` convolution layer for spatial processing.
        bn2 (nn.BatchNorm2d): BatchNorm layer for the second convolution.
        conv3 (nn.Conv2d): Third `1x1` convolution layer for dimensionality restoration.
        bn3 (nn.BatchNorm2d): BatchNorm layer for the third convolution.
        relu (nn.ReLU): ReLU activation function.
        downsample (nn.Sequential): Optional downsampling shortcut connection.

    Methods:
        forward(x):
            Defines the forward pass of the Bottleneck block.
    """
    def __init__(self,
                 in_channel=256, hidden_=64, out_channel=256,
                 kernel_sizes=1, stride=1,
                 downsample=True, change_kernel=False):
        super().__init__()

        self.downsamples = downsample  # Flag for applying downsampling

        # First 1x1 convolution: reduces the number of channels (dimensionality reduction)
        self.conv1 = nn.Conv2d(
            in_channels=in_channel, out_channels=hidden_,
            kernel_size=kernel_sizes, stride=stride, bias=False
        )
        self.bn1 = nn.BatchNorm2d(hidden_, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)

        # Second 3x3 convolution: applies spatial processing
        self.conv2 = nn.Conv2d(
            in_channels=hidden_, out_channels=hidden_,
            kernel_size=3, padding=1, stride=stride, bias=False
        )
        if change_kernel:  # Modify stride for downsampling in the second convolution
            self.conv2 = nn.Conv2d(
                in_channels=hidden_, out_channels=hidden_,
                kernel_size=3, padding=1, stride=2, bias=False
            )
        self.bn2 = nn.BatchNorm2d(hidden_, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)

        # Third 1x1 convolution: restores the number of channels (dimensionality restoration)
        self.conv3 = nn.Conv2d(
            in_channels=hidden_, out_channels=out_channel,
            kernel_size=kernel_sizes, stride=stride, bias=False
        )
        self.bn3 = nn.BatchNorm2d(out_channel, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)

        # ReLU activation: introduces non-linearity
        self.relu = nn.ReLU(inplace=True)

        # Downsampling shortcut if specified
        if self.downsamples:
            self.downsample = nn.Sequential(
                nn.Conv2d(
                    in_channels=in_channel, out_channels=out_channel,
                    kernel_size=kernel_sizes, stride=stride, bias=False
                ),
                nn.BatchNorm2d(out_channel, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
            )
            if change_kernel:  # Modify stride for downsampling in the shortcut path
                self.downsample = nn.Sequential(
                    nn.Conv2d(
                        in_channels=in_channel, out_channels=out_channel,
                        kernel_size=kernel_sizes, stride=2, bias=False
                    ),
                    nn.BatchNorm2d(out_channel, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
                )

    def forward(self, x):
        """
        Forward pass through the Bottleneck block.

        Args:
            x (torch.Tensor): Input tensor with shape (batch_size, in_channel, height, width).

        Returns:
            torch.Tensor: Output tensor after applying the Bottleneck operations.
        """
        skip_x = x  # Store the original input for the residual connection

        # Apply the three convolutional layers with BatchNorm
        x = self.conv1(x)
        x = self.bn1(x)
        x = self.conv2(x)
        x = self.bn2(x)
        x = self.conv3(x)
        x = self.bn3(x)

        # Add the residual (shortcut) connection
        if self.downsamples:
            x = self.downsample(skip_x) + x

        x = self.relu(x)  # Apply ReLU activation to the final output
        return x

# ResNet-50 architecture from scratch

In this section, we implement the ResNet-50 architecture from scratch. ResNet-50 is a widely used deep convolutional neural network designed for image classification tasks. It is known for its ability to achieve high performance on complex datasets due to the use of residual connections that mitigate the vanishing gradient problem in deep networks.

Our implementation follows these key steps:

Initial Layers: The model begins with a convolutional layer, followed by batch normalization, ReLU activation, and max pooling, which reduce the input's spatial dimensions while capturing essential features.
Residual Layers: The core of the model consists of four main stages (layer1 to layer4). Each stage is built using Bottleneck blocks, which include shortcut connections that directly add the input to the output of a stack of convolutional layers. The number of filters increases progressively across layers, allowing the model to learn hierarchical feature representations.
Global Pooling and Classification: After the residual layers, the model applies adaptive average pooling to reduce the spatial dimensions to a fixed size. A fully connected layer maps the extracted features to class probabilities.
This design reflects the structure of the original ResNet-50 architecture. By implementing it step by step, we not only replicate its functionality but also gain a deeper understanding of its inner workings. Finally, we prepare the model to load pretrained weights, which enhances its performance on various tasks without the need for training from scratch.

In [None]:
class ResNeT_copy(nn.Module):
    """
    Implementation of the ResNet-50 architecture from scratch.

    This class builds the ResNet-50 model step by step using the following components:
    - Initial convolutional layer with BatchNorm, ReLU, and max pooling.
    - Four sequential layers (layer1 to layer4) comprising Bottleneck blocks,
      with increasing channel dimensions as the network deepens.
    - Adaptive average pooling to reduce the spatial dimensions to 1x1.
    - Fully connected (linear) layer for classification.

    Args:
        None. Default settings are used to build ResNet-50.

    Attributes:
        conv1 (nn.Conv2d): Initial convolutional layer with 64 filters of size 7x7.
        bn1 (nn.BatchNorm2d): Batch normalization layer for the initial convolution.
        relu (nn.ReLU): ReLU activation function.
        maxpool (nn.MaxPool2d): Max pooling layer to reduce spatial dimensions.
        layer1-4 (nn.Sequential): Stacked Bottleneck blocks forming the ResNet layers.
        avgpool (nn.AdaptiveAvgPool2d): Adaptive average pooling to produce a fixed-size feature map.
        fc (nn.Linear): Fully connected layer for classification into 1000 classes.

    Methods:
        forward(x):
            Defines the forward pass through the entire ResNet-50 model.
    """
    def __init__(self):
        super().__init__()

        # Initial convolutional layer: captures basic image features
        self.conv1 = nn.Conv2d(
            in_channels=3, out_channels=64,
            kernel_size=7, stride=2, padding=3, bias=False
        )
        self.bn1 = nn.BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        self.relu = nn.ReLU(inplace=True)  # Adds non-linearity
        self.maxpool = nn.MaxPool2d(3, 2, 1, dilation=1, ceil_mode=False)

        # First layer: 64 input channels, expanded to 256 in the bottleneck blocks
        self.layer1 = nn.Sequential(
            Bottleneck(64),  # First Bottleneck block with downsampling
            *[Bottleneck(downsample=False) for i in range(2)]  # Two additional blocks
        )

        # Second layer: Expands from 256 to 512 channels
        self.layer2 = nn.Sequential(
            Bottleneck(256, 128, 512, change_kernel=True),  # First block with stride 2
            *[Bottleneck(512, 128, 512, downsample=False) for i in range(3)]  # Additional blocks
        )

        # Third layer: Expands from 512 to 1024 channels
        self.layer3 = nn.Sequential(
            Bottleneck(512, 256, 1024, change_kernel=True),  # First block with stride 2
            *[Bottleneck(1024, 256, 1024, downsample=False) for i in range(5)]  # Additional blocks
        )

        # Fourth layer: Expands from 1024 to 2048 channels
        self.layer4 = nn.Sequential(
            Bottleneck(1024, 512, 2048, change_kernel=True),  # First block with stride 2
            *[Bottleneck(2048, 512, 2048, downsample=False) for i in range(2)]  # Additional blocks
        )

        # Adaptive average pooling: Reduces each feature map to 1x1
        self.avgpool = nn.AdaptiveAvgPool2d((1, 1))

        # Fully connected layer: Maps the 2048 features to 1000 classes
        self.fc = nn.Linear(2048, 1000, bias=True)

    def forward(self, x):
        """
        Forward pass through the ResNet-50 model.

        Args:
            x (torch.Tensor): Input tensor with shape (batch_size, 3, height, width).

        Returns:
            torch.Tensor: Output tensor with shape (batch_size, 1000), representing class scores.
        """
        # Initial convolutional block
        x = self.conv1(x)
        x = self.bn1(x)
        x = self.relu(x)
        x = self.maxpool(x)

        # Pass through the ResNet layers
        x = self.layer1(x)
        x = self.layer2(x)
        x = self.layer3(x)
        x = self.layer4(x)

        # Global average pooling
        x = self.avgpool(x)

        # Flatten and apply the fully connected layer
        x = torch.flatten(x, start_dim=1)
        x = self.fc(x)

        return x

# Verification:

To verify the functionality of the ResNet-50 implementation, we perform a simple forward pass using a test tensor. This tensor simulates an image batch with the following characteristics:

Shape: (1, 3, 224, 224):
Batch size = 1 (single image).
Channels = 3 (RGB image).
Height and Width = 224 pixels (standard input size for ResNet models).
The goal of this test is to ensure that the network processes the input tensor correctly through all layers and outputs a tensor with the expected shape (1, 1000)—representing predictions for 1000 classes (as per ImageNet classification).

In [None]:
# Test the ResNet-50 implementation with a dummy input tensor.
test_tensor = torch.ones((1, 3, 224, 224))  # Create a tensor simulating a batch of RGB images.

# Instantiate the ResNet-50 model and pass the test tensor through it.
output = ResNeT_copy()(test_tensor)

# Print the shape of the output tensor.
print(output.shape)

torch.Size([1, 1000])


In [None]:
!pip install -q torchinfo

In [None]:
# Print out the model
from torchinfo import summary
summary(model=ResNeT_copy(),input_size=(1,3,224,224),col_names=["input_size","output_size","num_params","trainable"],
        col_width=15,
        row_settings=["var_names"])

Layer (type (var_name))                  Input Shape     Output Shape    Param #         Trainable
ResNeT_copy (ResNeT_copy)                [1, 3, 224, 224] [1, 1000]       --              True
├─Conv2d (conv1)                         [1, 3, 224, 224] [1, 64, 112, 112] 9,408           True
├─BatchNorm2d (bn1)                      [1, 64, 112, 112] [1, 64, 112, 112] 128             True
├─ReLU (relu)                            [1, 64, 112, 112] [1, 64, 112, 112] --              --
├─MaxPool2d (maxpool)                    [1, 64, 112, 112] [1, 64, 56, 56] --              --
├─Sequential (layer1)                    [1, 64, 56, 56] [1, 256, 56, 56] --              True
│    └─Bottleneck (0)                    [1, 64, 56, 56] [1, 256, 56, 56] --              True
│    │    └─Conv2d (conv1)               [1, 64, 56, 56] [1, 64, 56, 56] 4,096           True
│    │    └─BatchNorm2d (bn1)            [1, 64, 56, 56] [1, 64, 56, 56] 128             True
│    │    └─Conv2d (conv2)               [1

# Load pretrained weights into our custom ResNet-50 model

In this part, we load pretrained weights into our custom ResNet-50 model. Using pretrained weights allows the model to leverage knowledge learned from large datasets (like ImageNet) without requiring extensive training from scratch. This greatly improves performance for tasks such as image classification.

Here’s what happens step by step:

Retrieve Pretrained Weights: The pretrained weights for ResNet-50 are obtained using torchvision.models.ResNet50_Weights.DEFAULT.get_state_dict(). These weights represent the parameters learned by the model during training on ImageNet.
Load Weights: We load these weights into our ResNeT_copy model using load_state_dict(). The strict=True argument ensures that the structure of our model matches exactly with the weight definitions, preventing any mismatches.

In [None]:
pretrained_weights = torchvision.models.ResNet50_Weights.DEFAULT.get_state_dict()
pretrained_weights


ResNeT_copy().load_state_dict(pretrained_weights , strict=True)

<All keys matched successfully>

We also define a function to load and prepare the custom ResNet-50 model with pretrained weights and adjustments for the desired number of output classes. This function performs several key tasks:

- Load Model: We instantiate our custom ResNet-50 model and load the pretrained weights from the official ResNet-50 model.

- Set Transformation: The transformation function corresponding to the pretrained model is applied, with an adjustment to resize the input image to the required size of 224x224 pixels.

- Freeze Parameters: The weights for all layers, except the final fully connected layer, are frozen by setting requires_grad=False. This means the model will not update these parameters during training, allowing us to fine-tune only the last layer.

- Modify Output Layer: The final fully connected layer is replaced with a new one that has classes output units, allowing for customization of the output size for different tasks.

In [None]:
# Function to load and prepare the custom ResNet-50 model with pretrained weights.
def load_model(classes=1000):
    # Step 1: Instantiate the custom ResNet-50 model.
    our_model = ResNeT_copy()

    # Step 2: Retrieve the pretrained weights for ResNet-50.
    pretrained_weights = torchvision.models.ResNet50_Weights.DEFAULT.get_state_dict()

    # Step 3: Apply the transformation function for the pretrained model.
    transform_ = torchvision.models.ResNet50_Weights.DEFAULT.transforms()

    # Resize the input images to 224x224 as expected by ResNet-50.
    transform_.resize_size = 224

    # Step 4: Load the pretrained weights into the model.
    our_model.load_state_dict(pretrained_weights, strict=True)

    # Step 5: Freeze all parameters (no gradient updates during training).
    for i in our_model.parameters():
        i.requires_grad = False

    # Step 6: Replace the final fully connected layer to match the number of output classes.
    our_model.fc = nn.Linear(2048, classes)

    return our_model, transform_



In [None]:
our_model,transform_=load_model()

# Load our Dataset for testing

In this part we will load our dataset CIFAR10.

In [None]:
# Create train and test datasets
train_dataset = torchvision.datasets.CIFAR10(root='.',
                                             train=True,
                                             download=True,
                                             transform=transform_)

test_dataset = torchvision.datasets.CIFAR10(root='.',
                                            train=False, # want the test split
                                            download=True,
                                            transform=transform_)

# Get the lengths of the datasets
train_len = len(train_dataset)
test_len = len(test_dataset)

from torch.utils.data import DataLoader

# Create DataLoaders
import os
#we will use all cpu for experiment
NUM_WORKERS = os.cpu_count()


train_dataloader = DataLoader(dataset=train_dataset,
                              batch_size=64,
                              shuffle=True,
                              num_workers=NUM_WORKERS)

test_dataloader = DataLoader(dataset=test_dataset,
                              batch_size=64,
                              shuffle=False,
                              num_workers=NUM_WORKERS)




Files already downloaded and verified
Files already downloaded and verified


# Training and testing function:

In this section, we implement the core training and testing processes for evaluating the ResNet model. The training loop is responsible for optimizing the model's parameters by calculating the loss and updating the model weights through backpropagation. We use the training data to compute the forward pass, calculate the loss, and adjust the model's parameters using an optimizer. The testing loop, on the other hand, evaluates the performance of the model on unseen data without updating the model's parameters. It computes the loss and accuracy, providing valuable insights into how well the model generalizes to new data. This section also includes functions to track the performance metrics—loss and accuracy—during each epoch, as well as the time taken for each training and testing phase. The goal is to ensure that the model not only learns effectively during training but also maintains good performance when evaluated on test data.


Part of the code for these loops is borrowed and adapted from the tutorial on Learn PyTorch, which provides an excellent introduction to implementing basic training and evaluation loops in PyTorch.

Reference: https://www.learnpytorch.io/pytorch_2_intro/

In [None]:
import time
from tqdm.auto import tqdm
from typing import Dict, List, Tuple
generator = torch.Generator(device="cuda")

def train_epoch(epoch: int,
                model: torch.nn.Module,
                data_loader: torch.utils.data.DataLoader,
                loss_function: torch.nn.Module,
                optimizer: torch.optim.Optimizer,
                device: torch.device,
                disable_progress_bar: bool = False) -> Tuple[float, float]:
  """Executes a single epoch of training for the model.

  This function sets the model to training mode and processes the training data
  in batches. It performs the forward pass, computes the loss, and updates the
  model parameters using the optimizer.

  Args:
    model: The model to be trained.
    data_loader: A DataLoader instance that supplies batches of training data.
    loss_function: The loss function to calculate the discrepancy between predictions and true labels.
    optimizer: Optimizer that adjusts model parameters based on computed gradients.
    device: The device on which computations are performed (e.g., "cpu" or "cuda").
    disable_progress_bar: A flag to disable the progress bar.

  Returns:
    A tuple containing the average training loss and accuracy for the epoch.
  """
  # Set the model to training mode
  model.train()

  # Initialize variables to accumulate loss and accuracy metrics
  total_loss, total_accuracy = 0, 0

  # Initialize a progress bar for the training loop
  # Set up a progress bar to show the current status of the training process,
  # including how many batches have been processed out of the total. This helps
  # track the progress of the training loop, with an option to hide the progress
  # bar if needed.
  progress_bar = tqdm(
        enumerate(data_loader),
        desc=f"Training Epoch {epoch}",
        total=len(data_loader),
        disable=disable_progress_bar
    )

  for batch_idx, (inputs, targets) in progress_bar:
      # Transfer inputs and targets to the target device
      inputs, targets = inputs.to(device), targets.to(device)

      # 1. Perform forward pass
      predictions = model(inputs)

      # 2. Calculate the loss
      loss = loss_function(predictions, targets)
      total_loss += loss.item()

      # 3. Zero the gradients for the optimizer
      optimizer.zero_grad()

      # 4. Backpropagate the loss
      loss.backward()

      # 5. Update the model parameters based on gradients
      optimizer.step()

      # 6. Calculate and accumulate accuracy
      predicted_classes = torch.argmax(torch.softmax(predictions, dim=1), dim=1)
      total_accuracy += (predicted_classes == targets).sum().item() / len(predicted_classes)

      # Update progress bar with current metrics
      progress_bar.set_postfix(
            {
                "loss": total_loss / (batch_idx + 1),
                "accuracy": total_accuracy / (batch_idx + 1),
            }
        )

  # Compute the average loss and accuracy for the epoch
  avg_loss = total_loss / len(data_loader)
  avg_accuracy = total_accuracy / len(data_loader)

  return avg_loss, avg_accuracy

def test_epoch(epoch: int,
               model: torch.nn.Module,
               data_loader: torch.utils.data.DataLoader,
               loss_function: torch.nn.Module,
               device: torch.device,
               disable_progress_bar: bool = False) -> Tuple[float, float]:
  """Executes a single epoch of testing for the model.

  This function sets the model to evaluation mode and processes the test data in
  batches to compute the loss and accuracy without updating model parameters.

  Args:
    model: The model to be evaluated.
    data_loader: A DataLoader instance that supplies batches of test data.
    loss_function: The loss function used to compute the error on test data.
    device: The device on which computations are performed (e.g., "cpu" or "cuda").
    disable_progress_bar: A flag to disable the progress bar.

  Returns:
    A tuple containing the average test loss and accuracy for the epoch.
  """
  # Set the model to evaluation mode
  model.eval()

  # Initialize variables to accumulate loss and accuracy metrics
  total_loss, total_accuracy = 0, 0

  # Initialize a progress bar for the testing loop
  progress_bar = tqdm(
      enumerate(data_loader),
      desc=f"Testing Epoch {epoch}",
      total=len(data_loader),
      disable=disable_progress_bar
  )

  # Disable gradient calculations for testing
  with torch.no_grad():
      for batch_idx, (inputs, targets) in progress_bar:
          # Transfer inputs and targets to the target device
          inputs, targets = inputs.to(device), targets.to(device)

          # 1. Perform forward pass
          logits = model(inputs)

          # 2. Calculate the loss
          loss = loss_function(logits, targets)
          total_loss += loss.item()

          # 3. Calculate and accumulate accuracy
          predicted_labels = logits.argmax(dim=1)
          total_accuracy += ((predicted_labels == targets).sum().item() / len(predicted_labels))

          # Update progress bar with current metrics
          progress_bar.set_postfix(
              {
                  "loss": total_loss / (batch_idx + 1),
                  "accuracy": total_accuracy / (batch_idx + 1),
              }
          )

  # Compute the average loss and accuracy for the epoch
  avg_loss = total_loss / len(data_loader)
  avg_accuracy = total_accuracy / len(data_loader)

  return avg_loss, avg_accuracy

def train_and_evaluate(model: torch.nn.Module,
                       train_loader: torch.utils.data.DataLoader,
                       test_loader: torch.utils.data.DataLoader,
                       optimizer: torch.optim.Optimizer,
                       loss_function: torch.nn.Module,
                       num_epochs: int,
                       device: torch.device,
                       disable_progress_bar: bool = False) -> Dict[str, List]:
  """Trains and evaluates a model over multiple epochs.

  This function alternates between training and testing the model for each epoch,
  and stores various performance metrics such as loss and accuracy for both training
  and testing. These metrics are accumulated and printed for each epoch.

  Args:
    model: The model to be trained and evaluated.
    train_loader: A DataLoader instance that provides training data.
    test_loader: A DataLoader instance that provides test data.
    optimizer: The optimizer used to adjust model parameters.
    loss_function: The loss function used to compute loss during training and testing.
    num_epochs: The number of epochs to train the model.
    device: The device used for computation (e.g., "cpu" or "cuda").
    disable_progress_bar: A flag to disable progress bar display.

  Returns:
    A dictionary containing lists of loss and accuracy for both training and
    testing for each epoch. The dictionary also includes time spent on each epoch.
  """
  # Initialize a dictionary to store the results for each epoch
  results = {
      "train_loss": [],
      "train_accuracy": [],
      "test_loss": [],
      "test_accuracy": [],
      "train_time": [],
      "test_time": []
  }

  # Loop through the epochs
  for epoch in tqdm(range(num_epochs), disable=disable_progress_bar):
      # Perform the training step and record the time taken
      start_train_time = time.time()
      train_loss, train_accuracy = train_epoch(epoch=epoch,
                                               model=model,
                                               data_loader=train_loader,
                                               loss_function=loss_function,
                                               optimizer=optimizer,
                                               device=device,
                                               disable_progress_bar=disable_progress_bar)
      end_train_time = time.time()
      train_time = end_train_time - start_train_time

      # Perform the testing step and record the time taken
      start_test_time = time.time()
      test_loss, test_accuracy = test_epoch(epoch=epoch,
                                             model=model,
                                             data_loader=test_loader,
                                             loss_function=loss_function,
                                             device=device,
                                             disable_progress_bar=disable_progress_bar)
      end_test_time = time.time()
      test_time = end_test_time - start_test_time

      # Print the results for the current epoch
      print(
          f"Epoch {epoch + 1}/{num_epochs} | "
          f"Train Loss: {train_loss:.4f} | "
          f"Train Accuracy: {train_accuracy:.4f} | "
          f"Test Loss: {test_loss:.4f} | "
          f"Test Accuracy: {test_accuracy:.4f} | "
          f"Train Time: {train_time:.4f}s | "
          f"Test Time: {test_time:.4f}s"
      )

      # Store the metrics for the current epoch
      results["train_loss"].append(train_loss)
      results["train_accuracy"].append(train_accuracy)
      results["test_loss"].append(test_loss)
      results["test_accuracy"].append(test_accuracy)
      results["train_time"].append(train_time)
      results["test_time"].append(test_time)

  # Return the accumulated results
  return results

# Experiment 1: Without using torch.compile

In this experiment, we aim to measure the training speed of our custom ResNet-50 model without using torch.compile. torch.compile is an optimization feature in PyTorch that can accelerate the execution of the model by applying several optimizations under the hood. By training without it, we establish a baseline for the model's performance in terms of training speed.

The main objective here is to compare the speed of training without torch.compile to the speed when it is enabled. This will help us understand the impact of torch.compile on the training time, and whether it leads to any significant improvements in performance. We will focus on measuring the training duration in this experiment, rather than accuracy, to observe the raw performance of the model.

In [None]:
# Set the number of epochs as a constant
NUM_EPOCHS = 4

# Set the learning rate as a constant (this can be changed to get better results but for now we're just focused on time)
LEARNING_RATE = 0.003

# Create model
our_model,transform_=load_model()

our_model.to(device)

# Create loss function and optimizer
loss_fn = torch.nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(our_model.parameters(),
                             lr=LEARNING_RATE)

# Train model and track results
single_run_no_compile_results = train_and_evaluate(model=our_model,
                                      train_loader=train_dataloader,
                                      test_loader=test_dataloader,
                                      loss_function=loss_fn,
                                      optimizer=optimizer,
                                      num_epochs=NUM_EPOCHS,
                                      device=device)





  0%|          | 0/4 [00:00<?, ?it/s]

Training Epoch 0:   0%|          | 0/782 [00:00<?, ?it/s]

Testing Epoch 0:   0%|          | 0/157 [00:00<?, ?it/s]

Epoch 1/4 | Train Loss: 4.6687 | Train Accuracy: 0.1115 | Test Loss: 3.2232 | Test Accuracy: 0.1147 | Train Time: 185.5349s | Test Time: 35.3446s


Training Epoch 1:   0%|          | 0/782 [00:00<?, ?it/s]

Testing Epoch 1:   0%|          | 0/157 [00:00<?, ?it/s]

Epoch 2/4 | Train Loss: 2.6961 | Train Accuracy: 0.1144 | Test Loss: 2.5095 | Test Accuracy: 0.1210 | Train Time: 183.6579s | Test Time: 36.3600s


Training Epoch 2:   0%|          | 0/782 [00:00<?, ?it/s]

Testing Epoch 2:   0%|          | 0/157 [00:00<?, ?it/s]

Epoch 3/4 | Train Loss: 2.4033 | Train Accuracy: 0.1135 | Test Loss: 2.5593 | Test Accuracy: 0.1102 | Train Time: 183.6039s | Test Time: 36.0656s


Training Epoch 3:   0%|          | 0/782 [00:00<?, ?it/s]

Testing Epoch 3:   0%|          | 0/157 [00:00<?, ?it/s]

Epoch 4/4 | Train Loss: 2.3467 | Train Accuracy: 0.1137 | Test Loss: 2.3847 | Test Accuracy: 0.1144 | Train Time: 184.4749s | Test Time: 35.1311s


#Experiment Part 2: Using torch.compile

In this part of the experiment, we aim to evaluate the training performance of the model when utilizing PyTorch’s torch.compile feature. torch.compile is designed to optimize the model execution by using various backend optimizations, which may result in faster training times. The goal of this experiment is to compare the training speed with and without the use of torch.compile, rather than focusing on the accuracy of the model. By applying torch.compile, we will investigate whether these optimizations lead to a significant improvement in training speed, thus determining its effectiveness for accelerating the model’s performance.

In [None]:
# Create model

# Set the number of epochs as a constant
NUM_EPOCHS = 4

# Set the learning rate as a constant (this can be changed to get better results but for now we're just focused on time)
LEARNING_RATE = 0.003



our_model_2,transform_=load_model()

our_model_2.to(device)

# Create loss function and optimizer
loss_fn_2 = torch.nn.CrossEntropyLoss()
optimizer_2 = torch.optim.Adam(our_model_2.parameters(),
                             lr=LEARNING_RATE)

import time

start=time.time()

compile_model=torch.compile(our_model_2)
compile_model.to(device)
end = time.time()

compile_time = end - start
print (f"Our compile time is {compile_time}")


single_run_no_compile_results_2 = train_and_evaluate(model=compile_model,
                                      train_loader=train_dataloader,
                                      test_loader=test_dataloader,
                                      loss_function=loss_fn_2,
                                      optimizer=optimizer_2,
                                      num_epochs=NUM_EPOCHS,
                                      device=device)


Our compile time is 0.007426738739013672


  0%|          | 0/4 [00:00<?, ?it/s]

Training Epoch 0:   0%|          | 0/782 [00:00<?, ?it/s]

Testing Epoch 0:   0%|          | 0/157 [00:00<?, ?it/s]

Epoch 1/4 | Train Loss: 4.6626 | Train Accuracy: 0.1084 | Test Loss: 3.1933 | Test Accuracy: 0.1084 | Train Time: 221.2563s | Test Time: 51.7011s


Training Epoch 1:   0%|          | 0/782 [00:00<?, ?it/s]

Testing Epoch 1:   0%|          | 0/157 [00:00<?, ?it/s]

Epoch 2/4 | Train Loss: 2.6940 | Train Accuracy: 0.1113 | Test Loss: 2.5396 | Test Accuracy: 0.1050 | Train Time: 173.5156s | Test Time: 32.8960s


Training Epoch 2:   0%|          | 0/782 [00:00<?, ?it/s]

Testing Epoch 2:   0%|          | 0/157 [00:00<?, ?it/s]

Epoch 3/4 | Train Loss: 2.4038 | Train Accuracy: 0.1124 | Test Loss: 2.4299 | Test Accuracy: 0.1118 | Train Time: 173.7537s | Test Time: 33.5924s


Training Epoch 3:   0%|          | 0/782 [00:00<?, ?it/s]

Testing Epoch 3:   0%|          | 0/157 [00:00<?, ?it/s]

Epoch 4/4 | Train Loss: 2.3456 | Train Accuracy: 0.1134 | Test Loss: 2.3772 | Test Accuracy: 0.1190 | Train Time: 174.5483s | Test Time: 37.4367s


# Our Result and Dicussion:

In [None]:
single_run_no_compile_results,single_run_no_compile_results_2

({'train_loss': [4.668694774208166,
   2.696133101382829,
   2.4032718001119315,
   2.3467337167476447],
  'train_accuracy': [0.11147298593350384,
   0.11441016624040921,
   0.11347106777493605,
   0.11365089514066497],
  'test_loss': [3.2232171395781695,
   2.509476212179585,
   2.55926466899313,
   2.38466168209246],
  'test_accuracy': [0.11474920382165606,
   0.12101910828025478,
   0.11017117834394904,
   0.11435111464968153],
  'train_time': [185.53492140769958,
   183.6578814983368,
   183.60390734672546,
   184.4749050140381],
  'test_time': [35.344587564468384,
   36.36003136634827,
   36.06556224822998,
   35.13107228279114]},
 {'train_loss': [4.662568461864501,
   2.693975373607157,
   2.403793570635569,
   2.345617717489257],
  'train_accuracy': [0.1083959398976982,
   0.11127317774936062,
   0.11241208439897699,
   0.11337116368286446],
  'test_loss': [3.1932707865526724,
   2.5395904544052805,
   2.429929541934068,
   2.3772401293371894],
  'test_accuracy': [0.108379777070

We can display out results in dataframe:

In [None]:

import pandas as pd
no_compile_results_df = pd.DataFrame(single_run_no_compile_results)
compile_results_df = pd.DataFrame(single_run_no_compile_results_2)

In [None]:
no_compile_results_df

Unnamed: 0,train_loss,train_accuracy,test_loss,test_accuracy,train_time,test_time
0,4.668695,0.111473,3.223217,0.114749,185.534921,35.344588
1,2.696133,0.11441,2.509476,0.121019,183.657881,36.360031
2,2.403272,0.113471,2.559265,0.110171,183.603907,36.065562
3,2.346734,0.113651,2.384662,0.114351,184.474905,35.131072


In [None]:
compile_results_df

Unnamed: 0,train_loss,train_accuracy,test_loss,test_accuracy,train_time,test_time
0,4.662568,0.108396,3.193271,0.10838,221.256316,51.701094
1,2.693975,0.111273,2.53959,0.104996,173.51556,32.896035
2,2.403794,0.112412,2.42993,0.111764,173.753655,33.592386
3,2.345618,0.113371,2.37724,0.119029,174.548302,37.436663


### Result

Training Time and Testing Time
The comparison between the training and testing times of the two models—one without torch.compile and the other with torch.compile—is summarized in the tables above.

Observations:

Training Time:

The model with torch.compile generally exhibited a slightly longer training time compared to the non-compiled model. This is primarily due to the additional overhead of optimizing the computation graph during the compilation phase.

For example, in Epoch 0, the training time for the compiled model was approximately 221.26 seconds, while the non-compiled model required only 185.53 seconds.

Testing Time:

The compiled model demonstrated a mixed performance in testing time. While it showed faster testing in certain epochs, such as Epoch 1 (32.89 seconds compared to 36.36 seconds in the non-compiled model), it had slightly longer testing times in others. This variability is likely influenced by the level of optimization achieved during the compilation phase.

### Discussion:

Short Training Runs:

In shorter training runs, the benefits of model compilation are less pronounced because the optimization process does not have sufficient epochs to effectively leverage the pre-compiled computational graph. As seen in our results, the non-compiled model marginally outperformed the compiled model in terms of training efficiency for this experiment.

Potential for Longer Training Runs:

A longer training run would likely result in the compiled model achieving better results. This is because the initial compilation overhead would be offset by the increased efficiency of the computation graph for subsequent epochs. Over time, the compiled model could surpass the non-compiled one in both training and testing performance.

Overall, the compiled model has the potential to provide significant benefits in scenarios involving prolonged training sessions or computationally intensive workloads. However, for shorter training runs, the non-compiled model may still hold an edge in terms of training efficiency.