<a href="https://colab.research.google.com/github/bishara74/GPU-Training-Tutorial/blob/main/GPU_vs_CPU_Benchmark.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Tutorial: How to Accelerate PyTorch Model Training with a GPU

In this notebook, we will train a simple Convolutional Neural Network (CNN) to classify sneaker images.

We will first train it on the **CPU** and time it. Then, we will train the *exact same model* on a **GPU** to see the performance difference.

This project demonstrates the basics of:
* Loading image data in PyTorch
* Building a simple CNN
* Writing a training loop
* **GPU Resource Handling** to accelerate training

In [1]:
import torch
import torch.nn as nn
import torch.optim as optim
import torchvision
from torchvision import datasets, transforms, models
from torch.utils.data import DataLoader
import time
import os
import zipfile

zip_file_name = 'sneakers.zip'
data_dir = 'sneakers_data'

if not os.path.exists(data_dir):
    print(f"'{data_dir}' folder not found. Extracting {zip_file_name}...")
    with zipfile.ZipFile(zip_file_name, 'r') as zip_ref:
        zip_ref.extractall(data_dir)
    print(f"Data extracted to '{data_dir}'")
else:
    print(f"Data directory '{data_dir}' already exists.")



data_transform = transforms.Compose([
    transforms.Resize((224, 224)),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406],
                         std=[0.229, 0.224, 0.225])
])


image_dataset_dir = os.path.join(data_dir, 'sneakers_dataset')

image_dataset = datasets.ImageFolder(image_dataset_dir, transform=data_transform)
dataloader = DataLoader(image_dataset, batch_size=32, shuffle=True)

class_names = image_dataset.classes
num_classes = len(class_names)

print(f"\nSuccessfully loaded {len(image_dataset)} images.")
print(f"Found {num_classes} classes: {class_names}")

'sneakers_data' folder not found. Extracting sneakers.zip...
Data extracted to 'sneakers_data'

Successfully loaded 2207 images.
Found 4 classes: ['Nike Air Force 1', 'Nike Air Jordan 1 High', 'Nike Air Max 1', 'Nike Dunk Low']


## Part 1: Training on the CPU ⏱️

Now, we'll build our model. We'll use a famous pre-trained model called **ResNet-18** because it's strong and trains fast.

To create our benchmark, we will **force** PyTorch to only use the CPU:
1.  We define our device as `"cpu"`.
2.  We send the `model` to the `device`.
3.  In our training loop, we send every *batch* of data to the `device`.

We will use `time.time()` to measure exactly how long the training takes.

In [2]:

#Set up
model = models.resnet18(pretrained=True)


num_ftrs = model.fc.in_features
model.fc = nn.Linear(num_ftrs, num_classes)


device = torch.device("cpu")
model = model.to(device)
print(f"--- Starting training on {device} ---")


criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(model.parameters(), lr=0.001, momentum=0.9)


start_time = time.time()


num_epochs = 3

for epoch in range(num_epochs):
    print(f"Epoch {epoch+1}/{num_epochs}")
    print("-" * 10)


    model.train()

    for inputs, labels in dataloader:
        inputs = inputs.to(device)
        labels = labels.to(device)

        optimizer.zero_grad()

        outputs = model(inputs)
        loss = criterion(outputs, labels)

        loss.backward()

        optimizer.step()

end_time = time.time()
cpu_time = end_time - start_time

print("\n--- Training Finished ---")
print(f"Total CPU Training Time: {cpu_time:.2f} seconds")



Downloading: "https://download.pytorch.org/models/resnet18-f37072fd.pth" to /root/.cache/torch/hub/checkpoints/resnet18-f37072fd.pth


100%|██████████| 44.7M/44.7M [00:00<00:00, 128MB/s]


--- Starting training on cpu ---
Epoch 1/3
----------
Epoch 2/3
----------
Epoch 3/3
----------

--- Training Finished ---
Total CPU Training Time: 1425.24 seconds


## Part 2: Training on the GPU
Now, we will do the *exact same training*, but we'll tell PyTorch to use the GPU. The key steps are:
1.  Check if a **"cuda" (NVIDIA GPU)** device is available.
2.  Move the `model` to the `device`.
3.  Inside the training loop, move our `inputs` and `labels` to the `device` on every batch.

This is the core of **GPU resource handling**.

In [3]:
#  SET UP THE GPU DEVICE
if torch.cuda.is_available():
    device = torch.device("cuda")
    print(f"--- GPU is available! Starting training on {device} ---")
else:
    # This is a fallback in case the GPU is not connected
    device = torch.device("cpu")
    print(f"--- GPU not found, falling back to {device} ---")


model = models.resnet18(pretrained=True)
num_ftrs = model.fc.in_features
model.fc = nn.Linear(num_ftrs, num_classes)

model = model.to(device)

criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(model.parameters(), lr=0.001, momentum=0.9)

start_time = time.time()

num_epochs = 3

for epoch in range(num_epochs):
    print(f"Epoch {epoch+1}/{num_epochs}")
    print("-" * 10)

    model.train()

    for inputs, labels in dataloader:
        inputs = inputs.to(device)
        labels = labels.to(device)

        optimizer.zero_grad()
        outputs = model(inputs)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()

end_time = time.time()
gpu_time = end_time - start_time

print("\n--- Training Finished ---")
print(f"Total GPU Training Time: {gpu_time:.2f} seconds")

--- GPU is available! Starting training on cuda ---
Epoch 1/3
----------
Epoch 2/3
----------
Epoch 3/3
----------

--- Training Finished ---
Total GPU Training Time: 107.13 seconds


# Conclusion: The Power of GPU Acceleration

Let's compare our final results:

* **Total CPU Training Time:** 1425.24 seconds (~23.7 minutes)
* **Total GPU Training Time:** 107.13 seconds (~1.8 minutes)

**Result:** By properly handling our GPU resources, we achieved a **13.3x speedup!**

This notebook is now a **shared resource** and **tutorial** for any developer looking to get started with GPU-accelerated deep learning.