<a href="https://colab.research.google.com/github/furk4neg3/Aerial-Object-Detection/blob/main/Aerial_Object_Detection.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Aerial Object Detection

🐈 In this project, I made AI models for aerial object detection task. One is
CNN model created by me from scratch, and the other is fine tuned Fast RCNN model.

🐈 SkyFusion aerial object detection data is used, you can reach it inside the notebook.

🐈 This was the first PyTorch project I designed myself (I worked on AI projects before but they were on Tensorflow, I'm trying to learn PyTorch too right now), other than the ones done in courses, and object detection is a completely new topic for me, so I'm open to criticisms and improvements.

🐈 Result in Fast RCNN model is pretty good, which has loss of 0.8746. In my researchs, I found that loss between 0 to 1 is considered good, so I can say that this model was successful.

🐈 That's the torchvision version I used creating this notebook, to ensure that it will run well on your system too, there's installation.

In [None]:
!pip install torchvision==0.15.1

## Taking the Data

🐈 As mentioned above, data is from Kaggle.

In [2]:
! pip install -q kaggle

In [None]:
from google.colab import files
files.upload()

In [4]:
! mkdir ~/.kaggle

In [5]:
! cp kaggle.json ~/.kaggle/

In [6]:
! chmod 600 ~/.kaggle/kaggle.json

In [7]:
! kaggle datasets download -d kailaspsudheer/tiny-object-detection

Dataset URL: https://www.kaggle.com/datasets/kailaspsudheer/tiny-object-detection
License(s): apache-2.0
Downloading tiny-object-detection.zip to /content
 99% 177M/179M [00:01<00:00, 193MB/s]
100% 179M/179M [00:01<00:00, 150MB/s]


In [None]:
! unzip tiny-object-detection.zip

## Data Preparation


🐈 Features of images were in coco format, so install pycocotools.

In [None]:
!pip install pycocotools

🐈 Install necessary libraries

In [10]:
import pandas as pd
from pycocotools.coco import COCO
import json

import matplotlib.pyplot as plt
import seaborn as sns

### Receiving Features Data

In [11]:
SkyFusion_ann = open(r"SkyFusion/train/_annotations.coco.json")
SkyFusion_COCO = json.load(SkyFusion_ann)
SkyFusion_train = pd.DataFrame(SkyFusion_COCO['annotations'])

SkyFusion_ann = open(r"SkyFusion/test/_annotations.coco.json")
SkyFusion_COCO = json.load(SkyFusion_ann)
SkyFusion_test = pd.DataFrame(SkyFusion_COCO['annotations'])

SkyFusion_ann = open(r"SkyFusion/valid/_annotations.coco.json")
SkyFusion_COCO = json.load(SkyFusion_ann)
SkyFusion_val = pd.DataFrame(SkyFusion_COCO['annotations'])

### A Short Description of Features Data

🫒 I will only talk about the columns that will be used.

🐈 First, image_id column shows which image the line is linked to (because there can be different number of boxes in images).

🐈 Then, category_id is the label column. There are 3 labels that which car, ship and plane. Those are represented as 1, 2 and 3.

🐈 Finally, there's bbox column. That's used for drawing border boxes around objects.

In [12]:
SkyFusion_train.head()

Unnamed: 0,id,image_id,category_id,bbox,area,segmentation,iscrowd
0,0,0,3,"[259, 49, 4.8, 9.6]",46.08,"[[264, 48.8, 259.2, 48.8, 259.2, 58.4, 264, 58...",0
1,1,0,3,"[284, 630, 4.8, 8.8]",42.24,"[[288.8, 630.4, 284, 630.4, 284, 639.2, 288.8,...",0
2,2,0,3,"[281, 568, 4, 8.8]",35.2,"[[284.8, 568, 280.8, 568, 280.8, 576.8, 284.8,...",0
3,3,0,3,"[288, 570, 4.8, 10.4]",49.92,"[[292.8, 569.6, 288, 569.6, 288, 580, 292.8, 5...",0
4,4,0,3,"[303, 553, 4.8, 9.6]",46.08,"[[308, 552.8, 303.2, 552.8, 303.2, 562.4, 308,...",0


🐈 Drop unnecessary columns.

In [13]:
SkyFusion_train = SkyFusion_train.drop(["area", "id", "iscrowd", "segmentation"], axis=1)
SkyFusion_test = SkyFusion_test.drop(["area", "id", "iscrowd", "segmentation"], axis=1)
SkyFusion_val = SkyFusion_val.drop(["area", "id", "iscrowd", "segmentation"], axis=1)

In [14]:
SkyFusion_train.head()

Unnamed: 0,image_id,category_id,bbox
0,0,3,"[259, 49, 4.8, 9.6]"
1,0,3,"[284, 630, 4.8, 8.8]"
2,0,3,"[281, 568, 4, 8.8]"
3,0,3,"[288, 570, 4.8, 10.4]"
4,0,3,"[303, 553, 4.8, 9.6]"


🐈 Check labels.

In [15]:
value_counts_A = SkyFusion_train['category_id'].value_counts()
print("Counts of unique values of labels:")
print(value_counts_A)

Counts of unique values of labels:
category_id
3    33396
1     8696
2     1483
Name: count, dtype: int64


## Receiving Images Data

🐈 Assigning images to variables for later use.

In [16]:
import os
from PIL import Image
import numpy as np

# Base directory
base_dir = "SkyFusion"

# Directories for train, test, and valid
train_dir = os.path.join(base_dir, "train")
test_dir = os.path.join(base_dir, "test")
valid_dir = os.path.join(base_dir, "valid")

# Function to load images from a given directory
def load_images(directory):
    images = []
    valid_extensions = ('.jpg', '.jpeg', '.png', '.bmp', '.tiff')

    # Traverse the directory and load image files
    for root, dirs, files in os.walk(directory):
        for file in files:
            if file.lower().endswith(valid_extensions):
                img_path = os.path.join(root, file)
                img = Image.open(img_path).convert("RGB")  # Load and convert to RGB
                images.append(img)
    return images

# Load images from train, test, and valid directories
train_images = load_images(train_dir)
test_images = load_images(test_dir)
valid_images = load_images(valid_dir)

# Convert to NumPy arrays for model input
train_images_np = [np.array(img) for img in train_images]
test_images_np = [np.array(img) for img in test_images]
valid_images_np = [np.array(img) for img in valid_images]

# Print to see distribution between train, test and validation
print(f"Loaded {len(train_images_np)} train images, {len(test_images_np)} test images, and {len(valid_images_np)} valid images.")

Loaded 2094 train images, 449 test images, and 449 valid images.


## Matching Features with Images

🐈 As mentioned before, the number of boxes may vary depending on the images. So we should match features with images using image_id column. The function below does exactly that.

In [17]:
def match_bboxes_with_images(df):
  boxes_dict = {}

  for _, row in df.iterrows():
      image_id = row['image_id']
      bbox = row['bbox']
      # Turn bboxes into [x_min, y_min, x_max, y_max] format
      x_min, y_min, width, height = bbox
      x_max = x_min + width
      y_max = y_min + height

      if image_id not in boxes_dict:
          boxes_dict[image_id] = []
      boxes_dict[image_id].append([x_min, y_min, x_max, y_max, row['category_id']])  # [x_min, y_min, x_max, y_max, class]

  images = []
  boxes = []

  for image_id in df['image_id'].unique():
      if image_id < len(train_images_np):
          images.append(train_images_np[image_id])
          boxes.append(boxes_dict.get(image_id, []))

  return images, boxes

In [18]:
# Create train, test and validation boxes
train_images, train_boxes = match_bboxes_with_images(SkyFusion_train)
test_images, test_boxes = match_bboxes_with_images(SkyFusion_test)
val_images, val_boxes = match_bboxes_with_images(SkyFusion_val)

In [19]:
# Check if image and box counts match
len(train_boxes), len(train_images), len(test_boxes), len(test_images), len(val_boxes), len(val_images)

(2094, 2094, 449, 449, 449, 449)

🐈 Because there may be a different number of boxes in an image, the length of the elements in the boxes will also vary.

In [20]:
# Check the structure of boxes
test_boxes[0]

[[411, 566, 429.4, 576.4, 2],
 [199, 546, 211, 554, 2],
 [205, 532, 217.8, 537.6, 2],
 [330, 554, 341.2, 570, 2],
 [242, 566, 252.4, 572.4, 2],
 [202, 540, 213.2, 544.8, 2],
 [198, 557, 206, 561.8, 2],
 [320, 554, 333.6, 571.6, 2],
 [313, 552, 325, 568.8, 2],
 [240, 534, 249.6, 540.4, 2],
 [204, 537, 214.4, 540.2, 2],
 [226, 570, 237.2, 579.6, 2],
 [198, 553, 206, 557.8, 2]]

In [21]:
val_boxes[0]

[[256, 131, 289.5, 162, 1],
 [261, 93, 299, 121, 1],
 [275, 53, 314, 83.5, 1],
 [300, 16, 338.5, 45, 1],
 [185, 5, 217.5, 34.5, 1],
 [165, 46, 197, 88, 1],
 [123, 38, 152.5, 76.5, 1],
 [87, 22, 117, 57.5, 1],
 [607, 188, 638, 220, 1],
 [529, 150, 569.5, 178, 1],
 [511, 98, 559, 138.5, 1],
 [444, 60, 475, 101.5, 1],
 [411, 95, 446.5, 132.5, 1],
 [520, 293, 550, 330, 1],
 [483, 285, 506.5, 321.5, 1],
 [373, 281, 403.5, 312, 1],
 [363, 380, 400, 419.5, 1],
 [450, 562, 489.5, 604, 1],
 [370, 511, 391.5, 543.5, 1],
 [325, 478, 355, 520, 1],
 [280, 453, 311.5, 499.5, 1]]

## Preparing Images to Model

🐈 Creating transformation for images. Because the pictures are taken from far away, objects look small. So I chose a big size of (640, 640) to be able to detect small looking objects.

In [None]:
import torchvision.transforms as transforms

image_transforms = transforms.Compose([
    transforms.ToPILImage(),
    transforms.Resize((640, 640)),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
])

# Code below normalizes border box coordinates. This can improve the performance but
# that was not the case for me, so I commented this out.
"""def transform_bounding_boxes(boxes, size):
    # Adjust bounding boxes if resizing the image
    # size is (width, height) of the original image, and the new size is fixed.
    # This assumes you're resizing to (640, 640) for example.
    new_width, new_height = 640, 640
    width_ratio = new_width / size[0]
    height_ratio = new_height / size[1]

    transformed_boxes = []
    for box in boxes:
        x_min, y_min, x_max, y_max, category_id = box
        x_min *= width_ratio
        x_max *= width_ratio
        y_min *= height_ratio
        y_max *= height_ratio
        transformed_boxes.append([x_min, y_min, x_max, y_max, category_id])

    return transformed_boxes"""

## Creating Datasets

In [47]:
from torch.utils.data import Dataset
class CustomObjectDetectionDataset(Dataset):
    def __init__(self, images, boxes, transform=None):
        self.images = images
        self.boxes = boxes
        self.transform = transform

    def __len__(self):
        return len(self.images)

    def __getitem__(self, idx):
        image = self.images[idx]
        boxes = self.boxes[idx]

        # Store the original size for bounding box transformation
        original_size = (image.shape[1], image.shape[0])  # (width, height)

        if self.transform:
            image = self.transform(image)  # Apply image transformations

        return image, torch.tensor(boxes, dtype=torch.float32)  # Return the transformed image and boxes

# Create the dataset instance with transformations
train_dataset = CustomObjectDetectionDataset(train_images, train_boxes, transform=image_transforms)
test_dataset = CustomObjectDetectionDataset(test_images, test_boxes, transform=image_transforms)
val_dataset = CustomObjectDetectionDataset(val_images, val_boxes, transform=image_transforms)

🐈 Check if dataset is shaped correctly.

In [48]:
import torch
train_dataset[0][0].shape

torch.Size([3, 640, 640])

In [49]:
train_dataset[0][1].shape

torch.Size([7, 5])

## Creating Dataloader

🐈 Zipping features and images inside dataloader. Batch size can change.

In [50]:
import torch
from torch.utils.data import DataLoader

# Set your batch size
batch_size = 64  # Adjust based on your memory constraints

# Create the DataLoader
train_dataloader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True, collate_fn=lambda x: tuple(zip(*x)))
test_dataloader = DataLoader(test_dataset, batch_size=batch_size, shuffle=False, collate_fn=lambda x: tuple(zip(*x)))
val_dataloader = DataLoader(val_dataset, batch_size=batch_size, shuffle=False, collate_fn=lambda x: tuple(zip(*x)))

In [51]:
len(train_dataloader)

33

🐈 Checking shapes again.

In [52]:
train_images_batch, train_labels_batch = next(iter(train_dataloader))
train_images_batch[0].shape, train_labels_batch[0].shape

(torch.Size([3, 640, 640]), torch.Size([14, 5]))

## First Model (Custom CNN Model)

🫒 A simple CNN model created from scratch by me. Normally, I know that fine tuning existing model is a better choice, but I wanted to create a model from scratch too.

🐈 Backbone model is created from convolutional layers. After that, there's 2 heads:

🐈 First head is box_head. That outputs a box that contains the object that have been detected by the model.

🐈 Other head is class_head. This one outputs class of the object. Because this one makes classification, I decided to use linear layers too, and it gives better resutls than pure CNN.

In [53]:
import torch
import torch.nn as nn
import torch.nn.functional as F

class CustomObjectDetector(nn.Module):
    def __init__(self, num_classes):
        super(CustomObjectDetector, self).__init__()
        # Feature Extractor (simple CNN)
        self.feature_extractor = nn.Sequential(
            nn.Conv2d(3, 16, kernel_size=3, stride=1, padding=1),
            nn.BatchNorm2d(16),
            nn.ReLU(),
            nn.MaxPool2d(2, 2),  # Output: (16, 320, 320)
            nn.Conv2d(16, 32, kernel_size=3, stride=1, padding=1),
            nn.BatchNorm2d(32),
            nn.ReLU(),
            nn.MaxPool2d(2, 2),  # Output: (32, 160, 160)
            nn.Conv2d(32, 64, kernel_size=3, stride=1, padding=1),
            nn.BatchNorm2d(64),
            nn.ReLU(),
            nn.MaxPool2d(2, 2),  # Output: (64, 80, 80)
        )

        # Bounding Box Regressor
        self.box_head = nn.Sequential(
            nn.Conv2d(64, 128, kernel_size=3, stride=1, padding=1),
            nn.ReLU(),
            nn.Conv2d(128, 4, kernel_size=1),  # 4 coordinates for each box
        )

        # Classification Head with Dense Layers
        self.class_head = nn.Sequential(
            nn.Conv2d(64, 128, kernel_size=3, stride=1, padding=1),
            nn.ReLU(),
            nn.Conv2d(128, 256, kernel_size=3, stride=1, padding=1),  # Added additional conv layer
            nn.ReLU(),
            nn.AdaptiveAvgPool2d((1, 1)),  # Reduce each feature map to a single value
            nn.Flatten(),  # Flatten the output for the dense layer
            nn.Linear(256, 128),  # First dense layer
            nn.ReLU(),
            nn.Linear(128, num_classes)  # Final dense layer outputting class scores
        )

    def forward(self, x):
        # Check input shape
        if x.ndim != 4 or x.shape[1] != 3:
            raise ValueError(f"Expected input of shape (batch_size, 3, H, W), got {x.shape}")

        # Pass input through the feature extractor
        features = self.feature_extractor(x)

        # Predict bounding boxes and class scores
        boxes = self.box_head(features)  # Shape: (batch_size, 4, height, width)
        class_scores = self.class_head(features)  # Shape: (batch_size, num_classes, height, width)

        return boxes, class_scores

### Loss Function

🐈 Had to use a custom loss function here because model gives 2 different outputs. For boxes, I use smooth L1 loss and for classes, cross entropy loss.

In [54]:
def compute_loss(pred_boxes, true_boxes, pred_classes, true_classes):
    # Box regression loss (smooth L1)
    box_loss = F.smooth_l1_loss(pred_boxes, true_boxes)

    # Classification loss (Cross Entropy)
    class_loss = F.cross_entropy(pred_classes, true_classes)

    total_loss = box_loss + class_loss
    return total_loss

### Training the First Model

🐈 This block of code was a real challenge for me, that's why it's that long. Had to do so many things to make it work. Tried to explain it well with comment lines, I recommend you to review the codes.

In [56]:
from torch.optim.lr_scheduler import StepLR
import time

def compute_loss(pred_boxes, true_boxes, pred_classes, true_classes):
    # Flatten the predicted boxes
    pred_boxes = pred_boxes.view(-1, 4)  # Shape: [batch_size * height * width, 4] 4 here is len([x_min, y_min, x_max, y_max])

    # Check shape of true_boxes and handle padding if necessary
    if true_boxes.shape[0] == 0:
        raise ValueError("No true boxes available for loss computation.")

    # Get the number of boxes in the batch
    num_boxes = true_boxes.size(0)

    max_boxes = 100  # Maximum number of boxes we expect in any image
    padded_true_boxes = torch.zeros((max_boxes, 4), device=pred_boxes.device)  # create a zero tensor for padding

    # Fill with true boxes (assuming it's not exceeding max_boxes)
    padded_true_boxes[:num_boxes, :] = true_boxes[:max_boxes, :]

    # Compute Smooth L1 loss for bounding boxes
    box_loss = F.smooth_l1_loss(pred_boxes[:num_boxes], padded_true_boxes[:num_boxes])

    # For classification, ensure pred_classes is properly shaped
    pred_classes = pred_classes.view(-1, pred_classes.shape[1])  # Flatten to [batch_size * height * width, num_classes]

    # Use true_classes in a similar way, making sure they match the correct number of predictions
    true_classes = true_classes.view(-1)  # Flatten to [total_boxes]

    # Check for out-of-bounds classes
    if true_classes.max() >= pred_classes.shape[1]:
        raise ValueError(f"True class index {true_classes.max()} is out of bounds for the number of classes {pred_classes.shape[1]}.")

    # Compute Cross-Entropy loss for classes
    class_loss = F.cross_entropy(pred_classes[:num_boxes], true_classes[:num_boxes])

    # Combine the losses
    total_loss = box_loss + class_loss
    return total_loss


model = CustomObjectDetector(num_classes=3)  # We have 3 classes
optimizer = torch.optim.Adam(model.parameters(), lr=1e-2) # On my experiments I decided that best performing lr is this.
scheduler = StepLR(optimizer, step_size=1, gamma=0.8) # Because starting lr is a big value, I want to reduce it while model trains
num_epochs = 10

# Training loop
for epoch in range(num_epochs):

    start_time = time.time() # I want to see how much does it take to train the model

    for data in train_dataloader:
        images = torch.stack([img for img in data[0]])  # Stack images into a single tensor
        targets = data[1]

        # Predict bounding boxes and class scores
        pred_boxes, pred_classes = model(images)

        # Prepare true_boxes and true_classes for each image in the batch
        true_boxes = []
        true_classes = []

        for i in range(images.size(0)):  # Loop through each image in the batch
            # Extract the box and class for each image
            box = targets[i][0][:4]  # Extracting the first box (coordinates)
            cls = targets[i][0][4]   # Extracting the class label

            true_boxes.append(box)
            true_classes.append(int(cls))  # Ensure cls is an integer and append to the list

        # Convert lists to tensors
        true_boxes = torch.stack(true_boxes).to(images.device)  # Ensure the boxes are on the same device

        # Convert true_classes to tensor and ensure it's a long tensor
        true_classes = torch.tensor(true_classes, dtype=torch.long, device=images.device)  # Ensure class labels are on the same device

        true_classes -= 1  # Classes are in format 1, 2, 3 but we want them to be 0 indexed, so substract 1

        # Compute loss
        loss = compute_loss(pred_boxes, true_boxes, pred_classes, true_classes)

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

    scheduler.step()
    end_time = time.time()  # Record the end time of the epoch
    epoch_duration = end_time - start_time  # Calculate the duration
    epoch_duration = epoch_duration / 60

    print(f"Epoch [{epoch + 1}/{num_epochs}], Loss: {loss.item():.4f}, LR: {scheduler.get_last_lr()[0]:.6f}, Duration: {epoch_duration:.2f} minutes")

Epoch [1/10], Loss: 181.4098, LR: 0.008000, Duration: 3.49 minutes
Epoch [2/10], Loss: 178.6821, LR: 0.006400, Duration: 3.54 minutes
Epoch [3/10], Loss: 172.1516, LR: 0.005120, Duration: 3.52 minutes
Epoch [4/10], Loss: 158.2260, LR: 0.004096, Duration: 3.50 minutes
Epoch [5/10], Loss: 167.4919, LR: 0.003277, Duration: 3.53 minutes
Epoch [6/10], Loss: 180.1522, LR: 0.002621, Duration: 3.54 minutes
Epoch [7/10], Loss: 186.1154, LR: 0.002097, Duration: 3.55 minutes
Epoch [8/10], Loss: 173.8505, LR: 0.001678, Duration: 3.55 minutes
Epoch [9/10], Loss: 181.7435, LR: 0.001342, Duration: 3.57 minutes
Epoch [10/10], Loss: 178.8379, LR: 0.001074, Duration: 3.58 minutes


## Creating Dataset and Dataloader for Fine Tuned Fast RCNN Model

🐈 Preparing datasets and dataloaders for fast rcnn model.

In [62]:
import torch
from torch.utils.data import Dataset, DataLoader
import numpy as np

class CustomObjectDetectionDataset(Dataset):
    def __init__(self, images, boxes):
        self.images = images  # NumPy arrays of shape (N, H, W, C)
        self.boxes = boxes  # List of arrays with shape (num_boxes, 5)

    def __len__(self):
        return len(self.images)

    def __getitem__(self, idx):
        img = self.images[idx]
        img = img / 255.0  # Normalize to [0, 1]
        img = img.transpose((2, 0, 1))  # Change to (C, H, W)

        # Load bounding boxes and classes
        boxes = np.array(self.boxes[idx])  # Ensure boxes is a NumPy array
        target = {
            "boxes": torch.tensor(boxes[:, :4], dtype=torch.float32),  # x1, y1, x2, y2
            "labels": torch.tensor(boxes[:, 4] - 1, dtype=torch.int64)  # class labels
        }

        return torch.tensor(img, dtype=torch.float32), target


In [63]:
from torch.utils.data import DataLoader

finetune_train_dataset = CustomObjectDetectionDataset(train_images, train_boxes)
finetune_val_dataset = CustomObjectDetectionDataset(val_images, val_boxes)
finetune_test_dataset = CustomObjectDetectionDataset(test_images, test_boxes)

train_loader = DataLoader(finetune_train_dataset, batch_size=batch_size, shuffle=True, collate_fn=lambda x: tuple(zip(*x)))
val_loader = DataLoader(finetune_val_dataset, batch_size=batch_size, shuffle=False, collate_fn=lambda x: tuple(zip(*x)))
test_loader = DataLoader(finetune_test_dataset, batch_size=batch_size, shuffle=False, collate_fn=lambda x: tuple(zip(*x)))

## Creating Fast RCNN Model

In [64]:
from torchvision.models.detection.faster_rcnn import FastRCNNPredictor
import torchvision.models as models
import torchvision.transforms as T

num_classes = 3

# Load a pre-trained Faster R-CNN model
model = models.detection.fasterrcnn_resnet50_fpn(pretrained=True)

# Get the number of input features for the classifier
in_features = model.roi_heads.box_predictor.cls_score.in_features

# Replace the pre-trained head with a new one (with num_classes)
model.roi_heads.box_predictor = FastRCNNPredictor(in_features, num_classes)



In [65]:
model

FasterRCNN(
  (transform): GeneralizedRCNNTransform(
      Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
      Resize(min_size=(800,), max_size=1333, mode='bilinear')
  )
  (backbone): BackboneWithFPN(
    (body): IntermediateLayerGetter(
      (conv1): Conv2d(3, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False)
      (bn1): FrozenBatchNorm2d(64, eps=0.0)
      (relu): ReLU(inplace=True)
      (maxpool): MaxPool2d(kernel_size=3, stride=2, padding=1, dilation=1, ceil_mode=False)
      (layer1): Sequential(
        (0): Bottleneck(
          (conv1): Conv2d(64, 64, kernel_size=(1, 1), stride=(1, 1), bias=False)
          (bn1): FrozenBatchNorm2d(64, eps=0.0)
          (conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
          (bn2): FrozenBatchNorm2d(64, eps=0.0)
          (conv3): Conv2d(64, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
          (bn3): FrozenBatchNorm2d(256, eps=0.0)
          (relu): ReLU(

## Fine Tuning Fast RCNN Model

🐈 Normally, code below doesn't (shouldn't) have any problem. When I run it, it runs without problems for ~10 minutes, then I lose access to device on cloud. I searched it, and it says it's because I'm using too much processing power. I have Colab Pro and I'm using TPU with 300+ GB RAM, I don't understand how it's able to run this model normally but not the fine tuned version.

In [None]:
"""import torch
from torchvision.models.detection import FasterRCNN
from torchvision.models.detection.backbone_utils import resnet_fpn_backbone
from torchvision.models.detection.faster_rcnn import FastRCNNPredictor

# Number of classes (including background)
num_classes = 4  # For example, we have 3 classes + background

# Load the pre-trained Faster R-CNN model
model = models.detection.fasterrcnn_resnet50_fpn(pretrained=True)

# Get the number of input features for the classifier
in_features = model.roi_heads.box_predictor.cls_score.in_features

# Define a new custom classifier head with additional layers
class CustomPredictor(torch.nn.Module):
    def __init__(self, in_channels, num_classes):
        super(CustomPredictor, self).__init__()
        # Define additional layers
        self.fc1 = torch.nn.Linear(in_channels, 512)
        self.relu = torch.nn.ReLU()
        self.fc2 = torch.nn.Linear(512, 256)
        self.fc3 = torch.nn.Linear(256, num_classes)  # Final classification layer

    def forward(self, x):
        x = self.relu(self.fc1(x))
        x = self.relu(self.fc2(x))
        x = self.fc3(x)
        return x

# Replace the classifier head with the custom predictor
model.roi_heads.box_predictor = FastRCNNPredictor(in_features, num_classes)

class CustomRCNNHead(torch.nn.Module):
    def __init__(self, base_predictor, num_classes):
        super(CustomRCNNHead, self).__init__()
        self.base_predictor = base_predictor
        # Add more custom layers here
        self.additional_fc = torch.nn.Sequential(
            torch.nn.Linear(num_classes, 256),
            torch.nn.ReLU(),
            torch.nn.Linear(256, num_classes)
        )

    def forward(self, x):
        x = self.base_predictor(x)

        # x is a tuple of (class_logits, box_regression)
        class_logits = x[0]

        # Apply additional layers only to class logits
        class_logits = self.additional_fc(class_logits)

        # Return the modified logits and the original box regression
        return class_logits, x[1]

# Replace the existing predictor with this custom model
model.roi_heads.box_predictor = CustomRCNNHead(model.roi_heads.box_predictor, num_classes)"""

## Training Fast RCNN Model

🐈 1 epoch takes more than an hour, so I couldn't train it for more than 1 epoch. Ideal condition is when you train it for more epochs.

🐈 As you can see, that model gives a way WAY better result. As in my researchs I found that loss between 0 to 1 is considered good, so our result is nice even in first epoch. On top of that, because objects are small in our images, it's even harder to achieve an acceptable loss value, so this result is really good.

In [68]:
import time
device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
model.to(device)

optimizer = torch.optim.SGD(model.parameters(), lr=0.005, momentum=0.9, weight_decay=0.0005)
num_epochs = 1  # Set this based on your needs

for epoch in range(num_epochs):
    start_time = time.time()
    model.train()
    for images, targets in train_loader:
        images = [image.to(device) for image in images]
        targets = [{k: v.to(device) for k, v in t.items()} for t in targets]

        optimizer.zero_grad()
        loss_dict = model(images, targets)
        losses = sum(loss for loss in loss_dict.values())
        losses.backward()
        optimizer.step()

    end_time = time.time()  # Record the end time of the epoch
    epoch_duration = end_time - start_time  # Calculate the duration
    epoch_duration = epoch_duration / 60
    print(f"Epoch [{epoch + 1}/{num_epochs}], Loss: {losses.item():.4f},  Duration: {epoch_duration:.2f} minutes")

Epoch [1/1], Loss: 0.8746,  Duration: 77.64 minutes
