![image.png](https://i.imgur.com/a3uAqnb.png)

# **🚀 Object Detection with Faster R-CNN**
In this lab, we will:

✅ Build a **custom Dataset class** for **Pascal VOC dataset**  
✅ Use a **pretrained Faster R-CNN model** for object detection  
✅ Train and evaluate the model  

---

## **1️⃣ Dataset Class**
We use the **Pascal VOC 2007 dataset**, which contains images with **bounding boxes and labels**.  
PyTorch provides a built-in dataset loader: `torchvision.datasets.VOCDetection`.

In [None]:
### **🔹 Load & Transform Dataset**
import torchvision
from torchvision.datasets import VOCDetection
from torch.utils.data import DataLoader
import torch

# Define dataset path and transformations
data_path = "./VOC_data/"
transform = torchvision.transforms.Compose([torchvision.transforms.ToTensor()])

# Load Pascal VOC dataset (train & test)
train_dataset = VOCDetection(root=data_path, year='2007', image_set='train', download=True, transform=transform)
test_dataset = VOCDetection(root=data_path, year='2007', image_set='test', download=True, transform=transform)

# Custom collate function to handle variable number of bounding boxes
def custom_collate_fn(data): # handel the dynmic of data set and يفرق بينهم seprated in
    return tuple(zip(*data))

# Create DataLoaders                 (Tip: Use lower batch size if encountered Out-of-Memory (OOM) error in Training)
train_loader = DataLoader(train_dataset, batch_size=1, shuffle=True, num_workers=2, collate_fn=custom_collate_fn)
test_loader = DataLoader(test_dataset, batch_size=1, shuffle=False, num_workers=2, collate_fn=custom_collate_fn)


Downloading http://host.robots.ox.ac.uk/pascal/VOC/voc2007/VOCtrainval_06-Nov-2007.tar to ./VOC_data/VOCtrainval_06-Nov-2007.tar


100%|██████████| 460M/460M [00:21<00:00, 21.6MB/s]


Extracting ./VOC_data/VOCtrainval_06-Nov-2007.tar to ./VOC_data/
Downloading http://host.robots.ox.ac.uk/pascal/VOC/voc2007/VOCtest_06-Nov-2007.tar to ./VOC_data/VOCtest_06-Nov-2007.tar


100%|██████████| 451M/451M [00:23<00:00, 19.6MB/s]


Extracting ./VOC_data/VOCtest_06-Nov-2007.tar to ./VOC_data/


In [None]:
# Mappings of label names (found in dataset annotation) to integer IDs (or classes) which we will feed to the model
voc_classes = {
    "aeroplane": 0,
    "bicycle": 1,
    "bird": 2,
    "boat": 3,
    "bottle": 4,
    "bus": 5,
    "car": 6,
    "cat": 7,
    "chair": 8,
    "cow": 9,
    "diningtable": 10,
    "dog": 11,
    "horse": 12,
    "motorbike": 13,
    "person": 14,
    "pottedplant": 15,
    "sheep": 16,
    "sofa": 17,
    "train": 18,
    "tvmonitor": 19,
}

#  Reverse of label to class id mapping. needed because the model predictions will be ids and we need to change it to label to visualize it.
reverse_voc_classes = {v: k for k, v in voc_classes.items()}


### **🔹 Why Do We Need a Custom `collate_fn`?**
Unlike classification datasets, where each image has a **fixed shape and label**, object detection images have **variable numbers of bounding boxes**.  

- The default `collate_fn` (which applies `torch.stack`) **doesn't work**, since bounding box tensors have different shapes.  
- Instead, we **return a tuple** that **keeps individual image-label pairs separate**.

##### Before using custom collate_fn:
```python
data = [
    (image1, dict1),  
    (image2, dict2),
    ...  
]
```
##### After:
```python
images_tuple = (image1, image2, ...)  
targets_tuple = (dict1, dict2, ...)   
```

## **2️⃣ Model Class**
We use a **pretrained Faster R-CNN with a MobileNetV3-Large backbone**.

### **🔹 Modify the Model**
- The default model is trained on **COCO dataset** with **91 classes**.
- We modify the classifier to detect **20 Pascal VOC classes**.


In [None]:
import torchvision

# Load pretrained Faster R-CNN model
model = torchvision.models.detection.fasterrcnn_mobilenet_v3_large_fpn(pretrained=True) # باكج مودل بعدين نبي ديدكت و ايش هو

# Change number of output classes to match Pascal VOC dataset مودفي لاست لاير على كلاسس
num_classes = 20  # Pascal VOC has 20 object classes عندنا فقط اوت و نبي اكسس لل انبت علشان نقدر نغير اخر لاير
in_features = model.roi_heads.box_predictor.cls_score.in_features  # Input features for predictor

# Replace final layer with new predictor
model.roi_heads.box_predictor = torchvision.models.detection.faster_rcnn.FastRCNNPredictor(in_features, num_classes)


# Freeze the backbone and just finetune the head (You can finetune the whole model, but it'd take time and resources)  بون مودل و الراس تاسك سبيسفك لاير فقط يعطي كلاسفير
# ف نثبت مودل الي فيه كل ectract وكيذا and no need to update weight
model.requires_grad_(False) # in backbones
model.roi_heads.box_predictor = model.roi_heads.box_predictor.requires_grad_(True)  # in head we change the output so we need to calculate grediant


# Move model to GPU if available
device = "cuda" if torch.cuda.is_available() else "cpu"
model.to(device)

model ## من هنا نشوف لاير

Downloading: "https://download.pytorch.org/models/fasterrcnn_mobilenet_v3_large_fpn-fb6a3cc7.pth" to /root/.cache/torch/hub/checkpoints/fasterrcnn_mobilenet_v3_large_fpn-fb6a3cc7.pth
100%|██████████| 74.2M/74.2M [00:00<00:00, 99.8MB/s]


FasterRCNN(
  (transform): GeneralizedRCNNTransform(
      Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
      Resize(min_size=(800,), max_size=1333, mode='bilinear')
  )
  (backbone): BackboneWithFPN(
    (body): IntermediateLayerGetter(
      (0): Conv2dNormActivation(
        (0): Conv2d(3, 16, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
        (1): FrozenBatchNorm2d(16, eps=1e-05)
        (2): Hardswish()
      )
      (1): InvertedResidual(
        (block): Sequential(
          (0): Conv2dNormActivation(
            (0): Conv2d(16, 16, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), groups=16, bias=False)
            (1): FrozenBatchNorm2d(16, eps=1e-05)
            (2): ReLU(inplace=True)
          )
          (1): Conv2dNormActivation(
            (0): Conv2d(16, 16, kernel_size=(1, 1), stride=(1, 1), bias=False)
            (1): FrozenBatchNorm2d(16, eps=1e-05)
          )
        )
      )
      (2): InvertedResidual(
        (block):

## **3️⃣ Training and Validation Loops**
### **🔹 Training Loop**
- The model takes **images & targets** is the bounding boxes (target) and computes **losses internally**. No need to define the loss.
- We only need to **backpropagate and update the optimizer**.


In [None]:
from tqdm import tqdm

def train_one_epoch(model, dataloader, optimizer, device):
    model.train()
    total_loss = 0

    for images, targets in tqdm(dataloader):
        images = list(img.to(device) for img in images)  # its list to matches what model expecting this way ,, list of image in one batch

        # Convert targets its bounding box and label inside it
        for target in targets: # هنا عل كل صوره نمشي
            boxes = []
            labels = []
            for obj in target['annotation']['object']: # annotation is the label # objech is the bounding box  # هنا نمشي عل كل بوندق بوكس
                label = obj['name']  # هنا لابل و بوكس في صوره وحده في نفس الصوره بعد ما نخلص ايتراتي ثاني
                box = obj['bndbox']
                xmin, ymin, xmax, ymax = [int(box[k]) for k in ['xmin', 'ymin', 'xmax', 'ymax']] # define box , ندخله ك و ك في بوكس و يصير انتجر و يدخل لي حقه
                boxes.append(torch.Tensor([xmin, ymin, xmax, ymax]).to(device))
                labels.append(voc_classes[label]) #  voc_classes فوق عرفناه و يطلع int

            target['boxes'] = torch.stack(boxes)
            target['labels'] = torch.Tensor(labels).type(torch.int64).to(device)

        # Compute losses
        loss_dict = model(images, targets) # stack of tensor from out loop and this tensor got the boxes ,, loss for boundery , loss of label tahts why there are dict
        losses = sum(loss for loss in loss_dict.values())  # Sum all losses

        optimizer.zero_grad()
        losses.backward()
        optimizer.step()

        total_loss += losses.item() # we work with tesnsor and we used item to just get with one number of the رقم

    avg_loss = total_loss / len(dataloader)
    return avg_loss

### **🔹 Validation Loop**
#### **🔹 How Do We Evaluate Object Detection Models?**
To evaluate object detection models like **Faster R-CNN**, we need to measure **how well the predicted bounding boxes match the ground truth boxes**.

![image.png](https://i.imgur.com/MDFxFMX.png)

---

#### **📌 Intersection over Union (IoU)**
✅ We consider a detection **correct** if the predicted box **overlaps significantly** with the ground truth box.  
✅ This is measured using **Intersection over Union (IoU)**, which calculates the **ratio of overlap** between the two boxes.

$$
IoU = \frac{\text{Area of Overlap}}{\text{Area of Union}}
$$

🚀 **Higher IoU = Better Detection!**  


![image.png](https://i.imgur.com/yNNhjwr.png)

---

#### **📌 What is mAP@0.5:0.95?**
mAP (**mean Average Precision**) is the most commonly used **metric for object detection**.

🔹 **mAP@0.5:0.95** means we compute the **average precision** at **different IoU thresholds** from **0.5 to 0.95**, increasing in steps of **0.05**.

- **IoU ≥ 0.5** → Loose match  
- **IoU ≥ 0.75** → Stricter match  
- **IoU ≥ 0.95** → Extremely strict match  

**mAP@0.5:0.95** takes the **average of all these values**, giving us a single number that represents how well the model performs **across different difficulty levels**.


In [None]:
from IPython.display import clear_output ## IoU :: to see the number of ground and prediction how much they are overlapping to each other
!pip install torchmetrics
clear_output()

In [None]:
from torchmetrics.detection.mean_ap import MeanAveragePrecision

# Initialize metric
metric = MeanAveragePrecision(iou_thresholds=[0.5, 0.55, 0.6, 0.65, 0.7, 0.75, 0.8, 0.85, 0.9, 0.95]) # ياخذ كل تراش هولد و يجرب عليه لل بريدكت بعدين يجمع الواحد للصح

def validate(model, dataloader, device):
    """Evaluates the model using mAP@0.5:0.95."""
    model.eval()
    metric.reset()  # داخل كل ephoche لازم نسوي ريسيت في ديتكت

    with torch.no_grad():
        for images, targets in tqdm(dataloader):
            images = [img.to(device) for img in images]
            preds = model(images)

            # Convert predictions to correct format الانهم متوقعين يكونو في specfic format
            processed_preds = []
            for pred in preds:
                processed_preds.append({
                    "boxes": pred["boxes"].cpu(),
                    "scores": pred["scores"].cpu(),
                    "labels": pred["labels"].cpu()
                })

            # Convert ground truth targets
            processed_targets = []
            for target in targets:
                gt_boxes = []
                gt_labels = []
                for obj in target['annotation']['object']:
                    label = obj['name']
                    box = obj['bndbox']
                    xmin, ymin, xmax, ymax = [int(box[k]) for k in ['xmin', 'ymin', 'xmax', 'ymax']]
                    gt_boxes.append([xmin, ymin, xmax, ymax])
                    gt_labels.append(voc_classes[label])

                processed_targets.append({
                    "boxes": torch.tensor(gt_boxes).cpu(),  # need cpu due to mAP expected ,, for less compution
                    "labels": torch.tensor(gt_labels).cpu()
                })

            # Update metric
            metric.update(processed_preds, processed_targets)

    return metric.compute()  # Compute final mAP scores

## **4️⃣ Running Training & Validation**

In [None]:
optimizer = torch.optim.AdamW(model.parameters(), lr=0.0001)
num_epochs = 10  # Set number of epochs

for epoch in range(num_epochs):
    train_loss = train_one_epoch(model, train_loader, optimizer, device) # سويناه دايركتلي الانه مودل يرجعه
    mAP_results = validate(model, test_loader, device)

    print(f"Epoch {epoch+1}/{num_epochs}, Loss: {train_loss:.4f}") # نتيجه عاليه يعني افضل يعني مودل local for both localization and globilazation
    print(f"mAP@0.5:0.95 for Test: {mAP_results['map']:.4f}") # قريب لل قراوند و بريدكت

100%|██████████| 2501/2501 [49:43<00:00,  1.19s/it]
 10%|▉         | 482/4952 [10:18<1:34:45,  1.27s/it]

## **5️⃣ Visualizing Predictions vs. Ground Truth**


In [None]:
import random
import torchvision.transforms.functional as F
import matplotlib.pyplot as plt
import matplotlib.patches as patches

# Helper function to overlay Ground Truth & Predicted Boxes
def visualize_gt_pred(img, gt_boxes, gt_annotations, pred_boxes, pred_annotations, title=""):
    fig, ax = plt.subplots(1, figsize=(8, 8))
    ax.imshow(img)

    # Plot Ground Truth in RED
    for bbox, annotation in zip(gt_boxes, gt_annotations):
        x_min, y_min, x_max, y_max = bbox
        width = x_max - x_min
        height = y_max - y_min

        rect = patches.Rectangle((x_min, y_min), width, height, linewidth=2, edgecolor='r', facecolor='none')
        ax.add_patch(rect)
        plt.text(x_min, y_min - 5, annotation, color='r', fontsize=8, bbox=dict(facecolor='white', alpha=0.7))

    # Plot Predictions in GREEN
    for bbox, annotation in zip(pred_boxes, pred_annotations):
        x_min, y_min, x_max, y_max = bbox
        width = x_max - x_min
        height = y_max - y_min

        rect = patches.Rectangle((x_min, y_min), width, height, linewidth=2, edgecolor='g', facecolor='none')
        ax.add_patch(rect)
        plt.text(x_min, y_min - 5, annotation, color='g', fontsize=8, bbox=dict(facecolor='white', alpha=0.7))

    plt.axis('off')
    plt.title(title)
    plt.show()

In [None]:
# Select 3 random test images راندوم
test_indices = [445557,555333, 222]

# Set model to evaluation mode
model.eval()

# Create figure with 5×2 subplots
fig, axes = plt.subplots(1, 3, figsize=(30, 60))
axes = axes.ravel()  # Flatten axes for easy iteration للمكان و الموثع هو يفضل يكون صف واحد لو كان مختلف ف فهو مهم

for i, idx in enumerate(test_indices):  # iterate through images 3 to get it into test image and target from our data index
    test_img, test_target = test_dataset[idx] # لوب باخذ قيمه اختبار و ترجع لنا i ----

    # Extract Ground Truth Boxes & Labels
    gt_boxes = []  # ground truth  قيمه صحيحه في البيانات
    gt_annotations = []  # labels
    # one image  iterate to more than bounding box
    for obj in test_target['annotation']['object']:
        box = obj['bndbox']
        xmin, ymin, xmax, ymax = [int(box[k]) for k in ['xmin', 'ymin', 'xmax', 'ymax']]
        gt_boxes.append([xmin, ymin, xmax, ymax])
        gt_annotations.append(obj['name'])

    # Run Model on Test Image
    with torch.no_grad():
        pred = model([test_img.to(device)])

    pred = pred[0]  # getting one image

    # Extract Predictions
    pred_boxes = pred['boxes'].cpu()
    pred_annotations = pred['labels'].cpu()
    pred_scores = pred['scores'].cpu() # موجوده بينهم بوكس و لابل

    # Apply Confidence Threshold (Only keep predictions with score ≥ 0.8) اعرف سكور من مخزن
    valid_mask = pred_scores >= 0.8 # ture or false ,, since it is expression
    pred_annotations = pred_annotations[valid_mask]  # ياخذ القيمه بس الصحيحه
    pred_boxes = pred_boxes[valid_mask]

    # Convert Predicted Labels from Numeric to Class Names
    pred_annotations = [reverse_voc_classes[val.item()] for val in pred_annotations]   # هو رقم بس علشان اطلعه في شارت ارجه سترينق

    # Overlay GT & Predictions on Image صور هو تينسورر ف نرجعه للصوره من خلال هذا
    img = F.to_pil_image(test_img)
    ax = axes[i]
    ax.imshow(img)

    # Plot Ground Truth in RED  نحط قراوند بوكس و لابل
    for bbox, annotation in zip(gt_boxes, gt_annotations):  # zip to iterat OVER MULTIPLE VALUES
        x_min, y_min, x_max, y_max = bbox
        width = x_max - x_min
        height = y_max - y_min

        rect = patches.Rectangle((x_min, y_min), width, height, linewidth=2, edgecolor='r', facecolor='none')  # object is ركتانقل there are library to draw them
        #  linewidth سماكة  facecolor fill color
        ax.add_patch(rect)
        ax.text(x_min, y_min - 5, annotation, color='r', fontsize=24, bbox=dict(facecolor='white', alpha=0.7))  # x_min, y_min - 5 طباعه لابل عل ابر بارت لافت

    # Plot Predictions in GREEN
    for bbox, annotation in zip(pred_boxes, pred_annotations):
        x_min, y_min, x_max, y_max = bbox
        width = x_max - x_min
        height = y_max - y_min

        rect = patches.Rectangle((x_min, y_min), width, height, linewidth=2, edgecolor='g', facecolor='none')
        ax.add_patch(rect)
        ax.text(x_min, y_min - 5, annotation, color='g', fontsize=24, bbox=dict(facecolor='white', alpha=0.7))

    ax.axis('off')
    ax.set_title(f"Image {idx}")

plt.tight_layout()
plt.show()

### Contributed by: Mohamed Eltayeb

![image.png](https://i.imgur.com/a3uAqnb.png)