In [None]:
#!pip install transformers evaluate

In [None]:
#!wget https://raw.githubusercontent.com/dvgodoy/assets/main/PyTorchInPractice/helper_functions.py
from helper_functions import get_image_from_url, save_images, show, xml_to_csv

## 11.4 Lab 5 / Case 5: Fine-Tuning Object Detection Models

In this lab, you'll build a dataset, including data augmentation, and fine-tune a custom object detection model by replacing its standard backbone with a different computer vision model. In the end, you'll evaluate the model using metrics from the COCO challenge.

### 11.4.1 Oxford-IIIT Pet Dataset

You'll build a dataset using the images and annotations from the [Oxford-IIIT Pet dataset](https://www.robots.ox.ac.uk/~vgg/data/pets/):

"_We have created a 37 category pet dataset with roughly 200 images for each class. The images have a large variations in scale, pose and lighting. All images have an associated ground truth annotation of breed, head ROI, and pixel level trimap segmentation._"

You will load the data using [PyTorch's built-in class](https://pytorch.org/vision/stable/generated/torchvision.datasets.OxfordIIITPet.html), but you're tasked with preprocessing the annotations and building a dataset that is compatible with V2 transforms for data augmentation (without wrapping the built-in dataset, that is).

First, load the data to the a folder of your choice (e.g. `./pets`), making sure to retrieve the `trainval` split (which has annotations), and choose both target types, `category` and `segmentation`, since you'll be fine-tuning a model to detect pets on images.

In [None]:
from torchvision.datasets import OxfordIIITPet

root_folder = './pets'
# write the arguments to create an instance of the dataset
pets = OxfordIIITPet(...)

### 11.4.2 Annotations

The annotations follow the Pascal VOC challenge format, and are stored as individual XML files, one for each annotated image, inside the `oxford-iiit-pet/annotations/xmls` subfolder. Use the `xml_to_csv()` helper function to convert all these files into a Pandas dataframe and inspect its contents.

In [None]:
xml_df = ...
xml_df

The annotations contain the box coordinates in the Pascal VOC system (`[xmin, ymin, xmax, ymax]`), but they only have two main classes, cats and dogs, instead of the expected 37 classes found in the description. As it turns out, there are more files in the `annotations` folder, namely, `list.txt`, `trainval.txt`, and `test.txt`:

In [None]:
# if you chose a different root folder, change it accordingly
!ls -l ./pets/oxford-iiit-pet/annotations

Let's take a look at the `list.txt` file:

In [None]:
!head ./pets/oxford-iiit-pet/annotations/list.txt

It contains a list of all images in the dataset, organized in four columns separated by spaces: Image, CLASS-ID, SPECIES, BREED ID. As it turns out, the "class" from the XML file is actually the species. We're interested in the true class ids, from 1 to 37, as stated in the description.

Now, let's take a look at the file corresponding to the data you loaded, the `trainval` split:

In [None]:
!head ./pets/oxford-iiit-pet/annotations/trainval.txt

It clearly follows the same structure as the previous file, but it does not contain any headers, and it lists only the images that belong to the original train and validation split.

We can load it in Pandas for easier visualization:

In [None]:
import pandas as pd

trainval_df = pd.read_csv('./pets/oxford-iiit-pet/annotations/trainval.txt', sep=' ', header=None, names=['filename', 'class_id', 'species', 'breed_id'])
trainval_df

Each filename has its own corresponding class index (`class_id`), but the label itself, as the descriptive name corresponding to the category is only available as part of the filename itself. We can easily extract it, though:

In [None]:
trainval_df['category'] = trainval_df['filename'].apply(lambda v: ' '.join([w.capitalize()
                                                                            for w in v.split('_')[:-1]]))

Moreover, there are 3,680 rows, one for each image, but there are 3,687 annotations retrieved from the XML files. Why? It is important to highlight that:
- some images may have more than one annotation/box - you saw that already in the Penn-Fudan dataset
- some images probably have no annotations/boxes (you'll see that soon)

You may use the custom dataset class `ObjDetectionDataset` once again, since it is prepared to take a CSV file or Pandas dataframe containing the annotations (filename, labels, xmin, ymin, xmax, and ymax columns), but keep in mind that only the filenames in the file/dataframe are going to be considered by it.

So, if you choose to use the same class, either:
- your file/dataframe must also include the filenames that have no annotations
- images without annotations won't be included

It is better to keep images without annotations as negative cases, so merge both dataframes and make sure that:
- every filename is kept, so there are still 3,680 unique filenames after merging
- the resulting dataframe has, at least, the following columns: `filename`, `label`, `category`, `xmin`, `ymin`, `xmax`, and `ymax`

Besides, use the resulting dataframe to build a `id2label` dictionary to map class id into the corresponding category.

In [None]:
trainval_df['filename'] = trainval_df['filename'].apply(lambda v: f'{v}.jpg')
annotations_df = trainval_df.merge(xml_df, how='left', on='filename')

In [None]:
colnames = ['filename', 'label', 'category', 'width', 'height', 'xmin', 'ymin', 'xmax', 'ymax']
annotations_df = annotations_df.rename(columns={'class_id': 'label'})[colnames]
annotations_df

In [None]:
id2label = dict(annotations_df[['label', 'category']].drop_duplicates().values)
id2label

In [None]:
assert len(annotations_df['filename'].unique()) == 3680
assert len(id2label.values()) == 37
assert len(annotations_df) == 3681

Shouldn'it be 3,687? Perhaps even more, since it should also include images without any annotations? It actually should, but some of the annotated images were excluded from the `trainval.txt` list of files for some unknown reason. In case you're curious, these are the images:

In [None]:
extra_annotations = set(xml_df['filename'].unique()).difference(set(annotations_df['filename'].unique()))
extra_annotations

The whole point of this apparent detour from our main job here - fine-tuning an object detection model - is to illustrate the fact that every dataset has its issues, and you should always take your time to investigate how it's organized, if there are quality issues, and ensure it's in the right shape to be loaded into an instance of your dataset class.

By the way, PyTorch's built-in dataset class for the Oxford-IIIT Pet Dataset handles this preprocssing (splitting filenames, building id2label dictionary, etc) in its [constructor method](https://pytorch.org/vision/main/_modules/torchvision/datasets/oxford_iiit_pet.html), in case you'd like to check it out.

### 11.4.3 Train-Validation Split

The original list of files does not give any indication regarding the split between training and validation sets, so you'll have to do it yourself.

Our suggestion is to shuffle the filenames, and take a large part of them (e.g. 3,000) as training set, and the remaining files as validation set.

Split the annotations dataframe in two, as the filenames in each dataframe determine which files are going be part of each dataset (assuming you're using our `ObjDetectionDataset`):

In [None]:
import numpy as np

np.random.seed(11)

# Get all (unique) file names from the annotations dataframe
fnames = ...
np.random.shuffle(fnames)

# Create a boolean pandas series to determine if a given annotation belongs
# to the training set
# Tip: don't forget that images may have multiple annotations - make sure
# two annotations of the same image don't end up in different sets
is_train = ...

annotations = {}
# Use the boolean series to slice the annotations dataframe
annotations['train'] = ...
annotations['val'] = ...

### 11.4.4 Loading Model's Weights

You're using a new backbone for your Faster R-CNN model, so you need to pick one that's different from ResNet50. You could, for example, choose a smaller model from the ResNet family, but it's likely more fun to choose a completely different model instead. We suggest you use MobileNet V2 as the new backbone.

Once you choose the model, load its pretrained weights and the prescribed transformations that come with it. 

In [None]:
from torchvision.models import get_weight

weights = ...
transforms_fn = ...
transforms_fn

This is its `forward()` method (of MobileNet V2 transform, that is). Take a good look at the sequence of transformations it performs because, as you probably already guesses, this function is not compatible with V2 transforms, so you'll have to include them yourself - if needed - in your data augmentation pipeline (the next section).

```python
def forward(self, img: Tensor) -> Tensor:
    img = F.resize(img, self.resize_size, interpolation=self.interpolation, antialias=self.antialias)
    img = F.center_crop(img, self.crop_size)
    if not isinstance(img, Tensor):
        img = F.pil_to_tensor(img)
    img = F.convert_image_dtype(img, torch.float)
    img = F.normalize(img, mean=self.mean, std=self.std)
    return img
```

### 11.4.5 Data Augmentation

It is time to write your own `get_transform()` function that takes one argument, namely, ìf it is performing transformations on the training or the validation set:
- if it is the validation set, it should stick to the basics (hint: check the prescribed transformations to assess these points)
  - make sure the image is in the right size/shape for the backbone of your choice
  - convert, if needed, PIL images to tensors
  - normalize the values
- if it is in the training set, it may perform data augmentation as well:
  - choose one or more data augmenting transformations
  - sanitize bounding boxes, just in case

Pay special attention to the order in which transformations will happen, to make sure the transformed image at the end of the pipeline does indeed match the requirements of the backbone model.

In [None]:
import torch
from collections import defaultdict
from torchvision.transforms import v2 as transforms

augmenting = [
    # Choose one (or more) augmentation transform(s), such as RandomHorizontalFlip, for example
    # write your code here
    ...
]

basic = [
    # Include required transformations here, such as transforming PIL images into tensors
    # and normalizing pixel values
    # write your code here
    ...
]

def get_transform(train):
    ops = [
        # Include resizing transformations here, to make images the right size for the chosen model
        # write your code here
        ...
    ]
    # Only does augmenting in training mode
    if train:
        ops.extend(augmenting)
    # Basic transforms: to tensor, sanitizing, and normalizing
    ops.extend(basic)
    return transforms.Compose(ops)

### 11.4.6 Datasets and DataLoaders

Create two datasets, one for training, and one for validation, and assign the corresponding transformations to each one of them:

In [None]:
import os
import pandas as pd
import torch
from torchvision.io import read_image, ImageReadMode
from torchvision.datapoints import Image, BoundingBox, BoundingBoxFormat, Mask
from torchvision.ops import masks_to_boxes, box_area
from torchvision.datasets import VisionDataset

class ObjDetectionDataset(VisionDataset):
    def __init__(self, image_folder, annotations=None, mask_folder=None, transforms=None):
        super().__init__(image_folder, transforms, None, None)
        self.image_folder = image_folder
        self.annotations = annotations
        self.mask_folder = mask_folder
        self.transforms = transforms

        self.images = list(sorted(os.listdir(image_folder)))

        self.df_boxes = None
        assert (annotations is not None) or (mask_folder is not None), "At least one, annotations or masks, must be supplied"
        if annotations is not None:
            if isinstance(annotations, str):
                self.df_boxes = pd.read_csv(annotations)
            else:
                self.df_boxes = annotations
            assert len(set(self.df_boxes.columns).intersection({'filename', 'xmin', 'ymin', 'xmax', 'ymax'})) == 5, "Missing columns in CSV"
            self.images = self.df_boxes['filename'].unique().tolist()

        self.masks = None
        if mask_folder is not None:
            self.masks = list(sorted(os.listdir(mask_folder)))
            assert len(self.masks) == len(self.images), "Every image must have one, and only one, mask"

    def __getitem__(self, idx):
        image_filename = os.path.join(self.image_folder, self.images[idx])
        image_tensor = read_image(image_filename, mode=ImageReadMode.RGB)
        image_hw = image_tensor.shape[-2:]

        labels = None
        # If there are masks, we work with them
        if self.masks is not None:
            mask_filename = os.path.join(self.mask_folder, self.masks[idx])
            merged_mask = read_image(mask_filename)
            instances = merged_mask.unique()[1:]

            masks = (merged_mask == instances.view(-1, 1, 1))
            boxes = masks_to_boxes(masks)

            wrapped_masks = Mask(masks)
        # No masks, so we fallback to a DF of annotated boxes
        else:
            annots = self.df_boxes.query(f'filename == "{self.images[idx]}"')
            boxes = torch.as_tensor(annots.dropna()[['xmin', 'ymin', 'xmax', 'ymax']].values)
            if 'label' in annots.columns:
                labels = torch.as_tensor(annots.dropna()['label'].values)
            wrapped_masks = None

        wrapped_boxes = BoundingBox(boxes, format=BoundingBoxFormat.XYXY, spatial_size=image_hw)
        num_objs = len(boxes)

        if len(boxes):
            if labels is None:
                # if there are no labels, we assume every instance is of
                # the same, and only, class
                labels = torch.ones((num_objs,), dtype=torch.int64)
            area = box_area(wrapped_boxes)
        else:
            # Only background, no boxes
            labels = torch.zeros((0,), dtype=torch.int64)
            area = torch.tensor([0.], dtype=torch.float32)

        target = {
            'boxes': wrapped_boxes,
            'area': area,
            'labels': labels,
            'image_id': torch.tensor([idx+1]),
            'iscrowd': torch.zeros((num_objs,), dtype=torch.int64)
        }
        if wrapped_masks is not None:
            target['masks'] = wrapped_masks

        image = Image(image_tensor)

        if self.transforms is not None:
            image, target = self.transforms(image, target)

        return image, target

    def __len__(self):
        return len(self.images)

In [None]:
datasets = {}
datasets['train'] = ...
datasets['val'] = ...

len(datasets['train']), len(datasets['val'])

Next, create two data loaders, one for each dataset. You should shuffle the training set, but not the validation one. Also, keep batch size small (e.g. two) to avoid out-of-memory issues in the GPU.

In [None]:
from torch.utils.data import DataLoader

dataloaders = {}
dataloaders['train'] = ...
dataloaders['val'] = ...

Try fetching a mini-batch from your training set:

In [None]:
next(iter(dataloaders['train']))

Did you get an error? No? Consider yourself lucky! At some point, it will raise an error, whenever an image with either zero or more than one annotation is included in the mini-batch.

The collate function is the function used by the data loader to patch together multiple data points into a mini-batch. If your dataset is nothing but tensors, that's trivial: it only has to stack them up. Stacking them up, though, assumes every data point has exactly the same shape for its features.

In object detection models, though, this is not guaranteed to be the case: one image may have no boxes, another one may have three boxes, and yet another one may have only one. Those cannot be stacked together.

The solution, fortunately, is pretty easy, and it looks like this:

```python
lambda batch: tuple(zip(*batch))
```

Throw the lambda function above as the `collate_fn` argument of your data loaders, and try again:

In [None]:
dataloaders = {}
dataloaders['train'] = ...
dataloaders['val'] = ...

next(iter(dataloaders['train']))

### 11.4.7 Model

Now, the fun part begins: replacing the backbone of a pretrained object detection model!

You will have to create a brand new instance of the `FasterRCNN` class using the required arguments to make your model work:
   - `backbone`: your feature extractor
   - `rpn_anchor_generator`: the new anchor generator
   - `box_roi_pool`: the new ROI pooler
   - `num_classes`: the number of classes for your task

You already know the number of classes - but don't forget another one for the negative case, that is, whenever there's no object in the image. This class (for the background, if you will) is usually assigned the zero index (and that's why the class indices from the dataset start at one).

You also have the weights for the backbone model too, but you need to create a model that returns its features only (the "headless" model, as seen in Chapter 2). The model must return either a feature map dictionary (if you're extracing features from multiple layers of your backbone) or a single tensor (if you're extracting a single set of features). Also, keep in mind that:
   - some models (like [MobileNet V2](https://pytorch.org/hub/pytorch_vision_mobilenet_v2/), our suggested choice of new bacbone) can have its features extracted easily accesing a single attribute (`features` in the case of MobileNet)
   - for more complex models, you can use `create_feature_extractor()` or `IntermediateLayerGetter` to build your backbone

Use the weights you already loaded to create an instance of your backbone model and use one of the alternatives above to get its features only returned:

In [None]:
from torchvision.models.detection import FasterRCNN
from torchvision.models import mobilenet_v2

num_classes = len(id2label) + 1

weights = ...
mobilenet = ...
new_backbone = mobilenet.features

Double-check if your model is returning what you expect of it by feeding it a random tensor in the shape of a mini-batch (make sure the height and width of your random images match the expected input of your model):

In [None]:
dummy_x = torch.randn(2, 3, 224, 224)
dummy_output = new_backbone(dummy_x)

You shouldn't get any errors, and your dummy output must be either a single tensor, or a feature map dictionary. Check the shape of each returned tensor (one or more), and make sure they all have the same number of output channels. This is required by the Faster R-CNN architecture.

In [None]:
out_channels = dummy_output.shape[1]
out_channels

Assign the number of output channels to the instance of your backbone as an `out_channels` attribute:

In [None]:
new_backbone.out_channels = ...

Create an instance of the `AnchorGenerator` class, and make sure each argument - `sizes` and `aspect_ratios` is a tuple containing as many elements as the number of feature maps returned by your backbone.

Each element is a tuple itself, and may have as many elements as you wish. For more details, refer to the "Region Proposal Network" subsection.

In [None]:
from torchvision.models.detection.rpn import AnchorGenerator

sizes = ((32, 64, 128, 256, 512),)
aspect_ratios = ((0.5, 1.0, 2.0),)

anchor_generator = ...

Create an instance of the `MultiScaleRoIAlign` class, and make sure it points to at least one valid feature map as returned by our backbone model. For more details, refer to the "Regions of Interest" subsection.

In [None]:
from torchvision.ops import MultiScaleRoIAlign

output_size = 7
sampling_ratio = 2

# Tip: simpler models don't return dictionaries, but feature maps are guaranteed to be a dictionary
# containing, at least, a "0" key
roi_pooler = MultiScaleRoIAlign(featmap_names=['0'], output_size=7, sampling_ratio=2)

Now, put everything together as your own Faster R-CNN model:

In [None]:
model = ...

There you go!

#### 11.4.7.1 Double-Checking the Model

To make sure your configuration is working fine, you can feed your new Faster R-CNN model a random tensor representing a dummy mini-batch once again. If you don't get any errors back, you're likely good to go!

Don't forget to send each tensor in your mini-batch, individually, to the device. You cannot simply send them all at once as you used to do before.

In [None]:
device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')

model.to(device)
model.train()

images, targets = next(iter(dataloaders['train']))

# Send images and targets to device
# write your code here
...

# Make predictions using your model - you should get a dict of losses back
output = ...
output

#### 11.4.7.2 Recap

In order to replace the backbone of a Faster R-CNN model, you need to:

1. Choose a computer vision model (e.g. ResNet, MobileNet)
2. Create a backbone by extracting the model features as a feature map (dict) or a single tensor:
   - if there are multiple feature maps, make sure they all produce the same number of output channels
   - some models (like MobileNet) can have its features extracted easily accesing a single attribute
   - for more complex models, you can use `create_feature_extractor()` or `IntermediateLayerGetter` to build your backbone
   - create and/or set the `out_channels` attribute of your backbone
3. Create and configure an `AnchorGenerator` so it has as many `sizes` and `aspect_ratios` as feature maps returned by your backbone
4. Create and configure a ROI pooler `MultiScaleRoIAlign` so it points to the correct feature maps returned by your backbone
5. Create an instance of `FasterRCNN` with the proper arguments:
   - `backbone`: your feature extractor
   - `rpn_anchor_generator`: the new anchor generator
   - `box_roi_pool`: the new ROI pooler
   - `num_classes`: the number of classes for your task


### 11.4.8 Learning Rate Schedulers

We need to talk about learning rates. So far, we've always used an optimizer with a fixed learning rate. The Adam optimizer actually makes adjustments under the hood but, as far as we're concerned, the learning rate was defined as a single value for the whole training loop.

Unfortunately, this straightforward approach does not always work. In some cases, and training a Faster R-CNN model is one of those cases, it may be necessary to gradually "warm-up" the learning rate up to the desired level. Conversely, sometimes you may need to actually decrease the learning rate after a few epochs or updates. Or, perhaps, combine both approaches: warm-up at start, run it at certain level for a while, and then start decreasing it.

Luckily, it's not hard to accomplish this at all: there's a learning rate scheduler for every need. Let's say you’d like to reduce the learning rate by one order of magnitude (that is, multiplying it by 0.1) every T epochs, such that training is faster at the beginning and slows down after a while to try avoiding convergence problems. Set up a scheduler to do that.

Now, let's say you'd like to warm-up the learning rate from zero all the way up to a predefined level over the course of the first epoch (you'll actually do that). Set up another scheduler to do that too.

The learning rate scheduer does modify the underlying learning rate set in the optimizer, so it should be no surprise that one of the scheduler's arguments is the optimizer itself. The learning rate set for the optimizer will be the initial learning rate of the scheduler. The scheduler also has a `step()` method, just like the optimizer does, and you should call the scheduler's `step()` method after calling the optimizer's.

The schedulers may be divided into two main groups, according to the where they are expected to be called during the training loop.

#### 11.4.8.1 Epoch Schedulers

These schedulers will have their `step()` method called at the end of every epoch. But each scheduler has its own rules for updating the learning rate. Here are a few examples from [PyTorch's learning rate schedulers](https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate):

- `StepLR`: it multiplies the learning rate by a factor `gamma` every `step_size` epochs.
- `LinearLR`: it starts with a learning rate that's a fraction (`start_factor`) of the optimizer's learning rate, and modifies it linearly over `total_iters` updates until it reaches the end learning rate (`end_factor`, which defaults to 1.0, times the optimizer's learning rate)
- `MultiStepLR`: it multiplies the learning rate by a factor `gamma` at the epochs indicated in the list of `milestones`.
- `ExponentialLR`: it multiplies the learning rate by a factor `gamma` every epoch, no exceptions.
- `LambdaLR`: it takes your own customized function that should take the epoch as an argument and returns the corresponding multiplicative factor (with respect to the initial learning rate).

#### 11.4.8.2 Validation Loss Scheduler

`ReduceLROnPlateau` scheduler should also have its `step()` method called at the end of every epoch, but it has its own group here because it does not follow a predefined schedule. Ironic, right?

The `step()` method takes the validation loss as an argument, and the scheduler can be configured to tolerate a lack of improvement in the loss (to a threshold, of course) up to a given number of epochs (the aptly named `patience` argument). After the scheduler runs out of patience, it updates the learning rate, multiplying it by the `factor` argument (for the schedulers listed in the last section, this factor was named `gamma`).

#### 11.4.8.3 Mini-Batch Schedulers

Some schedulers should have their `step()` method called at the end of every mini-batch, like cyclical schedulers:

- `CyclicLR`: This cycles between `base_lr` and `max_lr` (so it disregards the initial learning rate set in the optimizer), using step_size_up updates to go from the base to the max learning rate, and step_size_down updates to go back. This behavior corresponds to mode=triangular. Additionally, it is possible to shrink the amplitude using different modes: triangular2 will halve the amplitude after each cycle, while exp_range will exponentially shrink the amplitude using gamma as base and the number of the cycle as the exponent.
- `OneCycleLR`: This uses a method called annealing to update the learning rate from its initial value up to a defined maximum learning rate (`max_lr`) and then down to a much lower learning rate over a `total_steps` number of updates, thus performing a single cycle.

The fact that the `LinearLR` scheduler belongs to the "Epoch Schedulers" group doesn't mean you're not allowed to use it as a mini-batch scheduler. You're free to use update the learning rate at your convenience and you'll be doing exactly that but, first, let's take a couple of scheduler for a spin!

Start by creating an optimizer:

In [None]:
import torch.optim as optim
from torch.optim.lr_scheduler import StepLR, LinearLR

optimizer = optim.SGD(model.parameters(), lr=0.005, momentum=0.9, weight_decay=0.0005)

You'll want to track how its learning rate is modified over the course of the mini-batches and epochs, so the helper function below is, well, helping you with that by unpacking the internal learning rate of the optimizer:

In [None]:
def get_lr(optimizer):
    return list(map(lambda d: d['lr'], optimizer.state_dict()['param_groups']))

get_lr(optimizer)

You'll have to create two schedulers now:
- one that reduces the learning rate by a ten-fold factor after three epochs
- another one that increases the learning rate from zero to the value set in the optimizer over the course of one epoch (hint: it has to update the learning rate as many times as there are mini-batches in your training set

In [None]:
lr_scheduler = ...

warmup_factor = 1.0 / 1000
warmup_iters = min(1000, len(dataloaders['train']) - 1)
lr_scheduler2 = ...

Now, place each scheduler's `step()` method in the dummy training loop below so they behave as described above:

In [None]:
# Recreating everything here, so you don't have to re-run the previous code
# if you want to try different configurations or places in the loop
optimizer = optim.SGD(model.parameters(), lr=0.005, momentum=0.9, weight_decay=0.0005)
lr_scheduler = ...
lr_scheduler2 = ...

num_epochs = 5

lrs = []

for epoch in range(num_epochs):
    for i in range(len(dataloaders['train'])):
        # write your code here?
        ...

        lrs.append(get_lr(optimizer)[0])

    # write your code here?
    ...

You should get a "curve" like this:

In [None]:
from matplotlib import pyplot as plt
plt.plot(lrs)

### 11.4.9 Training Loop

It is time to write a real training loop now! You can use the dummy loop as a template and build on top of it, once you're happy with your schedulers.

Don't forget to send every tensor, individually, to the same device as the model. Also, keep in mind that the model returns a dictionary with many separate losses. It is your job to sum them all up to compute gradients based on the total.

In [None]:
optimizer = optim.SGD(model.parameters(), lr=0.005,
                            momentum=0.9, weight_decay=0.0005)

# Decreases the learning rate by 10x every 3 epochs
lr_scheduler = ...

# Warms-up the learning rate from zero to 0.005 over one epoch
warmup_factor = 1.0 / 1000
warmup_iters = min(1000, len(dataloaders['train']) - 1)

lr_scheduler2 = ...

In [None]:
num_epochs = 5

model.to(device)

for epoch in range(num_epochs):
    for i, (images, targets) in enumerate(dataloaders['train']):
        # Send images and targets to device
        # write your code here
        images = ...
        targets = ...

        # Set your model's mode
        # write your code here
        ...

        # Call the model to get a loss dict back
        loss_dict = ...
        
        if not (i % 50):
            print([(k, v.item()) for k, v in loss_dict.items()])

        # You have many losses in the dict, but you can only
        # call backward one a single value, so you must
        # add them up
        losses = ...

        optimizer.zero_grad()
        losses.backward()
        optimizer.step()

        # Make a call to a scheduler - maybe here?
        # write your code here
        ...

    # Make a call to a scheduler - maybe here?
    # write your code here
    ...

Training this model takes quite a while...

In [None]:
torch.save(model.state_dict(), 'mobilenet_v2_pets.pth')

In [None]:
# # mobilenet_v2_pets.pth
# !wget https://github.com/dvgodoy/assets/releases/download/model/mobilenet_v2_pets.pth
# state = torch.load('./mobilenet_v2_pets.pth', map_location='cpu')
# model.load_state_dict(state)
# model.to(device)

### 11.4.10 Trying It Out

Let's take an image from our validation set and see if our model can correctly detect an object , cat or dog, in the image:

In [None]:
i = 550
x, y = datasets['val'][i]
model.eval()
images = list(x.unsqueeze(0).to(device))
targets = [{k: v.to(device) for k, v in y.items()}]

pred = model(images, targets)
pred, y['boxes']

It predicted a box indeed, but is it any good?

In [None]:
from torchvision.utils import draw_bounding_boxes
from torchvision.transforms import ToPILImage

boxes = pred[0]['boxes'][0].detach().cpu().unsqueeze(0)
img = read_image(f"./pets/oxford-iiit-pet/images/{datasets['val'].images[i]}")
result = draw_bounding_boxes(img, boxes, colors=['red'], width=3)
ToPILImage()(result)

Not at all! It is awfully misplaced! Does it mean we should throw the model away and start from scratch?

Not so fast... let's dig a little bit deeper and try figuring it out first.

#### 11.4.10.1 Predicted vs Original Boxes

First, we need to rule out an annotation issue. Who knows if the box was correctly drawn in the first place, right?

So, let's start by retrieving the annotated bounding box:

In [None]:
fname = datasets['val'].images[i]
orig_box = torch.as_tensor(annotations['val'].query(f'filename == "{fname}"').values[0][-4:].astype(np.float32)).unsqueeze(0)
orig_box

Then, let's plot the two boxes, the annotated one (in green) and the predicted one (in red) next to each other:

In [None]:
boxes = pred[0]['boxes'][0].detach().cpu().unsqueeze(0)
img = read_image(f"./pets/oxford-iiit-pet/images/{datasets['val'].images[i]}")
result = draw_bounding_boxes(img, torch.cat([boxes, orig_box]), colors=['red', 'green'], width=3)
ToPILImage()(result)

Clearly, the annotation is good, so what else could be the issue here?

#### 11.4.10.2 Predicted vs Transformed Boxes

Remember that, even if we didn't perform data augmentation on the validation set, it still goes through a few transformations to match the required input shape required by the MobileNet V2 model.

So, instead of drawing the (raw) annotated box on top of the original image, let's take both the box and image after these transformations:

In [None]:
boxes = pred[0]['boxes'][0].detach().cpu().unsqueeze(0)
img = read_image(f"./pets/oxford-iiit-pet/images/{datasets['val'].images[i]}")
img = transforms.Compose([transforms.Resize(232, antialias=True), transforms.CenterCrop(224)])(img)
result = draw_bounding_boxes(img, torch.cat([boxes, y['boxes']]), colors=['red', 'green'], width=3)
ToPILImage()(result)

Much, much better! The predicted and (transformed) annotated box are very similar!

So, what's happening here?

The original Faster R-CNN model used a ResNet50 backbone and included both pre- and post-processing routines out-of-the-box. By replacing the backbone with our own, we had to handle the preprocessing transformations ourselves, so it's only logical we do the same for the post-processing too, right? That's the missing piece in our workflow.

### 11.4.11 Postprocessing Boxes

Most preprocessing transformations for computer vision models include resizing and center cropping, so that's where we're focusing here. We'll resize and shift the boxes according to the estimated resizing and cropping. The `restore_boxes()` helper function does exactly that:

In [None]:
from torchvision.models.detection.transform import resize_boxes

def restore_boxes(orig_image_dims, transf_image_dims, boxes):
    is_wide = (orig_image_dims [1] >= orig_image_dims[0])
    shortest = min(orig_image_dims)
    longest = max(orig_image_dims)
    shift = (longest-shortest)/2
    if is_wide:
        shift = torch.as_tensor([shift, 0, shift, 0])
    else:
        shift = torch.as_tensor([0, shift, 0, shift])

    # if it's a perfect square, we assume it was cropped that size
    is_cropped = (transf_image_dims[0] == transf_image_dims[1])
    boxes = resize_boxes(boxes, transf_image_dims, (shortest, shortest) if is_cropped else orig_image_dims)

    ratio = longest/shortest
    is_square = (abs(ratio - 1) < 0.01)
    if not is_square:
        boxes += shift
    return boxes

What does the (resized) predicted box look like when drawn on top of the original image?

In [None]:
img = read_image(f"./pets/oxford-iiit-pet/images/{datasets['val'].images[i]}")
orig_image_dims = img.shape[1:]
transf_image_dims = x.shape[1:]
boxes = pred[0]['boxes'][0].detach().cpu().unsqueeze(0)
boxes = restore_boxes(orig_image_dims, transf_image_dims, boxes)

result = draw_bounding_boxes(img, boxes, colors=['red']*boxes.shape[0], width=3)
ToPILImage()(result)

It looks awesome now!

Congratulations! You have successfully gone through the development and training of your first customized Faster R-CNN model, including data cleaning and organizing, and using learning rate schedulers. That's quite an accomplishment!

But that's quite subjective, right? It's not a true evaluation. Let's talk about that now.

### 11.4.12 Evaluation

The evaluation of object detection models is quite involved, so we're starting from the top (the actual results), and then we'll dig a bit into how you can arrive at these results.

The COCO challenge includes a particular way of evaluating the predictions, and there's an API available at its [cocoapi](https://github.com/cocodataset/cocoapi) GitHub repository. The code isn't quite friendly, so we'll be using [Torchvision's reference scripts for object detection](https://github.com/pytorch/vision/tree/main/references/detection) instead.

The commands below retrieve the necessary files so we can run the evaluation function:

In [None]:
# !wget https://raw.githubusercontent.com/pytorch/vision/v0.15.2/references/detection/coco_eval.py
# !wget https://raw.githubusercontent.com/pytorch/vision/v0.15.2/references/detection/transforms.py
# !wget https://raw.githubusercontent.com/pytorch/vision/v0.15.2/references/detection/utils.py && mv utils.py detection_utils.py
# !wget https://raw.githubusercontent.com/pytorch/vision/v0.15.2/references/detection/engine.py
# !wget https://raw.githubusercontent.com/pytorch/vision/v0.15.2/references/detection/coco_utils.py
# !sed -i 's/import utils/import detection_utils as utils/' engine.py
# !sed -i 's/import utils/import detection_utils as utils/' coco_eval.py

In [None]:
from engine import evaluate as torch_coco_eval
torch_coco_eval(model, dataloaders['val'], device=device)

Let's focus on the two first results for Average Precision. The first one (IoU=0.50:0.95) is the key metric of the COCO challenge, and the second one (IoU=0.50) is the key metric of the Pascal VOC challenge. The latter is easier to understand, so let's start with that one.

Its value, 0.608 means that model detect the boxes correctly with a precision of 60.8%, that is, from all the detect boxes, almost two thirds were considered correct. How do we know if a box is correctly detected or not? What about checking the overlap between the true (annotated) box and the predicted one? That's the idea behind the "IoU" metric (the intersection over union) we'll discuss now.