# TorchVision Faster R-CNN Finetuning Tutorial

Based on [TorchVision Object Detection Finetuning Tutorial](https://pytorch.org/tutorials/intermediate/torchvision_tutorial.html).

For this tutorial, we will be finetuning a pre-trained Faster R-CNN model in the [*Penn-Fudan Database for Pedestrian Detection and Segmentation*](https://www.cis.upenn.edu/~jshi/ped_html/PennFudanPed.zip). It contains 170 images with 345 instances of pedestrians, and we will use it to illustrate how to use the new features in torchvision in order to train an instance segmentation model on a custom dataset. To run the notebook, download the dataset, unzip it and move it into the `data/` directory.

## Defining the Dataset

The dataset should inherit from the standard `torch.utils.data.Dataset` class, and implement `__len__` and `__getitem__`.

Specifically for the torchvision reference scripts to work, the dataset `__getitem__` should return a tuple `(image, target)`, with:

* `image`: a PIL Image of size (H, W)
* `target`: a dictionary containing the following fields
    * `boxes` (`FloatTensor[N, 4]`): the coordinates of the `N` bounding boxes in `[x0, y0, x1, y1]` format, ranging from `0` to `W` and `0` to `H`
    * `labels` (`Int64Tensor[N]`): the label for each bounding box
    * `image_id` (`Int64Tensor[1]`): an image identifier. It should be unique between all the images in the dataset, and is used during evaluation
    * `area` (`Tensor[N]`): The area of the bounding box. This is used during evaluation with the COCO metric, to separate the metric scores between small, medium and large boxes.
    * `iscrowd` (`UInt8Tensor[N]`): instances with `iscrowd=True` will be ignored during evaluation.
    * (optionally) `masks` (`UInt8Tensor[N, H, W]`): The segmentation masks for each one of the objects
    * (optionally) `keypoints` (`FloatTensor[N, K, 3]`): For each one of the `N` objects, it contains the `K` keypoints in `[x, y, visibility]` format, defining the object. `visibility=0` means that the keypoint is not visible. Note that for data augmentation, the notion of flipping a keypoint is dependent on the data representation, and you should probably adapt `references/detection/transforms.py` for your new keypoint representation

One note on the labels. The model considers class 0 as background. If your dataset does not contain the background class, you should not have 0 in your labels. For example, assuming you have just two classes, cat and dog, you can define 1 (not 0) to represent cats and 2 to represent dogs. So, for instance, if one of the images has both classes, your labels tensor should look like [1,2].

Additionally, if you want to use aspect ratio grouping during training (so that each batch only contains images with similar aspect ratio), then it is recommended to also implement a `get_height_and_width` method, which returns the height and the width of the image. If this method is not provided, we query all elements of the dataset via `__getitem__` , which loads the image in memory and is slower than if a custom method is provided.

**See [`lib.penn_fundan`](./lib/penn_fundan.py) for the dataset implementation.**

## Defining your model

In this tutorial, we will be using [Faster R-CNN](https://arxiv.org/abs/1506.01497) with a ResNet-50 FPN backbone. Faster R-CNN is a model that predicts both bounding boxes and class scores for potential objects in the image.

In [None]:
from torchvision.models.detection.faster_rcnn import fasterrcnn_resnet50_fpn, FasterRCNN_ResNet50_FPN_Weights


model = fasterrcnn_resnet50_fpn(weights=FasterRCNN_ResNet50_FPN_Weights.DEFAULT)

## Training and evaluation functions

We will be using a modification from [torchvision reference scripts for training object detection](https://github.com/pytorch/vision/tree/v0.3.0/references/detection) that are included in the torchvision repository. This includes code for the training loop, evaluation with COCO metrics and image transformation utilities, so we don't need to write all these things ourselves.

The scripts are in the directory `lib/detection/`.

### Putting everything together

We now have the dataset class, the models and the data transforms. Let's instantiate them

In [None]:
import lib.detection.transforms as T
from lib.penn_fundan import PennFudanDataset

# Define data transforms for training batches
train_tfm = T.Compose([
    T.ToTensor(),  # converts the image, a PIL image, into a PyTorch Tensor
    T.RandomHorizontalFlip(0.5)  # randomly flip the training images
])

# Define data transforms for validation batches
val_tfm = T.ToTensor()

# Define datasets
dataset_train = PennFudanDataset('data/PennFudanPed/', train_tfm)
dataset_val = PennFudanDataset('data/PennFudanPed/', val_tfm)

In [None]:
from torch.utils.data import DataLoader
from lib.detection.utils import collate_fn


# define training and validation data loaders
data_loader_train = DataLoader(
    dataset_train, batch_size=2, shuffle=True, num_workers=4,
    collate_fn=collate_fn
)

data_loader_val = DataLoader(
    dataset_val, batch_size=2, shuffle=False, num_workers=4,
    collate_fn=collate_fn
)

Now let's instantiate the model and the optimizer

In [None]:
import torch
from torch.optim.lr_scheduler import StepLR
from torch.optim import SGD


device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')

# our dataset has two classes only - background and product
num_classes = 2

# move model to the right device
model.to(device)

# construct an optimizer
params = [p for p in model.parameters() if p.requires_grad]
optimizer = SGD(params, lr=0.005, momentum=0.9, weight_decay=0.0005)

# and a learning rate scheduler which decreases the learning rate by
# 10x every 3 epochs
lr_scheduler = StepLR(optimizer,
                      step_size=3,
                      gamma=0.1)

## Start TensorBoard for logging

To start TensorBoard on VSC OnDemand, go to [the dashboard](https://ondemand.hpc.kuleuven.be/pun/sys/dashboard/) and click on "TensorBoard". Use the following settings:

- Number of cores: 1
- Account: lp_edu_maibi_anndl
- Partition: interactive
- Project/Log folder: maibi_cv/3_detection/runs
- Number of hours: 4
- Number of gpu's: 0

Leave the other settings at their default values.

And now let's train the model for a couple of epochs, evaluating at the end of every epoch.

In [None]:
from lib.detection.engine import train_one_epoch, evaluate
from torch.utils.tensorboard import SummaryWriter

num_epochs = 10

writer = SummaryWriter()

for epoch in range(num_epochs):
    train_one_epoch(model, optimizer, data_loader_train,
                    device, epoch, writer=writer)
    # update the learning rate
    lr_scheduler.step()
    # evaluate on the validation dataset
    evaluate(model, data_loader_val, device, epoch, writer=writer)

Now that training has finished, let's have a look at what it actually predicts in a test image

In [None]:
# pick one image from the test set
img, _ = dataset_val[0]
# put the model in evaluation mode
model.eval()
with torch.no_grad():
    prediction = model([img.to(device)])

Printing the prediction shows that we have a list of dictionaries. Each element of the list corresponds to a different image. As we have a single image, there is a single dictionary in the list.
The dictionary contains the predictions for the image we passed. In this case, we can see that it contains `boxes`, `labels` and `scores` as fields.

In [None]:
prediction

Let's inspect the image and the predicted detection boxes.

For that, we need to convert the image, which has been rescaled to 0-1 and had the channels flipped so that we have it in `[C, H, W]` format. Next, we iterate over the predicted boxes and draw them on the images.

In [None]:
from PIL import Image, ImageDraw

im = Image.fromarray(img.mul(255).permute(1, 2, 0).byte().numpy())
draw = ImageDraw.Draw(im)

for box in prediction[0]['boxes'].cpu().numpy():
    draw.rectangle(box, width=5)

im