# Torchvision Object Detection Finetuning Tutorial

**References**: https://pytorch.org/tutorials/intermediate/torchvision_tutorial.html


For this tutorial, we will be finetuning a pre-trained [Mask R-CNN](https://arxiv.org/abs/1703.06870) model in the [Penn-Fudan Database for Pedestrian Detection and Segmentation](https://www.cis.upenn.edu/~jshi/ped_html/). It contains 170 images with 345 instances of pedestrians, and we will use it to illustrate how to use the new features in torchvision in order to train an instance segmentation model on a custom dataset.

## Define the Dataset
The reference scripts for training object detection, instance segmentation and person keypoint detection allows for easily supporting adding new custom datasets. The dataset should inherit from the standard torch.utils.data.Dataset class, and implement `__len__` and `__getitem__`.

The only specificity that we require is that the dataset `__getitem__` should return:

- image: a PIL Image of size `(H,W)`
- target: a dict containing the follwing fields
    - `boxes (FloatTensor[N,4]`: the coordinates of `N` bounding boxes in `[x0,y0,x1,y1]` format, ranging from `0` to `W` and `0` to `H`
    - `labels (Int64Tensor[N])`: the label for each bounding box. `0` represents always the background class.
    - `image_id (Int64Tensor[1])`: an image identifier. It should be unique between all the images in the dataset, and is used during evalution.
    - `area (Tensor[N])`: the area of the bounding box. This is used during evalation with the COCO metric, to separate metric scores between small, medium and large boxes.
    - `iscrowd (Uint8Tensor[N,H,W])`: instances with with iscrowd=True will be ignored durining evaluation.
    - (optionally) `masks (UInt8Tensor[N, H, W]`: The segmentation masks for each one of the objects.
    - (optionally) `keypoints (FloatTensor[N, K, 3]`: For each one of the N objects, it contians the K keypoints in `[x, y, visibility]` format, defining the object. visibility=0 means that the keypoint is not visible. Note that for data augmentation, the notion of flipping a keypoint is dependent on the data representation, and you should probably adapt r`eferences/detection/transforms.py` for your new keypoint representation
    

One note on the `labels`. The model considers class `0` as background. If your dataset does not contain the background class, you should not have `0` in your `labels`.
For example, assuming you have just two classes, cat and dog, you can define 1 (not 0) to represent cats and 2 to represent dogs. So, for instance, if one of the images has booth classes, your labels tensor should look like [1,2].

Additionally, if you want to use aspect ratio grouping during training (so that each batch only contains images with similar aspect ratio), then it is recommended to also implement a `get_height_and_width` method, which returns the height and the width of the image. If this method is not provided, we query all elements of the dataset via `__getitem__` , which loads the image in memory and is slower than if a custom method is provided.


## 1. Write a custom dataset for PennFudan

Let’s write a dataset for the PennFudan dataset. After [downloading and extracting the zip file](https://www.cis.upenn.edu/~jshi/ped_html/PennFudanPed.zip), we have the following folder structure:

```console
PennFudanPed/
  PedMasks/
    FudanPed00001_mask.png
    FudanPed00002_mask.png
    FudanPed00003_mask.png
    FudanPed00004_mask.png
    ...
  PNGImages/
    FudanPed00001.png
    FudanPed00002.png
    FudanPed00003.png
    FudanPed00004.png
```

Here is on example of pair of images and segmentation masks:

<img src="https://pytorch.org/tutorials/_static/img/tv_tutorial/tv_image01.png">
<img src="https://pytorch.org/tutorials/_static/img/tv_tutorial/tv_image02.png">

So each image has a corresponding segmentation mask, where each color correspond to a different instance. Let’s write a `torch.utils.data.Dataset` class for this dataset.

In [1]:
import os
import numpy as np
import torch
from PIL import Image

In [2]:
class PennFudanDataset(object):
    def __init__(self, root, transforms):
        self.root = root
        self.transforms = transforms
        #load all image files, sorting them to
        #ensure that they are alighed
        self.imgs = list(sorted(os.listdir(os.path.join(root, "PNGImages"))))
        self.masks = list(sorted(os.listdir(os.path.join(root, "PedMasks"))))
        
    def __getitem__(self, idx):
        #load images ad masks
        img_path = os.path.join(self.root, "PNGImages", self.imgs[idx])
        mask_path = os.path.join(self.root, "PedMasks", self.masks[idx])
        img = Image.open(img_path).convert("RGB")
        # note that we haven't converted the mask to RGB,
        # because each color corresponds to a different instance
        # with 0 being background
        mask = Image.open(mask_path)
        #convert the PIL Image into a numpy array
        mask = np.array(mask)
        #instances are encoded as different colors
        obj_ids = np.unique(mask)
        #first id is the background, so remove it
        obj_ids = obj_ids[1:]
        
        #split the color-encoded mask into a set
        #of binary masks
        masks =  mask == obj_ids[:, None, None]
        
        #get bounding box coordinates for each mask 
        num_objs = len(obj_ids)
        boxes = []
        
        for i in range(num_objs):
            pos = np.where(masks[i])
            xmin = np.min(pos[1])
            xmax = np.max(pos[1])
            ymin = np.min(pos[0])
            ymax = np.max(pos[0])
            boxes.append([xmin, ymin, xmax, ymax])
            
        #convert everything into a torch.Tensor
        boxes = torch.as_tensor(boxes, dtype=torch.float32)
        #there is only one class
        labels = torch.ones((num_objs, ), dtype=torch.int64)
        masks = torch.as_tensor(masks, dtype=torch.unit8)
        
        image_id = torch.tensor([idx])
        area = (boxes[:, 3] - boxes[:, 1])*(boxes[:, 2] - boxes[:, 0])
        #suppose all instances are not crowd
        iscrowd = torch.zeros((num_objs, ), dtype=torch.int64)
        
        target = {}
        target["boxes"] = boxes
        target["labels"] = labels
        target["masks"] = masks
        target["iscrowd"] = iscrowd
        target["area"] = area
        target["image_id"]  = image_id
        
        if self.transforms is not None:
            img, targer = self.transforms(img, target)
            
        return img, target
    
    def __len__(self):
        return len(self.imgs)

## 2. Defining model

In this tutorial, we will be using [Mask R-CNN](https://arxiv.org/abs/1703.06870), which is based on top of [Faster R-CNN](https://arxiv.org/abs/1506.01497). Faster R-CNN is a model that predicts both bouding boxes and class scores for potential objects in the image.

<img src="https://pytorch.org/tutorials/_static/img/tv_tutorial/tv_image03.png">

Mask R-CNN adds an extra branch into Faster R-CNN, which also predicts segmantation masks for each instance.

<img src="https://pytorch.org/tutorials/_static/img/tv_tutorial/tv_image04.png">

There are two common situations where one might want to modify one of the available models in torchvision modelzoo. The first is when we want to start from a pre-trained model, and just finetune the last layer. The other is when we want to replace the backbone of the model with a different one (for faster predictions, for example).


Let’s go see how we would do one or another in the following sections.

### 2.1 Finetuning from a pretrained model
Let’s suppose that you want to start from a model pre-trained on COCO and want to finetune it for your particular classes. Here is a possible way of doing it:

In [3]:
import torchvision
from torchvision.models.detection.faster_rcnn import FastRCNNPredictor

#load a model pre-trained on COCO
model = torchvision.models.detection.fasterrcnn_resnet50_fpn(pretrained=True)

#replace the classifierwith a new one, that has num_classes which user-defined
num_classes =2 #1 class (person) + background
#get the number of input features for the classifier
in_features = model.roi_heads.box_predictor.cls_score.in_features
#replace the pre-trained head with a new one
model.roi_heads.box_predictor = FastRCNNPredictor(in_features, num_classes) 

### 2.2 Modifying the model to add  a different backbone

In [4]:
import torchvision
from torchvision.models.detection import FasterRCNN
from torchvision.models.detection.rpn import AnchorGenerator

#load a pre-trained model for classification and return
#only the features
backbone = torchvision.models.mobilenet_v2(pretrained=True).features
#FasterRCNN nedds to know the number of output channels in backbone.
#For mobilenet_v2, it's 1280 so we need to add it here
backbone.out_channels = 1280

#let's make the RPN generate 5x3 anchors per spatial locations
#with 5 different sizes and 3 different aspect ratios.
#We have a Tuple[Tuple[int]] because each feature map could
#protentially have different sizes and aspect ratios
anchor_generator = AnchorGenerator(sizes=((32,64,128,256,512),),
                                  aspect_ratios=((0.5,1.0,2.0),))


# let's define what are the feature maps that we will
# use to perform the region of interest cropping, as well as
# the size of the crop after rescaling.
# if your backbone returns a Tensor, featmap_names is expected to
# be [0]. More generally, the backbone should return an
# OrderedDict[Tensor], and in featmap_names you can choose which
# feature maps to use.
roi_pooler = torchvision.ops.MultiScaleRoIAlign(featmap_names=[0],
                                               output_size=7,
                                               sampling_ratio=2)

#put the pieces together inside a FasterRCNN model
model = FasterRCNN(backbone,
                  num_classes=2,
                  rpn_anchor_generator=anchor_generator,
                  box_roi_pool=roi_pooler)

## An Instance segmentation model for PennFudan Dataset

In our case, we want to fine-tune from a pre-trained model, given that our dataset is very small, so we will be following approach number 1.

Here we want to also compute the instance segmentation masks, so we will be using Mask R-CNN:

In [5]:
import torchvision
from torchvision.models.detection.faster_rcnn import FastRCNNPredictor
from torchvision.models.detection.mask_rcnn import MaskRCNNPredictor

def get_model_instance_segmentation(num_classes):
    #load an instance segmentation model pre-trained on COCO
    model = torchvision.models.detection.maskrcnn_resnet50_fpn(pretrained=True)
    
    #get number of input features for the classifier
    in_features = model.roi_heads.box_predictor.cls.score.in_features
    #replace the pre-trained head with a new one
    model.roi_heads.box_predictor = FastRCNNPredictor(in_features, num_classes)
    
    #get the number of input features for the mask classifier
    in_features_mask = model.roi_heads.mask_predictor.conv5_mask.in_channels
    hidden_layer = 256
    #replace the mask predictor with a new one
    model.roi_heads.mask_predictor = MaskRCNNPredictor(in_features_mask,
                                                      hidden_layer,
                                                      num_classes)
    return model

That’s it, this will make model be ready to be trained and evaluated on your custom dataset.


## Putting everything together
In `references/detection/`, we have a number of helper functions to simplify training and evaluating detection models. Here, we will use `references/detection/engine.py`, `references/detection/utils.py` and `references/detection/transforms.py`. Just copy them to your folder and use them here.

Let’s write some helper functions for data `augmentation / transformation`:

In [6]:
from references import transforms as T

def get_transform(train):
    transforms = []
    transforms.append(T.ToTensor())
    if train:
        transforms.append(T.RandomHorizontalFlip(0.5))
    return T.Compose(transforms)

## Testing `forward()` method (Optional)

Before iterating over the dataset, it’s good to see what the model expects during training and inference time on sample data.

In [7]:
from references import utils

model = torchvision.models.detection.fasterrcnn_resnet50_fpn(pretrained=True)
dataset = PennFudanDataset('PennFudanPed', get_transform(train=True))
data_loader = torch.utils.data.DataLoader(
 dataset, batch_size=2, shuffle=True, num_workers=4,
 collate_fn=utils.collate_fn)
# For Training
images,targets = next(iter(data_loader))
images = list(image for image in images)
targets = [{k: v for k, v in t.items()} for t in targets]
output = model(images,targets)   # Returns losses and detections
# For inference
model.eval()
x = [torch.rand(3, 300, 400), torch.rand(3, 500, 400)]
predictions = model(x)           # Returns predictions

AttributeError: Caught AttributeError in DataLoader worker process 0.
Original Traceback (most recent call last):
  File "/home/dangnam739/.local/lib/python3.6/site-packages/torch/utils/data/_utils/worker.py", line 185, in _worker_loop
    data = fetcher.fetch(index)
  File "/home/dangnam739/.local/lib/python3.6/site-packages/torch/utils/data/_utils/fetch.py", line 44, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/home/dangnam739/.local/lib/python3.6/site-packages/torch/utils/data/_utils/fetch.py", line 44, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "<ipython-input-2-67d2c2c89219>", line 46, in __getitem__
    masks = torch.as_tensor(masks, dtype=torch.unit8)
AttributeError: module 'torch' has no attribute 'unit8'
