<H2 style="text-align: center">Object Detection on PASCAL VOC 2007</H2>
<H3 style="text-align: center">Multi-Class Setting</H3>

For this Excercise, we will be finetuning a pre-trained [Faster R-CNN](https://arxiv.org/abs/1506.01497) model on the [PASCAL VOC 2007](http://host.robots.ox.ac.uk/pascal/VOC/voc2007/) dataset for Multiple Object Detection. It contains total 9,963 images containing 20 different objects: *aeroplane*, *bicycle*, *bird*, *boat*, *bottle*, *bus*, *car*, *cat*, *chair*, *cow*, *diningtable*, *dog*, *horse*, *motorbike*, *person*, *pottedplant*, *sheep*, *sofa*, *train*, *tvmonitor*. However, for this illustration we will use a subset of that dataset. First, we need to use the `pycocotools`, this library will be used for computing the evaluation metrics following the [Microsoft COCO](https://cocodataset.org/#home) metric for intersection over union. More detailed and complicated version of object detector can be found [here](https://github.com/facebookresearch/detectron2).

###Imports

In [None]:
import os
import numpy as np
import torch
import torch.utils.data
from PIL import Image
import pandas as pd
import torchvision
import pycocotools
from torchvision.models.detection.faster_rcnn import FastRCNNPredictor

###Dataset
We will only download the `trainval` set and split it in our way. In reality, PASCAL VOC has a separate `test` split which is generally used for comparing detection methods. Please have a look on the [PASCAL VOC 2007](http://host.robots.ox.ac.uk/pascal/VOC/voc2007/) dataset webpage for more details.

In [None]:
if not os.path.exists('VOCtrainval_06-Nov-2007.tar'):
    !wget http://host.robots.ox.ac.uk/pascal/VOC/voc2007/VOCtrainval_06-Nov-2007.tar
if not os.path.exists('VOCtest_06-Nov-2007.tar'):
    !wget http://host.robots.ox.ac.uk/pascal/VOC/voc2007/VOCtest_06-Nov-2007.tar
!rm -r VOC2007/
!mkdir VOC2007/
!mkdir VOC2007/train/
!mkdir VOC2007/test/
!tar -xf VOCtrainval_06-Nov-2007.tar --directory VOC2007/train/
!tar -xf VOCtest_06-Nov-2007.tar --directory VOC2007/test/
!mv VOC2007/train/VOCdevkit/VOC2007/* VOC2007/train/
!rm -r VOC2007/train/VOCdevkit/
!mv VOC2007/test/VOCdevkit/VOC2007/* VOC2007/test/
!rm -r VOC2007/test/VOCdevkit/
class_str2num = {'aeroplane': 1, 'bicycle': 2, 'bird': 3, 'boat': 4, 'bottle': 5,
                 'bus': 6, 'car': 7, 'cat': 8, 'chair': 9, 'cow': 10, 
                 'diningtable': 11, 'dog': 12, 'horse': 13, 'motorbike': 14, 
                 'person': 15, 'pottedplant': 16, 'sheep': 17, 'sofa': 18,
                 'train': 19, 'tvmonitor': 20}
class_num2str = {v: k for k, v in class_str2num.items()}
antns = sorted(os.listdir('VOC2007/train/Annotations'))

###Annotations
Detection and segmentation datasets come with annotation files containing information about the object location (bounding box), pixel wise position etc. Below we can consider an example from PASCAL VOC annotation file.
```
<annotation>
	<folder>VOC2007</folder>
	<filename>000005.jpg</filename>
	<source>
		<database>The VOC2007 Database</database>
		<annotation>PASCAL VOC2007</annotation>
		<image>flickr</image>
		<flickrid>325991873</flickrid>
	</source>
	<owner>
		<flickrid>archintent louisville</flickrid>
		<name>?</name>
	</owner>
	<size>
		<width>500</width>
		<height>375</height>
		<depth>3</depth>
	</size>
	<segmented>0</segmented>
	<object>
		<name>chair</name>
		<pose>Rear</pose>
		<truncated>0</truncated>
		<difficult>0</difficult>
		<bndbox>
			<xmin>263</xmin>
			<ymin>211</ymin>
			<xmax>324</xmax>
			<ymax>339</ymax>
		</bndbox>
	</object>
	...
	<object>
		<name>chair</name>
		<pose>Unspecified</pose>
		<truncated>1</truncated>
		<difficult>1</difficult>
		<bndbox>
			<xmin>277</xmin>
			<ymin>186</ymin>
			<xmax>312</xmax>
			<ymax>220</ymax>
		</bndbox>
	</object>
</annotation>
```

###XML Parser
We have written the following `XML` parser function for reading the annotation file.

In [None]:
import xml.etree.ElementTree as ET
def parse_xml(xml_file):
    tree = ET.parse(xml_file)
    root = tree.getroot()
    bboxes = []
    labels = []
    for boxes in root.iter('object'):
        filename = root.find('filename').text
        ymin, xmin, ymax, xmax = None, None, None, None
        ymin = int(boxes.find("bndbox/ymin").text)
        xmin = int(boxes.find("bndbox/xmin").text)
        ymax = int(boxes.find("bndbox/ymax").text)
        xmax = int(boxes.find("bndbox/xmax").text)
        box = [xmin, ymin, xmax, ymax]
        bboxes.append(box)
        labels.append(int(class_str2num[boxes.find("name").text]))
    return filename, bboxes, labels

###Example of Images

In [None]:
from PIL import Image, ImageDraw
idx = np.random.randint(len(antns))
filename, boxes, labels = parse_xml(os.path.join('VOC2007/train/Annotations', antns[idx]))
image = Image.open(os.path.join('VOC2007/train/JPEGImages',filename))
draw = ImageDraw.Draw(image)
for i, ibox in enumerate(boxes):
    draw.rectangle([(ibox[0], ibox[1]), (ibox[2], ibox[3])], outline='red', width=3)
    draw.text((ibox[0], ibox[1]), text = class_num2str[labels[i]])
image

Clone the PyTorch vision repo and copy some python files containing implementation on evaluation, transformations, loss function etc.

In [None]:
!git clone https://github.com/pytorch/vision.git
!cp vision/references/detection/utils.py ./
!cp vision/references/detection/transforms.py ./
!cp vision/references/detection/coco_eval.py ./
!cp vision/references/detection/engine.py ./
!cp vision/references/detection/coco_utils.py ./
from engine import train_one_epoch, evaluate
import utils
import transforms as T

###Data Loader
An example of PyTorch data generator for object detector can be found [here](https://github.com/sgrvinod/a-PyTorch-Tutorial-to-Object-Detection/blob/master/datasets.py) or [here](https://pytorch.org/tutorials/intermediate/torchvision_tutorial.html). A simplified version of the data generator can be found below.

In [None]:
class PASCALVOCDataset(torch.utils.data.Dataset):
    def __init__(self, root, transforms=None):
        self.root = root
        self.transforms = transforms
        self.antns = sorted(os.listdir(os.path.join(root, 'Annotations')))

    def __getitem__(self, idx):
        # load annotation
        filename, boxes, labels = parse_xml(os.path.join(self.root, 'Annotations', self.antns[idx]))
        # load image
        img_path = os.path.join(self.root, 'JPEGImages', filename)
        img = Image.open(img_path).convert('RGB')
        boxes = torch.as_tensor(boxes, dtype=torch.float32)
        num_objs = boxes.shape[0]
        # classes
        labels = torch.tensor(labels, dtype=torch.int64)
        image_id = torch.tensor([idx])
        area = (boxes[:, 3] - boxes[:, 1]) * (boxes[:, 2] - boxes[:, 0])
        # suppose all instances are not crowd
        iscrowd = torch.zeros((num_objs,), dtype=torch.int64)
        target = {}
        target['boxes'] = boxes
        target['labels'] = labels
        target['image_id'] = image_id
        target['area'] = area
        target['iscrowd'] = iscrowd
        if self.transforms is not None:
            img, target = self.transforms(img, target)
        return img, target
        
    def __len__(self):
        return len(self.antns)

###Data Instance

In [None]:
dataset = PASCALVOCDataset(root='VOC2007/train')
dataset.__getitem__(0)

###Model
In this workshop, we will be using [Faster R-CNN](https://arxiv.org/abs/1506.01497) which is a model that predicts both bounding boxes and class scores for potential objects in the image. The detection models available within PyTorch can be found [here](https://pytorch.org/vision/stable/models.html#object-detection-instance-segmentation-and-person-keypoint-detection).

![Faster R-CNN](https://raw.githubusercontent.com/pytorch/vision/temp-tutorial/tutorials/tv_image03.png)

Check the PyTorch implementation of Faster R-CNN with ResNet-50 backbone [here](https://github.com/pytorch/vision/blob/master/torchvision/models/detection/faster_rcnn.py)

In [None]:
def get_model(num_classes):
    # load an object detection model pre-trained on COCO
    model = torchvision.models.detection.fasterrcnn_resnet50_fpn(pretrained=True)
    # get the number of input features for the classifier
    in_features = model.roi_heads.box_predictor.cls_score.in_features
    # replace the pre-trained head with a new on
    model.roi_heads.box_predictor = FastRCNNPredictor(in_features, num_classes)
    return model

###Transformations
We only use random horizontal flip transformation. One can use some other transformations, such as affine etc.

In [None]:
def get_transform(train):
    transforms = []
    # converts the image, a PIL image, into a PyTorch Tensor
    transforms.append(T.ToTensor())
    if train:
        # during training, randomly flip the training images
        # and ground-truth for data augmentation
        transforms.append(T.RandomHorizontalFlip(0.5))
    return T.Compose(transforms)

###Initialization of Dataset and Data Loader

In [None]:
# use our dataset and defined transformations
train_dataset = PASCALVOCDataset(root= 'VOC2007/train', transforms = get_transform(train=True))
test_dataset = PASCALVOCDataset(root= 'VOC2007/test', transforms = get_transform(train=False))
# split the dataset in train and test set
torch.manual_seed(1)
train_indices = torch.randperm(len(train_dataset)).tolist()
test_indices = torch.randperm(len(test_dataset)).tolist()
# first n examples
# because of the constraint of computational resources, I just use 100 samples.
# Please feel free to use more samples if you have enough resources
n = 100
train_dataset = torch.utils.data.Subset(train_dataset, train_indices[:n])
test_dataset = torch.utils.data.Subset(test_dataset, test_indices[:n])
# define training and validation data loaders
train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=2, shuffle=True, num_workers=2, collate_fn=utils.collate_fn)
test_loader = torch.utils.data.DataLoader(test_dataset, batch_size=1, shuffle=False, num_workers=2, collate_fn=utils.collate_fn)
print("We have: {} examples, {} are training and {} testing".format(len(train_indices+test_indices), len(train_dataset), len(test_dataset)))

###Initialization of Model and Optimizer
In this training, it is recommended to use learning rate scheduler, more details on it can be found [here](https://pytorch.org/docs/stable/optim.html).

In [None]:
device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
# our dataset has two classes only - raccoon and not racoon
num_classes = len(class_str2num) + 1
# get the model using our helper function
model = get_model(num_classes)
# move model to the right device
model.to(device)
# construct an optimizer
params = [p for p in model.parameters() if p.requires_grad]
optimizer = torch.optim.SGD(params, lr=0.005, momentum=0.9, weight_decay=0.0005)
# and a learning rate scheduler which decreases the learning rate by # 10x every 3 epochs
lr_scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=3, gamma=0.1)

###Train and Test Loop
Read more about the `train_one_epoch()` and `evaluate()` functions [here](https://github.com/pytorch/vision/tree/master/references/detection). Details on the evaluation metrics, different parameters and following verbose can be found [here](https://cocodataset.org/#detection-eval). In this training, it is recommended to use learning rate scheduler and learning rate warm-up, more details on it can be found [here](https://pytorch.org/docs/stable/optim.html).

In [None]:
# let's train it for 10 epochs
num_epochs = 10
for epoch in range(num_epochs):
    # train for one epoch, printing every 10 iterations
    train_one_epoch(model, optimizer, train_loader, device, epoch, print_freq=10)
    # evaluate on the test dataset
    evaluate(model, test_loader, device=device)
    # update the learning rate
    lr_scheduler.step()

###Qualitative Results

In [None]:
import matplotlib.pyplot as plt
plt.rcParams['figure.figsize'] = (10.0, 10.0)
plt.rcParams['figure.dpi'] = 72
model = model.cpu()
model.eval()
for i in range(len(test_dataset)):
    img, _ = test_dataset[i]
    label_boxes = np.array(test_dataset[i][1]['boxes'])
    #put the model in evaluation mode
    with torch.no_grad():
        prediction = model([img])
    image = Image.fromarray(img.mul(255).permute(1, 2, 0).byte().numpy())
    draw = ImageDraw.Draw(image)
    # draw groundtruth
    for elem in range(len(label_boxes)):
        draw.rectangle([(label_boxes[elem][0], label_boxes[elem][1]), (label_boxes[elem][2], label_boxes[elem][3])], outline ="green", width =3)
    for element in range(len(prediction[0]['boxes'])):
        box = prediction[0]['boxes'][element].cpu().numpy()
        label = prediction[0]['labels'][element].cpu().item()
        score = np.round(prediction[0]['scores'][element].cpu().numpy(), decimals= 4)
        if score > 0.6:
            draw.rectangle([(box[0], box[1]), (box[2], box[3])], outline ='red', width =3)
            draw.text((box[0], box[1]), text = class_num2str[label] + ', ' + str(score))
    plt.imshow(image)
    plt.show()