LINK TO THE NOTEBOOK:

[shorturl.at/aQSTY](shorturl.at/aQSTY)

Seyed Majid Azimi, DLR, IMF-PBA
seyedmajid.azimi@dlr.de

# Object Detection Tutorial using Deep Learning Methods with PyTorch
#### honorable mentions and credit: Machine-Vision Research Group @Fractal Analytics, Ross Girschick, Kaiming He, Detectron2

For this tutorial, we will be explaining the recent milestone at another computer vision application called object detection. We will explain the cores of recent algorithms and the implenetation of some parts. In the end, we will apply one of them on several test images.


# Image Classification vs. Object Detection


Image Classification is a problem where we assign a class label to an input image. For example, given an input image of a cat, the output of an image classification algorithm is the label “Cat”.

In object detection, we are not only interested in what objects are in the input image, but we are also interested in where they are located.

![imageclassificationvsobjectdetection](https://raw.githubusercontent.com/smajida/DeepLearningWorkshop-ISPRS2019/master/image-classification-vs-object-detection.png)

The figure above illustrates the difference between image classification and object detection.


# 1.1. Image Classification vs Object Detection : Which one to use?


In most applications where there are more than one object in the input image, we need to find the location of the objects, and then classify them. We use an object detection algorithm in such case.

Object detection can be hundreds of times slower than image classification, and therefore, in applications where the location of the object in the image is not important, we use image classification.

# Object Detection

The object detection task is composed of two steps, object localization and object classification. 

Object Localization is to draw a bounding box around the region of the object in the image and object localization is to classify the object inside that bounding box. So basically the output of the object detection algorithm is 4 floating values of the coordinates of each bounding box andd the corresponding class from a list of known classes.
  



## Sliding Window Approach
Traditionally object-detection was implemented using simple template matching techniques. In these methods, target objects used to be cropped and using specific descriptors like HOG and SIFT, features for the same used to be generated. The approach subsequently used a sliding-window on the image and compared each location with the database of object feature vectors. Enhanced algorithms have used classifiers, like trained SVM classifiers to replace the use of these databases. Since objects will be of different sizes, people used different window sizes and image sizes (Image Pyramids). These complex pipelines managed partly solved the object-detection problem, but had many drawbacks. The pipelines were computationally time consuming, and the hand engineered features using algorithms like HOG and SIFT were not highly accurate.



## Region Proposals with CNN (R-CNN)
With the advent of the use of deep learning in machine-vision and the staggering results these algorithms got for image classification challenges in 2012, researchers started looking at deep-learning solutions to solve object detection problems. One of the first important algorithms to solve object detection using deep learning was R-CNN ( Region proposals with CNN)

This is the first paper to show that CNN can lead to dramatically higher Object detection performance. It achieved 58.5 mAP on VOC 2007 data-set.

In R-CNN an image processing technique is used to make list of proposed regions in the input image which are then sent through the network for classification. But this is computationally more efficient than sliding window approach as only fewer potential crops which may contain the object is classified by the network.
  
  ![R-CNN](https://raw.githubusercontent.com/smajida/DeepLearningWorkshop-ISPRS2019/master/RCNN.png)

  Image Source :  [Ross Girshick et al](https://arxiv.org/pdf/1311.2524.pdf)


To build an R-CNN object-detection pipeline, the following approach can be used

    The approach starts with a standard network like VGG or ResNet pre-trained on Image-net — this network will act like a feature extractor for the image. The approach removes the class specific classification layer and uses the bottleneck layer to extract features from the image. In R-CNN the approach has used VGG network and we get a 4096-dim vector for each image proposal.

![RCNN WARPED OBJECTS](https://miro.medium.com/proxy/1*0S3dZGri0zAbt4tFSn5k0w.png)

    they train the network using these warped images. While training a dataset such as Pascal VOC, they place 20+1 (n_class + background) layer at the end and train the network. Each batch size contains 32 positive windows and 96 background windows. A Region proposal is said to be positive if it has an IoU ≥ 0.5 with the ground truth box. Usually obtain these 128 (32+96) proposals from 2 images (batch size). Since there will be ~2k proposals from each image, we sample the positive images and negative images(background) images separately. We use selective search to generate these proposals.
    After fine-tuning the network, we will send the proposals through the network and obtain a 4096 dim vector. The authors of this paper performed a grid search on IOU to select the positive samples, IOU of 0.3 worked well for them.
    For each object class, train a SVM (one versus other) classifier. You can use hard negative mining to improve the classification accuracy.
    We also train a bounding box regressors to improve upon the localization errors. This is applied on the proposal once the class specific SVM classifier classify the object.
    
    
Testing the algorithm

    For testing, R-CNN generates around 2000 category-independent region proposal from the input image, extracts a fixed-length feature vector for each proposal using a CNN (VGG Net), and then classifies each region with category specific linear SVM’s. This gives class specific ‘objectness’ score for each proposal, which are then sent a non-maxim suppression algorithm.
    Test time inference takes 13 sec per image on GPU and 53s/image on CPU.

The major problem with this approach are :

    Sending ~2000 proposals to the Neural network, thus bringing test time to 13 sec on GPU.
    Complex pipeline for training and inference. No end to end training pipeline. Neural network is trained separately, SVM classifiers are trained individually.
    
    
  RCNN is better than sliding window, but its still computationally expensive as the network has to classify all the region proposals. It takes around 30-40s for inference of a single image.
  



  
- ## Fast Region Proposal (Fast R-CNN)

This paper provided break-through results and has set the standards for the approaches that followed it. Major differences brought by Fast R-CNN:

![Fast R-CNN](https://raw.githubusercontent.com/smajida/DeepLearningWorkshop-ISPRS2019/master/Fast-RCNN.png)
  Image Source : [Ross Gishick](https://arxiv.org/pdf/1504.08083.pdf)
  
*    Removed SVM classifiers and joined a regression and classification layer to the net with a multi-task loss function. Thus making it a single stage training and inference network.
*    Remove 2000 CNN execution for each object proposal and joined it with ROI pooling layer which runs once.
*    Use VGGNet as the backend rather than AlexNet.
*    Build new loss functions which are less sensitive to outliers.

Fast R-CNN network works in the following way

    The network processes the whole image to produce a convolutional feature map. Then for each object proposal ROI pooling layer extracts fixed-length feature vector, which is finally passed to subsequent FC layers.
    The FC layers branch into two sibling output layers, one that estimates softmax probability over K+1 object classes, and another layer producing the refined bounding box positions.
    2000 proposals of particular image are not passed through the network as in R-CNN, Instead, The image is passed only once and the computed features are shared across ~2000 proposals like the same way it is done in SPP Net .
    Also, the ROI pooling layer does max pooling in each sub-window of approximate size h/H X w/W. H and W are hyper-parameters. It is a special case of SPP layer with one pyramid level.
    The two sibling output layers’ outputs are used to calculate a multi-task loss on each labeled ROI to jointly train for classification and bounding-box regression.
    They have used L1 loss for bounding box regression as opposed to L2 loss in R-CNN and SPP-Net which is more sensitive to outliers.
    
    
In short, in fast R-CNN, rather than getting region proposals and classifying each region proposals separately, the input image is sent into the CNN network which gives a feature map of the image. Again some region proposals are used, but now we get the region proposals from the feature map of the image and these feature maps are classified. This reduces the computation as some of the CNN layers are common for the whole image. 

Credit to [SPP](https://arxiv.org/abs/1512.02325) for the improvement on R-CNN


### RoI Pooling Network



NameError: ignored

## Multi-task Loss Function

## Loss Functions

Regression uses smooth L1 loss, which is less sensitive to outliers. This is the same loss used in all the frameworks till now.

![smooth L1 loss](https://miro.medium.com/proxy/1*ct5e8rEJYIK4SJxPTYEFWA.png)


Classification loss is called focal loss, which is a reshaped version of cross entropy loss and this paper talks a lot about this.

Lets look at how this focal loss is designed. We will first look at binary cross entropy loss for single object classification

Cross entropy loss:

<center>
  
![cross-entropy](https://miro.medium.com/proxy/1*0bxc7T4lRrcRlYIFeiIbeg.png)
  
</center>

further advanced information:

Focal loss:

<center>

  ![Focal loss](https://miro.medium.com/proxy/1*iD5yJGfE_odwUaYEgAqOrA.png)

</center>


*    where p_{t} is p if y=1 and 1-p otherwise. Lets look at the focal loss equation clearly. Here alpha is called the balancing param and gamma is called the focusing param.

### Final step: 
1000 proposals from each feature scale were obtained after thresholding detector confidence at 0.05. The top predictions from all levels are merged and non-maximum suppression with a threshold of 0.5 is applied to yield the final detections.


In [0]:
## Smooth L1 Loss
def _smooth_l1_loss(bbox_pred, bbox_targets, bbox_inside_weights, bbox_outside_weights, sigma=1.0, dim=[1]):

    sigma_2 = sigma ** 2
    box_diff = bbox_pred - bbox_targets
    in_box_diff = bbox_inside_weights * box_diff
    abs_in_box_diff = torch.abs(in_box_diff)
    
    
    smoothL1_sign = (abs_in_box_diff < 1. / sigma_2).detach().float()
    
    in_loss_box = torch.pow(in_box_diff, 2) * (sigma_2 / 2.) * smoothL1_sign \
                  + (abs_in_box_diff - (0.5 / sigma_2)) * (1. - smoothL1_sign)
    
    
    out_loss_box = bbox_outside_weights * in_loss_box
    loss_box = out_loss_box

    s = loss_box.size(0)
    loss_box = loss_box.view(s, -1).sum(1).mean()
    # for i in sorted(dim, reverse=True):
    #   loss_box = loss_box.sum(i)
    # loss_box = loss_box.mean()
    return loss_box

# ROI Network
# loss (cross entropy) for object classification
RCNN_loss_cls = F.cross_entropy(cls_score, rois_label)

# loss (l1-norm) for bounding box regression
RCNN_loss_bbox = _smooth_l1_loss(bbox_pred, rois_target, rois_inside_ws, rois_outside_ws)


Fast_RCNN_loss = RCNN_loss_cls.mean() + RCNN_loss_bbox.mean()



  

## Faster R-CNN

This approach is still the best choice for most of the researchers and has achieved incredible results. The algorithm has surpassed all the previous results in-terms of both accuracy and speed. Faster R-CNN also has come up with new techniques which have become gold standards for all upcoming frameworks. Lets have deep look into these methods.

Changes to the Fast R-CNN

*    Removed Selective search and added a Deep Neural network for generating proposals (RPN network)
*    Introduced anchor boxes

Anchor boxes became very common from here for all the frameworks. RPN network can work as single object detector or generate proposals for Fast R-CNN network. One thing is for sure, we have removed the traditional computer vision techniques completely and made a full fledged deep neural network which gets trained end to end.


  The idea of Faster R-CNN is to use CNNs to propose potential region of interest and the network is called Region Proposal Network. After getting the region proposals , its just like Fast RCNN, we use every regions for classification.
![Faster R-CNN](https://raw.githubusercontent.com/smajida/DeepLearningWorkshop-ISPRS2019/master/Faster-RCNN.png)

![Anchor](https://raw.githubusercontent.com/smajida/DeepLearningWorkshop-ISPRS2019/master/anchor.png)
Image Source : [Shaoqing Ren et al](https://papers.nips.cc/paper/5638-faster-r-cnn-towards-real-time-object-detection-with-region-proposal-networks.pdf)

# Putting all together
![-Fast-er- R-CNN FPN](https://raw.githubusercontent.com/smajida/DeepLearningWorkshop-ISPRS2019/master/whole2.jpg)

##Overall procedure

In [0]:
# 1. getting input arguments for Pascal VOC dataset
args = parse_args()
# 2. getting VOC dataset names and region of interests and image database.
imdb, roidb, ratio_list, ratio_index = combined_roidb(args.imdb_name)
# 3. 
dataloader = torch.utils.data.DataLoader(dataset, batch_size=args.batch_size,                            										sampler=sampler_batch, num_workers=args.num_workers)
# 4. faster rcnn backbone of ResNet101
fasterRCNN = resnet(imdb.classes, 101, pretrained=True, class_agnostic=args.class_agnostic)
# 5. rcnn
for epoch in range(args.start_epoch, args.max_epochs + 1):
    # 5.1 
    fasterRCNN.train()
    # 5.2 
    if epoch % (args.lr_decay_step + 1) == 0:
        adjust_learning_rate(optimizer, args.lr_decay_gamma)
        lr *= args.lr_decay_gamma
    # 5.3    
   	for data in dataloaer:
        # 5.3.1 
        im_data.resize_(data[0].size()).copy_(data[0])
        im_info.resize_(data[1].size()).copy_(data[1])
        gt_boxes.resize_(data[2].size()).copy_(data[2])
        num_boxes.resize_(data[3].size()).copy_(data[3])
        
        fasterRCNN.zero_grad()
        # 5.3.2 
        rois, cls_prob, bbox_pred, \
      	rpn_loss_cls, rpn_loss_box, \
      	RCNN_loss_cls, RCNN_loss_bbox, \
      	rois_label = fasterRCNN(im_data, im_info, gt_boxes, num_boxes)
        # 5.3.3 
        loss = rpn_loss_cls.mean() + rpn_loss_box.mean() \
           	+ RCNN_loss_cls.mean() + RCNN_loss_bbox.mean()
        # 5.3.4      
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()


def combined_roidb(imdb_names, training=True):
    roidbs = [get_roidb(s) for s in imdb_names.split('+')]
  	roidb = roidbs[0] # len(roidb) = 1002
  	
    if len(roidb) > 1:
        pass
    else:
		  imdb = get_imdb(imdb_names)
        
    if training:
      roidb = filter_roidb(roidb)
        
    ratio_list, ratio_index = rank_roidb_ratio(roidb)
    
    return imdb, roidb, ratio_list, ratio_index

def get_roidb(imdb_name):
    imdb = get_imdb(imdb_name)

    imdb.set_proposal_method(cfg.TRAIN.PROPOSAL_METHOD)

    roidb = get_training_roidb(imdb)
    return roidb

def get_imdb(name):
    return __sets[name]()
    #>>> imdb: <datasets.pascal_voc.pascal_voc object at 0x7f60e737d860>


## Faster RCNN Loss Function

In [0]:
# RPN
rois, rpn_loss_cls, rpn_loss_bbox 
	= self.RCNN_rpn(base_feat, im_info, gt_boxes, num_boxes)

self.rpn_loss_cls = F.cross_entropy(rpn_cls_score, rpn_label)
self.rpn_loss_box = _smooth_l1_loss(rpn_bbox_pred, rpn_bbox_targets, 
                                    rpn_bbox_inside_weights,
                                    rpn_bbox_outside_weights, sigma=3, dim=[1,2,3])


Faster_RCNN_loss = rpn_loss_cls.mean() + rpn_loss_box.mean() \
           + RCNN_loss_cls.mean() + RCNN_loss_bbox.mean()  


## Region Proposal Network (RPN)
  Take an input image of 800x800 and send it to the network. For a VGG Network, after subsampling ratio of 16, the output will be [512, 50, 50]. An RPN network is applied on this feature map, which generates (50*50*9) boxes regression and classification scores. So regression output is 50*50*9*4 (x,y, w, h) and classification output is 50*50*9*2 (object present or not). Here 9 implies the number of anchor boxes at each location. Below is the procedure on how anchor boxes are generated on the image. (Note: With slight modifications, this is how anchor boxes are applied mostly on all the following frameworks)
    


    
Anchor boxes

  At each location, we have 9 anchor boxes, and they have different scales and aspect ratios. The following are defined



## Anchor Boxes

In [0]:
import torch.nn as nn
req_features = []
k = dummy_img.clone()
for i in fe:
    k = i(k)
    if k.size()[2] < 800//16:
        break
    req_features.append(i)
    out_channels = k.size()[1]

print(len(req_features)) #7
print(out_channels) # 256

faster_rcnn_fe_extractor = nn.Sequential(*req_features)

out_map = faster_rcnn_fe_extractor(dummy_img)
print(out_map.size()) #Out: torch.Size([1, 256, 50, 50])

7
256
torch.Size([1, 256, 50, 50])


In [0]:
anchors_boxes_per_location = 9
scales = [8, 16, 32]
ratios = [0.5, 1, 2]
ctr_x, ctr_y = 16/2, 16/2 #(at (1,1) location)

At feature map location 1,1 is mapped to [0, 0, 16, 16] box on the image. this has center at 8, 8. Now we need to draw the 9 anchor boxes using the above scales and ratios. A look at all the centers on the image

![anchor points](https://miro.medium.com/proxy/1*f-AxsYA9ys5wtiY9NDZh9Q.png)

In [0]:
anchor_base = np.zeros((len(ratios) * len(scales), 4), dtype=np.float32)
print(anchor_base)

[[0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]]


In [0]:
sub_sample = 16

for i in range(len(ratios)):
  for j in range(len(scales)):
    h = sub_sample * scales[j] * np.sqrt(ratios[i])
    w = sub_sample * scales[j] * np.sqrt(1./ ratios[i])
    index = i * len(scales) + j
    anchor_base[index, 0] = ctr_y - h / 2.
    anchor_base[index, 1] = ctr_x - w / 2.
    anchor_base[index, 2] = ctr_y + h / 2.
    anchor_base[index, 3] = ctr_x + w / 2.
    
print(anchor_base)


[[ -37.254833  -82.50967    53.254833   98.50967 ]
 [ -82.50967  -173.01933    98.50967   189.01933 ]
 [-173.01933  -354.03867   189.01933   370.03867 ]
 [ -56.        -56.         72.         72.      ]
 [-120.       -120.        136.        136.      ]
 [-248.       -248.        264.        264.      ]
 [ -82.50967   -37.254833   98.50967    53.254833]
 [-173.01933   -82.50967   189.01933    98.50967 ]
 [-354.03867  -173.01933   370.03867   189.01933 ]]


This is for one location, now we need to apply for all the anchor centers. since there are 50*50 anchor centers and each one has 9 anchors. in total we get 22500 anchors.

In [0]:
fe_size = 50
anchors = np.zeros((fe_size * fe_size * 9, 4))
index = 0
for i in range(len(ratios)):
  for j in range(len(scales)):
    h = sub_sample * scales[j] * np.sqrt(ratios[i])
    w = sub_sample * scales[j] * np.sqrt(1./ ratios[i])
    anchors[index, 0] = ctr_y - h / 2.
    anchors[index, 1] = ctr_x - w / 2.
    anchors[index, 2] = ctr_y + h / 2.
    anchors[index, 3] = ctr_x + w / 2.
    index += 1

print(anchors.shape)

(22500, 4)


The final two matrices are

    anchor_locations [N, 4] — [22500, 4]
    anchor_labels [N,] — [22500]

A look at anchors at (400, 400) on image

![anchors](https://miro.medium.com/proxy/1*cPidpSRVUVgv3YeY9Fc11Q.png)

*   So the RPN produces object proposals wrt to each anchor box along with their objectness score. An anchor box is assigned as positive if it has max_iou with the ground truth object or iou greater than 0.7. An anchor box is assigned negative if it has iou < 0.4. All the anchor boxes with iou [0.4, 0.7] are ignored. Anchor boxes which fall outside the image are also ignored.
*   Again, since vast majority of them will have negative samples, we will use the same Fast R-CNN strategy to sample 128+ samples and 128- samples (total 256) for a batch size of 2 for training.

*  Smooth L1 loss and cross entropy loss can be used for regression and classification. Regression outputs are offset with anchor box locations using the following formula



t_{x} = (x - x_{a})/w_{a}

t_{y} = (y - y_{a})/h_{a}

t_{w} = log(w/ w_a)

t_{h} = log(h/ h_a)

x, y , w, h are the ground truth box center co-ordinates, width and height. x_a, y_a, h_a and w_a and anchor boxes center cooridinates, width and height.

*    Once RPN outputs are generated, we need to process them before sending to the RoI pooling layer (aka fast R-CNN network) . The Faster R_CNN says, RPN proposals highly overlap with each other. To reduced redundancy, we adopt non-maximum supression (NMS) on the proposal regions based on their cls scores. We fix the IoU threshold for NMS at 0.7, which leaves us about 2000 proposal regions per image. After an ablation study, the authors show that NMS does not harm the ultimate detection accuracy, but substantially reduces the number of proposals. After NMS, we use the top-N ranked proposal regions for detection. In the following we training Fast R-CNN using 2000 RPN proposals. During testing they evaluate only 300 proposals, they have tested this with various numbers and obtained this.

Once the ~2000 proposals are generated, these are transformed using the following formulas

x = (w_{a} * ctr_x_{p}) + ctr_x_{a}

y = (h_{a} * ctr_x_{p}) + ctr_x_{a}

h = np.exp(h_{p}) * h_{a}

w = np.exp(w_{p}) * w_{a}

and later convert to y1, x1, y2, x2 format

## Feature Pyramid Networks

In this line, Researchers have observed two problems with the Faster R-CNN. First it is unable to detect small objects, second class imbalance is not focussed properly (Random sampling 256 samples and calculating loss is not a proper way). So the researches have introduced two new concepts

*   Feature Pyramid networks [FPN](https://arxiv.org/abs/1612.03144)  

The idea of FPN is to process the images in multi-scale fashion rather than one single scale.


![Faster R-CNN FPN](https://raw.githubusercontent.com/smajida/DeepLearningWorkshop-ISPRS2019/master/FPN-1.png)


Image Source:[Tsung-Yi Lin et al](https://arxiv.org/abs/1612.03144)
 
Faster RCNN is mostly unable to catch small objects in the image. This is largely addressed in COCO and ILSVRC challege using image pyramids by most of the winning teams. A simple image pyramid is given below, you scale image to different sizes and send it to the network, Once the detections are detected on each scale, all the predictions are combined using different method. Though this method worked, Inference is a costly process as each image should be computed at various scales independently.


A deep ConvNet computes a feature hierarchy layer by layer, and with sub-sampling layers the feature hierarchy has an inherent multi-scale, pyramidal shape. This in-network feature hierarchy produces feature maps of different spatial resolutions. First lets see how FPN works and later we will move into the intuition part.

*    First take a standard resnet architecture(Ex ResNet50). In Faster R-CNN, discussed above, we have considered only feature maps of sub-sampling ratio 16 were taken to compute Region proposals and later pass the RPN outputs to Fast RCNN. Here it is done in the following way.

![FPN-module2](https://raw.githubusercontent.com/smajida/DeepLearningWorkshop-ISPRS2019/master/fpn-3.png)

### ResNet Backbone

In [0]:
## For information - do not execute
      
class resnet_backbone(nn.Module):
    def __init__(self, num_class = 1, input_channel = 3, output_stride=32, layer=101):
        super(resnet_backbone, self).__init__()
        if layer == 101:
            self.resnet = resnet101(pretrained=True, output_stride=output_stride)
        elif layer == 152:
            self.resnet = resnet152(pretrained=True, output_stride=output_stride)
        elif layer == 50:
            self.resnet = resnet50(pretrained=True, output_stride=output_stride)
        else:
            raise ValueError("only support ResNet101 and ResNet152 now")

        if input_channel == 1:
            self.conv1 = nn.Conv2d(1, 64, kernel_size=7, stride=2, padding = 3, bias=False)
        elif input_channel == 3:
            self.conv1 = self.resnet.conv1
        else:
            raise ValueError("input channel should be 3 or 1")

    def forward(self, x):

        c1 = self.conv1(x) #1, 320*320
        c1 = self.resnet.bn1(c1)
        c1 = self.resnet.relu(c1)
        c1 = self.resnet.maxpool(c1) #4, 80*80

        c2 = self.resnet.layer1(c1)
        c3 = self.resnet.layer2(c2) #8, 40*40
        c4 = self.resnet.layer3(c3) #16, 20*20
        c5 = self.resnet.layer4(c4) #32, 10*10

        return c1, c2, c3, c4, c5



*    The problem with excessive sub sampling is it will lead to poor localization. If incase we want to use features from earlier layers (say subsample 8) the semantics of the object are not captured very clearly. So, we have to take the best of both the worlds. With this intuition feature pyramid networks or FPN is designed first on Faster RCNN. The latlayers reduces the channel dimension (number of feature maps), up-sampled and added to the previous layer outputs. Since up-sampling is done using bilinear interpolation, they have added another conv layer this output so that the aliasing effect of up-sampling is removed. We have ignored p1 because of the computational complexity (It generates (400*400*3 = 480k proposals for a single feature map p1).

![FPN-module](https://raw.githubusercontent.com/smajida/DeepLearningWorkshop-ISPRS2019/master/fpn-2.png)


*    Anchor boxes are designed on each feature map separately. Since scale is taken care off by the FPN, At each location we take anchors of 3 aspect ratio [1:2, 2:1, 1:1]. In total we get 9 anchors at each location over the feature map scale.


#### Results:

*    some ablation studies suggest that using FPN, Average precision (AR) has improved by 8.0 points over a single scale RPN.
*    lateral connections improved the AP by 10 points.
*    On small objects the coco style AP has improved from 9.6% to 17.8%.
*    Inference on each image takes 176 ms when using resnet101 as the backend.
    
  



### FPN Base

In [0]:
## For information - do not execute

def _upsample_add(self, x, y):
        '''Upsample and add two feature maps.
        Args:
          x: (Variable) top feature map to be upsampled.
          y: (Variable) lateral feature map.
        Returns:
          (Variable) added feature map.
        '''
        _,_,H,W = y.size()
        return F.upsample(x, size=(H,W), mode='bilinear') + y


class _FPN(nn.Module):
    """ FPN """
    def __init__(self, classes, class_agnostic):
        super(_FPN, self).__init__()
        self.classes = classes
        self.n_classes = len(classes)
        self.class_agnostic = class_agnostic
        # loss
        self.RCNN_loss_cls = 0
        self.RCNN_loss_bbox = 0

        self.maxpool2d = nn.MaxPool2d(1, stride=2)
        # define rpn
        self.RCNN_rpn = _RPN_FPN(self.dout_base_model)
        self.RCNN_proposal_target = _ProposalTargetLayer(self.n_classes)

        # NOTE: the original paper used pool_size = 7 for cls branch, and 14 for mask branch, to save the
        # computation time, we first use 14 as the pool_size, and then do stride=2 pooling for cls branch.
        self.RCNN_roi_pool = _RoIPooling(cfg.POOLING_SIZE, cfg.POOLING_SIZE, 1.0/16.0)
        self.RCNN_roi_align = RoIAlignAvg(cfg.POOLING_SIZE, cfg.POOLING_SIZE, 1.0/16.0)
        self.grid_size = cfg.POOLING_SIZE * 2 if cfg.CROP_RESIZE_WITH_MAX_POOL else cfg.POOLING_SIZE
        self.RCNN_roi_crop = _RoICrop()
        
      def forward(self, im_data, im_info, gt_boxes, num_boxes):
        batch_size = im_data.size(0)

        im_info = im_info.data
        gt_boxes = gt_boxes.data
        num_boxes = num_boxes.data

        # feed image data to base model to obtain base feature map
        # Bottom-up
        c1 = self.RCNN_layer0(im_data)
        c2 = self.RCNN_layer1(c1)
        c3 = self.RCNN_layer2(c2)
        c4 = self.RCNN_layer3(c3)
        c5 = self.RCNN_layer4(c4)
        # Top-down
        p5 = self.RCNN_toplayer(c5)
        p4 = self._upsample_add(p5, self.RCNN_latlayer1(c4))
        p4 = self.RCNN_smooth1(p4)
        p3 = self._upsample_add(p4, self.RCNN_latlayer2(c3))
        p3 = self.RCNN_smooth2(p3)
        p2 = self._upsample_add(p3, self.RCNN_latlayer3(c2))
        p2 = self.RCNN_smooth3(p2)

        p6 = self.maxpool2d(p5)

        rpn_feature_maps = [p2, p3, p4, p5, p6]
        mrcnn_feature_maps = [p2, p3, p4, p5]

        rois, rpn_loss_cls, rpn_loss_bbox = self.RCNN_rpn(rpn_feature_maps, im_info, gt_boxes, num_boxes)


In [0]:
# import necessary libraries
from PIL import Image
import matplotlib.pyplot as plt
import torch
import torchvision.transforms as T
import torchvision
import torch
import numpy as np
import cv2

# get the pretrained model from torchvision.models
# Note: pretrained=True will get the pretrained weights for the model.
# model.eval() to use the model for inference
model = torchvision.models.detection.fasterrcnn_resnet50_fpn(pretrained=True)
model.eval()

# Class labels from official PyTorch documentation for the pretrained model
# Note that there are some N/A's 
# for complete list check https://tech.amikelive.com/node-718/what-object-categories-labels-are-in-coco-dataset/
# we will use the same list for this notebook
COCO_INSTANCE_CATEGORY_NAMES = [
    '__background__', 'person', 'bicycle', 'car', 'motorcycle', 'airplane', 'bus',
    'train', 'truck', 'boat', 'traffic light', 'fire hydrant', 'N/A', 'stop sign',
    'parking meter', 'bench', 'bird', 'cat', 'dog', 'horse', 'sheep', 'cow',
    'elephant', 'bear', 'zebra', 'giraffe', 'N/A', 'backpack', 'umbrella', 'N/A', 'N/A',
    'handbag', 'tie', 'suitcase', 'frisbee', 'skis', 'snowboard', 'sports ball',
    'kite', 'baseball bat', 'baseball glove', 'skateboard', 'surfboard', 'tennis racket',
    'bottle', 'N/A', 'wine glass', 'cup', 'fork', 'knife', 'spoon', 'bowl',
    'banana', 'apple', 'sandwich', 'orange', 'broccoli', 'carrot', 'hot dog', 'pizza',
    'donut', 'cake', 'chair', 'couch', 'potted plant', 'bed', 'N/A', 'dining table',
    'N/A', 'N/A', 'toilet', 'N/A', 'tv', 'laptop', 'mouse', 'remote', 'keyboard', 'cell phone',
    'microwave', 'oven', 'toaster', 'sink', 'refrigerator', 'N/A', 'book',
    'clock', 'vase', 'scissors', 'teddy bear', 'hair drier', 'toothbrush'
]


def get_prediction(img_path, threshold):
  """
  get_prediction
    parameters:
      - img_path - path of the input image
      - threshold - threshold value for prediction score
    method:
      - Image is obtained from the image path
      - the image is converted to image tensor using PyTorch's Transforms
      - image is passed through the model to get the predictions
      - class, box coordinates are obtained, but only prediction score > threshold
        are chosen.
    
  """
  img = Image.open(img_path)
  transform = T.Compose([T.ToTensor()])
  img = transform(img)
  pred = model([img])
  pred_class = [COCO_INSTANCE_CATEGORY_NAMES[i] for i in list(pred[0]['labels'].numpy())]
  pred_boxes = [[(i[0], i[1]), (i[2], i[3])] for i in list(pred[0]['boxes'].detach().numpy())]
  pred_score = list(pred[0]['scores'].detach().numpy())
  pred_t = [pred_score.index(x) for x in pred_score if x>threshold][-1]
  pred_boxes = pred_boxes[:pred_t+1]
  pred_class = pred_class[:pred_t+1]
  return pred_boxes, pred_class
  


def object_detection_api(img_path, threshold=0.5, rect_th=3, text_size=3, text_th=3):
  """
  object_detection_api
    parameters:
      - img_path - path of the input image
      - threshold - threshold value for prediction score
      - rect_th - thickness of bounding box
      - text_size - size of the class label text
      - text_th - thichness of the text
    method:
      - prediction is obtained from get_prediction method
      - for each prediction, bounding box is drawn and text is written 
        with opencv
      - the final image is displayed
  """
  boxes, pred_cls = get_prediction(img_path, threshold)
  img = cv2.imread(img_path)
  img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
  for i in range(len(boxes)):
    cv2.rectangle(img, boxes[i][0], boxes[i][1],color=(255, 255, 0), thickness=rect_th)
    cv2.putText(img,pred_cls[i], boxes[i][0], cv2.FONT_HERSHEY_SIMPLEX, text_size, (0,255,255),thickness=text_th)
  plt.figure(figsize=(20,30))
  plt.imshow(img)
  plt.xticks([])
  plt.yticks([])
  plt.show()

Downloading: "https://download.pytorch.org/models/fasterrcnn_resnet50_fpn_coco-258fb6c6.pth" to /root/.cache/torch/checkpoints/fasterrcnn_resnet50_fpn_coco-258fb6c6.pth
100%|██████████| 167502836/167502836 [00:02<00:00, 59903861.15it/s]


In [0]:
# download an image for inference
!wget https://www.wsha.org/wp-content/uploads/banner-diverse-group-of-people-2.jpg -O people.jpg

# use the api pipeline for object detection
# the threshold is set manually, the model sometimes predict 
# random structures as some object, so we set a threshold to filter
# better prediction scores.
object_detection_api('./people.jpg', threshold=0.8)

--2019-09-24 20:47:02--  https://www.wsha.org/wp-content/uploads/banner-diverse-group-of-people-2.jpg
Resolving www.wsha.org (www.wsha.org)... 104.198.7.33
Connecting to www.wsha.org (www.wsha.org)|104.198.7.33|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1923610 (1.8M) [image/jpeg]
Saving to: ‘people.jpg’


2019-09-24 20:47:02 (5.77 MB/s) - ‘people.jpg’ saved [1923610/1923610]



Let's try one more complex example


In [0]:
!wget https://cdn.pixabay.com/photo/2013/07/05/01/08/traffic-143391_960_720.jpg -O traffic.jpg

object_detection_api('/content/traffic.jpg', rect_th=2, text_th=1, text_size=1)

In [0]:
!wget https://hips.hearstapps.com/hmg-prod.s3.amazonaws.com/images/10best-cars-group-cropped-1542126037.jpg -O cars.jpg
  
object_detection_api('./cars.jpg', rect_th=6, text_th=5, text_size=5)

In [0]:
!wget https://images.unsplash.com/photo-1458169495136-854e4c39548a -O traffic_scene2.jpg
  
object_detection_api('./traffic_scene2.jpg', rect_th=15, text_th=7, text_size=5, threshold=0.8)  

# Comparing the inference time of model in CPU & GPU


In [0]:
import time

def check_inference_time(image_path, gpu=False):
  model = torchvision.models.detection.fasterrcnn_resnet50_fpn(pretrained=True)
  model.eval()
  img = Image.open(image_path)
  transform = T.Compose([T.ToTensor()])
  img = transform(img)
  if gpu:
    model.cuda()
    img = img.cuda()
  else:
    model.cpu()
    img = img.cpu()
  start_time = time.time()
  pred = model([img])
  end_time = time.time()
  return end_time-start_time

## Inference time for Object Detection

In [0]:
cpu_time = sum([check_inference_time('./traffic_scene2.jpg', gpu=False) for _ in range(5)])/5.0
gpu_time = sum([check_inference_time('./traffic_scene2.jpg', gpu=True) for _ in range(5)])/5.0


print('\n\nAverage Time take by the model with GPU = {}s\nAverage Time take by the model with CPU = {}s'.format(gpu_time, cpu_time))

# Single-stage Object Detection: YOLO & SSD

All the techniques we have seen for object-detection till are two-stage, RPN generates object proposals and then Fast RCNN classifies and regress on top of those predicted object proposals. A pertinent question is can we build single stage detectors?

### RetinaNet (Focal Loss and FPN combined)](https://arxiv.org/pdf/1708.02002.pdf) as a Single-stage FPN

when evaluating 1^⁴ to 1^⁵ anchor boxes in Faster RCNN (FPN) network , most of the boxes do not contain objects, leading to extreme class imbalance. To counter for class imbalance, we are sampling 256 proposal from each mini-batch (128+ve and 128-ve). However this is not a robost approach and the authors of this paper have proposed a loss function called Focal loss which tries to address class imbalance efficiently.




### Yolo (You look only once)

YOLO is an important framework that has led to improved speed and accuracy for object-detection tasks. Yolo has evolved over a period of time and has published 3 papers till now. Though most of the ideas are similar to the ones we have discussed above, YOLO provides pointed enhancements that especially address faster processing times.

*    Yolo has three papers written till to date, Yolo1, yolo2, Yolo3 which showed improvements in terms of accuracy.
*    Yolo1 is independently published and to my knowledge is the first paper which talks about single stage detection. RetinaNet takes ideas from Yolo and SSD.



![Faster R-CNN](https://raw.githubusercontent.com/smajida/DeepLearningWorkshop-ISPRS2019/master/ssd-yolo.png)


# YOLOv3

![Yolov3](https://raw.githubusercontent.com/smajida/DeepLearningWorkshop-ISPRS2019/master/yolov3.png)
[Yolov3 another description](https://raw.githubusercontent.com/smajida/DeepLearningWorkshop-ISPRS2019/master/yolov3_2.png)

# RefineDet & SSD

![Faster R-CNN](https://raw.githubusercontent.com/smajida/DeepLearningWorkshop-ISPRS2019/master/ssd-refinedet.png)

# Almost recent benchmark

![Faster R-CNN](https://raw.githubusercontent.com/smajida/DeepLearningWorkshop-ISPRS2019/master/benchmark_2.PNG)

![Faster R-CNN](https://raw.githubusercontent.com/smajida/DeepLearningWorkshop-ISPRS2019/master/1*QOGcvHbrDZiCqTG6THIQ_w.png)


# Nex step: instance segmentation, MASK-RCNN

![Faster R-CNN](https://raw.githubusercontent.com/smajida/DeepLearningWorkshop-ISPRS2019/master/1*vMiMMU6sIfb7aUFXerUIWw.png)
