# Pytorch Object Detection

**What is Object Detection?**

Predicting the location of the object along with the class is called Object Detection.

<img src="https://cdn-images-1.medium.com/max/1600/1*cpe4z05DgTJvm0SG0MsrjA.png" width="60%">

Object Detection is modeled as a classification problem where we take windows of fixed sizes from input image at all the possible locations feed these patches to an image classifier.

**Goals of this Session**

Understanding:

1. Two common Datasets for object detection
2. Object detection metrics
3. A short history of object detectors

Followed by a hands-on example.

------

## Datasets

While various datasets exist for Object Detection, here we focus on the two you will encounter most in research and validation of state-of-the-art approaches. Many other datasets often serve a specific use case (e.g. annotated traffic scenes - KITTI).

### Pascal VOC (2005)

Name:
- Pascal Visual Object Classes

Data:
- 2012 dataset has 11,530 images containing annotated 27,450 objects
- 20 classes (will be covered in follow up hands on)
- Images originally obtained from Flickr

Source:
- Pascal VOC challenges 2005-2012
- http://host.robots.ox.ac.uk/pascal/VOC/
- http://host.robots.ox.ac.uk/pascal/VOC/pubs/everingham10.pdf

Annotation sample:
```
<annotation>
	<folder>GeneratedData_Train</folder>
	<filename>000001.png</filename>
	<path>/my/path/GeneratedData_Train/000001.png</path>
	<source>
		<database>Unknown</database>
	</source>
	<size>
		<width>224</width>
		<height>224</height>
		<depth>3</depth>
	</size>
	<segmented>0</segmented>
	<object>
		<name>21</name>
		<pose>Frontal</pose>
		<truncated>0</truncated>
		<difficult>0</difficult>
		<occluded>0</occluded>
		<bndbox>
			<xmin>82</xmin>
			<xmax>172</xmax>
			<ymin>88</ymin>
			<ymax>146</ymax>
		</bndbox>
	</object>
</annotation>
```

We will look into Pascal VOC during the follow-up hands-on example.

Labels in VOC.

In [3]:
['aeroplane','bicycle','bird','boat','bottle','bus','car','cat','chair','cow','diningtable','dog', 'horse',
 'motorbike','person','pottedplant','sheep','sofa','train','tvmonitor']

['aeroplane',
 'bicycle',
 'bird',
 'boat',
 'bottle',
 'bus',
 'car',
 'cat',
 'chair',
 'cow',
 'diningtable',
 'dog',
 'horse',
 'motorbike',
 'person',
 'pottedplant',
 'sheep',
 'sofa',
 'train',
 'tvmonitor']

VOC Data example from paper:

<img src='images/VOC_example.png' width="65%">

### COCO (2015)

Name:
- Common Objects in context

Data:
- Everyday scenes of "common objects"
- Large-scale object detection, segmentation, and captioning dataset. 
- Over 200K images with 1.5M objects
- 80 object categories
- Additional annotations for 91 "Stuff" classes available
- Stuff classes are background materials that are defined by homogeneous or repetitive patterns
- COCO dataset reader in `torchvision.datasets.CocoDetection`

Source:
- Microsoft
- http://cocodataset.org/
- https://arxiv.org/abs/1405.0312

Labels in COCO.

In [2]:
['person', 'bicycle', 'car', 'motorcycle', 'airplane', 'bus', 'train', 'truck', 'boat', 'traffic light', 
 'fire hydrant', 'stop sign', 'parking meter', 'bench', 'bird', 'cat', 'dog', 'horse', 'sheep', 'cow', 'elephant', 
 'bear', 'zebra', 'giraffe', 'backpack', 'umbrella', 'handbag', 'tie', 'suitcase', 'frisbee', 'skis', 'snowboard', 
 'sports ball', 'kite', 'baseball bat', 'baseball glove', 'skateboard', 'surfboard', 'tennis racket', 'bottle', 
 'wine glass', 'cup', 'fork', 'knife', 'spoon', 'bowl', 'banana', 'apple', 'sandwich', 'orange', 'broccoli', 
 'carrot', 'hot dog', 'pizza', 'donut', 'cake', 'chair', 'couch', 'potted plant', 'bed', 'dining table', 
 'toilet', 'tv', 'laptop', 'mouse', 'remote', 'keyboard', 'cell phone', 'microwave', 'oven', 'toaster', 'sink', 
 'refrigerator', 'book', 'clock', 'vase', 'scissors', 'teddy bear', 'hair drier', 'toothbrush']

['person',
 'bicycle',
 'car',
 'motorcycle',
 'airplane',
 'bus',
 'train',
 'truck',
 'boat',
 'traffic light',
 'fire hydrant',
 'stop sign',
 'parking meter',
 'bench',
 'bird',
 'cat',
 'dog',
 'horse',
 'sheep',
 'cow',
 'elephant',
 'bear',
 'zebra',
 'giraffe',
 'backpack',
 'umbrella',
 'handbag',
 'tie',
 'suitcase',
 'frisbee',
 'skis',
 'snowboard',
 'sports ball',
 'kite',
 'baseball bat',
 'baseball glove',
 'skateboard',
 'surfboard',
 'tennis racket',
 'bottle',
 'wine glass',
 'cup',
 'fork',
 'knife',
 'spoon',
 'bowl',
 'banana',
 'apple',
 'sandwich',
 'orange',
 'broccoli',
 'carrot',
 'hot dog',
 'pizza',
 'donut',
 'cake',
 'chair',
 'couch',
 'potted plant',
 'bed',
 'dining table',
 'toilet',
 'tv',
 'laptop',
 'mouse',
 'remote',
 'keyboard',
 'cell phone',
 'microwave',
 'oven',
 'toaster',
 'sink',
 'refrigerator',
 'book',
 'clock',
 'vase',
 'scissors',
 'teddy bear',
 'hair drier',
 'toothbrush']

## Metrics

Evaluation in Object Detection needs to measure 2 tasks 

1. Object existence (classification) 
2. Object location (regression) 

Over a non-uniform distribution of classes at various levels of confidence. 

This is non-trivial. mAP was introduced as evaluation metric for object detection. In order to break down AP (Average Precision), we will first (re-)visit Precision and Recall metrics.

### Classification

**Precision**
  - False Positive Rate
  - `true object detections / total number of objects predicted`
  - How many selected items are relevant?

**Recall**
  - False Negative Rate 
  - `true object detections / total number of objects in dataset`
  - How many relevant items are selected?
  
<img src="https://upload.wikimedia.org/wikipedia/commons/2/26/Precisionrecall.svg" width="35%">

**Average Precision (AP)**
  - Precision-Recall Curve p(r) computed per class through varying confidence thresholds and determining precision and recall tuples for threshold
  - AP computes the average value of p(r) as

$$\int_0^1 p(r)dr$$

### Regression

**Intersection over Union (IoU)**

- AP is often as per below at a given IoU threshold
- IoU determines the overlap between ground thruth and prediction
- `area of intersection / area of union`
- In the COCO challenge, 10 different IoU thresholds are considered (0.5 to 0.95 in steps of 0.05)

<img src="https://github.com/rafaelpadilla/Object-Detection-Metrics/blob/master/aux_images/iou.png?raw=true">

### Mean Average Precision (mAP)

Bringing together classification and localization.

- AP over all classes and/or IoU thresholds (minimum IoU)
- Pascal VOC2007
  - mAP over 20 classes at IoU=0.5
- COCO2017
  - mAP over 80 classes and all 10 IoU thresholds
- If averaged over IoU thresholds, precise localization tends to be better rewarded

------

## Short History of Object Detectors

Two types of detectors exist:

1. One-Shot Detectors: Object Detection and Classification Problem in one stage
2. Two-Shot Detectors: Object Detection and Classification Problem in two stage

Here we walk through existing model architectures for Object Detection, starting with two-shot methods introduced earlier.

### R-CNN (2014)

- Original application of CNN to the Object Detection problem
- CNNs are computationally expensive
- It is impossible to run CNNs on all windows generated on an image
- R-CNN uses a **proposal algorithm called Selective Search reducing the number of possible bounding boxes** through features like texture, intensity, color and/or a measure of insideness
- 2000 region proposals
- All these selected boxes are fed to the CNN model after they have been resized to a fixed size
- SVM is used to predict the class of each identified bounding box
- https://arxiv.org/abs/1311.2524

<img src="images/R-CNN.png" width="65%">

### SPP-net (2014)

- SPP = Spatial Pyramid Pooling
- R-CNN was very slow (CNN on 2000 proposed regions takes time)
- SPP-net: **Calculate CNN representation of image only once** and use that when calculating representation for region after selective search (pooling)
- **Spatial Pooling instead of Max Pooling** after last Convolutional Layer = dividing region of arbitrary size into constant number of bins
- That way, the input remains fixed size
- Drawback: Not trivial to perform back propagation through Spatial Pooling Layer
- Step towards more popular Fast R-CNN
- https://arxiv.org/abs/1406.4729

### Fast R-CNN (2015)

- Fixes the key problem in SPP-net: End-2-end training **propagating gradients through Spatial Pooling**
- 25x faster than R-CNN
- **Added Bounded Box Regression to the neural network training**
- Now, network had 2 heads: Classification and bounding box regression
- Independent training for classification and bounding box regression no longer needed
- https://arxiv.org/abs/1504.08083

<img src="images/Fast-R-CNN.png" width="65%">

### Faster R-CNN (2016)

- 10x faster than Fast R-CNN
- **Replaces Selective Search in previous approaches with a small CNN called Region Proposal Network (RPN)**
- Uses the idea of **Anchor Boxes** - predicting offsets instead of coordinates for fixed aspect ratio boxes
- https://arxiv.org/abs/1506.01497

## One-Stage Object Detectors

So far, all methods discussed were classification/regression handled object detection as a classification problem in 2 stages. There are a few methods which object detection as regression problem in one stage. A one-stage detector uses a fixed grid of boxes while two-stage detectors uses proposal networks to generate box proposals.

### YOLO

- **YOLO v1 (2015)**
  - Detection as a regression problem: Take an input image and learn class probabilities
  - YOLO divides image into grid and predicts bounding boxes on grid
  - Confidence reflects accuracy of box
  - It also predicts classifications core for each box
  - Low confidence boxes are discarded
  - CNN is run only once (YOLO is super fast and can be run in real-time)
  - YOLO sees entire image and not just regions -> additional context can help reduce false positives
  - Downside: 1 type of class per grid max, struggles with small objects, localization errors, low recall
  - https://arxiv.org/abs/1506.02640
- **YOLO v2 (2016)**
  - Focus on improving recall and localization
  - **Batch normalization added to YOLO** leading to 2% improvement in mAP
  - Pre-trained on **higher resolution images** leading to 4% increase in mAP
  - **Use Anchor Boxes** to predict bounding boxes (predicting offsets instead of coordinates for fixed aspect ratio anchor boxes) - more boxes predicted per image (decrease in mAP, improvement in recall)
  - darknet-19 architecture (19 + 11 = 30 layers)
  - https://arxiv.org/abs/1612.08242
- **YOLO v3 (2018)**
  - Little design changes
  - "As accurate as SSD, but three times faster"
  - Some speed has been traded off for boosts in accuracy in YOLO v3
  - This is due to **changes to underlying CNN architecture** (darknet-53)
  - 53 layer pre-trained on Imagenet + 53 more layers stacked onto network for detection (106 layers total)
  - **Prediction at 3 scales** (generated through downsampling)
  - https://arxiv.org/abs/1804.02767

YOLOv1 architecture:

<img src="images/YOLO_CNN.png" width="60%">

YOLO process:

<img src="images/YOLO.png" width="65%">

### Single-Shot-Detector - SSD (2016)

- Balance between speed and accuracy
- Runs a convolutional network on image once to calculate feature map
- Then **run a small 3x3 kernel on feature map to predict bounding boxes and classification probability**
- Uses **Anchor Boxes** and learns offsets
- Each convolutional layer operates at a different scale
- https://arxiv.org/abs/1512.02325

<img src="images/SSD.png" width="65%">

### RetinaNet (2017)

- Main contribution is a new loss function called **Focal Loss** which significantly increases accuracy
- Essentially a Feature Pyramid Network with replaced loss
- Focal Loss lowers loss for well classified examples while putting focus on hard to classify examples
- https://arxiv.org/abs/1708.02002

### Mask R-CNN (2018)

- Covered in Segmentation tutorial
- Extends Faster R-CNN by adding a branch for predicting an object mask in parallel with the existing branch for bounding box recognition
- https://arxiv.org/abs/1703.06870

### Model Architectures: Speed and Accuracy comparison

AP for Object Detection algorithms as reported in YOLO v3.

<img src="images/YOLO_v3_performance.png" width="60%">

<img src="images/YOLO_v3_comparison.png" width="60%">

Overview from https://zsc.github.io/megvii-pku-dl-course/slides/Lecture6(Object%20Detection).pdf:

<img src="images/accuracy-speed.png" width="60%">

**Overall Status**

2017
- Faster R-CNN with best accuracy numbers
- SSD with good speed / precision tradeoff
- YOLO for speed  

2018
- RetinaNet for accuracy
- YOLO v3 for speed

-----

**Sources**
- https://cv-tricks.com/object-detection/faster-r-cnn-yolo-ssd/
- https://blog.paperspace.com/how-to-implement-a-yolo-object-detector-in-pytorch/
- https://blog.athelas.com/a-brief-history-of-cnns-in-image-segmentation-from-r-cnn-to-mask-r-cnn-34ea83205de4
- http://cs231n.stanford.edu/slides/2018/cs231n_2018_ds06.pdf
- https://medium.com/@jonathan_hui/real-time-object-detection-with-yolo-yolov2-28b1b93e2088
- https://towardsdatascience.com/yolo-v3-object-detection-53fb7d3bfe6b
- https://datascience.stackexchange.com/questions/25119/how-to-calculate-map-for-detection-task-for-the-pascal-voc-challenge
- https://medium.com/@umerfarooq_26378/from-r-cnn-to-mask-r-cnn-d6367b196cfd
- https://mc.ai/yolo3-a-huge-improvement/
- https://medium.com/@smallfishbigsea/notes-on-focal-loss-and-retinanet-9c614a2367c6
- https://medium.com/@timothycarlen/understanding-the-map-evaluation-metric-for-object-detection-a07fe6962cf3
- https://zsc.github.io/megvii-pku-dl-course/slides/Lecture6(Object%20Detection).pdf
- https://hackernoon.com/understanding-yolo-f5a74bbc7967
- http://christopher5106.github.io/object/detectors/2017/08/10/bounding-box-object-detectors-understanding-yolo.html
- https://towardsdatascience.com/evolution-of-object-detection-and-localization-algorithms-e241021d8bad