In [None]:
%%HTML
<link rel="stylesheet" type="text/css" href="../../css/custom.css">

# Object detection

![footer_logo](../../images/logo.png)


## Goal

Classification is not the only computer vision task that can be performed by deep learning algorithms. 

In this notebook we shall discuss object detection.

## Program

- [Object detection]()
- [Problem defintion]()
- [Object detection pre-deep learning]()
- [Two stage methods]()
- [Single-shot methods]()

# Object detection

- Classification is not the only computer vision task that can be performed by deep learning algorithms

- Localizing objects is a crucial task for the real world:
    - autonomous driving, 
    - personal robots, 
    - industrial robotics, . . . 

### Example


<center><img src="../../images/object_detection/gemeente.png" width="800"><center>

In [None]:
from IPython.display import YouTubeVideo

YouTubeVideo('75H9EAvYN80', width=600, height=450)

# Problem definition

Given an input image, predict the locations of a certain class of objects in the image. The general workflow is represented by the diagram below:

![](../../images/object_detection/object-detection.png)

## Bounding boxes
Locations are usually represented using bounding boxes.

![](../../images/object_detection/bounding-boxes.png)

## Metrics

Typically evaluations are done using the IoU metric (intersection over union).

![](../../images/object_detection/IoU1.png)

## Metrics

Although this is by no means a perfect metric, it is the de facto standard

<!-- The box below would be flagged as having detected the image, even though it only has found half of the horse. -->

![](../../images/object_detection/IoU2.png)

# Object detection pre-deep learning

## Handcrafted features

Early object detectors were based on handcrafted features.

For example, Haar-like features:

![](../../images/object_detection/haar.png)

[Viola & Jones 2001](https://www.cs.cmu.edu/~efros/courses/LBMV07/Papers/viola-cvpr-01.pdf)

## Sliding windows

Sliding windows (of different sizes) would check if a feature response is strong enough. 

![](../../images/object_detection/sliding_window.gif)

However, checking every possible window is time-consuming.

## Selective search

Instead, selective search can be used to propose regions that have high “objectness”. They are identified by hierarchial groupings from oversegmenting the image:

1. oversegment the image based on pixel intensities
2. add all bounding boxes corresponding to segmented parts to the list of regional proposals
3. merge similar regions togethr
4. repeat from step 2


<img src="../../images/object_detection/selective-search.png" width="800">


[Uijlings et. al 2013](https://ivi.fnwi.uva.nl/isis/publications/bibtexbrowser.php?key=UijlingsIJCV2013&bib=all.bib)

# Deep Learning era

With the advent of deep learning, the performance of object detectors has improved dramatically.

<img src="../../images/object_detection/object-detection-history.png" width="500">

[Ground AI](https://www.groundai.com/project/object-detection-in-20-years-a-survey/1)



# Two stage methods

![](../../images/object_detection/two-stage.png)

1. Generate region proposals (instead of sliding window).
2. Classify each proposed region, if feature response strong enough, output detection.

In addition to predicting the presence of an object within the region proposals, the algorithm also feeds back to increase the precision of the bounding box.

The R-CNN family of methods generate accurate results but are computationally heavy.

## R-CNN 

- Extract proposals via selective search [(Uijlings et. al 2013)](https://ivi.fnwi.uva.nl/isis/publications/bibtexbrowser.php?key=UijlingsIJCV2013&bib=all.bib)
- Scale to common size then extract features with CNN
- Classify with an SVM

![](../../images/object_detection/r-cnn.png)

[Girshick et al. 2014](https://arxiv.org/pdf/1311.2524.pdf)

## Fast R-CNN

- Extract proposals via selective search.
- Extract features and classify with CNN.

![](../../images/object_detection/fast-r-cnn.png)

[Girshick et al. 2015](https://arxiv.org/abs/1504.08083#:~:text=Fast%20R%2DCNN%20builds%20on,while%20also%20increasing%20detection%20accuracy.)

## Faster R-CNN

- Faster R-CNN does away with selective search.
- Instead it uses a separate network to predict the region proposals. 
- The predicted region proposals are then reshaped using a RoI pooling layer, which is then used to classify the image within the proposed region and predict the offset values for the bounding boxes.

<img src="../../images/object_detection/faster-r-cnn.png" width="400" align="center"/>

[Ren et al. 2015](https://arxiv.org/pdf/1506.01497.pdf)


## Real-time object detection

The speed of Faster R-CNN allows it to be used for real-time object detection

![](../../images/object_detection/real-time.png)

[source](https://towardsdatascience.com/r-cnn-fast-r-cnn-faster-r-cnn-yolo-object-detection-algorithms-36d53571365e)

In [None]:
from IPython.display import YouTubeVideo

YouTubeVideo('zebSqDt6oMM', width=600, height=450)

# Single-shot methods

![](../../images/object_detection/single-shot.png)

- A single convolutional network predicts the bounding boxes and the class probabilities for these boxes.
- As these methods only have a single stage, they have very fast performance.



## YOLO: You Only Look Once

- Take an image and split it into an SxS grid, for each of the gridcells we produce m bounding boxes. 
- For each of the bounding boxes, the network outputs a class probability.
- For the bounding boxes with the highest class probability, it performs a regression to improve the boxes' precision.

<img src="../../images/object_detection/yolo.png" width="1000">

[Redman et al. 2016](https://arxiv.org/abs/1506.02640)

# Summary

- Object detection was orginally performed using hand-crafted features
- Deep learning methods learn the features
    - Two stage methods e.g. R-CNN
    - Single-shot methods e.g. YOLO
- Deep learing has rseulted in a significant imporvement for object detection algorithms


##### Reference
[Kerola, T. 2019]()
