# 1 - Object Detection

Object detection involves locating multiple objects in an image by drawing boxes around them and assigning each one a class label. It combines image classification (categorizing the whole image) and localization (identifying the location of one object).

- **Image classification:** Labels the entire image with one class (cat or not-cat).
- **Localization:** Draws a bounding box around one object in the image.
- **Classification with localization:** Labels one object in the image and localizes it with a  bounding box.
- **Object detection:** Labels multiple objects in the image and localizes each with a bounding box.

<br> 

<div style="text-align:center">
    <img src="media/object_detection.png" width=800>
</div>

## 1.1 - Classification with Localization
- **Technique:** Utilizes a Convolutional Neural Network (CNN) with a softmax layer for classification and additional parameters $(b_x, b_y, b_H, b_W)$ for bounding box localization.

The target label $Y$ includes:

$$Y = \left[ \begin{matrix}
p_c \\
b_x \\
b_y \\
b_H \\
b_W \\
c_1 \\
c_2 \\
\dots \\
c_n \\
\end{matrix} \right]
$$

Where:
- $p_c$ is the probability of an objected being detected.
- $(b_x, b_y)$ is the location of the object in $x,y$ coordinates with respect to the bounding box. By convention, the top-left corner of the bounding box is $(0,0)$ and the bottom-right corner is $(1,1)$.
- $(b_H, b_W)$ is the height of width of the bounding box with respect to the image dimensions.
- $c_i$ are the number of classes (exluding the default background class).

For instance, if we've 3 classes: $[\text{Cat, Dog, Human, Backgound}]$ where $\text{Background}$ is true when none of the other classes are detected (think of it as default class):

- If a human is detected, $Y$ would look like this:

$$Y = \left[ \begin{matrix}
1 & \text{probability of object being present} \\
0.5 & \text{x-coordinate is at center} \\
0.7 & \text{y-coordinate is 70% from top} \\
0.3 & \text{height of bounding box is 30% the image} \\
0.4 & \text{width of bounding box is 40% the image} \\
0 & \text{not a dog} \\
0 & \text{not a cat} \\
1 & \text{human is detected}
\end{matrix} \right]
$$

- If none of the three classes are detected:

$$Y = \left[ \begin{matrix}
0 & \text{no object detected} \\
? & \text{don't care value} \\
? & \text{don't care value} \\
? & \text{don't care value} \\
? & \text{don't care value} \\
? & \text{don't care value} \\
? & \text{don't care value} \\
? & \text{don't care value} \\
\end{matrix} \right]
$$

The **Loss Function** could be defined as:

$$\mathcal{L} = 
\begin{cases}
\sum_{i=1}^n \left( \hat{y}_i - y_i \right)^2 & y_1=1 \\
\\
\left( \hat{y}_1 - y_1 \right)^2 & y_1 = 0\\
\end{cases}
$$

Where:
- If an object is detected $y_1 = 1$, then the loss is the squared difference between all ground truths (including bounding box locations) and predictions.
- If no object is detected $y_1 = 0$, then the loss is just the squared difference of the first entry as we don't care about the remaining entries.

## 1.2 - Landmark Detection

Landmark detection involves identifying and localizing important key points on objects in an image.

<br>

<div style="text-align:center">
    <img src="media/landmark_detection.png" width=700>
    <caption><font color="purple"><br><u>Landmark Detection:</u> detecting landmarks (key points) on a human face. 129 neurons correspond to:<br>Face detected? (1 neuron)<br>For 64 landmarks, there are 64 (x,y) pairs or 128 points in total. Therefore, (128 neurons)</font></caption>
</div>


**Face Recognition:**
- Detect landmarks like corners of eyes, corners of mouth, edges of nose etc. This allows detecting pose, emotions, can enable graphics filters.
- Need consistent labeling across images (e.g. $l_1$ is always left corner of left eye)

**Human Pose Estimation:**
- Detect landmarks like midpoint of chest, shoulders, elbows, wrists etc.
- Again need consistent labeling across images

## 1.3 - Sliding Windows

The sliding window algorithm is used to detect objects in images by passing cropped regions of the image through a ConvNet.

<br>

<div style="text-align:center">
    <img src="media/sliding-windows.png" width=900>
</div>

- Create a labeled dataset of tightly cropped positive and negative images of the object (e.g. cars)
- Train a ConvNet on this dataset to recognize the object in small image regions
- To detect objects in a new image:
    - Define a window size (e.g. 100x100 pixels)
    - Crop out the region enclosed by the window
    - Feed this small image region into the ConvNet to classify if it contains the object
    - Slide the window across the image with some defined stride
    - Repeat this process with larger window sizes to detect larger objects

This allows scanning an image with "sliding windows" of different sizes to find objects.

- Main disadvantage:
    - Extremely computationally expensive to crop and run ConvNet on so many sub-images
    - Lower stride means more windows and better localization but even slower

## 1.4 - Convolutional Implementation of Sliding Windows

A better implementation of a sliding window is its convolutional implementation. This involves turning fully connected layers into convolutional layers. To do this, consider the following example of a ConvNet with two FC layers:

<br> 

<div style="text-align:center">
    <img src="media/sliding-windows-conv-before.png" width=800>
</div>

We can replace each of these FC layers with equivalent CONV layers by applying a filter that has the same dimensions as the input matrices. By using several of these filters (in this example: 400) we can create a CONV layer that is equivalent to the FC layer and has the same dimensionality in the third dimension

<br>

<div style="text-align:center">
    <img src="media/sliding-windows-conv-after.png" width=800>
</div>

This method of replacing FC layers with equivalent CONV layers can be used to implement the sliding windows method by using convolution. Consider the following example of a ConvNet with an input image of size $14 \times 14 \times 3$ (note that the third dimension has been left out in the drawing for simplicity). This image is now convoluted by applying a $5 \times 5$ filter, some max-pooling and further convolution layers.

<br>

<div style="text-align:center">
    <img src="media/sliding-windows-conv-1.png" width=800>
</div>

We can implement the sliding window method now by adding a border of two pixels - this results in an input image of dimensions $16 \times 16 \times 3$. We can now apply the same $5 \times 5$ convolution and all the pooling and other pooling layers like before. However, because of the changed dimensions of the input image, the dimensions of the intermediate matrices are also different.

<br>

<div style="text-align:center">
    <img src="media/sliding-windows-conv-2.png" width=800>
</div>

It turns out however that the upper left corner of the resulting matrix (blue square) is the result of the upper left area of the input image (blue area). The other square in the output matrix correspond to the other areas in the input image accordingly. This mapping applies to the intermediate matrices too. The advantage calculating the result convolutionally is that the forward propagations of all areas are combined into one form sharing a lot of the computation in the regions of the image that are common. In contrast, with the sliding window method as described above forward propagation would have been done several times independently (once for each position of the sliding window). A convolutional implementation of sliding windows has therefore considerable lower computational cost compared to a sequential implementation because all the positions of a sliding window are computed in a single forward pass.

## 1.5 - YOLO

One disadvantage of the sliding windows method (even with a convolutional implementation) is that the window may not come to lie over an object nicely. That is there are only positions of the sliding window where parts of the object are outside the border of the window. Therefore the ConvNet cannot (or only badly) detect the object. One approach to solve this is the **YOLO algorithm** <i>(You Only Look Once)</i>. This algorithm divides the image into multiple cells of equal size.

<br>

<div style="text-align:center">
    <img src="media/yolo.png" width=500>
</div>

The basic idea is to then apply the sliding window process as described above to each of the cells (9 cells above). For this, there must be labels for each grid cell encoded as label vectors $Y$. Whether an object belongs to a specific image segment or not is determined by the coordinates of the center of the bounding box. That way, an object can only ever belong to exactly one segment.

The output for YOLO-networks is a volume, not a two-dimensional matrix. If we have for example 3 classes, the label vector would consist of 8 elements $(p_c,b_x,b_y,b_H,b_W,c_1,c_2,c_3)$. If we divide the image into 9 segments as in the picture above, we can detect one object per segment resulting in a label matrix (and therefore also the output matrix $Y$ of the CNN) of dimensions $(3 \times 3 \times 8)$ as for each cell in the $3 \times 3$ grid, we have the vector $Y$ of size 8.


The following picture shows an example of a CNN that uses YOLO to divide an image into $19 \times 19$ segments and distinguishes between 80 classes and shows how it encodes an image:

<br>

<div style="text-align:center">
    <img src="media/yolo-architecture.png">
    <caption><font color="purple">Spatial dimensions are reduced from 608 by a factor of 32 (pooling layers, strides, etc.) to 608/32 = 19. For each cell in the grid (19, 19), the network has 5 anchor boxes, and then the dimension of $Y$ is 5 bounding boxes $(p_c,b_x,b_y,b_H,b_W)$ together with 80 class probabilities resulting in 85 (5 + 80).</font></caption>
</div>

Note that the coordinates $b_x, b_y$ are relative to the grid cell, not the image. Therefore their values need to lie between 0 and 1. Likewise, the size of the bounding box (specified by $b_H$ and $b_W$) is relative to the size of the grid cell, but since the bounding box may overlap several grid cells, those values can become greater than 1 (whereas in a different algorithm without segmentation the size of the bounding box cannot be greater than than the image).

The YOLO algorithm is a convolutional implementation of the sliding window method which makes it performant by sharing a lot of the computation. In fact, the YOLO algorithm is so performant that it can be used for real-time object detection.

### 1.5.1 Improvements to YOLO

There are a few ways to improve the performance of the YOLO algorithm.

- Intersection over union
- Non-max suppression
- Anchor boxes

### Intersection over union
An improvement of the YOLO algorithm can be achieved by calculating how much a predicted bounding box overlaps with the actual bounding box.

<br>

<div style="text-align:center">
    <img src="media/iou.png" width=600>
</div>

This value is called Intersection over Union (IoU) and can be more formally defined as:

$$IoU = \frac{\text{size of the overlapping part (intersection)}}{\text{size of the combined bounding boxes (union)}}$$

The IoU can be used to evaluate the accuracy of the localization. When $IoU > 0.5$ (or any reasonable value) the localization could be marked as correct.

### Non-max suppression
YOLO can further be improved by using non-max suppression (NMS). NMS prevents YOLO from detecting the same object multiple times, e.g. if the YOLO algorithm calculates several bounding boxes for the same object with their center coordinates assigned to different cells. It does so by assigning the first component $p_c$ of a label vector $Y$ a value between 0 and 1 (whereas before this value was binary, i.e. either 0 or 1). The value of $p_c$ corresponds to the confidence of the algorithm that the bounding box contains the respective object. NMS then searches for overlapping boxes and removes all but the one with the highest value of $p_c$. The following image illustrates this (the blue boxes are the ones that are removed, the green boxes are the ones that are retained):

<br>

<div style="text-align:center">
    <img src="media/nms.png" width=600>
</div>

The algorithm for NMS is as follows:

```python
Delete all bounding boxes with pc <= 0.6
While there are bounding boxes:
    Take the bounding box with the largest pc
    Output this as a prediction
    Find all bounding boxes overlapping this bounding box (IoU > 0.5) and remove them
```

### Anchor boxes
To enable the YOLO algorithm to detect more than one object per cell, we can use anchor boxes. An anchor box corresponds to a bounding box for a certain object that overlaps the bounding box of another object. An object is assigned to the anchor box with the highes IoU value (whereas previously the object was assigned to the grid cell that contained the object's midpoint). To predict more than one object per cell, the label-vectors need to be stacked on top of each other in the output. The following picture illustrates this for two anchor boxes (i.e. a YOLO-NN that can detect two objects per cell).

<br>

<div style="text-align:center">
    <img src="media/anchor-boxes.png" width=500>
</div>

### YOLO Algorithm

Assume we train a ConvNet using YOLO by dividing the images with a $3 \times 3$ grid whereas the ConvNet should be able to detect 2 objects per grid cell and we have 3 possible classes of objects. The dimensions of the output matrix of such a ConvNet would therefore be $3 \times 3 \times (2 . 8) = 3 \times 3 \times 16$.

Generally speaking the dimensions of the output $y$ of a ConvNet using YOLO with a $k \times l$ grid, $n_c$ classes and $m$ anchor boxes can be calculated as follows:

$$dim(y) = k \times l \times (m \cdot (5 + n_c))$$

The term ($5+n_c)$ stems from the five components needed to indicate the confidence and the coordinates/size of a bounding box $(p_c,b_x,b_y,b_H,b_W)$ together with the $n_c$ components needed to form a one-hot vector of a class.

A CNN that uses YOLO can therefore do the following:

1. Divide the picture into a grid of equally sized segments by using the YOLO-algorithm
2. Detect and locate several objects in an image by applying a convolutional implementation of a sliding window
3. Calculate the bounding boxes by training on the labelled data
4. Use anchor boxes to detect multiple different objects within the same grid cell
5. Apply non-max suppression to keep only one bounding box per object
6. output a matrix of dimensions $dim(y)$ which represents the encoding of the information

If we visualize the output matrix from the last step it might look something like this:

<br>

<div style="text-align:center">
    <img src="media/yolo-output-example.png" width=600>
</div>

# R-CNN (Regions with CNN)

R-CNN first selects region proposals that may contain objects, then runs a CNN classifier on each region. This allows focusing on promising object regions instead of scanning the full image.

One of the downsides of YOLO is that it processes a lot of areas where no objects are present. R-CNN stands for regions with ConvNets. R-CNN tries to pick a few windows (region proposals) and run a ConvNet on top of them. The algorithm R-CNN uses to pick windows is called a `segmentation algorithm` and it outputs something like this:

<br>

<div style="text-align:center">
    <img src="media/segmentation.png" width=700>
</div>


If, for example, the segmentation algorithm produces 2000 blobs, then we should run our classifier/CNN on top of these
blobs. There has been a lot of work regarding R-CNN to make it faster:


**R-CNN**
- Propose regions. Classify proposed regions one at a time. Output label + bounding box.
- Downside is that its slow.
- [Girshik et. al, 2013. Rich feature hierarchies for accurate object detection and semantic segmentation]


**Fast R-CNN**
- Propose regions. Use convolution implementation of sliding windows to classify all the proposed regions.
- [Girshik, 2015. Fast R-CNN]


**Faster R-CNN**
- Use convolutional network to propose regions.
- [Ren et. al, 2016. Faster R-CNN: Towards real-time object detection with region proposal networks]


**Mask R-CNN**
- https://arxiv.org/abs/1703.06870

Most of the implementation of faster R-CNN are still slower than YOLO. According to YOLO:
> Our model has several advantages over classifier-based systems. It looks at the whole image at test time so its predictions are informed by global context in the image. It also makes predictions with a single network evaluation unlike systems like R-CNN which require thousands for a single image. This makes it extremely fast, more than 1000x faster than R-CNN and 100x faster than Fast R-CNN.