# Convolutional Neural Networks
## Object Detection

Author: Binghen Wang

Last Updated: 19 Dec, 2022

<nav>
    <b>Deep learning navigation:</b> <a href="./Deep Learning Basics.ipynb">Deep Learning Basics</a> |
    <a href="./Deep Learning Optimization.ipynb">Optimization</a>
    <br>
    <b>CNN navigation:</b> <a href="./Convolutional Neural Networks.ipynb">CNN Basics</a>
</nav>

---
<nav>
    <a href="../Machine%20Learning.ipynb">Machine Learning</a> |
    <a href="../Supervised Learning/Supervised%20Learning.ipynb">Supervised Learning</a>
</nav>

---

## Contents

- [Object Classification with Localization](#OCL)
- [Object Detection](#OD)
    - [Sliding Windows](#SW)
- [YOLO Algorithm](#YOLO)
    - [Bounding Boxes](#BB)
    - [Intersection Over Union](#IoU)
    - [Non-max Suppression](#NMS)
    - [Anchor Boxes](#AB)
- [Semantic Segmentation](#SS)
    - [Transposed Convolution](#TC)
    - [U-Net](#UN)

Previously, the discussed models are mainly for the task of **object classification**. Namely, given an input image, the network classifies it into one of the known classes. The CNN can also be used to perform more complicated tasks.

<a name = 'OCL'></a>
## Object Classification with Localization
An improvement of the classification algorithm is to output concurrently a rectangle surrounding the identified object. This process is called **localization**. This can be achieved by modifying the output layer of the CNN.

<div style = "text-align: center;">
    <img src="./images/Object classification with localization.png" style="width:90%;" >
</div>


#### Example
Consider a classification with localization problem, where the total number of classes equals 4 (pedestrian, car, motorcycle, background). The output/target is a $8\times1$ vector and takes the following form:
$$
y = \left[\begin{array}{c}
p_c \\
b_x \\
b_y \\
b_h \\
b_w \\
c_1 \\
c_2 \\
c_3 \\
\end{array}\right] \begin{array}{l}
\cdots \text{is there any object?} \\
\cdots \text{bounding box location x} \\
\cdots \text{bounding box location y} \\
\cdots \text{bounding box height} \\
\cdots \text{bounding box width} \\
\cdots \text{is the object a pedestrian?} \\
\cdots \text{is the object a car?} \\
\cdots \text{is the object a motorcycle?} 
\end{array}
$$
The loss function can be defined as:
$$
L(\hat y, y) = \text{BinaryCrossEntropy}(\hat y_1, y_1) + \mathbb{1}\{y_1 = 1\} \cdot \sum_{j=2}^{j=5}{(\hat y_j - y_j)}^2 + \mathbb{1}\{y_1 = 1\} \cdot \text{CrossEntropy}\left(\left[\begin{array}{c} \hat y_6 \\ \hat y_7 \\ \hat y_8 \end{array}\right], \left[\begin{array}{c} y_6 \\ y_7 \\ y_8 \end{array} \right]\right)
$$
Therefore, <br>
if $y = 1$,
$$
L(\hat y, y) = \text{BinaryCrossEntropy}(\hat y_1, y_1) + \sum_{j=2}^{j=5}{(\hat y_j - y_j)}^2 + \text{CrossEntropy}\left(\left[\begin{array}{c} \hat y_6 \\ \hat y_7 \\ \hat y_8 \end{array}\right], \left[\begin{array}{c} y_6 \\ y_7 \\ y_8 \end{array} \right]\right)
$$
if $y = 0$,
$$
L(\hat y, y) = \text{BinaryCrossEntropy}(\hat y_1, y_1)
$$

<a name ='OD'></a>
## Object Detection
<a name ='SW'></a>
### Sliding windows
One simple yet computationally costly way to turn an object classification algorithm into an object detection algorithm is through the use of sliding windows. To save on computation, one could use the convolutional implementation of sliding windows.

<div style = "text-align: center;">
    <img src="./images/sliding windows.png" style="width:100%;" >
</div>

A **drawback** of the sliding windows approach is that the size of sliding windows is **fixed** (by the original classification algorithm input). To achieve better localization, we want windows of different sizes.

## YOLO Algorithm

YOLO stands for 'You Only Look Once'. It was invented by <a href = 'https://arxiv.org/pdf/1506.02640.pdf'>Redmon et al. (2015)</a>, making use of the **convolutional implementation of sliding windows** and **object classification with localization**.

<div style = "text-align: center;">
    <img src="./images/YOLO Algorithm.png" style="width:50%;" ><br>
    Source: <a href = 'https://arxiv.org/pdf/1506.02640.pdf'>Redmon et al. (2015)</a>
</div>

<a name ='BB'></a>
### Bounding Box Predictions for Grid Cells
The algorithm works by:
- dividing the input into a $S\times S$ grid
- for each grid cells, predicting $B$ bounding boxes, confidences for those boxes, and $C$ class probabilities.
- labeling a grid as having an object if the midpoint of the object falls into it.

#### Example
**Three classes**: {pedestrian, car, motorcycle}. <br>
**The target label for each grid cell** is of length 8: ${\left[p_c, b_x, b_y, b_h, b_w, c_1, c_2, c_3 \right]}^{\mathsf{T}}$
<div style = "text-align: center;">
    <img src="./images/bounding box predictions.png" style="width:80%;" >
</div>
The diagonal-stripes-shaded cell has the label
$$
{\left[1, 0.4, 0.15, 0.5, 0.85, 0, 1, 0 \right]}^{\mathsf{T}},
$$
whereas the checkered cell has label
$$
{\left[0, ?, ?, ?, ?, ?, ?, ? \right]}^{\mathsf{T}}.
$$

<br>
<div class = "alert alert-block alert-success"><b>Note:</b> the bounding box can exceed the boundaries of a grid. </div>

<a name ='IoU'></a>
### Intersection Over Union
Localization can generally not be exactly the same as the labelled location of bounding box. In order to evaluate the goodness of the localization, we employ the idea of **intersection over union** (IoU). It works by calculating the ratio of the intersection of two bounding boxes over their union. Generally speaking, an $$\text{IoU} \geq 0.5$$ is considered **correct localization**.

<div style = "text-align: center;">
    <img src="./images/IoU.png" style="width:20%;" >
</div>

<a name ='NMS'></a>
### Non-max Suppression
In the inference phase, multiple bounding boxes could be predicted for a certain object. (This happens when multiple grids 'think' they contian the middle points of the object.) To make sure that only one bounding box (the best one) is predicted for each object, the **non-max suppression algorithm** can be used. It works as follows:

<blockquote>
    For each object class (e.g. $c_1, c_2, c_3$):
    <blockquote>
    Rank all the bounding boxes by confidence $p_c$. <br>
    Repeat while there is any remaining bounding boxes:
    <blockquote>
        <ul>
            <li> Pick the box with the current largest confidence and output it as a prediction.
            <li> Delete all boxes that have an IoU $\geq 0.5$ with the selected box.
        </ul>
    </blockquote>
    </blockquote>
</blockquote>

<a name ='AB'></a>
### Anchor Boxes
While the chance of the midpoints of two objects falling into one grid is small with small grid cells, the **Anchor Boxes** algorithm has been invented to deal with this possibility. It works by pre-specifying a couple of anchor boxes of different shapes and then label the objects with an anchor box which has the **largest IoU** with the object bounding box.

One way to pick the shapes of anchor boxes is through **K-means** classification using commonly observed objects. The anchor boxes algorithm allows the detection algorithm to specializ
e better in detecting objects of different shapes.

#### Example

<div style = "text-align: center;">
    <img src="./images/anchor boxes.png" style="width:50%;" >
</div>

The output shape is $3\times 3\times 16$ or $3 \times 3 \times 2 \times 8$. For this input photo, grid cell 8 has the following label:

$$
{\left[\underbrace{0, ?, ?, ?, ?, ?, ?, ?}_{\text{anchor box 1}}, \underbrace{1, 0.3, 0.3, 0.4, 0.75, 0, 1, 0}_{\text{anchor box 2}} \right]}^{\mathsf{T}}
$$

Grid cell 9 has the following label:

$$
{\left[\underbrace{1, 0.7, 0.4, 0.6, 0.25, 1, 0, 0}_{\text{anchor box 1}}, \underbrace{1, 0.18, 0.3, 0.4, 0.75, 0, 1, 0}_{\text{anchor box 2}} \right]}^{\mathsf{T}}
$$

<a name = 'SS'></a>
## Semantic Segmentation

**Semantic segmentation** is the labelling of **each pixel** with a class of what it belongs to. It allows for finer boundaries of different objects with in an image and has been applied widely in medical image analysis and autonomous driving.

<a name ='TC'></a>
### Transposed Convolution
Previous classification and detection algorithms follow the general pattern of decreasing height and width as an input passes through the neural network. Yet to be able to label each pixel with a class, the output of the neural network is required to be of the same shape as the input. This is achieved by the employment of **Transposed Convolutional Layers**.

<div style = "text-align: center;">
    <img src="./images/transposed convolution.png" style="width:75%;" >
</div>

A good explanation of transposed convolution: <a href ="https://naokishibuya.medium.com/up-sampling-with-transposed-convolution-9ae4f2df52d0">here</a>.
<blockquote>
    [W]e <b>can emulate the transposed convolution using a convolution</b>. We up-sample the input by adding zeros between the values in the input matrix in a way that the direct convolution produces the same effect as the transposed convolution. You may find some article explains the transposed convolution in this way. However, it is less efficient due to the need to add zeros to up-sample the input before the convolution. <br>
    <div style = "text-align: right">– Up-sampling with Transposed Convolution, Naoki (2017)</div>
</blockquote>

<a name ='UN'></a>
### U-Net

<div style = "text-align: center;">
    <img src="./images/U-net.png" style="width:75%;" > <br>
    Source: <a href = "https://arxiv.org/pdf/1505.04597.pdf">Olaf Ronneberger, Philipp Fischer, and Thomas Brox (2015)</a>
</div>

**Note**:
- The up-conv in the picture refers to the **transposed convolution**.
- The **skip connection** here *differs from* that in ResNets in that it takes the early activations and **concatenate** them with the activations from the normal strand. In ResNets, skip connections add early activations and *add* them to the linear part before feeding them the activation function.
- The **skip connection** helps provide high-resolution low-level feature information, while the usual strand provides the low-resolution high-level contextual information.
- The final **out_channels** equals the number of classes $n_c$ of the classification problem, which happens to be 2 in the above illustration.