# 1 - Object Detection

Object detection involves locating multiple objects in an image by drawing boxes around them and assigning each one a class label. It combines image classification (categorizing the whole image) and localization (identifying the location of one object).

- **Image classification:** Labels the entire image with one class (cat or not-cat).
- **Localization:** Draws a bounding box around one object in the image.
- **Classification with localization:** Labels one object in the image and localizes it with a  bounding box.
- **Object detection:** Labels multiple objects in the image and localizes each with a bounding box.

<br> 

<div style="text-align:center">
    <img src="media/object_detection.png" width=800>
</div>

## 1.1 - Classification with Localization
- **Technique:** Utilizes a Convolutional Neural Network (CNN) with a softmax layer for classification and additional parameters $(bx, by, bh, bw)$ for bounding box localization. By convention, $(bx, by) = \text{center of image}$.

The target label $Y$ includes:

$$Y = \left[ \begin{matrix}
Pc \\
bx \\
by \\
bh \\
bw \\
c1 \\
c2 \\
\dots \\
c_n \\
\end{matrix} \right]
$$

Where:
- $Pc$ is the probability of an objected being detected.
- $(bx, by)$ is the location of the object in $x,y$ coordinates taken from the center.
- $(bh, bw)$ is the height of width of the bounding box.
- $c_i$ are the number of classes (exluding the default background class).

For instance, if we've 3 classes: $[\text{Cat, Dog, Human, Backgound}]$ where $\text{Background}$ is true when none of the other classes are detected (think of it as default class):

- If a human is detected, $Y$ would look like this:

$$Y = \left[ \begin{matrix}
1 & \text{probability of object being present} \\
0.5 & \text{x-coordinate is at center} \\
0.7 & \text{y-coordinate is 70% from top} \\
0.3 & \text{height of bounding box is 30% the image} \\
0.4 & \text{width of bounding box is 40% the image} \\
0 & \text{not a dog} \\
0 & \text{not a cat} \\
1 & \text{human is detected}
\end{matrix} \right]
$$

- If none of the three classes are detected:

$$Y = \left[ \begin{matrix}
0 & \text{no object detected} \\
? & \text{don't care value} \\
? & \text{don't care value} \\
? & \text{don't care value} \\
? & \text{don't care value} \\
? & \text{don't care value} \\
? & \text{don't care value} \\
? & \text{don't care value} \\
\end{matrix} \right]
$$

The **Loss Function** could be defined as:

$$\mathcal{L} = 
\begin{cases}
\sum_{i=1}^n \left( \hat{y}_i - y_i \right)^2 & y_1=1 \\
\\
\left( \hat{y}_1 - y_1 \right)^2 & y_1 = 0\\
\end{cases}
$$

Where:
- If an object is detected $y_1 = 1$, then the loss is the squared difference between all ground truths (including bounding box locations) and predictions.
- If no object is detected $y_1 = 0$, then the loss is just the squared difference of the first entry as we don't care about the remaining entries.

## 1.2 - Landmark Detection

Landmark detection involves identifying and localizing important key points on objects in an image.

<br>

<div style="text-align:center">
    <img src="media/landmark_detection.png" width=700>
    <caption><font color="purple"><br><u>Landmark Detection:</u> detecting landmarks (key points) on a human face. 129 neurons correspond to:<br>Face detected? (1 neuron)<br>For 64 landmarks, there are 64 (x,y) pairs or 128 points in total. Therefore, (128 neurons)</font></caption>
</div>


**Face Recognition:**
- Detect landmarks like corners of eyes, corners of mouth, edges of nose etc. This allows detecting pose, emotions, can enable graphics filters.
- Need consistent labeling across images (e.g. $l_1$ is always left corner of left eye)

**Human Pose Estimation:**
- Detect landmarks like midpoint of chest, shoulders, elbows, wrists etc.
- Again need consistent labeling across images

## 1.3 - Sliding Windows

The sliding window algorithm is used to detect objects in images by passing cropped regions of the image through a ConvNet.

<br>

<div style="text-align:center">
    <img src="media/sliding-windows.png" width=900>
</div>

- Create a labeled dataset of tightly cropped positive and negative images of the object (e.g. cars)
- Train a ConvNet on this dataset to recognize the object in small image regions
- To detect objects in a new image:
    - Define a window size (e.g. 100x100 pixels)
    - Crop out the region enclosed by the window
    - Feed this small image region into the ConvNet to classify if it contains the object
    - Slide the window across the image with some defined stride
    - Repeat this process with larger window sizes to detect larger objects

This allows scanning an image with "sliding windows" of different sizes to find objects.

- Main disadvantage:
    - Extremely computationally expensive to crop and run ConvNet on so many sub-images
    - Lower stride means more windows and better localization but even slower

## 1.4 - Convolutional Implementation of Sliding Windows

A better implementation of a sliding window is its convolutional implementation. This involves turing fully connected layers into convolutional layers. To do this, consider the following example of a ConvNet with two FC layers:

<br> 

<div style="text-align:center">
    <img src="media/sliding-windows-conv-before.png" width=800>
</div>

We can replace each of these FC layers with equivalent CONV layers by applying a filter that has the same dimensions as the input matrices. By using several of these filters (in this example: 400) we can create a CONV layer that is equivalent to the FC layer and has the same dimensionality in the third dimension

<br>

<div style="text-align:center">
    <img src="media/sliding-windows-conv-after.png" width=800>
</div>

This method of replacing FC layers with equivalent CONV layers can be used to implement the sliding windows method by using convolution. Consider the following example of a ConvNet with an input image of size $14 \times 14 \times 3$ (note that the third dimension has been left out in the drawing for simplicity). This image is now convoluted by applying a $5 \times 5$ filter, some max-pooling and further convolution layers.

<br>

<div style="text-align:center">
    <img src="media/sliding-windows-conv-1.png" width=800>
</div>

We can implement the sliding window method now by adding a border of two pixels to the left. This results in an input image of dimensions $16 \times 16 \times 3$. We can now apply the same $5 \times 5$ convolution and all the pooling and other pooling layers like before. However, because of the changed dimensions of the input image, the dimensions of the intermediate matrices are also different.

<br>

<div style="text-align:center">
    <img src="media/sliding-windows-conv-2.png" width=800>
</div>

It turns out however that the upper left corner of the resulting matrix (blue square) is the result of the upper left area of the input image (blue area). The other square in the output matrix correspond to the other areas in the input image accordingly. This mapping applies to the intermediate matrices too. The advantage calculating the result convolutionally is that the forward propagations of all areas are combined into one form sharing a lot of the computation in the regions of the image that are common. In contrast, with the sliding window method as described above forward propagation would have been done several times independently (once for each position of the sliding window). A convolutional implementation of sliding windows has therefore considerable lower computational cost compared to a sequential implementation because all the positions of a sliding window are computed in a single forward pass.