From Andrew Ng

## Classification with Localization

Conv net with softmax output and bounding box $(b_x, b_y, b_h, b_w)$. Training set contains class labels and bounding boxes.

Assuming 1 object in an image, label $y = (p_c, b_x, b_y, b_h, b_w, c_1, c_2, c_3)$ where:

* $p_c$ is the probability that there is an object of interest
* $c_1, c_2, c_3$ are class labels.
* $(b_x, b_y, b_h, b_w)$ are coordinates in the image.

### Loss Function

Typically: 

**Logistic regression loss** for $p_c$,

**mean square error** for bounding box parameters

** softmax loss** for class labels $c_1, c_2, c_3$.

Example:

$$
\begin{aligned}
\mathcal{L}(\hat{y}, y) &= \begin{cases}
\sum_{i=1} \big(\hat{y}_i - y_i \big)^2 & if y_1 == 1 \\
\big(\hat{y}_1 - y_1 \big)^2 & if y_1 == 0
\end{cases}
\end{aligned}
$$

The case where there is no object, the terms other than the first item in the label are not relevant.

Coordinate landmarks can also be used to classify the pose of a person in an image, provided that the cordinates of the body are labelled.

## Object Detection

### Sliding window detection

Use a varity of window sizes to slide across an image and classify the windowed image for possible class.

**Disadvantage**: computational cost due to large number of windows. May miss objects due to stride size.

### CNN Implementation of Sliding Windows

Typically you have input -> CONV -> Max Pool -> FC -> FC -> Softmax

However, you can replace the FC layers with CONV layers with # of filters same as the size of the width of FC layers, ie. if FC has 400 neurons, we can replace it with filters matching the size of previous layer, but 400 of them. 

Then it can be followed by 1x1x400 CONV layers, then to a 1x1x4 CONV layer if there are 4 classes for the softmax layer. Max pool size is (2, 2).

Creating the windows results in lots of duplicated computation. Therefore for instead of using 14x14x3 images and 1x1x4 softmax in the end, you can instead use 16x16x3 images, with the same filters, but get a 2x2x4 output for softmax in the end.

Each order of the 2x2x4 output represent a subset of the 16x16x3 imanges, echo of wich is 14x14x3. Therefore 4 images can be done in one pass.

