From Andrew Ng

## Classification with Localization

Conv net with softmax output and bounding box $(b_x, b_y, b_h, b_w)$. Training set contains class labels and bounding boxes.

Assuming 1 object in an image, label $y = (p_c, b_x, b_y, b_h, b_w, c_1, c_2, c_3)$ where:

* $p_c$ is the probability that there is an object of interest
* $c_1, c_2, c_3$ are class labels.
* $(b_x, b_y, b_h, b_w)$ are coordinates in the image.

### Loss Function

Typically: 

**Logistic regression loss** for $p_c$,

**mean square error** for bounding box parameters

** softmax loss** for class labels $c_1, c_2, c_3$.

Example:

$$
\begin{aligned}
\mathcal{L}(\hat{y}, y) &= \begin{cases}
\sum_{i=1} \big(\hat{y}_i - y_i \big)^2 & if y_1 == 1 \\
\big(\hat{y}_1 - y_1 \big)^2 & if y_1 == 0
\end{cases}
\end{aligned}
$$

The case where there is no object, the terms other than the first item in the label are not relevant.

Coordinate landmarks can also be used to classify the pose of a person in an image, provided that the cordinates of the body are labelled.

## Object Detection

### Sliding window detection

Use a varity of window sizes to slide across an image and classify the windowed image for possible class.

**Disadvantage**: computational cost due to large number of windows. May miss objects due to stride size.

### CNN Implementation of Sliding Windows

Typically you have input -> CONV -> Max Pool -> FC -> FC -> Softmax

However, you can replace the FC layers with CONV layers with # of filters same as the size of the width of FC layers, ie. if FC has 400 neurons, we can replace it with filters matching the size of previous layer, but 400 of them. 

Then it can be followed by 1x1x400 CONV layers, then to a 1x1x4 CONV layer if there are 4 classes for the softmax layer. Max pool size is (2, 2).

Creating the windows results in lots of duplicated computation. Therefore for instead of using 14x14x3 images and 1x1x4 softmax in the end, you can instead use 16x16x3 images, with the same filters, but get a 2x2x4 output for softmax in the end.

Each order of the 2x2x4 output represent a subset of the 16x16x3 imanges, echo of wich is 14x14x3. Therefore 4 images can be done in one pass.



## YOYL (You Only Look Once) 2015

Split an image into grids. For each grid cell, the label is $y = (p_c, b_x, b_y, b_h, b_w, c_1, c_2, c_3)$.

For example an impage is split into 3x3 grids (in practice typically use 19x19 grids). Find the mid point of each object in each cell.

Input 100x100x3 -> CONV -> MAX POOL -> ... -> Output 3x3x8.

$(b_x, b_y, b_h, b_w)$ are expressed relative to the size of the grid cell. $(b_x, b_y)$ they have to be [0, 1]. $(b_h, b_w)$ could be > 1.

## Intersection Over Union (IOU)

Used to evaluate objection. Computers: **IOU = size of intersection / size of union**. 

If IOU >= 0.5, then correct. Or 0.6, 0.7, etc.

It measure the overlap of two boxes.

## Non-Max Suppression

Ensure your algo detects an object only once. 

Multiple cells in a grid of a picture may have high $p_c$, probability of detection. 

Non-Max suppression algo:

* For a cluster of boxes with high $p_c$, choose the one with the highest $p_c$, 
* Checks all other boxes with high IOU, suppress them. 

Example, for a 19x19 grid, for 1 object:


1. discard all boses with $p_c \leq 0.6$ 
2. while there are remaining bounding boxes:
    * pick the box with the largest $p_c$, output that as a prediction
    * discard any remaining box with IOU >= 0.5 with the box output in the prevoius step (this drops boxes that are potentially flagging the same object).
    
For multiple objects, repeat for each object class.

## Anchor Boxes

For dealing with **overlapping objects**, those with the **same midpoint**. 

For the case of 2 objects: change the encoding from $y = (p_c, b_x, b_y, b_h, b_w, c_1, c_2, c_3)$ to:

$$y = (p_c, b_x, b_y, b_h, b_w, c_1, c_2, c_3, p_c, b_x, b_y, b_h, b_w, c_1, c_2, c_3)$$

Use the first group for anchor box 1, and the second group for anchor box 2. 

Each object in training image is assigned to grid cell that contains object's midpoint and anchor box for the grid cell with highest IOU.

## YOLO

To detect 3 classes: pedestrian/car/motocycle, **dimension** of $y$ that has 2 anchor boxes: 3x3x2x8 or 3x3x16, (output of your CNN).

Then run non-max suppression, for each object class.

## Region Proposal

R-CNN (2013). Run **segmentation algorithm** to find regions to run CNN classifier on. Reduces the no. of regions to run through CNN.

Fast R-CNN speeds up quite a bit. However, proposal algo still slow. 

Faster R-CNN runs faster, but still slower than YOLO.