# Object detection

## Object localization
* classification
    * label
    * one object
* classification with localization
    * label AND bounding box
    * one object
* detection
    * labels AND bounding boxes
    * multiple objects

* solution
    * adding output units for bounding box
        * $b_x, b_y, b_h, b_w$
        * middle point, height and with of the box
        * convention is that these are defined in terms of input image size
    * target vector
        * is there an object?
        * bounding box
        * classes
        * examples $y = \begin{bmatrix} p_c \\ b_x \\ b_y \\ b_h \\ b_w \\ c_1 \\ c_2 \\ c_3 \end{bmatrix}$, $\begin{bmatrix} 1 \\ b_x \\ b_y \\ b_h \\ b_w \\ 0 \\ 1 \\ 0 \end{bmatrix}$, $\begin{bmatrix} 0 \\ ? \\ ? \\ ? \\ ? \\ ? \\ ? \\ ? \end{bmatrix}$
    * example of squared loss
        * $ L(\hat y,y) = (\hat y_1-y_1)^2+(\hat y_2-y_2)^2+...+(\hat y_8-y_8)^2$, if $y_1=1$
        * $L(\hat y,y) = (\hat y_1-y_1)^2$, if $y_1=1$
        * different types of losses across the elements

* landmark detection
    * detect multiple important points of the image (ie persons image and eye position and silhouette, pose detection)
    * output is object there and point coordinates
    * labels need to be consistent across images

## Object detection
* sliding window algorithm
    * algorithm
        * looping window through a image using a sliding window
        * classifying whether there is an object in the sliding window
        * pick larger sliding window
        * looping window through the image using the new window
        * classifying whether there is the object
        * ...
    * computationally intensive
    * can be implemented convolutionally!
* conv sliding window algorithm
    * [paper](https://arxiv.org/abs/1312.6229)
    * conv net
        * input (14 x 14 x 3)
        * conv layer (5 x 5) with 16 filters
        * max pool layer (2 x 2)
        * FC with 400 units
        * FC with 400 units
        * output layer with softmax
    * transforming FC layer to conv
        * input (14 x 14 x 3)
        * conv layer (5 x 5) with 16 filters
        * max pool layer (2 x 2)
        * conv layer (5 x 5) with 400 filters, output (1 x 1 x 400)     
        * conv layer (1 x 1) with 400 filters, output (1 x 1 x 400)
        * conv layer (1 x 1) with 4 filters, output fed into softmax
    * further example (base)
        * input (14 x 14 x 3)
        * conv layer 16 filters (5 x 5), output (10 x 10 x 16), stride 1, no padding
        * max pool (2 x 2), output (5 x 5 x 16)
        * conv layer 400 filters (5 x 5), output (1 x 1 x 400)
        * conv layer 400 filters (1 x 1), output (1 x 1 x 400)
        * conv layer 4 filters (1 x 1), softmax on top, output (1 x 1 x 4)
    * further example (extended)
        * input (16 x 16 x 3), slightly larger image with 4 sliding windows (14 x 14 x 3), with stride 2
        * conv layer 16 filters (5 x 5) with stride 1, output (12 x 12 x 16)
        * max pool (2 x 2), output (6 x 6 x 16)
            * this step is responsible for the stride of 2 in the original image
        * conv layer 400 filters (5 x 5), output (2 x 2 x 400)
        * conv layer 400 filters (2 x 2), output (2 x 2 x 400)
        * conv layer 4 filters (2 x 2), softmax on top, output (2 x 2 x 4)
            * every point in the layer corresponds to the corner fo the input image (ie upper left)

## Bounding box predictions
* YOLO (You only look once) [paper](https://arxiv.org/abs/1506.02640)
* detects object midpoint within the image grid + bounding box around the object
* target again $p_c$, $b_x$, $b_y$, $b_h$, $b_w$, $c_1$, $c_2$, $c_3$ (object, mid point, box, class)
* example
    * dividing input image into 3 x 3 grid, than target looks like (3 x 3 x 8), considering all definition above
    * input (100 x 100 x 3)
    * conv net (has to match i/o sizes)
    * output (3 x 3 x 8)
* it is sensible to use finer grids to avoid multiple objects in one window
* bounding boxes might use specific activations to ensure valid ranges

## Intersection over union
* used for evaluating object localization, prediction vs ground truth
* (size of the intersection of prediction and ground truth) / (the size of the union of those), "correct" if IoU >= 0.5

## Non-max suppression
* multiple detection of the same object
* results in multiple midpoints, multiple bounding boxes
* high-level intuition
    * $p_c$ with highest probability of the object (lets highlight that)
    * non-max suppression looks at detections with high overlap (high IoU), and suppresses them
    * then it continues with the following highest $p_c$ detected and suppresses non-max predictions with high IoU
    * ...
* algorithm (for a single class)
    * discard all predictions with $p_c \leq 0.6$, deal only with the high-confident onces
    * while there are remaining boxes:
        * pick the box with the highest $p_c$, output that as a prediction
        * discard any remaining box with IoU >= 0.5 with the box output in the previous step

## Anchor boxes
* addresses the problem of multiple objects in one grid cell
* multiple anchor boxes (shaped bounding boxes) can be selected
* forces network to specialize
* target comprises then of multiple anchors put on top of each other such as $y = \begin{bmatrix} p^1_c \\ b^1_x \\ b^1_y \\ b^1_h \\ b^1_w \\ c^1_1 \\ c^1_2 \\ c^1_3 \\ p^2_c \\ b^2_x \\ b^2_y \\ b^2_h \\ b_w \\ c_1 \\ c_2 \\ c_3 \end{bmatrix}$, where upper indices marks anchors
* previously
    * each object in image assigned to grid cell that contains the midpoint
    * target vector (3 x 3 x 8)
* with 2 anchor boxes
    * each object in image is assigned to cell with the midpoint and anchor box for the cell with the highest IoU (grid cell, anchor box)
    * target vector (3 x 3 x 16)
    * does not work well with 3 objects, same shape object also problematic

## YOLO algorithm

* train set construction
    * 3 classes, 2 bounding boxes, img grid 3 x 3
    * y is 3 x 3 (grid) x 2 (anchor boxes) x 8 (bounding boxes 5 + number of classes 3)
    * for each grid cell, y is constructed, if the object is present, bounding box based on IoU is selected
    * image input -> volume block output

* making predictions
    * where $p_c = 0$, than other values ignored
    * where $p_c = 1$, than correct values across the board
    * algorithm
        * for each grid cell, predict 2 bounding boxes
        * get rid of low-prob predictions
        * for each class separately, generate non-max suppression

## Region proposal

* regions with CNNs, the goal is to pick just a few CNN windows to run through CNN (ignores empty regions)
* do image segmentation, place bounding boxes around interesting regions, feed the regions to CNN
* R-CNN [paper](https://arxiv.org/pdf/1311.2524)
    * propose regions, classify regions one at a time, output label & bounding box
* Fast R-CNN [paper](https://arxiv.org/pdf/1504.08083)
    * propose regions, use conv implementation of sliding windows to classify proposed regions
* Faster R-CNN [paper](https://arxiv.org/pdf/1506.01497)
    * initial segmentation still slow, using CNN to propose regions


## Semantic segmentation

* draw an outline around the detected object (pixel-wise)
* per pixel labeling
* architecture
    * input image
    * conv & pool layers (shrinking height and with, expanding channels)
    * conv & pool layers (expanding heigh and with, shrinking channels)

### Transpose convolution

* enables for blowing up inputs
* traditional conv (6 x 6 x 3) image, convolved with 5 filters of (3 x 3 x 3), resulting in (4 x 4 x 5)
* transpose conv (2 x 2) image, convolved with (3 x 3), resulting in (4 x 4)
    * padding p = 1, stride s = 2
    * upper left entry of the input, convolved with the filter, pasting result in the upper left corner, padding left unfilled,
    * moving to the second value in the input, multiply with the filter, project to the output, padding set to zero, add values where there are overlaps with the previous operation, input values to remaining positions
    * continue with the remaining two positions in the input using same process

### U-net

* [paper](https://arxiv.org/pdf/1505.04597)
* architecture intuition
    * input image
    * traditional convolution (detailed spatial info lost)
    * transpose convolution
    * skip connection from initial conv layers to the last conv layers (bringing detailed spatial info)
    * output layer

* architecture overview
    * input image (h x w x 3)
    * conv layer & ReLU
    * conv layer & ReLU
    * max-pooling to reduce h & w
        * conv layer & ReLU
        * conv layer & ReLU  
        * max-pooling to reduce h & w      
            * conv layer & ReLU
            * conv layer & ReLU  
            * max-pooling to reduce h & w 

                * trans conv layer & skip connects from prev layer
            * conv layer & ReLU
            * conv layer & ReLU
            * trans conv layer & skip connects from prev layer
        * conv layer & ReLU
        * conv layer & ReLU
        * trans conv layer & skip connects from prev layer                
    * conv layer & ReLU
    * conv layer & ReLU
    * trans conv layer & skip connects from prev (first) layer
    * conv layer & ReLU
    * conv layer & ReLU
    * (1 x 1) output layer, output (h x w x no of classes), argmax over classes dimension
