### Abstract
Object Detection is more difficult than image classification

Because detection needs the accurate locaiton of objects. Which brings two major difficulties:
1. numerous candidate object locations (often called “proposals”) must be processed.
2. these candidates provide only rough localization that must be refined to achieve precise localization.


Previouse approches use **multi-stage pipline** ( Proposal-CNN-SVM-boundingBoxRegression) 
which is **slow** ( VOC07 2.5GPU days, hundreds of gigabytes training and VGG16 takes 47s / image on a GPU) 
and inelegant.
Fast-R-CNN using a single-stage training algorithm that jointly learns to classify object proposals and refine their spatial locations.


R-CNN is slow because it performs a ConvNet forward pass for each object proposal, without sharing computation.

Spatial pyramid pooling networks (SPPnets) were proposed to speed up R-CNN by sharing computation. SPPnet accelerates 
R-CNN by 10 to 100$\times$ at test time. Training time is also reduced by 3$\times$ due to faster proposal feature extraction.

But SPPnets also use multi-stage pipeline, which stores all the fetures on disk.


### Contributions
>We propose a new training algorithm that fixes the disadvantages
>of R-CNN and SPPnet, while improving on their
>speed and accuracy. We call this method Fast R-CNN because
>it’s comparatively fast to train and test. The Fast RCNN
>method has several advantages:
1. Higher detection quality (mAP) than R-CNN, SPPnet
2. Training is single-stage, using a multi-task loss
3. Training can update all network layers
4. No disk storage is required for feature caching




### Overall structure
![](pic1.png)
![](diagram1.png)

- The feature map won't be fixed size due to the size variation of input images. 

```
graph LR
subgraph 
A[Image] -->|Deep ConvNet| B[Feature map]
C[proposals/RoIs]-->|Input| D((Extractor))
B-->|Input|D
D--> E[sub-Features for each RoI]
end
subgraph
E-->F(RoI pooling layers) 
F-->G(FC layers)

G-->|FC|H[label]
G-->|FC|I[position]
end
```

----

### RoI pooling layer 
 #### Purpose: 
 Convert the fetures in side any RoI to a fixed size feature map.
 #### Input
 Features inside RoI(can be in any shape)
 #### Output
 Fixed shaped features
 #### Details
 Using max pooling, pooling indepedent for each channel.
 Is a special case of the spatial pyramid pooling layer used in
 SPPnets. 

 For more detailed sub-window calculation see: ***K. He, X. Zhang, S. Ren, and J. Sun. Spatial pyramid pooling
 in deep convolutional networks for visual recognition. In
 ECCV, 2014***

https://github.com/deepsense-ai/roi-pooling
----
### Initializing with pretrained networks
 
 #### Three steps to modify a pretrained ImageNet network to Fast-R-CNN
 1. replace last max pooling layer by RoI pooling layer
 2. >The network’s last fully connected layer and softmax(which were trained for 1000-way ImageNet classification) are replaced with the two sibling layers described earlier (a fully connected layer and softmax over K+1 categoriesand  **category-specific** bounding-box regressors).
 3. Third, the network is modified to take two data inputs: a list of images and a list of RoIs in those images.

----
### Fine tuning for detection

#### why SPPnet is unable to update weights below the spatial pyramid pooling layer
 1. Back-propagation through the SPP is highly inefficient when each training sample(RoI) comes from a different image.
 2. The inefficiency stems from the fact that each RoI may have a very large receptive field, often spanning the entire input image.

#### Proposed method
##### main idea
 takes advantage of feature sharing during training
##### Details
 1. Hierarchical sample, sample N images and then sample R/N RoI in each image. (N=2, R=128 are used in the paper)
 2. Jointly optimizes a softmax classifier and bounding-box regressors

----
### Multi task loss
$$
  L(p,u,t^u,v) = L_{cls}(p,u) + \lambda[u\ge1]L_{loc}(t^v,v)
$$

$p=(p_0,p_1,...,p_K)$ is the probability distribution over K+1 categories.

$t^k = (t_x^k,t_y^k,t_w^k,t_h^k)$ is the output of bbox regression, for each of the K categories indexed by k.

$u$ is the ground-truth class.

$v = (v_x,v_y,v_w,v_h)$ is the ground truth bbox for class u. 

$L_{cls}(p,u) = -log(p_u)$ is log loss for class $u$.

$
[u\ge1] = \begin{cases}
0 & u\lt1 \\
1 & otherwise
\end{cases}
$

$
L_{loc}(t^u,v)=\Sigma_{i\in \{x,y,w,h\}}{smooth_{L1}(t_i^u-vi)}
$

$
smooth_{L1}(x) = \begin{cases}
0.5x^2 & if\ |x|<1 \\
|x|-0.5 & otherwise
\end{cases}
$

This paper claims robust $L_1$ norm is less sensitive to outliers than $L_2$ loss.
Which makes tuning the network easier.

---

### Mini-batch sampling.
How to choose training samples?
1. Each mini-batch constructed from N=2 image, each image have 64 RoI in it. 
2. Choose 25% of the samples from proposals that have $ IoU \ge 0.5$.
3. Choose 75% of the samples from proposals that have $ IoU \in [0.1,0.5)$ and label them as 0(back ground).

> During training, images are horizontally flipped with
probability 0:5. No other data augmentation is used

---
### Back propagation through RoI pooling layers
$$
\frac{\partial L}{\partial x_i} = \underset{r}{\Sigma}{\underset{j}{\Sigma}{[i=i\ast (r,j)]\frac{\partial L}{\partial y_{rj}}}}
$$

$ x_i \in \Re$ is the $i$-th input activation to the RoI pooling layer.

$ y_{rj} = x_{i\ast(r,j)} $ is the output of the pooling layer 

$ i\ast (r,j) = argmax_{{i}^{'} \in R(r,j)} $ 

$ R(r,j)$ is the index set of inputs in the sub-window over which the the output unit $y_{r,j}$ pools.

>for each mini-batch RoI r and for each pooling
output unit $y_{rj}$ , the partial derivative $\partial L/\partial y_{rj}$ is accumulated
if i is the argmax selected for $y_{rj}$ by max pooling

---
### SGD hyper-parameters
**initialization** 
fc layer and bbox regressor: zero-mean Gaussian distributions with standard deviations 0:01 and 0:001, bias = 0.

**learning rate**
per-layer learning rate: 1 for weights 2 for bias.
global learning rate 0.001

learning rate decay: 
   - VOC07, VOC12 : 30K iteration then 10K with 0.0001.
   - larger dataset: A momentum of 0.9 and parameter decay of 0.0005 (on weights and biases) are used.
    
    
--- 
### Scale invariance
** Two Approaches**
1. Brute force: Input fixed size images and the network need to learn the scale invariant detection.
2. Using image pyramids: 
    + data augmentation during training.
    + it's a approximation of scale invariance.

more details:
***K. He, X. Zhang, S. Ren, and J. Sun. Spatial pyramid pooling
 in deep convolutional networks for visual recognition. In
 ECCV, 2014***
 
 --- 
### Fast-R-CNN detection.
After the network is done trianing. 
The input of the network should be:
   1. image or image pyramid encoded as a list of images
   2. a list of RoI which is pre-computed.
 
 
The output of the network should be:
   1. For each test RoI $r$, the forward pass outputs a class posterior probability distribution $p$.
   2. a set of predicted bounding-box offsets relative to $r$.

After the network output there is a extra operation:
>We then perform non-maximum suppression independently for each
class using the algorithm and settings from R-CN
***R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature
hierarchies for accurate object detection and semantic
segmentation. In CVPR, 2014***

___

### Using truncated SVD to speedup  detection
**Scenario to use**: Object detection.
In this scenario(different from classification), because of the large number of RoIs, network spend near half of its time computing fc layers.

**Technique**: truncated SVD
Basic idea: split one single fc layer into 2 layers, reducing the total amount of parameters.

**Input**: fc layer $ W, b$

**Output**: two conssecutive fc layers $W_1, b_1, W_2, b_2$, and **No none-linearity between them**

**Process**:
 1. perform SVD on $W$, $W = U_0\Sigma V_0^T$
 2. Choose first $t$ singular value to approximate $W$, $ W \approx U\Sigma_t V^T$
 3 $W_1 = \Sigma_tV^T$, $b_1 = 0$ 
 4 $W_2 = U$, $b_2 = b$
 
**Analysis**:
 
  If $W$ is $u-by-v$ matrix, then $U$ is a $u-by-t$ matrix and $\Sigma_tV^T$ is a $t-by-v$ matrix.
 
  So, the amount of parameters is reduced from $uv$ to $t(u+v)$.

---
### Which layers to finetune?
1. If only fine-tuning the fc layer, the mAP decreased from 66.9% to 61.4%. Which means that the back-propogation through RoI pooling is important for deep networks.
2. It only necessary to update layers from conv3 1 and up (9 of the 13 conv layers).

 Two reasons: 
  1. updating from conv2_1 slows training by $1.3\times$ (12.5 vs. 9.5 hours) compared to learning from conv3_1
  2. updating from conv1_1 **over-runs GPU memory**.
  3. The difference in mAP when learning from conv2_1 up was only +0.3 points


##### My thought: 

It's possible to make model small if we fix the botton layer of CNN?

---

### Scale invariance: to brute force or finesse?

|     |  SPPnet ZF |   S    |   M   |   L   |
| ---- |:-----------:|-------|------|------|
|scales|1   /   5|1   /  5|1  /  5|1 |
|test rate(s/im)|0.14 / 0.38|0.10 / 0.39|0.15 / 0.64|0.32|
|VOC07 mAP|58.0 / 59.2|57.1 / 58.4|59.2 / 60.7|66.9|

The result of multi-scale and single scale shows that multi-scale provide slightly higher mAP.
But it also takes much more time to compute.
So, the single scale is the best trade-off between accuracy and speed.


---

### Do SVMs outperform softmax?
Anwser is **NO**.
>softmax slightly outperforming SVM for all three networks, by +0:1 to +0:8 mAP point

**Thoughts**
At first I was thinking compare SVMs to fc layers. Why compare SVM to softmax?

### Are more proposals always better?
> We find evidence that the proposalclassifier
cascade also improves Fast R-CNN accuracy

>Classifying sparse proposals is a type of cascade [22] in
which the proposal mechanism first rejects a vast number of
candidates leaving the classifier with a small set to evaluate

So there are brodly two types of proposal methods
1. sparse set of object proposals
2. dense set of proposals

This paper find out that sparse proposal method improves Fast-R-CNN's accuracy.

- mAP: Mean average precision
The score for Information retrieval
$$ MAP = \frac{\Sigma^{Q}_{q=1}{avgP(q)}}{Q} $$
Q is the number of queries.


- intersection over union (IoU)
$$
    IoU = \frac{DetectionResult \cap GroundTruth}{DetectionResult\cup GroundTruth}
$$


- Hard negative mining

http://cs.brown.edu/courses/cs143/2011/results/proj4/psastras/

>During the first training stage positive crops are used from the training data coupled with negative training crops randomly chosen from images without a face. To refine the SVM, the detector is run again on the non face scenes, and any detections (these are false positives) are used as new negative training examples to train another SVM, to decrease the false positive rate. This step can be repeated multiple times.

>The strategy used for mining hard negatives was relatively simple: for each non face scene image, any detected faces were sorted by confidence, with the top results used as hard negative examples. Hard negative mining usually improved performance by about 5-10%, but seemed much more effective for the RBF SVM.


 