### Mask R-CNN


#### Instance Segmentation problem
* You feed an image, it returns the object bounding boxes, classes and masks. More than just object detection.
* Object detection (Fast/ Faster RCNN ) + Semantic Segmentation (FCN)

#### Object detection
* Sliding window
    * brute force approach
    *  slide windows from left and right, and from up to down to identify objects using classification
    * varied sizes, and aspect ratios
* Selective search
    * start with each pixel as its own group
    * combine two that are closest
<img src="./selective-search.png">

<!-- ### State of the art of object detection -->

<!-- #### R-CNN -->

#### Faster R-CNN

<img src="./faster-rcnn.jpeg">

* uses a CNN feature extractor to extract image features
* then uses a CNN region proposal network to create region of interests (RoIs)
* then applying RoI pooling to warp them into fixed dimension
* feed into fully connected layers to make classification and boundary box prediction

<code>
ROIs = region_proposal(feature_maps)
for ROI in ROIs
    patch = roi_pooling(feature_maps, ROI)
    results = detector2(patch)
</code>

#### What’s similar between Mask R-CNN and Faster R-CNN?
* Both Mask R-CNN and Faster R-CNN have a branch for classification and bounding box regression.
* Both use ResNet 101 architecture to extract features from image.
* Both use Region Proposal Network(RPN) to generate Region of Interests(RoI).

The Faster R-CNN builds all the ground work for feature extractions and ROI proposals.


#### What's special about Mask RCNN?
* add 2 more convolution layers to build the mask.
* use ROI Align instead of ROI Pooling

#### ROI Align vs ROI Pool

<img src="./roi-align.png">

* Mask R-CNN uses ROI Align which does not digitalize the boundary of the cells (top right) and make every target cell to have the same size (bottom right). 
* It also applies interpolation to calculate the feature map values within the cell better.


#### Mask R-CNN
<img src="./mask-rcnn.png">

* paper
* Two stages

<!--     * First: a light weight neural network called RPN scans all FPN top-bottom pathway (feature map) and proposes regions which may contain objects.
    * Second: another neural network takes proposed regions by the first stage and assign them to several specific areas of a feature map level, scans these areas, and generates objects classes(multi-categorical classified), bounding boxes and masks.
* backbone is a FPN (Feature Pyramid Networks) style deep neural net. -->

<!-- * WHERE TO PUT THIS?
    * Warped features are then fed into fully connected layers to make classification using softmax and boundary box prediction is further refined using the regression model
    * Warped features are also fed into Mask classifier, which consists of two CNN’s to output a binary mask for each RoI. Mask Classifier allows the network to generate masks for every class without competition among classes. -->

<!--     * consists of a bottom-up pathway, a top-bottom pathway and lateral connections.
        * bottom-up: extracts features (can be any ConvNet, usually ResNet or VGG)
        * top-bottom: generates freature pyramid map
        * lateral connection: convolution and adding operations between two pathways.
    * FPN outperformssingle ConvNets b/c it maintains strong sematically features at various resoultion scales.
     -->




#### IoU (Intersection over Union)
<img src="./iou_equation.png" width="200" height="200">

* to measure the accuracy of an object detector on a particular dataset
* predicted bounding box and the ground-truth bounding box (hand-labeled)
* an Intersection over Union score > 0.5 is normally considered a “good” prediction
<!--     * Item 2b -->

What if I have multiple bounding boxes with IoU greater than 0.5?

#### NMS (Non Maximum Suppression)
<img src="./nms.png" width="200" height="200">

* It groups highly-overlapped boxes for the same class and selects the most confidence prediction only. 
* This avoids duplicates for the same object.
* Per class operation

<img src="./nms-2.png">
Classification with ROIs is shown in the dotted lines. Solid line is the refined bounding box.
<!-- * a class of algorithms to select one entity (e.g. bounding boxes) out of many overlapping entities. In essence it is a form of clustering algorithm.
* attempts have been made to use standard clustering algorithms such as k-means, Nearest Neighbor, DB Scan, etc. in object detection
* is used along with some form of overlap measure (e.g. IOU). -->

Key improvements / achievement of Mask R-CNN
* ROIAlign
    * Pixel-to-pixel alignment
    * for one object, it outputs multiple bounding boxes rather than a single definite one and warp them into a fixed dimension.
* Parallel operations do not affect the speed
* Reuse of feature map
* elimintate competetion between different classes when generating the mask
    * FCN uses softmax and multinomial cross entropy loss
    * Mask R-CNN uses sigmoid and a binary loss

Terms: 
* Anchor boxes: 
     <img src="./anchor-boxes.png">
     
    * to detect multiple objects, objects of different scales, and overlapping objects in an image. 
    * a set of predefined bounding boxes of a certain height and width.
    * final object detection is done by removing anchor boxes that belong to the background class and the remaining ones are filtered by their confidence score.

* RPN: 
    * uses a CNN to generate the multiple Region of Interest(RoI) using a lightweight binary classifier.
    * uses 9 anchors boxes over the image. 
    * The classifier returns object/no-object scores. Non Max suppression is applied to Anchors with high objectness score

<!-- Comparing with YOLO 
 -->

<!-- ResNet -->