### Mask R-CNN architecture


### Mask R-CNN (He K. et al., 2017)

[Paper](https://arxiv.org/abs/1703.06870)

*...conceptually simple, flexible, and general framework for object instance segmentation...Mask R-CNN is easy to generalize to other tasks, e.g., allowing us to estimate human poses in the same framework.*

* Object detection
* Instance segmentation
* Keypoint detection


In [1]:
import torch
import torchvision

import plotly.offline as offline

assert torch.cuda.is_available() is True
%load_ext watermark

In [2]:
%watermark -p torch,torchvision

torch      : 1.10.2
torchvision: 0.11.3



Let's recall Faster RCNN inference scheme:

<img src="../assets/3_faster_rcnn.svg" width="450">

Faster RCNN:

* Backbone/feature extractor


* Regional Proposal Network


* Fully connected layers, classifier and regressor heads

Mask RCNN updates:

* Third mask layer: $N \times K \times [m \times m]$, 

where $K$ - the number of classes, $m \times m - $ binary mask for each class, $N - $ the number of proposals.
(default m=28)


Mask head:
<img src="../assets/7_mask_rcnn.svg" width="650">

[Deconv(bad name)/TransposedConv](https://pytorch.org/docs/stable/generated/torch.nn.ConvTranspose2d.html)

* Mask layer is following the convolutional layer (feature map spatial structure property).


* RoI Align instead of RoI Pool 

<img src="https://erdem.pl/4482c23c18077e16a560fb6add556cc7/ROI-pooling.gif" width="450">


<img src="../assets/1_mask_rcnn.svg" width="950">

In [3]:
print(torchvision.ops.roi_align.__doc__)


    Performs Region of Interest (RoI) Align operator with average pooling, as described in Mask R-CNN.

    Args:
        input (Tensor[N, C, H, W]): The input tensor, i.e. a batch with ``N`` elements. Each element
            contains ``C`` feature maps of dimensions ``H x W``.
            If the tensor is quantized, we expect a batch size of ``N == 1``.
        boxes (Tensor[K, 5] or List[Tensor[L, 4]]): the box coordinates in (x1, y1, x2, y2)
            format where the regions will be taken from.
            The coordinate must satisfy ``0 <= x1 < x2`` and ``0 <= y1 < y2``.
            If a single Tensor is passed, then the first column should
            contain the index of the corresponding element in the batch, i.e. a number in ``[0, N - 1]``.
            If a list of Tensors is passed, then each Tensor will correspond to the boxes for an element i
            in the batch.
        output_size (int or Tuple[int, int]): the size of the output (in bins or pixels) after the pool

* Loss:

$$L = L_{cls} + L_{box} + L_{mask}$$

$L_{mask}$ class-averaged binary cross entropy loss;




* Multiple feature maps

* Feature pyramid network, FPN

FPN general scheme:
<img src="../assets/6_mask_rcnn.png" width="450">

[pytorch upsample](https://pytorch.org/docs/stable/generated/torch.nn.Upsample.html)

Training scheme:
    
<img src="../assets/2_mask_rcnn.svg" width="650">
    

Inference scheme:

<img src="../assets/3_mask_rcnn.svg" width="450">



[Matterport](https://github.com/matterport/Mask_RCNN) training and inference implementation:

<img src="../assets/5_mask_rcnn.svg" width="950">

<img src="../assets/4_mask_rcnn.svg" width="1050">

In [4]:
print(torchvision.models.detection.mask_rcnn.MaskRCNN.__doc__)


    Implements Mask R-CNN.

    The input to the model is expected to be a list of tensors, each of shape [C, H, W], one for each
    image, and should be in 0-1 range. Different images can have different sizes.

    The behavior of the model changes depending if it is in training or evaluation mode.

    During training, the model expects both the input tensors, as well as a targets (list of dictionary),
    containing:
        - boxes (``FloatTensor[N, 4]``): the ground-truth boxes in ``[x1, y1, x2, y2]`` format, with
          ``0 <= x1 < x2 <= W`` and ``0 <= y1 < y2 <= H``.
        - labels (Int64Tensor[N]): the class label for each ground-truth box
        - masks (UInt8Tensor[N, H, W]): the segmentation binary masks for each instance

    The model returns a Dict[Tensor] during training, containing the classification and regression
    losses for both the RPN and the R-CNN, and the mask loss.

    During inference, the model requires only the input tensors, and returns the post-

#### Keypoint detection: Human Pose Estimation

* $K$ classes of each point for an instance
* Only 1 point is foreground
* Bilinear upsacle increases accuracy
* Cross-entropy loss minimizes over an $(2m)^2$-way softmax output

<img src="../assets/8_mask_rcnn.svg" width="750">

<img src="../assets/9_mask_rcnn.svg" width="450">

In [5]:
print(torchvision.models.detection.keypoint_rcnn.KeypointRCNN.__doc__)


    Implements Keypoint R-CNN.

    The input to the model is expected to be a list of tensors, each of shape [C, H, W], one for each
    image, and should be in 0-1 range. Different images can have different sizes.

    The behavior of the model changes depending if it is in training or evaluation mode.

    During training, the model expects both the input tensors, as well as a targets (list of dictionary),
    containing:

        - boxes (``FloatTensor[N, 4]``): the ground-truth boxes in ``[x1, y1, x2, y2]`` format, with
            ``0 <= x1 < x2 <= W`` and ``0 <= y1 < y2 <= H``.
        - labels (Int64Tensor[N]): the class label for each ground-truth box
        - keypoints (FloatTensor[N, K, 3]): the K keypoints location for each of the N instances, in the
          format [x, y, visibility], where visibility=0 means that the keypoint is not visible.

    The model returns a Dict[Tensor] during training, containing the classification and regression
    losses for both the RPN a

### The surprising impact of mask-head architecture on novel class segmentation  (Birodkar V. et al., 2021)
[Paper](https://arxiv.org/abs/2104.00613v2)

*... we study Mask R-CNN and discover that instead of its default strategy of training the mask-head with a combination of proposals and groundtruth boxes, training the mask-head with only groundtruth boxes dramatically improves its performance on novel classes*

### Pointly-Supervised Instance Segmentation (Cheng B., Parkhi O., Kirillov A., 2021)
[Paper](https://arxiv.org/pdf/2104.06404v1.pdf)

*...existing instance segmentation models developed for full mask supervision, like Mask R-CNN, can be seamlessly trained with the point-based annotation without any major modifications*

#### References:

* https://www.youtube.com/watch?v=g7z4mkfRjI4\
* https://www.youtube.com/watch?v=nDPWywWRIRo
* https://paperswithcode.com/paper/mask-r-cnn
* https://x-engineer.org/bilinear-interpolation/
* [FPN](https://arxiv.org/abs/1612.03144)
* [FCN](https://arxiv.org/abs/1411.4038)
* https://pytorch.org/vision/0.12/_modules/torchvision/models/detection/keypoint_rcnn.html