# Lab 06: Mask R-CNN

[Mask R-CNN](https://arxiv.org/abs/1703.06870) was originally released under the Caffe2 framework by the FAIR team as part of a project called "Detectron."

After Caffe2 and PyTorch merged, eventually, FAIR released a ground-up reimplementation of Detectron with new models and
features. This framework is [Detectron2](https://github.com/facebookresearch/detectron2). If you're looking for a complete
approach to bounding box regression, mask regression, and skeleton point (pose) estimation, take a look.

In the meantime, however, other groups quickly implemented Mask R-CNN directly in TensorFlow and Keras.
Today we'll work with a [PyTorch implementation of Mask R-CNN](https://github.com/multimodallearning/pytorch-mask-rcnn) derived from those efforts.

In case the original code is inaccessible, we've archived the version of the project used to create this lab manual
[in the GitHub repository for this lab](https://github.com/dsai-asia/RTML/tree/main/Labs/06-Mask-R-CNN/torch_mask_rcnn_code).

The full Mask R-CNN structure with a ResNet-50 FPN backbone (later we will be using a ResNet-101 backbone) looks like this:

<img src="img/MaskRCNNArchitecture.png" title="FullMaskR-CNN" style="width: 860px;" />

For each intance of an object in an image, Mask R-CNN attempts to generate
 - A bounding box
 - Class scores
 - A segmentation mask

The backbone and neck of Mask R-CNN are based on
 - A feature pyramid network (FPN)
 - ResNet

## Feature Pyramid Networks

We've seen the idea of the feature pyramid network (FPN) in YOLOv3 and YOLOv4. It is a feature extractor using a pyramid concept.
We begin with ordinary progressive downsampling of the input to get a multiscale representation of the input, but rather
than using that "low-level" multiscale representation directly, we progressively upsample the coarse representation of
the input using input from the low-level feature maps. The idea is shown in the left-hand panel of the diagram above.
By incorporating information from both the low-level "bottom-up" fine grained feature map and the upsampled coarser grained feature
map in the pyramid, the fine grained representation at the bottom of the pyramid contains much more useful or more "semantic"
information about the input.

Feature pyramids can, in principle, come in many different forms. Refer to the figure below, extracted
from [the original FPN paper](https://arxiv.org/abs/1612.03144) and discussed in a
[nice blog by Sik-Ho Tsang](https://towardsdatascience.com/review-fpn-feature-pyramid-network-object-detection-262fc7482610):

<img src="img/Pyramid.png" title="PyramidNets" style="width: 800px;" />

The classical image pyramid, used by many techniques aiming at multiscale detection, looks like (a).
Classifiers we built early on in the class, such as AlexNet, look like (b). (c) and (d) utilize multiple
feature maps derived from the input through progressive downscaling. The difference in the FPN (d) is
the inclusion of both bottom up and top down pathways:

<img src="img/bottomup.jpeg" title="Bottom-up" style="width: 500px;" />

We know that as we process the input in multiple progressively downsampled layers, we are increasingly analyzing higher-level features
with larger receptive fields with some invariance to imaging conditions (translation, scale, lighting, etc.).

The top-down representations use progressively *upsampled* layers in which we are increasingly analyzing the input at high resolution but
with all the benefits of the downsampled representation.

The main risk of the top-down upsampling is that we would lose information about the details of the original input in constructing the
fine-grained feature maps. For that reason, we add lateral connections to the bottom-up feature maps of the same size:

<img src="img/lateralconnection.png" title="lateralconnection" style="width: 500px;" />

## ResNet backbone

The bottom-up part of the FPN used in Mask-RCNN is ResNet. It is used similar to how
Darknet-53 is used in YOLO. We take the classifier structure as the bottom-up half (left side) of
the pyramid, then we add the top-down part (right side) to obtain the FPN.

Mask R-CNN taps into ResNet in 4 or 5 places according to the implementation, at the ouptut of various residual blocks.

Here's a figure to explain this, this time from [Jonathan Hui's Medium site](https://jonathan-hui.medium.com/understanding-feature-pyramid-networks-for-object-detection-fpn-45b227b9106c):

<img src="img/upanddown.png" title="upanddown" style="width: 500px;" />

(There would be a P6 there as well if we're extracting a 5-scale pyramid).

## Region Proposal Network (RPN)

The Faster R-CNN RPN connects to the top of the FPN pyramid. It performs classification and bounding box regression for each possible proposal.

<img src="img/RPN.png" title="RPN" style="width: 600px;" />

## Detection network

The detection network uses the results of the RPN as well as the output of the FPN. With the RPN bounding box as input, we assign the box to one of the levels of the pyramid.
Specifically, we use

$$ k = \left\lfloor k_0 +\log_2\left( \frac{\sqrt{wh}}{224} \right) \right\rfloor $$

Then the ROIAlign block interpolates the appropriate set of features from the best level (level $k$) of the pyramid. The region is aligned and scaled to a size of
56$\times$56, and the resulting representation is forwarded to the mask prediction head.

## Mask Head

The mask head is a FCN that up-samples from the detection result, and the patch size is finally re-scaled back to the input size.

## Coding

So let's start investigating how Mask R-CNN works in detail.
First, clone the Github respository:

In [None]:
!git clone https://github.com/multimodallearning/pytorch-mask-rcnn.git

### MaskRCNN class (model.py)

First, let's visit the MaskRCNN class and its <code>init()</code> function:

In [None]:
    def __init__(self, config, model_dir):
        """
        config: A Sub-class of the Config class
        model_dir: Directory to save training logs and trained weights
        """
        super(MaskRCNN, self).__init__()
        self.config = config
        self.model_dir = model_dir
        self.set_log_dir()
        self.build(config=config)
        self.initialize_weights()
        self.loss_history = []
        self.val_loss_history = []

The most important action here is calling `build()`, which creates the network based on a
configuration object. `set_log_dir()` just sets up saving to a log file, and `initialize_weights()` loads
weights. Let's take a look at the `build()` function:

In [None]:
def build(self, config):
    """Build Mask R-CNN architecture.
    """

    # Image size must be dividable by 2 multiple times
    h, w = config.IMAGE_SHAPE[:2]
    if h / 2**6 != int(h / 2**6) or w / 2**6 != int(w / 2**6):
        raise Exception("Image size must be dividable by 2 at least 6 times "
                        "to avoid fractions when downscaling and upscaling."
                        "For example, use 256, 320, 384, 448, 512, ... etc. ")

    # Build the shared convolutional layers.
    # Bottom-up Layers
    # Returns a list of the last layers of each stage, 5 in total.
    # Don't create the thead (stage 5), so we pick the 4th item in the list.
    resnet = ResNet("resnet101", stage5=True)
    C1, C2, C3, C4, C5 = resnet.stages()

    # Top-down Layers
    # TODO: add assert to varify feature map sizes match what's in config
    self.fpn = FPN(C1, C2, C3, C4, C5, out_channels=256)

    # Generate Anchors
    self.anchors = Variable(torch.from_numpy(utils.generate_pyramid_anchors(config.RPN_ANCHOR_SCALES,
                                                                            config.RPN_ANCHOR_RATIOS,
                                                                            config.BACKBONE_SHAPES,
                                                                            config.BACKBONE_STRIDES,
                                                                            config.RPN_ANCHOR_STRIDE)).float(), requires_grad=False)
    if self.config.GPU_COUNT:
        self.anchors = self.anchors.cuda()

    # RPN
    self.rpn = RPN(len(config.RPN_ANCHOR_RATIOS), config.RPN_ANCHOR_STRIDE, 256)

    # FPN Classifier
    self.classifier = Classifier(256, config.POOL_SIZE, config.IMAGE_SHAPE, config.NUM_CLASSES)

    # FPN Mask
    self.mask = Mask(256, config.MASK_POOL_SIZE, config.IMAGE_SHAPE, config.NUM_CLASSES)

    # Fix batch norm layers
    def set_bn_fix(m):
        classname = m.__class__.__name__
        if classname.find('BatchNorm') != -1:
            for p in m.parameters(): p.requires_grad = False

    self.apply(set_bn_fix)

### Backbone

The backbone is initialized then tapped into in the lines

    resnet = ResNet("resnet101", stage5=True)
    C1, C2, C3, C4, C5 = resnet.stages()

We see that this version is using ResNet101 and extracting a 5-stage pyramid. How to find out what stages are being used? Take
a look at the ResNet class itself and take a look at its `__init__` and forward methods. We see that C1-C5 are the major
blocks of the network. The first (C1) is a single 7$\times$7 convolution with batch norm, ReLU, and a MaxPool operation. The others
(C2-C5) are ResNet residual blocks.

In [None]:
class ResNet(nn.Module):

    def __init__(self, architecture, stage5=False):
        super(ResNet, self).__init__()
        assert architecture in ["resnet50", "resnet101"]
        self.inplanes = 64
        self.layers = [3, 4, {"resnet50": 6, "resnet101": 23}[architecture], 3]
        self.block = Bottleneck
        self.stage5 = stage5

        self.C1 = nn.Sequential(
            nn.Conv2d(3, 64, kernel_size=7, stride=2, padding=3),
            nn.BatchNorm2d(64, eps=0.001, momentum=0.01),
            nn.ReLU(inplace=True),
            SamePad2d(kernel_size=3, stride=2),
            nn.MaxPool2d(kernel_size=3, stride=2),
        )
        self.C2 = self.make_layer(self.block, 64, self.layers[0])
        self.C3 = self.make_layer(self.block, 128, self.layers[1], stride=2)
        self.C4 = self.make_layer(self.block, 256, self.layers[2], stride=2)
        if self.stage5:
            self.C5 = self.make_layer(self.block, 512, self.layers[3], stride=2)
        else:
            self.C5 = None

    def forward(self, x):
        x = self.C1(x)
        x = self.C2(x)
        x = self.C3(x)
        x = self.C4(x)
        x = self.C5(x)
        return x

    def stages(self):
        return [self.C1, self.C2, self.C3, self.C4, self.C5]

    def make_layer(self, block, planes, blocks, stride=1):
        downsample = None
        if stride != 1 or self.inplanes != planes * block.expansion:
            downsample = nn.Sequential(
                nn.Conv2d(self.inplanes, planes * block.expansion,
                          kernel_size=1, stride=stride),
                nn.BatchNorm2d(planes * block.expansion, eps=0.001, momentum=0.01),
            )

        layers = []
        layers.append(block(self.inplanes, planes, stride, downsample))
        self.inplanes = planes * block.expansion
        for i in range(1, blocks):
            layers.append(block(self.inplanes, planes))

        return nn.Sequential(*layers)

### FPN

Let's go back to the MaskRCNN class. The next stage is the top-down layers of the FPN:

    self.fpn = FPN(C1, C2, C3, C4, C5, out_channels=256)

The main idea of the FPN is in its `forward()` method. Let's take a look.

In [None]:
class FPN(nn.Module):
    def __init__(self, C1, C2, C3, C4, C5, out_channels):
        super(FPN, self).__init__()
        self.out_channels = out_channels
        self.C1 = C1
        self.C2 = C2
        self.C3 = C3
        self.C4 = C4
        self.C5 = C5
        self.P6 = nn.MaxPool2d(kernel_size=1, stride=2)
        self.P5_conv1 = nn.Conv2d(2048, self.out_channels, kernel_size=1, stride=1)
        self.P5_conv2 = nn.Sequential(
            SamePad2d(kernel_size=3, stride=1),
            nn.Conv2d(self.out_channels, self.out_channels, kernel_size=3, stride=1),
        )
        self.P4_conv1 =  nn.Conv2d(1024, self.out_channels, kernel_size=1, stride=1)
        self.P4_conv2 = nn.Sequential(
            SamePad2d(kernel_size=3, stride=1),
            nn.Conv2d(self.out_channels, self.out_channels, kernel_size=3, stride=1),
        )
        self.P3_conv1 = nn.Conv2d(512, self.out_channels, kernel_size=1, stride=1)
        self.P3_conv2 = nn.Sequential(
            SamePad2d(kernel_size=3, stride=1),
            nn.Conv2d(self.out_channels, self.out_channels, kernel_size=3, stride=1),
        )
        self.P2_conv1 = nn.Conv2d(256, self.out_channels, kernel_size=1, stride=1)
        self.P2_conv2 = nn.Sequential(
            SamePad2d(kernel_size=3, stride=1),
            nn.Conv2d(self.out_channels, self.out_channels, kernel_size=3, stride=1),
        )

    def forward(self, x):
        x = self.C1(x)
        x = self.C2(x)
        c2_out = x          # keep C2 output
        x = self.C3(x)
        c3_out = x          # keep C3 output
        x = self.C4(x)
        c4_out = x          # keep C4 output
        x = self.C5(x)
        p5_out = self.P5_conv1(x)       # top-most of pyramid
        p4_out = self.P4_conv1(c4_out) + F.upsample(p5_out, scale_factor=2)         # lateral connections, 2nd top output
        p3_out = self.P3_conv1(c3_out) + F.upsample(p4_out, scale_factor=2)         # lateral connections, 3rd top output
        p2_out = self.P2_conv1(c2_out) + F.upsample(p3_out, scale_factor=2)         # lateral connections, 4th top output

        p5_out = self.P5_conv2(p5_out)
        p4_out = self.P4_conv2(p4_out)
        p3_out = self.P3_conv2(p3_out)
        p2_out = self.P2_conv2(p2_out)

        # P6 is used for the 5th anchor scale in RPN. Generated by
        # subsampling from P5 with stride of 2.
        p6_out = self.P6(p5_out)        # max pooling for RPN

        return [p2_out, p3_out, p4_out, p5_out, p6_out]

### Anchors

Anchor box sizes are set in the configuration file.

### RPN

Next, let's see how the RPN is produced.

    self.rpn = RPN(len(config.RPN_ANCHOR_RATIOS), config.RPN_ANCHOR_STRIDE, 256)

The RPN class is reproduced below. But also take a look at its output: class scores, softmaxed class scores, and bounding boxes.

In [None]:
class RPN(nn.Module):
    """Builds the model of Region Proposal Network.

    anchors_per_location: number of anchors per pixel in the feature map
    anchor_stride: Controls the density of anchors. Typically 1 (anchors for
                   every pixel in the feature map), or 2 (every other pixel).

    Returns:
        rpn_logits: [batch, H, W, 2] Anchor classifier logits (before softmax)
        rpn_probs: [batch, W, W, 2] Anchor classifier probabilities.
        rpn_bbox: [batch, H, W, (dy, dx, log(dh), log(dw))] Deltas to be
                  applied to anchors.
    """

    def __init__(self, anchors_per_location, anchor_stride, depth):
        super(RPN, self).__init__()
        self.anchors_per_location = anchors_per_location
        self.anchor_stride = anchor_stride
        self.depth = depth

        self.padding = SamePad2d(kernel_size=3, stride=self.anchor_stride)
        self.conv_shared = nn.Conv2d(self.depth, 512, kernel_size=3, stride=self.anchor_stride)
        self.relu = nn.ReLU(inplace=True)
        self.conv_class = nn.Conv2d(512, 2 * anchors_per_location, kernel_size=1, stride=1)     # class, score
        self.softmax = nn.Softmax(dim=2)
        self.conv_bbox = nn.Conv2d(512, 4 * anchors_per_location, kernel_size=1, stride=1)      # x,y,w,h

    def forward(self, x):
        # Shared convolutional base of the RPN
        x = self.relu(self.conv_shared(self.padding(x)))

        # Anchor Score. [batch, anchors per location * 2, height, width].
        rpn_class_logits = self.conv_class(x)

        # Reshape to [batch, 2, anchors]
        rpn_class_logits = rpn_class_logits.permute(0,2,3,1)
        rpn_class_logits = rpn_class_logits.contiguous()
        rpn_class_logits = rpn_class_logits.view(x.size()[0], -1, 2)

        # Softmax on last dimension of BG/FG.
        rpn_probs = self.softmax(rpn_class_logits)              # output class

        # Bounding box refinement. [batch, H, W, anchors per location, depth]
        # where depth is [x, y, log(w), log(h)]
        rpn_bbox = self.conv_bbox(x)

        # Reshape to [batch, 4, anchors]
        rpn_bbox = rpn_bbox.permute(0,2,3,1)
        rpn_bbox = rpn_bbox.contiguous()
        rpn_bbox = rpn_bbox.view(x.size()[0], -1, 4)

        return [rpn_class_logits, rpn_probs, rpn_bbox]

### Proposal classifier

The Faster R-CNN head contains the region proposal classifier.

    self.classifier = Classifier(256, config.POOL_SIZE, config.IMAGE_SHAPE, config.NUM_CLASSES)

The classifier is mainly composed of convolutional layers. The most
interesting process is the pyramidal ROI alignment at multiple scales.
In the Classifier class, the ROIs are assigned pyramid levels, then
ROIAlign is performed. Note that the `pyramid_roi_align()` method uses a
"crop and resize" method that is not quite identical to ROIAlign but does
do bilinear interpolation of the feature map.

In [None]:
def pyramid_roi_align(inputs, pool_size, image_shape):
    """Implements ROI Pooling on multiple levels of the feature pyramid.

    Params:
    - pool_size: [height, width] of the output pooled regions. Usually [7, 7]
    - image_shape: [height, width, channels]. Shape of input image in pixels

    Inputs:
    - boxes: [batch, num_boxes, (y1, x1, y2, x2)] in normalized
             coordinates.
    - Feature maps: List of feature maps from different levels of the pyramid.
                    Each is [batch, channels, height, width]

    Output:
    Pooled regions in the shape: [num_boxes, height, width, channels].
    The width and height are those specific in the pool_shape in the layer
    constructor.
    """

    # Currently only supports batchsize 1
    for i in range(len(inputs)):
        inputs[i] = inputs[i].squeeze(0)

    # Crop boxes [batch, num_boxes, (y1, x1, y2, x2)] in normalized coords
    boxes = inputs[0]

    # Feature Maps. List of feature maps from different level of the
    # feature pyramid. Each is [batch, height, width, channels]
    feature_maps = inputs[1:]

    # Assign each ROI to a level in the pyramid based on the ROI area.
    y1, x1, y2, x2 = boxes.chunk(4, dim=1)
    h = y2 - y1
    w = x2 - x1

    # Equation 1 in the Feature Pyramid Networks paper. Account for
    # the fact that our coordinates are normalized here.
    # e.g. a 224x224 ROI (in pixels) maps to P4
    image_area = Variable(torch.FloatTensor([float(image_shape[0]*image_shape[1])]), requires_grad=False)
    if boxes.is_cuda:
        image_area = image_area.cuda()
    roi_level = 4 + log2(torch.sqrt(h*w)/(224.0/torch.sqrt(image_area)))
    roi_level = roi_level.round().int()
    roi_level = roi_level.clamp(2,5)


    # Loop through levels and apply ROI pooling to each. P2 to P5.
    pooled = []
    box_to_level = []
    for i, level in enumerate(range(2, 6)):
        ix  = roi_level==level
        if not ix.any():
            continue
        ix = torch.nonzero(ix)[:,0]
        level_boxes = boxes[ix.data, :]

        # Keep track of which box is mapped to which level
        box_to_level.append(ix.data)

        # Stop gradient propogation to ROI proposals
        level_boxes = level_boxes.detach()

        # Crop and Resize
        # From Mask R-CNN paper: "We sample four regular locations, so
        # that we can evaluate either max or average pooling. In fact,
        # interpolating only a single value at each bin center (without
        # pooling) is nearly as effective."
        #
        # Here we use the simplified approach of a single value per bin,
        # which is how it's done in tf.crop_and_resize()
        # Result: [batch * num_boxes, pool_height, pool_width, channels]
        ind = Variable(torch.zeros(level_boxes.size()[0]),requires_grad=False).int()
        if level_boxes.is_cuda:
            ind = ind.cuda()
        feature_maps[i] = feature_maps[i].unsqueeze(0)  #CropAndResizeFunction needs batch dimension
        pooled_features = CropAndResizeFunction(pool_size, pool_size, 0)(feature_maps[i], level_boxes, ind)
        pooled.append(pooled_features)

    # Pack pooled features into one tensor
    pooled = torch.cat(pooled, dim=0)

    # Pack box_to_level mapping into one array and add another
    # column representing the order of pooled boxes
    box_to_level = torch.cat(box_to_level, dim=0)

    # Rearrange pooled features to match the order of the original boxes
    _, box_to_level = torch.sort(box_to_level)
    pooled = pooled[box_to_level, :, :]

    return pooled

### Mask Head

The last step creates the mask head. We have convolutions and upsampling to the original image size.
The difference between the classifier head and the mask head is that the classifier head takes the proposal bounding boxes
and outputs refined final bounding boxes, whereas the mask head uses those final refined bounding boxes to create masks.

The mask creation step in the MaskRCNN class looks like this:

    self.mask = Mask(256, config.MASK_POOL_SIZE, config.IMAGE_SHAPE, config.NUM_CLASSES)

### Prediction function

Now let's look at the overall
`predict()` method. There are two modes there: inference (evaluation/validation/test) mode and training mode.
Both modes have the same steps, but during training, we have to calculate ROI sizes for comparisons between the predicted output and the ground truth.

In [None]:
    def predict(self, input, mode):
        molded_images = input[0]
        image_metas = input[1]

        if mode == 'inference':
            self.eval()
        elif mode == 'training':
            self.train()

        # Set batchnorm always in eval mode during training
        def set_bn_eval(m):
            classname = m.__class__.__name__
            if classname.find('BatchNorm') != -1:
                m.eval()
                
        self.apply(set_bn_eval)

        # Feature extraction
        [p2_out, p3_out, p4_out, p5_out, p6_out] = self.fpn(molded_images)

        # Note that P6 is used in RPN, but not in the classifier heads.
        rpn_feature_maps = [p2_out, p3_out, p4_out, p5_out, p6_out]
        mrcnn_feature_maps = [p2_out, p3_out, p4_out, p5_out]

        # Loop through pyramid layers
        layer_outputs = []  # list of lists
        for p in rpn_feature_maps:
            layer_outputs.append(self.rpn(p))

        # Concatenate layer outputs
        # Convert from list of lists of level outputs to list of lists
        # of outputs across levels.
        # e.g. [[a1, b1, c1], [a2, b2, c2]] => [[a1, a2], [b1, b2], [c1, c2]]
        outputs = list(zip(*layer_outputs))
        outputs = [torch.cat(list(o), dim=1) for o in outputs]
        rpn_class_logits, rpn_class, rpn_bbox = outputs

        # Generate proposals
        # Proposals are [batch, N, (y1, x1, y2, x2)] in normalized coordinates
        # and zero padded.
        proposal_count = self.config.POST_NMS_ROIS_TRAINING if mode == "training" \
            else self.config.POST_NMS_ROIS_INFERENCE
        rpn_rois = proposal_layer([rpn_class, rpn_bbox],
                                  proposal_count=proposal_count,
                                  nms_threshold=self.config.RPN_NMS_THRESHOLD,
                                  anchors=self.anchors,
                                  config=self.config)

        if mode == 'inference':
            # Network Heads
            # Proposal classifier and BBox regressor heads
            mrcnn_class_logits, mrcnn_class, mrcnn_bbox = self.classifier(mrcnn_feature_maps, rpn_rois)

            # Detections
            # output is [batch, num_detections, (y1, x1, y2, x2, class_id, score)] in image coordinates
            detections = detection_layer(self.config, rpn_rois, mrcnn_class, mrcnn_bbox, image_metas)

            # Convert boxes to normalized coordinates
            # TODO: let DetectionLayer return normalized coordinates to avoid
            #       unnecessary conversions
            h, w = self.config.IMAGE_SHAPE[:2]
            scale = Variable(torch.from_numpy(np.array([h, w, h, w])).float(), requires_grad=False)
            if self.config.GPU_COUNT:
                scale = scale.cuda()
            detection_boxes = detections[:, :4] / scale

            # Add back batch dimension
            detection_boxes = detection_boxes.unsqueeze(0)

            # Create masks for detections
            mrcnn_mask = self.mask(mrcnn_feature_maps, detection_boxes)

            # Add back batch dimension
            detections = detections.unsqueeze(0)
            mrcnn_mask = mrcnn_mask.unsqueeze(0)

            return [detections, mrcnn_mask]

        elif mode == 'training':

            gt_class_ids = input[2]
            gt_boxes = input[3]
            gt_masks = input[4]

            # Normalize coordinates
            h, w = self.config.IMAGE_SHAPE[:2]
            scale = Variable(torch.from_numpy(np.array([h, w, h, w])).float(), requires_grad=False)
            if self.config.GPU_COUNT:
                scale = scale.cuda()
            gt_boxes = gt_boxes / scale

            # Generate detection targets
            # Subsamples proposals and generates target outputs for training
            # Note that proposal class IDs, gt_boxes, and gt_masks are zero
            # padded. Equally, returned rois and targets are zero padded.
            rois, target_class_ids, target_deltas, target_mask = \
                detection_target_layer(rpn_rois, gt_class_ids, gt_boxes, gt_masks, self.config)

            if not rois.size():
                mrcnn_class_logits = Variable(torch.FloatTensor())
                mrcnn_class = Variable(torch.IntTensor())
                mrcnn_bbox = Variable(torch.FloatTensor())
                mrcnn_mask = Variable(torch.FloatTensor())
                if self.config.GPU_COUNT:
                    mrcnn_class_logits = mrcnn_class_logits.cuda()
                    mrcnn_class = mrcnn_class.cuda()
                    mrcnn_bbox = mrcnn_bbox.cuda()
                    mrcnn_mask = mrcnn_mask.cuda()
            else:
                # Network Heads
                # Proposal classifier and BBox regressor heads
                mrcnn_class_logits, mrcnn_class, mrcnn_bbox = self.classifier(mrcnn_feature_maps, rois)

                # Create masks for detections
                mrcnn_mask = self.mask(mrcnn_feature_maps, rois)

            return [rpn_class_logits, rpn_bbox, target_class_ids, mrcnn_class_logits, target_deltas, mrcnn_bbox, target_mask, mrcnn_mask]


### Configuration

Check `config.py` to see the possible
configuration templates to change configuration information such as image size and number of classes.
Sample configuration and dataset setup can be seen in coco.py.


## Mask R-CNN in TorchVision

OK, let's get it all working.

You can use the Mask R-CNN implementation from [the multimodallearning Github repository](https://github.com/multimodallearning/pytorch-mask-rcnn)
or [the Pytorch torchvision R-CNN implementation](https://pytorch.org/vision/stable/models.html#mask-r-cnn).

This is a quickstart on the torchvision version of Mask R-CNN.

For help with fine tuning, see [the PyTorch instance segmentation fine tuning tutorial](https://pytorch.org/tutorials/intermediate/torchvision_tutorial.html).

1. Clone this repository.

    git clone https://github.com/multimodallearning/pytorch-mask-rcnn.git
    
2. We use functions from two more repositories that need to be build with the right --arch option for cuda support. The two functions are Non-Maximum Suppression from ruotianluo's pytorch-faster-rcnn repository and longcw's RoiAlign.

|GPU	|arch|
|-----|-----|
|TitanX|	sm_52|
|GTX 960M|	sm_50|
|GTX 1070|	sm_61|
|GTX 1080 (Ti)|	sm_61|
    cd nms/src/cuda/
    nvcc -c -o nms_kernel.cu.o nms_kernel.cu -x cu -Xcompiler -fPIC -arch=[arch]
    cd ../../
    python build.py
    cd ../

    cd roialign/roi_align/src/cuda/
    nvcc -c -o crop_and_resize_kernel.cu.o crop_and_resize_kernel.cu -x cu -Xcompiler -fPIC -arch=[arch]
    cd ../../
    python build.py
    cd ../../

3. As we use the COCO dataset (https://cocodataset.org/#home) install the Python COCO API(https://github.com/cocodataset/cocoapi) and create a symlink.

    ln -s /path/to/coco/cocoapi/PythonAPI/pycocotools/ pycocotools
    
Download the pretrained models on COCO and ImageNet from https://drive.google.com/drive/folders/1LXUgC2IZUYNEoXr05tdqyKFZY0pZyPDc

## Demo

If this doesn't work with your version of torchvision, you may need to re-install torchvison.

To test your installation simply run the demo with

    python demo.py

## Training on COCO

Training and evaluation code is in coco.py. You can run it from the command line as such:

    # Train a new model starting from pre-trained COCO weights
    python coco.py train --dataset=/path/to/coco/ --model=coco

    # Train a new model starting from ImageNet weights
    python coco.py train --dataset=/path/to/coco/ --model=imagenet

    # Continue training a model that you had trained earlier
    python coco.py train --dataset=/path/to/coco/ --model=/path/to/weights.h5

    # Continue training the last model you trained. This will find
    # the last trained weights in the model directory.
    python coco.py train --dataset=/path/to/coco/ --model=last

If you have not yet downloaded the COCO dataset you should run the command with the download option set, e.g.:

    # Train a new model starting from pre-trained COCO weights
    python coco.py train --dataset=/path/to/coco/ --model=coco --download=true

You can also run the COCO evaluation code with:

    # Run COCO evaluation on the last trained model
    python coco.py evaluate --dataset=/path/to/coco/ --model=last

# Running a pre-trained Mask R-CNN model on test images

First, let's copy some utility code from the torchvision library, load a pre-trained Mask R-CNN model,
and create a dataloader for the COCO validation images.

In [None]:
!cp /opt/pytorch/vision/references/detection/utils.py /home/jovyan/work/RTML/Mask\ R-CNN/
!cp /opt/pytorch/vision/references/detection/coco_utils.py /home/jovyan/work/RTML/Mask\ R-CNN/
!cp /opt/pytorch/vision/references/detection/transforms.py /home/jovyan/work/RTML/Mask\ R-CNN/
!cp /opt/pytorch/vision/references/detection/engine.py /home/jovyan/work/RTML/Mask\ R-CNN/
!cp /opt/pytorch/vision/references/detection/coco_eval.py /home/jovyan/work/RTML/Mask\ R-CNN/

In [None]:
import torch
import torchvision
from torchvision.models.detection.mask_rcnn import MaskRCNNPredictor
from torchvision.datasets import CocoDetection

import utils
from coco_utils import get_coco
import transforms

# Load a model pre-trained on COCO and put it in inference mode

print('Loading pretrained model...')
model = torchvision.models.detection.maskrcnn_resnet50_fpn(pretrained=True).cuda()
model.eval()

# Load the COCO 2017 train and val sets. We use the CocoDetection class definition
# from ./coco_utils.py, not the original torchvision.CocoDetection class. Also, we
# use transforms from ./transforms, not torchvision.transforms, because they need
# to transform the bboxes and masks along with the image.

coco_path = "/home/jovyan/work/COCO"

transform = transforms.Compose([
    transforms.ToTensor()
])

print('Loading COCO train, val datasets...')
coco_train_dataset = get_coco(coco_path, 'train', transform)
coco_val_dataset = get_coco(coco_path, 'val', transform)

def collate_fn(batch):
    return tuple(zip(*batch))

val_dataloader = torch.utils.data.DataLoader(coco_val_dataset, batch_size=8, shuffle=False, num_workers=4, collate_fn=collate_fn)

In [None]:
images, targets = next(iter(val_dataloader))
images = [ img.cuda() for img in images ]
predictions = model(images)

print('Prediction keys:', list(dict(predictions[0])))
print('Boxes shape:', predictions[0]['boxes'].shape)
print('Labels shape:', predictions[0]['labels'].shape)
print('Scores shape:', predictions[0]['scores'].shape)
print('Masks shape:', predictions[0]['masks'].shape)

The `predictions` list has one entry for each element of the batch. Each entry has the following keys:
1. `boxes`: A tensor containing $[x1,y1,x2,y2]$ coordinates for the 100 top-scoring bounding boxes.
2. `labels`: A tensor containing integer IDs of the labels corresponding to the 100 top bounding boxes.
3. `scores`: A tensor containing the scores of the top 100 bounding boxes, sorted from highest score to lowest.
4. `masks`: The mask corresponding to the most likely class for each of the top 100 bounding boxes. Each mask is the same size as the input image.

With that information, let's write some code to visualize a result. The `draw_segmentation_map()` function is
adapted from [Debugger Cafe's tutorial on Mask R-CNN](https://debuggercafe.com/instance-segmentation-with-pytorch-and-mask-r-cnn).

In [None]:
import numpy as np
import cv2
import random

# Array of labels for COCO dataset (91 elements)

coco_names = [
    '__background__', 'person', 'bicycle', 'car', 'motorcycle', 'airplane', 'bus',
    'train', 'truck', 'boat', 'traffic light', 'fire hydrant', 'N/A', 'stop sign',
    'parking meter', 'bench', 'bird', 'cat', 'dog', 'horse', 'sheep', 'cow',
    'elephant', 'bear', 'zebra', 'giraffe', 'N/A', 'backpack', 'umbrella', 'N/A', 'N/A',
    'handbag', 'tie', 'suitcase', 'frisbee', 'skis', 'snowboard', 'sports ball',
    'kite', 'baseball bat', 'baseball glove', 'skateboard', 'surfboard', 'tennis racket',
    'bottle', 'N/A', 'wine glass', 'cup', 'fork', 'knife', 'spoon', 'bowl',
    'banana', 'apple', 'sandwich', 'orange', 'broccoli', 'carrot', 'hot dog', 'pizza',
    'donut', 'cake', 'chair', 'couch', 'potted plant', 'bed', 'N/A', 'dining table',
    'N/A', 'N/A', 'toilet', 'N/A', 'tv', 'laptop', 'mouse', 'remote', 'keyboard', 'cell phone',
    'microwave', 'oven', 'toaster', 'sink', 'refrigerator', 'N/A', 'book',
    'clock', 'vase', 'scissors', 'teddy bear', 'hair drier', 'toothbrush'
]

# Random colors to use for labeling objects

COLORS = np.random.uniform(0, 255, size=(len(coco_names), 3)).astype(np.uint8)

# Overlay masks, bounding boxes, and labels on input numpy image

def draw_segmentation_map(image, masks, boxes, labels):
    alpha = 1
    beta = 0.5 # transparency for the segmentation map
    gamma = 0 # scalar added to each sum
    # convert from RGB to OpenCV BGR format
    image = cv2.cvtColor(image, cv2.COLOR_RGB2BGR)
    for i in range(len(masks)):
        mask = masks[i,:,:]
        red_map = np.zeros_like(mask).astype(np.uint8)
        green_map = np.zeros_like(mask).astype(np.uint8)
        blue_map = np.zeros_like(mask).astype(np.uint8)
        # apply a randon color mask to each object
        color = COLORS[random.randrange(0, len(COLORS))]
        red_map[mask > 0.5] = color[0]
        green_map[mask > 0.5] = color[1]
        blue_map[mask > 0.5] = color[2]
        # combine all the masks into a single image
        segmentation_map = np.stack([red_map, green_map, blue_map], axis=2)
        # apply colored mask to the image
        image = cv2.addWeighted(image, alpha, segmentation_map, beta, gamma)
        # draw the bounding box around each object
        p1 = (int(boxes[i][0]), int(boxes[i][1]))
        p2 = (int(boxes[i][2]), int(boxes[i][3]))
        color = (int(color[0]), int(color[1]), int(color[2]))
        cv2.rectangle(image, p1, p2, color, 2)
        # put the label text above the objects
        p = (int(boxes[i][0]), int(boxes[i][1]-10))
        cv2.putText(image, labels[i], p, cv2.FONT_HERSHEY_SIMPLEX, 0.5, color, 2, cv2.LINE_AA)
    
    return cv2.cvtColor(image, cv2.COLOR_BGR2RGB)

# Overlay masks, bounding boxes, and labels of objects with scores greater than
# threshold on one of the images in the input tensor using the predictions output by Mask R-CNN.

def prediction_to_mask_image(images, predictions, img_index, threshold):
    scores = predictions[img_index]['scores']
    boxes_to_use = scores >= threshold
    img = (images[img_index].cpu().permute(1, 2, 0).numpy() * 255).astype(np.uint8)
    masks = predictions[img_index]['masks'][boxes_to_use, :, :].cpu().detach().squeeze(1).numpy()
    boxes = predictions[img_index]['boxes'][boxes_to_use, :].cpu().detach().numpy()
    labels = predictions[img_index]['labels'][boxes_to_use].cpu().numpy()
    labels = [ coco_names[l] for l in labels ]

    return draw_segmentation_map(img, masks, boxes, labels)


Let's use the code above to visualize the predictions for the first image
in the validation set (index 0), using a threshold of 0.5:

In [None]:
from matplotlib import pyplot as plt

masked_img = prediction_to_mask_image(images, predictions, 0, 0.5)
plt.figure(1, figsize=(12, 9), dpi=100)
plt.imshow(masked_img)
plt.title('Validation image result')
plt.show()

## Evaluate on the COCO validation set

Let's get predictions in a loop for the full COCO 2017 validation set:

In [None]:
from engine import evaluate

results = evaluate(model, val_dataloader, 'cuda:0')

## Independent Work

Do the following:

 1. Get the demo up and running.
 2. Evaluate the pretrained COCO model on the COCO validation set we used last week.
 3. Run Mask R-CNN on the Cityscapes dataset in inference mode. Report your results and see what errors you find. [Here is the Cityscapes link](https://www.cityscapes-dataset.com/), and we'll also provide a copy of the dataset on the lab server.
 4. Fine tune the COCO Mask R-CNN on Cityscapes and report the results. Are they close to what's reported in the Mask R-CNN paper?


### Additional information you might want to know

The huge problem on Cityscapes is about the dataset. It is difficult to augment the dataset. We can address the problem by modifying the dataset class.

In <code>coco_utils.py</code>, add a function <code>get_cityscapes</code> for getting the dataset.

In [None]:
def get_cityscapes(root, ann_file, transforms):
    t = [ConvertCocoPolysToMask()]

    if transforms is not None:
        t.append(transforms)
    transforms = T.Compose(t)
    dataset = CocoDetection(root, ann_file, transforms=transforms)

    return dataset

When you want to use it, import the annotation file and images path into your program

In [None]:
path = "/root/Cityscapes/"                        # last year cityscapes path
train_annotation_file = "[cityscapes_train].json" # cityscapes annotation json train file
val_annotation_file = "[cityscapes_val].json"     # cityscapes annotation json eval file

Create transformation

In [None]:
transform = transforms.Compose([
    transforms.ToTensor()])

Load dataset

In [None]:
train_dataset = get_cityscapes(path,train_annotation_file, transform)                                

val_dataset = get_cityscapes(path,val_annotation_file, transform)

For augmentation, I copied from my thesis-Thyroid nodule dataset, if you like, you can modify the code for supporting the augmentation.

**_Using the dataset, you need to convert json mask file to mask images in different directory name but the mask image files need to be the same name as source images._**

In [None]:
import copy
import os

import albumentations as A
import albumentations.augmentations.functional as F
from albumentations.pytorch import ToTensorV2
import cv2
import matplotlib.pyplot as plt
import numpy as np
from torch.utils.data import Dataset, DataLoader
from os import listdir
import logging
import torch
from PIL import Image

from torchvision import transforms
from torchvision.transforms.transforms import RandomCrop

class ThyroidNoduleDataset(Dataset):
    def __init__(self, imgs_dir, masks_dir, transform=None):
        # source images direction
        self.imgs_dir = imgs_dir
        # masks images direction
        self.masks_dir = masks_dir
        self.transform = transform
        self.transform = A.Compose([t for t in transform if not isinstance(t, (A.Normalize, ToTensorV2))])

        list_files = listdir(imgs_dir)
        list_masks = listdir(masks_dir)

        # remove unmasked images
        for file in list_files:
            if file not in list_masks:
                list_files.remove(file)

        self.ids = [file for file in list_files if not file.startswith('.')]

        logging.info(f'Creating dataset with {len(self.ids)} examples')

    def __len__(self):
        return len(self.ids)

    def preprocess_mask(self, mask):
        if len(mask.shape) == 2:
            mask = np.expand_dims(mask, axis=2)
        #mask = mask.astype(np.float32)
        mask[mask <= 100] = 0
        mask[mask > 100] = 1
        return mask

    def __getitem__(self, i):
        idx = self.ids[i]
        mask_file = self.masks_dir + idx
        img_file = self.imgs_dir + idx

        image = cv2.imread(img_file)
        if len(image.shape) == 2:
            image = cv2.cvtColor(image, cv2.COLOR_GRAY2RGB)
        elif len(image.shape) == 3 and image.shape[2] == 4:
            image = cv2.cvtColor(image, cv2.COLOR_BGRA2RGB)
        else:
            image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
        mask = cv2.imread(mask_file, cv2.IMREAD_GRAYSCALE)
        mask = self.preprocess_mask(mask)

        if self.transform is not None:
            transformed = self.transform(image=image, mask=mask)
            image = transformed["image"]
            mask = transformed["mask"]
            if not isinstance(mask, np.ndarray):
                mask = torch.reshape(mask, [1,mask.shape[0], mask.shape[1]])
            else:
                image = numpy_to_torch(image) / 255.0
                mask = numpy_to_torch(mask)

        return {
            'image': image,
            'mask': mask,
            'img_file': img_file,
            'mask_file': mask_file,
            'file_name' : idx
        }

train_transform = A.Compose(
    [
        A.Resize(image_size, image_size),
        A.ShiftScaleRotate(shift_limit=0.2, scale_limit=0.15, rotate_limit=30, p=0.6, border_mode=cv2.BORDER_CONSTANT, value=0, mask_value=0),
        #A.RandomCrop(image_size, image_size, always_apply=False, p=1.0),
        A.RGBShift(r_shift_limit=25, g_shift_limit=25, b_shift_limit=25, p=0.5),
        A.RandomBrightnessContrast(brightness_limit=0.2, contrast_limit=0.2, p=0.5),
        A.Normalize(mean=(0.485, 0.456, 0.406), std=(0.229, 0.224, 0.225)),
        ToTensorV2(),
    ]
)

Using the dataset class

In [None]:
src_dir = 'Train/'
imgs_dir_name = 'images'
masks_dir_name = 'masks'
transform = create_transform(256, True)
dataset = ThyroidNoduleDataset(dir_img, dir_mask, transform)

# get a batch data
batch = dataset[15]
image, mask, img_file, mask_file, img_name = batch['image'], batch['mask'], batch['img_file'], batch['mask_file'], batch['file_name']

# loop random shuffle
train_loader = DataLoader(dataset, batch_size=batch_size, shuffle=True, num_workers=4, pin_memory=True)
for batch in train_loader:
    # train code

#### Converting json coco file to images (Example)

In [None]:
#%% 
# import library
import sys
from pycocotools.coco import COCO
import numpy as np
import skimage.io as io
import matplotlib.pyplot as plt
import matplotlib
from utils import FileTreeMaker
import os
from os import walk
import cv2
#matplotlib.use('TkAgg')
plt.figure(num=None, figsize=(8, 6), dpi=80, facecolor='w', edgecolor='k')
#%% Import coco files
# annotation path with json file
# the folder must contain
# annotaion_dir ---- annotation_file.json
#                |
#                --- annotation_file ----        (image folder with same name as annotation file)
#                                     |
#                                     --- image1.jpg
#                                     --- image2.png

annotation_dir = 'annotation_dir'
file_tree = FileTreeMaker()
print(file_tree.make(root=annotation_dir, level=1, 
            exclude_names=['json', 'Label', 'txt', 'Crop'],
            include_names=[], output_name=""))
print(file_tree.dir_full_names)
# %%
# get source images files list
fs = []
for (dirpath, dirnames, filenames) in walk(annotation_dir):
    fs.extend(filenames)
    break
#cocos = []
show = False
for f in fs:
    print('At folder: ', f[:-10])
    annFile = annotation_dir + '/' + f
    img_dir = annotation_dir + '/' + f[:-4]
    mask_dir = annotation_dir + '/' + f[:-4] + '_Mask'

    if not os.path.exists(mask_dir):
        print("No folder ", mask_dir, 'exist. Create the folder')
        os.mkdir(mask_dir)
        print("Create directory finished")

    coco=COCO(annFile)

    # display COCO categories and supercategories
    cats = coco.loadCats(coco.getCatIds())
    nms=[cat['name'] for cat in cats]
    #print('COCO categories: {}\n'.format(' '.join(nms)))
    catIds = coco.getCatIds(catNms=nms);
    #print('catIds: ', catIds)
    imgIds = coco.getImgIds(catIds=catIds );
    #print('imgIds: ', imgIds)
    for id in imgIds:
        img = coco.loadImgs(id)[0]
        #print(img)

        # load and display image
        # use url to load image
        I = cv2.imread(img_dir + '/' + img['file_name'])
        if show:
            plt.figure(num=None, figsize=(10, 8), dpi=80, facecolor='w', edgecolor='k')
            plt.axis('off')
            plt.imshow(I)
            plt.show()

        annIds = coco.getAnnIds(imgIds=img['id'], catIds=catIds, iscrowd=None)
        anns = coco.loadAnns(annIds)
        for j in range(len(anns)):
            if j == 0:
                mask = coco.annToMask(anns[j])
            else:
                mask = (mask | coco.annToMask(anns[j]))

        # load and display instance annotations
        if show:
            plt.figure(num=None, figsize=(10, 8), dpi=80, facecolor='w', edgecolor='k')
            plt.axis('off')
            plt.imshow(I)
            coco.showAnns(anns)
            plt.show()

        im = np.array(mask) * 255
        if show:
            print(im.shape, im.max())
            plt.figure(num=None, figsize=(10, 8), dpi=80, facecolor='w', edgecolor='k')
            plt.imshow(im); plt.axis('off')
            plt.show()

        cv2.imwrite(mask_dir + '/' + img['file_name'], im)
# %%
print('finish')
# %%

Good luck!