# "Video Instance Segmentation - overview of my Master's Thesis"
> "Part [3/3] of my Road to Masters story"

- toc: true
- branch: master
- badges: false
- image: images/ipynb/vis_main.png
- comments: false
- author: Giaco Stantino
- categories: [computer vision, machine learning]
- hide: false
- search_exclude: true
- permalink: /blog/vis-masters


#  <center> Intro </center>
In last two notebooks we builded some basic ideas behind computer vision. Today we are going to look at state of the  art algorithm. It was proposed to solve one of the hardest task (at least for now) - Video Istansce Segmentation. 

<center><img src="https://giacostantino.com/images/ipynb/vis_dogo.gif" width="400" height="600"/></center>

This post concludes the blog series Road to Masters. If you would like to check how method I tested performs on your video check out this **[Google Colab](https://colab.research.google.com/drive/1NiHX13_zBII-ZCp1FxjAbqPHACU2GNH2?usp=sharing)**.

***

#  <center> Prelude </center>

The motivation for writing this master's thesis was the topic of safety of people at work in a robotic environment, and in particular increasing its level by better quality of analysis of the image from the recordings, which may come from a robot's camera. Hence, the aim of the work was to test the effectiveness of the method for video instance segmentation MaskTrackRCNN, as well as to modify the architecture in order to improve the obtained results. The YouTube-VIS 2021 database was used to perform this task.

Due to the introduced modifications, the obtained results were analyzed and compared with the original version of the architecture. Better quality of the results generated by the network was observed, in the best case the average precision was increased by 21% (6.4 percentage points).

***

#  <center> The problem of segmentation </center>

NVidia DRIVE's results on image segmentation at work for autonomous driving and the use of neural networks for medical research are inspiring. In figure below: on the left side - different classifications and surfaces were distinguished; on the right - an MRI scan is visible with the marked areas, where changes in the tissue were detected. The subject of the segmentation analysis is important for many fields of science.

<center><img src="https://giacostantino.com/images/ipynb/vis_segmentations1.png" width=800></center>

Segmentation is a problem that the process of detecting objects in an image should solve before trying to recognize individual instances of objects. In fact, this is a key step towards understanding the image. **Semantic segmentation** associates each pixel of an image with a class, eg human, bird, pavement, vehicle, etc. It treats all objects of the same class as one. Such image analysis is often used to understand the scenery overall. Whereas **instance segmentation** treats objects of the same class as separate and single. It is used to distinguish objects of the same class and, for example, to track their position in a video recording. The figure below shows the results of the image analysis in terms of the segmentation discussed, in the middle image the chairs are assigned to the same object - semantic segmentation, and on the right to separate ones - instance segmentation.

<center><img src="https://giacostantino.com/images/ipynb/vis_segmentations2.png" width=800></center>

At this point, it is worth noting that object segmentation in a single image is different from **video object segmentation**. The former refers to the division of the image into regions that are homogeneous in terms of a certain feature. Sometimes it produces very different results for two similar images. The latter requires that the segmentation for a given frame refers to the segmentation of the previous frame and on this basis it is necessary to determine that the extracted segments belong to the same objects. In this process, the consistency between the segmentations of consecutive frames is extremely important, e.g. in terms of the color, texture and depth. **Deep neural networks** exhibit the ability to surpass traditional approaches in computer vision tasks such as detection, recognition, and segmentation

***

# <center> Chosen method </center>

MaskTrackRCNN was built on the framework of the object segmentation method in MaskRCNN, which in turn is based on the FasterRCNN method. Tracking head was added to MaskRCNN to better understand the temporal data.  All of the above are part of regional based convolutional networks (RCNN). 

<center><img src="https://giacostantino.com/images/ipynb/vis_matrioszka.png" width=500></center>

## R-CNN Family

Object detection is an integral part of computer vision. The difference between object detection algorithms and classification algorithms is that in detection algorithms we try to draw a bounding box around one (or more) object of interest to locate it in the image. The main reason you cannot solve this problem by building a standard convolutional network is that the number of occurrences of objects of interest is not constant. Therefore, algorithms such as R-CNN have been developed to find regions of interest and classify them quickly.

<center><img src="https://giacostantino.com/images/ipynb/vis_rcnn.png" width=800></center>

The methods based on the region proposals work according to the scheme shown in the figure above. First, the proposed areas of interest are generated, feature maps for them are determined and an attempt is made to classify them. How areas are generated depends on the specific method.

> Note: In this blog post we only consider parts of the MaskRCNN that are crucial to understand idea behind MaskTrackRCNN method, if you want to know more this are hepful links:  [FasterRCNN](https://towardsdatascience.com/faster-r-cnn-for-object-detection-a-technical-summary-474c5b857b46) and [MaskRCNN](https://engineering.matterport.com/splash-of-color-instance-segmentation-with-mask-r-cnn-and-tensorflow-7c761e238b46)

***

## Concepts behind MaskRCNN


MaskRCNN is a deep neural network, the task of which is to segment object instances in computer vision, i.e. to distinguish between different objects of the same class in a photo. It is a two-step method built on the FasterRCNN architecture. First, propositions of regions (in which the object may be located) are generated based on the input image. Then it predicts the object's class, prepares bounding boxes and generates a mask at the level of individual pixels.

<center><img src="https://giacostantino.com/images/ipynb/vis_mask_example.png" width=700></center>

First step: the region proposal network is used to generate multiple ROIs, for this purpose 9 anchor boxes are usually used. In the second stage, thanks to the informations from the feature maps and RoI Align function, the object class and frame are predicted in parallel, and binary masks are generated in parallel for each area of ​​interest using convolutional layers. 

***

## MaskTrackRCNN 

This architecture draws heavily on the network built to MaskRCNN. It extends the original idea of ​​the three heads of the model, i.e. object classification, regression of boxes and generation of masks with an additional fourth element, which is used to track the object instance in the recording frames.

<center><img src="https://giacostantino.com/images/ipynb/vis_masktrack_schemat.png" width=800></center>

The proposed tracking branch has two fully connected layers of artificial neurons that transform the feature maps obtained by RoIAlign into a vector form, thereby creating a feature vector (instead of a matrix map). Since characteristics from previously identified instances have already been computed beforehand, the architecture uses external memory to store them. This memory is dynamically updated when a candidate frame is assigned to an already existing instance in memory or expanded when an object has been evaluated as new to a given video sequence.

In summary, an additional tracking branch compares the information in the current frame with that in the rest of the video sequences stored in memory. On this basis, it determines whether the object is an instance of the previously identified object, if so, it is assigned the same class and identifier.

***

# <center> Modifications </center>

In the first implementation, the body of the MaskTrackRCNN architecture was the ResNet50 residual network. Neural network that genarates feature maps is call the bodu of the architecture. These maps are used by architecture's heads to generate results, such as the classifying head and tracking head. Thus, the better the feature maps, the better the results. In order to improve the results of the MaskTrackRCNN network, two modifications of the original architecture are proposed in the experimental part:

* changing the body of the MaskRCNN model to a deeper one than in the original implementation, e.g. ResNet101 or ResNeXt101
* replacement of RoI Align with a newer equivalent that takes into account temporal clues in the recording.

> Note: In this blog post we are going to focus on the second modification, if you want more about diffrences between ResNet and ResNeXt chechout this [post](https://medium.com/dataseries/enhancing-resnet-to-resnext-for-image-classification-3449f62a774c)


***

## Temporal RoI Align

**RoIAlign** is the operation of extracting a small feature map from each area of interest where the object is searched for. Properly matches the extracted features with the input data. To avoid quantization of the borders or RoI intervals, RoIAlign uses bilinear interpolation to compute the exact values of the input data features at four regularly sampled locations in each RoI interval, as shown in the figure below. Then the result is aggregated (using a maximum or average value). This solution is less expensive than the previous ones, such as RoIPool, and also extracts feature maps for images well.

<center><img src="https://giacostantino.com/images/ipynb/vis_roialign.png" width=200></center>

In August 2021, this solution was significantly improved. New approach, **Temporal RoI Align**, takes into account the temporal nature of the recordings. Therefore, it outperforms RoIAlign, which was designed to analyze single images, not sequences.


<center><img src="https://giacostantino.com/images/ipynb/vis_temporalroialign.png" width=600></center>

The illustration above shows two patterns of operation:

* RoI Align, which generates a feature map for an area of interest as described at the beginning of this section,
* The Temporal ROI Align function first extracts RoI features from support frames. The temporal attention algorithm is then used to aggregate the RoI traits for the current frame and the most similar RoI traits in the other frames. Ultimately, the final feature map is generated.

In the experimental part, the results obtained with the use of both algorithms are compared.

***

#  <center> Experiments </center>

The first experiments concerned the modification of the body of architecture. First, an attempt was made to reproduce the results of an original research paper using the ResNet50 network as the body of architecture. Then the MaskTrackRCNN method was trained in a modified form.

<center><img src="https://giacostantino.com/images/ipynb/vis_results1.png" width=450></center>

An attempt to reproduce the results of the work presenting MaskTrackRCNN resulted in an AP result lower by 0.6 than the original, which may be due to the lower number of training epochs. The best results were achieved by the architecture version using the ResNeXt101 network (cardinality 64) - this is an AP improvement by 1.9 compared to ResNet101. ResNeXt101 with a cardinality of 32 notes an improvement of 1.4. This is probably due to the added dimension of the network in both cases, which allows the generation of better quality feature maps.

***

The second group of studies was the use of the Temporal RoI Align function. In the thesis, the results were obtained for the Resnet101 and ResNeXt101 architecture bodies with a cardinality of 64.

<center><img src="https://giacostantino.com/images/ipynb/vis_results2.png" width=550></center>

It can be seen that the revised version of RoI Align using temporal features leads to an improvement in the results on all three metrics compared to the original version of the MaskRCNN architecture. For ResNet101 note improves by 2.9 AP, and for ResNeXt101 with a cardinality of 64 by 2.6 AP. Such results support the thesis that extracting features from the entire video, not just one frame for generating proposals, facilitates the classification of regions of interest.

***

# <center> Sample frames </center>

The masks placed on exemplary frames for fragments of recordings are presented below. It shows a man walking with a badminton racket and a skateboard.

<center><img src="https://giacostantino.com/images/ipynb/vis_sampleframes1.png" width=800></center>

The empirical results that were obtained and shown in the above frames present many problems related to the task of segmenting object instances in video. At the same time, it should be noted that this is a difficult task for which one of the best methods, which is MastTrackRCNN, achieves the results of about 30% AP.

***

<center><img src="https://giacostantino.com/images/ipynb/vis_sampleframes2.png" width=500></center>

First, the quality of the mask is poor. In the picture above, a close-up of the man's mask from the first video is shown. There are visible inaccuracies in the binary pixel allocation. Especially to the right of the head, many background pixels have been identified as part of an object instance. On the other hand, in this example, the model correctly predicted the class with 100% certainty.

***

<center><img src="https://giacostantino.com/images/ipynb/vis_sampleframes3.png" width=500></center>

The second problem that was observed is the incorrect recognition of objects. Above figure shows a close-up of the dog for two exemplary frames. The previous frame shows low classification certainty - about 3% for the tiger class, which is an incorrect classification. The subsequent frame shows that the object is correctly assigned to a real class, but to a different instance. The reason for such a result may be the insufficient influence of temporal information on the functioning of the investigator's head.

# <center> Summary </center>

My master's thesis discussed the problem of Video Instance Segmantation. MaskTrackRCNN was chosen as the benchmark for my experiments. I tried to improve the predictions made my original model. To achieve this goal, I proposed two modifications:  changing the backbone to a deeper architecture and using Temporal RoI Align in place of a previous function that didn't use temporal insight by default. My modified model outscored the original architucture, but the quality of produced masks was not satisfactory.

This sums up the blog post and the series. In the end, I am proud of the work I've done. It was without a doubt one of the most developing projects in my life.