# Video Object Segmentation with Mask R-CNN (OpenCV library)

In this notebook we will explain step by step how to implement Instance Segmentation in video.   
The goal is to generate a mask for each object in our video, such that we will be able to segment the foreground object from the background.   
Instance segmentation algorithms compute a pixel-wise mask for every detected object in the frame image.   
We will perform Instance Segmentation using Mask R-CNN architecture as instance segmentation algorithm. The Mask R-CNN algorithm is built upon the Faster R-CNN architecture.  Faster R-CNN is a popular object detection framework, and Mask R-CNN extends it through instance segmentation.

## Importing Libraries

First step, we need to import OpenCV and Numpy libraries.

In [1]:
import cv2 
import numpy as np
import random
from IPython.display import Video

## Video

Let's have a look at the video.

Link (click on "download" and watch it):   [https://github.com/buropas/Image_and_Video_Segmentation/blob/main/test.mp4](https://github.com/buropas/Image_and_Video_Segmentation/blob/main/test.mp4)

In [2]:
#Take a look at the input video
Video("test.mp4")

## Loading Model Configuration and Pre-trained Weights
We load:
- the Mask R-CNN model weights ("frozen_inference_graph.pb"), which are pre-trained on the COCO dataset, and
- the Mask R-CNN model configuration ("mask_rcnn_inception_v2_coco_2018_01_28.pbtxt")

In [3]:
## Loading Mask R-CNN configuration file and pre-trained weights

net = cv2.dnn.readNetFromTensorflow("frozen_inference_graph_coco.pb",                   # weights path
                                    "mask_rcnn_inception_v2_coco_2018_01_28.pbtxt")     # config path
                                    
print("MASK R-CNN LOADED SUCCESSFULLY")

MASK R-CNN LOADED SUCCESSFULLY


This is all we need in order to load the model configuration and pre-trained weights.  

## Classes in the COCO dataset

In [4]:
# classes in COCO dataset
classes = ["person","bicycle","car","motorcycle","airplane","bus","train","truck","boat","traffic light",
           "fire hydrant","street sign","stop sign","parking meter","bench","bird","cat","dog","horse",
           "sheep","cow","elephant","bear","zebra","giraffe","hat","backpack","umbrella","shoe","eye glasses",
           "handbag","tie","suitcase","frisbee","skis","snowboard","sports ball","kite","baseball bat",
           "baseball glove","skateboard","surfboard","tennis racket","bottle","plate","wine glass",
           "cup","fork","knife","spoon","bowl","banana","apple","sandwich","orange","broccoli","carrot",
           "hot dog","pizza","donut","cake","chair","couch","potted plant","bed","mirror","dining table",
           "window","desk","toilet","door","tv","laptop","mouse","remote","keyboard","cell phone","microwave",
           "oven","toaster","sink","refrigerator","blender","book","clock","vase","scissors","teddy bear",
           "hair drier","toothbrush"]

print("NUM CLASSES:", len(classes))

NUM CLASSES: 90


## Capturing video 
Next step is to load the video and define the VideoWriter object in order to save our final video with object segmentation. The output video will be saved as "segm_out_video.avi".

In the VideoWriter object we specify:
- the output file name (segm_out_video.avi), 
- the FourCC code (a 4-byte code used to specify the video codec), 
- the number of frames per second (fps), 
- the frame size.

In [5]:
## Loading video 
filename = "test.mp4"                         # filename
cap = cv2.VideoCapture(filename)              # loading video

# We get the resolution of our video (width and height) and we convert from float to integer
frame_width = int(cap.get(3))
frame_height = int(cap.get(4))

# We create VideoWriter object and define the codec. The output is stored in 'segm_out_video.avi' file.
out_video = cv2.VideoWriter("segm_out_video.avi",                        # output name
                            cv2.VideoWriter_fourcc('M', 'J', 'P', 'G'),  # 4-byte code used to specify the video codec
                                                                         # (we pass MJPG)
                            10,                                          # number of frames per second (fps) 
                            (frame_width, frame_height)                  # frame size
                            )

# set font and color of text (to show class and confidence score)
font = cv2.FONT_HERSHEY_PLAIN     # font
text_color = (0,255,0)            # green color

# random colors to distinguish between different classes (90 classes, 3 channels)
colors = np.random.randint(0, 255, (90, 3))   # generate 90 random colors

## Video Object Segmentation


The goal is to perform Video Object Segmentation: we want to automatically segment and generate pixel-wise masks for every detected object in our video.   
Video Object Segmentation is a binary labeling problem aiming to separate foreground object(s) from the background region of a video.   
So, the idea is that we process the video frame by frame.   
For each frame:
- we preprocess the frame and pass it as input into the network, then we run a forward pass to generate the network output,
- as output of the network, we obtain detected objects with bounding boxes coordinates, confidence scores, classes and predicted masks,
- we filter out objects detected with a confidence score lower than a specific threshold.
- Then, For the remaining detected objects, we extract the bounding boxes and the associated mask.    
  The predicted mask is only 15 x 15 pixels, so we need to resize the mask back in order to adapt the mask to the size   
  of the object in the original image.
- Finally, we find contours of the masks and we fill the area of the detected objects using random colors. Each class has its own color. We also create an overlay image in order to obtain transparency in the area of the detected 
  objects.

At the end, we save the video with Object Segmentation.

In [None]:
while cap.isOpened():     # while the capture is correctly initialized...

    # We process the video frame-by-frame
    
    ret, img = cap.read()           # we read each frame (img) from the video
                                    # we also retrieve ret, which is a boolean value. 
                                    # ret is True if the frame is read correctly    
    
    if ret == True:    # if the frame is read correctly, go on...
        
        # make 2 copies of the frame image (the first will be the final output, while the second will be the 
        # overlay on top)
        output = img.copy()             # copy of the original frame 
        overlay = img.copy()            # copy of the original frame

        height, width, _ = img.shape    # retrieve shape from image (frame)
        

        ## IMAGE PREPROCESSING
        ## Using blob function of opencv to preprocess frame (image) 
        # (The cv2.dnn.blobFromImage function returns a blob which is our input image with color swapping)
        blob = cv2.dnn.blobFromImage(img, swapRB=True)  
        
        ## NETWORK PREDICTIONS (Output)
        net.setInput(blob)                                                     # set blob as input to the network
        boxes, masks = net.forward(["detection_out_final", "detection_masks"]) # runs a forward pass... 
                                                                               # ...to compute the net output 

        num_detections = boxes.shape[2]        # number of detected objects in the frame

        for i in range(num_detections):        # for each of the detected objects...

            box = boxes[0,0,i]                 # single detected object
            class_id = int(box[1])             # the class associated with the detected object is the second element
            confidence_score = box[2]          # the confidence score for the detected object is the third element

            if confidence_score > 0.5:       # if the confidence score of the detected object is above a 
                                             # specific threshold, we keep on extracting box coordinates and 
                                             # mask associated with that object

                label = str(classes[class_id])    # class associated with the object

                x1, y1, x2, y2 = box[3:]          # box coordinates  (last 4 elements)

                # we multiply the coordinates for the width and height of our original image
                x1 = int(x1 * width)
                y1 = int(y1 * height)
                x2 = int(x2 * width)
                y2 = int(y2 * height)

                object_area = overlay[y1:y2, x1:x2]    # area of the detected object

                object_height, object_width, _ = object_area.shape   # height and width of the detected object


                ## MASK ##
                ##########

                ## We extract the pixel-wise segmentation (mask) for the detected object, 
                ## we resize the mask such that it's the same dimensions of the bounding box of the detected object
                ## finally, we threshold to create a binary mask.

                mask = masks[i, class_id]     # mask associated with the detected object and its predicted class id

                # The predicted mask is only 15 x 15 pixels so we resize the mask back to the original input object 
                # dimensions. We need to adapt the mask to the size of the object in the original image
                mask = cv2.resize(mask, (object_width, object_height))

                # For every pixel in the mask, if the pixel value is smaller than the threshold, it is set to 0, 
                # otherwise it is set to a maximum value (255, white pixel). 
                # The function cv2.threshold is used to apply the thresholding and we set a binary thresholding. 
                _, mask = cv2.threshold(mask, 0.5, 255, cv2.THRESH_BINARY)

                mask = np.array(mask, np.uint8)     # convert to array of integer 

                # We find the countours of the mask (mask coordinates) 
                # Each individual contour is a Numpy array of (x,y) coordinates of boundary points of the object.
                contours, _ = cv2.findContours(mask, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)

                # We fill the detected object with a random color for each detected object.
                # Each class has its own color.
                # fillPoly() fills an area bounded by several polygonal contours.
                for cnt in contours:     # for each contour...
                    cv2.fillPoly(object_area,            # area to fill
                                 [cnt],                  # contours bounding the area
                                 (int(colors[class_id][0]), int(colors[class_id][1]), int(colors[class_id][2]))# color
                                 )
                
                # We put text (class and confidence score) on top of each detected object
                cv2.putText(output, label + " " + str(round(confidence_score,2)), (x1, y1), font, 1.2, 
                            text_color, 2)   # text of the box 


        # Now, we apply the overlay.
        # Overlay is the image that we want to “overlay” on top of the original image using a supplied level 
        # of alpha transparency.
        alpha = 0.6
        cv2.addWeighted(overlay,  # image that we want to “overlay” on top of the original image
                        alpha,    # alpha transparency of the overlay (the closer to 0 the more transparent the 
                                  # overlay will appear)
                        output,   # original source image
                        1-alpha,  # beta parameter (1-alpha)
                        0,        # gamma value — a scalar added to the weighted sum (we set it to 0)
                        output    # our final output image
                        )
            
        cv2.imshow("out", output)      # display the current frame with detected objects and masks 

        out_video.write(output)        # the frame is saved for the final video

        key = cv2.waitKey(1)        # wait 1 millisecond between each frame
        if key == 27:               # if exit button, break and close
            break
    
    
    else:   # if the frame is not read correctly, break...
        break
    
# Release everything when job is finished
cap.release()
out_video.release()
cv2.destroyAllWindows()

Now, let's have a look at the final result...

## Video with Object Segmentation

Link to Output Video (click on "download" and just watch or download it):   

[https://github.com/buropas/Image_and_Video_Segmentation/blob/main/segm_out_video.avi](https://github.com/buropas/Image_and_Video_Segmentation/blob/main/segm_out_video.avi)