# Computer Vision, Lab 4: Semantic Segmentation

Last time we used a simple generative model based on HSV color space features to segment our scene.

Today we'll continue experimenting with methods to segment a scene with a sample ground robot in an indoor environment and the
task of segmenting unoccupied ground plane space from obstacles.

Today we'll move beyond a simple pixel-based classification to a more sophisticated CNN-based "semantic" segmentation model.

## Semantic Segmentation

We've seen that color-based segmentation in HSV space using a generative machine learning model can be effective but has limitations when the objects to be segmented from the background have color distributions similar to the background. The best accuracy we could get for our floor/obstacle model last week was around 95%, but hat was with occasional errors grouping in large enough regions that would preclude our robot from navigating safely in the indoor environment.

Semantic segmentation attempts to address this issue using a more sophisticated model to separate the scene into its constituent regions more effectively.

Typical semantic segmentation models typically use a lot of resources. For example, the state of the art models published for the MIT ADE20K dataset in their <link>[GitHub repository](https://github.com/CSAILVision/semantic-segmentation-pytorch)</link> are very accurate and run at 2-17 FPS on a NVIDIA Pascal Titan Xp GPU. They also run well on a GTX 1080TI. But on an i7 CPU, I found that they take 9-11 SECONDS PER FRAME, and on an NVIDIA Jetson Nano's GPU, they take 12-42 SEDONDS PER FRAME!

## Lighter semantic segmentation models

We would like to experiment with some semantic segmenation models that have a hope of running in real time on a small embedded system such as the Jetson Nano.

NVIDIA has published a very useful repository <link>[Jetson Inference](https://github.com/dusty-nv/jetson-inference)</link> that contains versions of two semantic segmentation models: SegNet and UNet.

I ran the SegNet model in this repository using FCN-ResNet18 trained on Pascal VOC with 320x320 input images. It takes a while to load the model into memory and so on, but inference time once all is ready is very fast: less than 70 ms.

So it's fast! Unfortunately, it doesn't work particularly well out of the box:

<img src="img/lab04-1.png" width="600"/>

But we shouldn't expect it to, considering that the Pascal VOC dataset "only" contains 20 classes plus the "background" (the black label), and others of which are "bottle" (the purple label) and "person" (the tan label).

That's on the NVIDIA Jetson Nano. The model itself was built with PyTorch, but it has been exported in ONNX (Open Neural Network Exchange) format, and the Jetson Nano executes it on TensorRT. You can download the <link>[ONNX model from AIT](https://www.cs.ait.ac.th/~mdailey/class/vision/fcn_resnet18.onnx)</link> (I just copied it from the excellent Jetson Inference repository).

If you'd like to understand the structure of the model represented in this ONNX file, download the <link>[Netron 4.3.4 AppImage for Linux](https://github.com/lutzroeder/netron)</link>, run the AppImage, and load the file. You'll see that it

1. Takes as input a 320x320 3-channel image
2. Performs 64 7x7 convolutions
3. Does batch normalization, ReLU, and MaxPool
4. Runs a residual block with the following
    1. 64 3x3 convolutions then BN then ReLU
    2. 64 3x3 convolutions then BN
5. ReLU
6. Residual block (same structure as above)
7. ReLU
8. Residual block with the following
    1. 128 3x3 convolutions then BN then ReLU
    2. 128 3x3 convolutions then BN
    3. Residual add using 128 1x1 convs to expand feature map
9. Etc. (several more residual blocks and downscaling by a factor of 32)
10. Final 21 1x1 convolution to obtain output 1x21x10x10 tensor

Since this model is relatively small and at least generates some output, let's see if we can get it running on our OpenCV video stream. A model that runs nice and fast on an Intel CPU and on a Jetson GPU, in C++/OpenCV as well as Python, would be perfect.

<img src="img/lab04-2.png" width="300"/>

## Load a segmentation model using OpenCV DNN

Let's see if we can get this ONNX model running in OpenCV with its DNN functionality.

First, let's initialize with the Pascal VOC classes and read an image:

### C++

In [None]:
string aStringClasses[] = {
    "background", "aeroplane", "bicycle", "bird", "boat", "bottle", "bus", "car", "cat", "chair", "cow",
    "diningtable", "dog", "horse", "motorbike", "person", "pottedplant", "sheep", "sofa", "train", "tvmonitor"
};

cv::Vec3b aColorClasses[] = {
        { 0, 0, 0 }, { 255, 0, 0 }, { 0, 255, 0 }, { 0, 255, 120 }, { 0, 0, 255 }, { 255, 0, 255 }, { 70, 70, 70 },
        { 102, 102, 156 }, { 190, 153, 153 }, { 180, 165, 180 }, { 150, 100, 100 }, { 153, 153, 153 },
        { 250, 170, 30 }, { 220, 220, 0 }, { 107, 142, 35 }, { 192, 128, 128 }, { 70, 130, 180 }, { 220, 20, 60 },
        { 0, 0, 142 }, { 0, 0, 70 }, { 119, 11, 32 }
};

int nClasses = sizeof(aColorClasses) / 3;

// Read CNN definition

auto net = cv::dnn::readNetFromONNX(ONNX_NETWORK_DEFINITION);

// Read input image

cv::Mat matFrame = cv::imread(IMAGE_FILE);
if (matFrame.empty()) {
    cerr << "Cannot open image file " << IMAGE_FILE << endl;
    return -1;
}

### Python

In [None]:
aStringClasses = [
    "background", "aeroplane", "bicycle", "bird", "boat", "bottle", "bus", "car", "cat", "chair", "cow",
    "diningtable", "dog", "horse", "motorbike", "person", "pottedplant", "sheep", "sofa", "train", "tvmonitor"
]

aColorClasses = [
        ( 0, 0, 0 ), ( 255, 0, 0 ), ( 0, 255, 0 ), ( 0, 255, 120 ), ( 0, 0, 255 ), ( 255, 0, 255 ), ( 70, 70, 70 ),
        ( 102, 102, 156 ), ( 190, 153, 153 ), ( 180, 165, 180 ), ( 150, 100, 100 ), ( 153, 153, 153 ),
        ( 250, 170, 30 ), ( 220, 220, 0 ), ( 107, 142, 35 ), ( 192, 128, 128 ), ( 70, 130, 180 ), ( 220, 20, 60 ),
        ( 0, 0, 142 ), ( 0, 0, 70 ), ( 119, 11, 32 )
]

nClasses = len(aColorClasses)

# Read CNN definition
net = cv2.dnn.readNetFromONNX(cv2.ONNX_NETWORK_DEFINITION)

# Read input image
matFrame = cv2.imread(IMAGE_FILE);
if (matFrame is None):
    print("Cannot open image file ", IMAGE_FILE)
    return -1

## Run an image through the network

Once the input image is preprocessed to form a tensor suitable for input to our DNN model, we can just set the input layer of the network to point to the newly preprocessed image tensor, then we can do a forward pass through the network model:

### C++

In [None]:
// Propagate the matInputTensor through the FCN model

net.setInput(matInputTensor);
cv::Mat matScore = net.forward();

### Python

In [None]:
# Propagate the matInputTensor through the FCN model

net.setInput(matInputTensor)
matScore = net.forward()

OK but how to do that preprocessing before we feed the image to the model?

Here's the thing: most CNN models (even object detection and image segmentation models) are based on a classification model as the "backbone" of the model. In the case of FCN-ResNet-18, the backbone is of course ResNet-18.

Usually, training a model based on a classifier begins by loading weights trained for classification on ImageNet or another dataset then further training and/or "fine tuning" the model on a more specific dataset. We do this because we don't want to spend the week of GPU time it takes to get a model that analyzes image edges, puts them together into higher-level shapes, and gradually extracts a set of coarsely localized features that describe objects of interest.

When data scientists train a model on ImageNet, they almost always perform a few common steps:

1. Scale the input image's R, G, and B intensities (normally in the range 0-255) to the range 0-1.
2. Scale the input image to the size needed for the classifier, or sample a patch from the input with the size required by the classifier.
3. Subtract expected mean values for the R, G, and B channels. The magic values for ImageNet are 0.485 for R, 0.456 for G, and 0.406 for B.
4. Divide by expected standard deviations for the R, G, and B channels. The magic values for ImageNet are 0.229 for R, 0.224 for G, and 0.225 for B.

There is an OpenCV function <code>cv::dnn::blobFromImage()</code> that performs some of these steps but not all. Check the documentation and add the necessary code to prepare the image for presentation for the pre-trained network.

Next, you'll want to use the result of the semantic segmentation model to color the input image and display for examination by the user, and perhaps add some information about CPU time required for the inference:

### C++

In [None]:
// Colorize the image and display

cv::Mat matColored;
colorizeSegmentation(matFrame, matScore, matColored, aColorClasses, aStringClasses, nClasses);

// Add timing information

std::vector<double> layersTimes;
double freq = cv::getTickFrequency() / 1000;
double t = net.getPerfProfile(layersTimes) / freq;
std::string label = cv::format("Inference time: %.2f ms", t);
cv::putText(matColored, label, cv::Point(10,30),
        cv::FONT_HERSHEY_SIMPLEX, 1.0, cv::Scalar(0, 255, 0));

// Display

cv::namedWindow(WINDOW_NAME, WINDOW_FLAGS);
cv::imshow(WINDOW_NAME, matColored);
cv::waitKey(0);

return 0;

### Python

In [None]:
# Colorize the image and display

matColored = cv2.colorizeSegmentation(matFrame, matScore, aColorClasses, aStringClasses, nClasses)

# Add timing information
layersTimes = 0
freq = cv2.getTickFrequency() / 1000;
t = net.getPerfProfile(layersTimes) / freq;
label = "Inference time: " + str(t) + " ms"
cv2.putText(matColored, label, (10,30), cv2.FONT_HERSHEY_SIMPLEX, 1.0, (0, 255, 0))

# Display
cv2.namedWindow(WINDOW_NAME, WINDOW_FLAGS)
cv2.imshow(WINDOW_NAME, matColored)
cv2.waitKey(0)

You'll have to figure out colorizeSegmentation for yourself. See if you can get a result similar to the image above. It won't be exactly the same, as the jetson inference code scales the input image slightly differently from `blobFromImage()`. I got the following:

<img src="img/lab04-3.png" width="600"/>

## Fine tuning on our own dataset

Since the stock model was trained on VOC 2012's 21 classes, it is unable to give a recognizable segmentation on our data set.
So here, we will want to fine-tune the FCN-ResNet-18 model on our own floor/obstacle dataset. We will load the existing weights, throw away the 21-class output layer of the existing model, and replace it with our own two-class output layer. We'll keep the weights for all but this last layer, then start training on our dataset.

Also, since we will be training on a medium-sized dataset (VOC), we should set up our machine learning model development environment to use a powerful GPU server rather than our poor little laptops.

## GPU server setup

This lab should work fine on the AIT DS&AI JupyterHub
server. [You can login here](https://puffer.cs.ait.ac.th/hub/login?next=).

You might instead want to create a custom environment according to
[the RTML GPU setup guide](https://github.com/dsai-asia/RTML/blob/main/Labs/01-Setup/01-Setup.ipynb),
but this requires quite a bit of work.

## Training Scripts

Download the <link>[training scripts and data for this lab](https://drive.google.com/file/d/1ihaFWQLTsFPpzAfAHfhrt1cR4WEwhH7r/view)</link>

Move the code and data into your project directory and take a look at `train.py` and the code it uses. The code is originally from torchvision but modified by Dustin Franklin at NVIDIA for some smaller models that will run on the Jetson Nano, especially FCN-ResNet-18.

## Train on Pascal VOC

Use train.py to learn an initial model from the Pascal VOC 2012 (21 class) dataset. You'll want to train for about 100 epochs and take the model with the best IoU on the validation set. Expect about 52% IoU.

If you configured the runtime environment as above, the model weights after each iteration as well as the best model will be saved to /workspace in the container, which is mapped to \$HOME/workspace on the host (puffer in our case).

## Transfer learning

Fine tune on our robot floor dataset
Take a look at <code>retrain.py</code>. This script loads the pre-trained FCN-ResNet-18 we built in the previous step and fine tunes it on the robot floor data set. Take a look at the code and play around with it, only fine tuning the fresh output layer or also tuning the layers already trained on VOC. You'll also want to experiment with the relative weighting of floor vs. obstacle classes in the loss function. If you don't give obstacles (the rare class) a high weight, the model will learn to classify everything as floor!

I didn't put the code to save the model weights here. To get your model saved, read and understand the two scripts `train.py` and `fine_tune.py` and add the code to save your model to the fine tuning script.

Give some thought to what criteria you should use for the "best model so far." If you only cared about per-pixel accuracy, you would probably just classify all pixels as "floor." What you probably want to maximize is the mean of the IoU for the two classes.

## Export to ONNX and run under OpenCV DNN

Once that's all working, export the model to ONNX using the provided script and see if you can get it running on OpenCV DNN.

## Put it all together!

Finally, you should be able to display, frame by frame, the input image with floor/non-floor pixels identified and a birds-eye view map of the obstacle-free space around
the robot (using the homography you saved earlier).

Post a video of your result to Piazza before the next lab, and turn in a report describing your experiments and results.
