Skip to content

2. The Segmentation Network and Architecture

George E Fouche edited this page Apr 12, 2018 · 6 revisions

Semantic Segmentation

Semantic Segmentation is the task of assigning meaning to a part of an object. This can be done at the pixel level where we assign each pixel to a target class such as road, car, pedestrian, sign, or any number of other classes. Semantic segmentation helps us derive valuable information about every pixel in the image rather than just slicing sections into bounding boxes. This is a field known as scene understanding, and it's mainly relevant to autonomous vehicles. Full scene understanding help with perception, which enables vehicles to make decisions.

Semantic Segmentation

Bounding Boxes

The Boundary Box is a simple method of scene unaderstanding compared to segmentation. In a neural network, the boundary box determines where an object is and draw a type box around it. There are already significant open-source state-of-the-art solutions, such as YOLO and SSD model. These models perform extremely well even at a high frame per second. They're useful for detecting different objects such as cars, people, traffic lights, and other objects in the scene.

Semantic Segmentation However, bounding boxes have their limits. Imagine drawing and bounding box around a curvy road, the forest, or the sky; this quickly becomes problematic or even impossible to convey the true shape of an object. At best, bounding boxes can only hope to achieve partial seen understanding which is why we use semantic segmentation in this project.

Bounding Boxes not Working

A full convolution network(FCN) is used to train the semantic segmentation model. It contains three encoder blocks, a 1x1 convolution layer, and three symmetrical decoder blocks.

Fully Convolutional network code snippet
def fcn_model(inputs, num_classes):

    # Add Encoder Blocks.
    filter = 96
    encoder_one = encoder_block(inputs,filter,2)
    encoder_two = encoder_block(encoder_one,filter *2, 2)
    encoder_three = encoder_block(encoder_two,filter*4,2)

    # Add 1x1 Convolution layer using conv2d_batchnorm().
    one_to_one_convolution = conv2d_batchnorm(encoder_three,filter*4,kernel_size=1,strides =1)

    # Add the same number of Decoder Blocks as the number of Encoder Blocks
    decoder_three = decoder_block(one_to_one_convolution,encoder_two,filter *4)
    decoder_two = decoder_block(decoder_three,encoder_one,filter *2)
    decoder_one = decoder_block(decoder_two,inputs,filter)


    # The function returns the output layer of the model. decoder_one is the final layer obtained from the last decoder_block()
    return layers.Conv2D(num_classes, 3, activation='softmax', padding='same')(decoder_one)

1x1 Convolution Layer vs Fully Connected Layer

A typical convolutional neural network might consist of a series of convolution layers. Followed by fully connected layers and ultimately a softmax activation function. It's a great architecture for a classification task.For example, is this a picture of a hotdog?

Convolutional Neural Network

But the question we want to answer is wherein a picture is a hotdog? The question is much more difficult to solve since fully connected layers don't preserve spatial information. However, if we change the C from connected to convolutional, we can integrate convolutions directly into the layer to create a fully convolutional network. It helps us answer where is the hotdog question because while doing the convolution they preserve the spatial information throughout the entire network. Additionally, since convolutional operations fundamentally don't care about the size of the input, a fully convolutional network will work on images of any size.

Fully Convolutional Network

Fully Convolutional Networks have achieved state of the art results in computer vision tasks such as semantic segmentation. FCNs take advantage of 3 special techniques:

  1. Replace fully connected layers with 1x1 convolutional layers
  2. Up-sampling through the use of transposed convolutional layers
  3. Skip Connection - These skip connection allow the network to use information from multiple resolution scales. As a result, the network is able to make more precise segmentation decisions.

No Skip Connection

No Skip Connection

Skip Connection

Skip Connection

Encoder and Decoder

Structurally, an FCN is usually comprised of two parts: encoder and decoder.

  • The encoder is a series of convolutional layers like VGG and ResNet. The goal of the encoder is to extract features from the image.
  • The decoder up-scale the output of the encoder such that it's the same size as the original image. Thus, it results in segmentation or prediction of each individual pixel in the original image.
Encoder code snippet
def encoder_block(input_layer, filters, strides):

    # Create a separable convolution layer using the separable_conv2d_batchnorm() function.
    output_layer = separable_conv2d_batchnorm(input_layer, filters, strides)
    return output_layer

One separable convolution layer is used for each Encoder. The encoding layers use convolution to help it find objects in a image regardless of where the object is located.

Decoder code snippet
def decoder_block(small_ip_layer, large_ip_layer, filters):

    # Upsample the small input layer using the bilinear_upsample() function.
    upsample = bilinear_upsample(small_ip_layer)
    # Concatenate the upsampled and large input layers using layers.concatenate
    concatenate_upsample = layers.concatenate([upsample,large_ip_layer])
    # Add some number of separable convolution layers
    output_layer = separable_conv2d_batchnorm(concatenate_upsample,filters,1)
    output_layer = separable_conv2d_batchnorm(output_layer,filters,1)
    return output_layer

One bilinear upsampling layer, a layer concatenation step, and one separable convolution layers is used for each decoder.The decoder layers heps to find the location of each identified objects, all the way to the pixel level.

Network Architecture

The goal of the encoder is to extract features from the image. It does that via a couple layers that find simple patterns in the first layer and then gradually learn to understand more and more about complex structures and shapes in the deeper layers. Next, the 1x1 convolution layer implements the same function as a fully connected layer but with the advantage of saving spatial information. The 1x1 convolution layer connects to the decoder, in which the goal of it is to up-scale the output of the encoder such that it's the same size as the original image. In addition, there are skip connections which are connected non-adjacent layers together. The use of skip connections here, for example, the output from the first encoder is connected to the input of the final decoder. The reason is to save information that might be lost during the encoding process, as a result, the network is able to make more precise segmentation decisions. At last, the last decoder stage contains a convolution output layer with softmax activation which makes the final pixel-wise segmentation between the three classes. One bilinear upsampling layer, a layer concatenation step, and one separable convolution layers are used for each decoder but are not shown in the diagram just to keep it simpler.

Network Architecture

Hyperparameters

The Hyperparameters use in this project is:

learning_rate = 0.01
batch_size = 15
num_epochs = 60
steps_per_epoch = 200
validation_steps = 50
workers = 2
  • learning_rate: Started at 0.01 and the network had no problem with that value
  • batch_size: is the number of training samples or images that get propagated through the network in a single pass. It is set to 15 because one of the 2 Nvidia GTX 1070 kept crashing because of low memory
  • num_epochs: the number of times the entire training dataset gets propagated through the network. This value is set to 60.
  • steps_per_epoch: number of batches of training images that go through the network in 1 epoch.This value is set to 200.
  • validation_steps: number of batches of validation images that go through the network in 1 epoch. This is similar to steps_per_epoch, except validation_steps is for the validation dataset. This value is set to 50.
  • workers: maximum number of processes to spin up. This can affect your training speed and is dependent on your hardware. This value is set to 2.

With 2 NVIDIA GTX 1070 and the hyperparameters settings above, it took almost 3 hours to train the model.

Note: See the model_training Jupyter Notebook located in the code folder or the HTML page located in the html folder for more information about the implementation of the segmentation network and the Results and Limitations for results.