🏷️sec_fcn
As discussed in :numref:sec_semantic_segmentation
,
semantic segmentation
classifies images in pixel level.
A fully convolutional network (FCN)
uses a convolutional neural network to
transform image pixels to pixel classes :cite:Long.Shelhamer.Darrell.2015
.
Unlike the CNNs that we encountered earlier
for image classification
or object detection,
a fully convolutional network
transforms
the height and width of intermediate feature maps
back to those of the input image:
this is achieved by
the transposed convolutional layer
introduced in :numref:sec_transposed_conv
.
As a result,
the classification output
and the input image
have a one-to-one correspondence
in pixel level:
the channel dimension at any output pixel
holds the classification results
for the input pixel at the same spatial position.
%matplotlib inline
from d2l import mxnet as d2l
from mxnet import gluon, image, init, np, npx
from mxnet.gluon import nn
npx.set_np()
#@tab pytorch
%matplotlib inline
from d2l import torch as d2l
import torch
import torchvision
from torch import nn
from torch.nn import functional as F
Here we describe the basic design of the fully convolutional network model.
As shown in :numref:fig_fcn
,
this model first uses a CNN to extract image features,
then transforms the number of channels into
the number of classes
via a sec_transposed_conv
.
As a result,
the model output has the same height and width as the input image,
where the output channel contains the predicted classes
for the input pixel at the same spatial position.
Below, we [use a ResNet-18 model pretrained on the ImageNet dataset to extract image features]
and denote the model instance as pretrained_net
.
The last few layers of this model
include a global average pooling layer
and a fully-connected layer:
they are not needed
in the fully convolutional network.
pretrained_net = gluon.model_zoo.vision.resnet18_v2(pretrained=True)
pretrained_net.features[-3:], pretrained_net.output
#@tab pytorch
pretrained_net = torchvision.models.resnet18(pretrained=True)
list(pretrained_net.children())[-3:]
Next, we [create the fully convolutional network instance net
].
It copies all the pretrained layers in the ResNet-18
except for the final global average pooling layer
and the fully-connected layer that are closest
to the output.
net = nn.HybridSequential()
for layer in pretrained_net.features[:-2]:
net.add(layer)
#@tab pytorch
net = nn.Sequential(*list(pretrained_net.children())[:-2])
Given an input with height and width of 320 and 480 respectively,
the forward propagation of net
reduces the input height and width to 1/32 of the original, namely 10 and 15.
X = np.random.uniform(size=(1, 3, 320, 480))
net(X).shape
#@tab pytorch
X = torch.rand(size=(1, 3, 320, 480))
net(X).shape
Next, we [use a sec_padding
.
Since
num_classes = 21
net.add(nn.Conv2D(num_classes, kernel_size=1),
nn.Conv2DTranspose(
num_classes, kernel_size=64, padding=16, strides=32))
#@tab pytorch
num_classes = 21
net.add_module('final_conv', nn.Conv2d(512, num_classes, kernel_size=1))
net.add_module('transpose_conv', nn.ConvTranspose2d(num_classes, num_classes,
kernel_size=64, padding=16, stride=32))
We already know that transposed convolutional layers can increase the height and width of feature maps. In image processing, we may need to scale up an image, i.e., upsampling. Bilinear interpolation is one of the commonly used upsampling techniques. It is also often used for initializing transposed convolutional layers.
To explain bilinear interpolation,
say that
given an input image
we want to
calculate each pixel
of the upsampled output image.
In order to calculate the pixel of the output image
at coordinate
Upsampling of bilinear interpolation
can be implemented by the transposed convolutional layer
with the kernel constructed by the following bilinear_kernel
function.
Due to space limitations, we only provide the implementation of the bilinear_kernel
function below
without discussions on its algorithm design.
def bilinear_kernel(in_channels, out_channels, kernel_size):
factor = (kernel_size + 1) // 2
if kernel_size % 2 == 1:
center = factor - 1
else:
center = factor - 0.5
og = (np.arange(kernel_size).reshape(-1, 1),
np.arange(kernel_size).reshape(1, -1))
filt = (1 - np.abs(og[0] - center) / factor) * \
(1 - np.abs(og[1] - center) / factor)
weight = np.zeros((in_channels, out_channels, kernel_size, kernel_size))
weight[range(in_channels), range(out_channels), :, :] = filt
return np.array(weight)
#@tab pytorch
def bilinear_kernel(in_channels, out_channels, kernel_size):
factor = (kernel_size + 1) // 2
if kernel_size % 2 == 1:
center = factor - 1
else:
center = factor - 0.5
og = (torch.arange(kernel_size).reshape(-1, 1),
torch.arange(kernel_size).reshape(1, -1))
filt = (1 - torch.abs(og[0] - center) / factor) * \
(1 - torch.abs(og[1] - center) / factor)
weight = torch.zeros((in_channels, out_channels,
kernel_size, kernel_size))
weight[range(in_channels), range(out_channels), :, :] = filt
return weight
Let us [experiment with upsampling of bilinear interpolation]
that is implemented by a transposed convolutional layer.
We construct a transposed convolutional layer that
doubles the height and weight,
and initialize its kernel with the bilinear_kernel
function.
conv_trans = nn.Conv2DTranspose(3, kernel_size=4, padding=1, strides=2)
conv_trans.initialize(init.Constant(bilinear_kernel(3, 3, 4)))
#@tab pytorch
conv_trans = nn.ConvTranspose2d(3, 3, kernel_size=4, padding=1, stride=2,
bias=False)
conv_trans.weight.data.copy_(bilinear_kernel(3, 3, 4));
Read the image X
and assign the upsampling output to Y
. In order to print the image, we need to adjust the position of the channel dimension.
img = image.imread('../img/catdog.jpg')
X = np.expand_dims(img.astype('float32').transpose(2, 0, 1), axis=0) / 255
Y = conv_trans(X)
out_img = Y[0].transpose(1, 2, 0)
#@tab pytorch
img = torchvision.transforms.ToTensor()(d2l.Image.open('../img/catdog.jpg'))
X = img.unsqueeze(0)
Y = conv_trans(X)
out_img = Y[0].permute(1, 2, 0).detach()
As we can see, the transposed convolutional layer increases both the height and width of the image by a factor of two.
Except for the different scales in coordinates,
the image scaled up by bilinear interpolation and the original image printed in :numref:sec_bbox
look the same.
d2l.set_figsize()
print('input image shape:', img.shape)
d2l.plt.imshow(img.asnumpy());
print('output image shape:', out_img.shape)
d2l.plt.imshow(out_img.asnumpy());
#@tab pytorch
d2l.set_figsize()
print('input image shape:', img.permute(1, 2, 0).shape)
d2l.plt.imshow(img.permute(1, 2, 0));
print('output image shape:', out_img.shape)
d2l.plt.imshow(out_img);
In a fully convolutional network, we [initialize the transposed convolutional layer with upsampling of bilinear interpolation. For the
W = bilinear_kernel(num_classes, num_classes, 64)
net[-1].initialize(init.Constant(W))
net[-2].initialize(init=init.Xavier())
#@tab pytorch
W = bilinear_kernel(num_classes, num_classes, 64)
net.transpose_conv.weight.data.copy_(W);
We read
the semantic segmentation dataset
as introduced in :numref:sec_semantic_segmentation
.
The output image shape of random cropping is
specified as
#@tab all
batch_size, crop_size = 32, (320, 480)
train_iter, test_iter = d2l.load_data_voc(batch_size, crop_size)
Now we can train our constructed fully convolutional network. The loss function and accuracy calculation here are not essentially different from those in image classification of earlier chapters. Because we use the output channel of the transposed convolutional layer to predict the class for each pixel, the channel dimension is specified in the loss calculation. In addition, the accuracy is calculated based on correctness of the predicted class for all the pixels.
num_epochs, lr, wd, devices = 5, 0.1, 1e-3, d2l.try_all_gpus()
loss = gluon.loss.SoftmaxCrossEntropyLoss(axis=1)
net.collect_params().reset_ctx(devices)
trainer = gluon.Trainer(net.collect_params(), 'sgd',
{'learning_rate': lr, 'wd': wd})
d2l.train_ch13(net, train_iter, test_iter, loss, trainer, num_epochs, devices)
#@tab pytorch
def loss(inputs, targets):
return F.cross_entropy(inputs, targets, reduction='none').mean(1).mean(1)
num_epochs, lr, wd, devices = 5, 0.001, 1e-3, d2l.try_all_gpus()
trainer = torch.optim.SGD(net.parameters(), lr=lr, weight_decay=wd)
d2l.train_ch13(net, train_iter, test_iter, loss, trainer, num_epochs, devices)
When predicting, we need to standardize the input image in each channel and transform the image into the four-dimensional input format required by the CNN.
def predict(img):
X = test_iter._dataset.normalize_image(img)
X = np.expand_dims(X.transpose(2, 0, 1), axis=0)
pred = net(X.as_in_ctx(devices[0])).argmax(axis=1)
return pred.reshape(pred.shape[1], pred.shape[2])
#@tab pytorch
def predict(img):
X = test_iter.dataset.normalize_image(img).unsqueeze(0)
pred = net(X.to(devices[0])).argmax(dim=1)
return pred.reshape(pred.shape[1], pred.shape[2])
To [visualize the predicted class] of each pixel, we map the predicted class back to its label color in the dataset.
def label2image(pred):
colormap = np.array(d2l.VOC_COLORMAP, ctx=devices[0], dtype='uint8')
X = pred.astype('int32')
return colormap[X, :]
#@tab pytorch
def label2image(pred):
colormap = torch.tensor(d2l.VOC_COLORMAP, device=devices[0])
X = pred.long()
return colormap[X, :]
Images in the test dataset vary in size and shape. Since the model uses a transposed convolutional layer with stride of 32, when the height or width of an input image is indivisible by 32, the output height or width of the transposed convolutional layer will deviate from the shape of the input image. In order to address this issue, we can crop multiple rectangular areas with height and width that are integer multiples of 32 in the image, and perform forward propagation on the pixels in these areas separately. Note that the union of these rectangular areas needs to completely cover the input image. When a pixel is covered by multiple rectangular areas, the average of the transposed convolution outputs in separate areas for this same pixel can be input to the softmax operation to predict the class.
For simplicity, we only read a few larger test images,
and crop a
voc_dir = d2l.download_extract('voc2012', 'VOCdevkit/VOC2012')
test_images, test_labels = d2l.read_voc_images(voc_dir, False)
n, imgs = 4, []
for i in range(n):
crop_rect = (0, 0, 480, 320)
X = image.fixed_crop(test_images[i], *crop_rect)
pred = label2image(predict(X))
imgs += [X, pred, image.fixed_crop(test_labels[i], *crop_rect)]
d2l.show_images(imgs[::3] + imgs[1::3] + imgs[2::3], 3, n, scale=2);
#@tab pytorch
voc_dir = d2l.download_extract('voc2012', 'VOCdevkit/VOC2012')
test_images, test_labels = d2l.read_voc_images(voc_dir, False)
n, imgs = 4, []
for i in range(n):
crop_rect = (0, 0, 320, 480)
X = torchvision.transforms.functional.crop(test_images[i], *crop_rect)
pred = label2image(predict(X))
imgs += [X.permute(1,2,0), pred.cpu(),
torchvision.transforms.functional.crop(
test_labels[i], *crop_rect).permute(1,2,0)]
d2l.show_images(imgs[::3] + imgs[1::3] + imgs[2::3], 3, n, scale=2);
- The fully convolutional network first uses a CNN to extract image features, then transforms the number of channels into the number of classes via a
$1\times 1$ convolutional layer, and finally transforms the height and width of the feature maps to those of the input image via the transposed convolution. - In a fully convolutional network, we can use upsampling of bilinear interpolation to initialize the transposed convolutional layer.
- If we use Xavier initialization for the transposed convolutional layer in the experiment, how does the result change?
- Can you further improve the accuracy of the model by tuning the hyperparameters?
- Predict the classes of all pixels in test images.
- The original fully convolutional network paper also uses outputs of some intermediate CNN layers :cite:
Long.Shelhamer.Darrell.2015
. Try to implement this idea.
:begin_tab:mxnet
Discussions
:end_tab:
:begin_tab:pytorch
Discussions
:end_tab: