In [1]:
import torch

# Data Manipulation Layers

There are other layer types that perform important functions in models, but don't participate in the learning process themselves.

**Max pooling** (and its twin, min pooling) reduce a tensor by combining cells, and assigning the maximum value of the input cells to the output cell. (We saw this ) For example:

In [2]:
my_tensor = torch.rand(1, 6, 6)
print(my_tensor)

maxpool_layer = torch.nn.MaxPool2d(3)
print(maxpool_layer(my_tensor))

tensor([[[0.7477, 0.1444, 0.6932, 0.6736, 0.4804, 0.3036],
         [0.0364, 0.3282, 0.0795, 0.0682, 0.7859, 0.0991],
         [0.8795, 0.0813, 0.1679, 0.9059, 0.2357, 0.1281],
         [0.5930, 0.3626, 0.5190, 0.9228, 0.5395, 0.8048],
         [0.1691, 0.2372, 0.7259, 0.7065, 0.3816, 0.0192],
         [0.3010, 0.7777, 0.5007, 0.8880, 0.1327, 0.4646]]])
tensor([[[0.8795, 0.9059],
         [0.7777, 0.9228]]])


If you look closely at the values above, you'll see that each of the values in the maxpooled output is the maximum value of each quadrant of the 6x6 input.

**Normalization layers** re-center and normalize the output of one layer before feeding it to another. Centering the and scaling the intermediate tensors has a number of beneficial effects, such as letting you use higher learning rates without exploding/vanishing gradients.

In [3]:
my_tensor = torch.rand(1, 4, 4) * 20 + 5
print(my_tensor)

print(my_tensor.mean())

norm_layer = torch.nn.BatchNorm1d(4)
normed_tensor = norm_layer(my_tensor)
print(normed_tensor)

print(normed_tensor.mean())


tensor([[[19.2188, 19.5332,  8.7880, 22.1162],
         [ 6.2174,  9.4878, 20.9356, 24.5772],
         [ 6.2270, 20.1174, 13.2483, 11.8228],
         [ 7.7914, 20.9717,  6.3650, 22.1155]]])
tensor(14.9708)
tensor([[[ 0.2547,  0.2991, -1.2173,  0.6636],
         [-0.5002, -0.3202,  0.3100,  0.5105],
         [-0.7908,  0.8667,  0.0471, -0.1230],
         [-0.7107,  0.7261, -0.8662,  0.8508]]],
       grad_fn=<NativeBatchNormBackward>)
tensor(3.7253e-08, grad_fn=<MeanBackward0>)


Running the cell above, we've added a large scaling factor and offset to an input tensor; you should see the input tensor's `mean()` somewhere in the neighborhood of 15. After running it through the normalization layer, you can see that the values are smaller, and grouped around zero - in fact, the mean should be very small (> 1e-8).

This is beneficial because many activation functions (discussed below) have their strongest gradients near 0, but sometimes suffer from vanishing or exploding gradients for inputs that drive them far away from zero. Keeping the data centered around the area of steepest gradient will tend to mean faster, better learning and higher feasible learning rates.

**Dropout layers** are a tool for encouraging *sparse representations* in your model - that is, pushing it to do inference with less data.

Dropout layers work by randomly setting parts of the input tensor *during training* - dropout layers are always turned off for inference. This forces the model to learn against this masked or reduced dataset. For example:

In [4]:
my_tensor = torch.rand(1, 4, 4)

dropout = torch.nn.Dropout(p=0.4)
print(dropout(my_tensor))
print(dropout(my_tensor))

tensor([[[0.2627, 0.7288, 0.0000, 0.0000],
         [0.4889, 0.0000, 1.4819, 0.1551],
         [0.0000, 1.2785, 1.3865, 0.6584],
         [0.0000, 0.0000, 0.0000, 0.4532]]])
tensor([[[0.2627, 0.7288, 0.5708, 1.3240],
         [0.4889, 0.0000, 1.4819, 0.1551],
         [0.0000, 0.0000, 1.3865, 0.0000],
         [0.0000, 1.0171, 1.5114, 0.4532]]])


Above, you can see the effect of dropout on a sample tensor. You can use the optional `p` argument to set the probability of an individual weight dropping out; if you don't it defaults to 0.5.