# Layers

Layer in torch is some transformation with some inputs and some outputs.

In [1]:
import torch
from torch import nn
from math import prod

To perform a layer transformation on the tensor `X`, simply use the syntax `layer(X)`.

## Linear

The `torch.nn.Linear` layer performs the following operation:

$$X_{n \times l} \cdot \omega_{l \times k} + b_k$$

Where:

- $l$ - number of inputs
- $k$ - number of outputs
- $n$ - number of input samples
- $X_{n \times l}$ - input tensor
- $\omega_{l \times k}$ - weight matrix of the layer
- $b_k$ - bias vector of the layer

---

The following cell applies the tensor to some data and manually performs the same transformation. The results should be identical.

In [2]:
in_features = 5
out_features = 3

linear = nn.Linear(
    in_features = in_features, 
    out_features = out_features
)

X = torch.rand(in_features)

print("Layer transformation")
print(linear(X).tolist())
print("X@w+b")
print((linear.weight@X + linear.bias).tolist())

Layer transformation
[-0.6235317587852478, -1.1335619688034058, -0.70122230052948]
X@w+b
[-0.6235317587852478, -1.1335619688034058, -0.70122230052948]


### Define values

To define custom values for tensors you have to use access `weight` and `bias` fater layer creation. They are belongs to the `type(linear_layer.weight)` data type. So you have to use method `copy_` under `torch.no_grad` context.

---

Here’s an example of how to do it:

In [3]:
linear_layer = nn.Linear(in_features=3, out_features=4)

default_weights = torch.ones_like(linear_layer.weight)
default_biases = torch.zeros_like(linear_layer.bias)

with torch.no_grad():
    linear_layer.weight.copy_(default_weights)
    linear_layer.bias.copy_(default_biases)

After completing the process, we have the `weight` tensor initialized with ones and the `bias` tensor initialized with zeros:

In [4]:
print(linear_layer.weight)
print(linear_layer.bias)

Parameter containing:
tensor([[1., 1., 1.],
        [1., 1., 1.],
        [1., 1., 1.],
        [1., 1., 1.]], requires_grad=True)
Parameter containing:
tensor([0., 0., 0., 0.], requires_grad=True)


### dtype

By default, tensors used in `torch.nn.Linear` have a `float32` data type. This can lead to issues when processing tensors with different data types. 

---

The following cell defines a tensor with a `float16` data type. 

In [27]:
tensor_size = 3

tensor = torch.empty(
    size=(tensor_size, tensor_size), 
    dtype=torch.float16
)
tensor

tensor([[ 4.3904e+04,  3.6992e+00,  3.5763e-07],
        [ 0.0000e+00, -5.1200e+02,         nan],
        [-5.1200e+02,  4.3789e+00,  2.0000e+00]], dtype=torch.float16)

The following cell defines a tensor with a `float32` data type. 

In [28]:
layer = nn.Linear(tensor_size, tensor_size)
for p in layer.parameters():
    print(p.dtype)

torch.float32
torch.float32


Trying to apply the `layer` to the tensor will result in an error stating that the data types are incompatible.

In [29]:
layer(tensor)

RuntimeError: mat1 and mat2 must have the same dtype, but got Half and Float

The following cell demonstrates how to change the data type of tensors used in `nn.Linear`. After modifying the data types, you can successfully apply the `layer` to the input tensor.

In [30]:
for p in layer.parameters():
    p.data = p.data.to(torch.float16)

layer(tensor)

tensor([[-6652.0000, 24528.0000, -9664.0000],
        [       nan,        nan,        nan],
        [   76.9375,  -286.0000,   113.6875]], dtype=torch.float16,
       grad_fn=<AddmmBackward0>)

## Dropout

A dropout layer randomly sets some components of the input tensor to zero with a given probability $p$. During training, the remaining non-zero components are scaled by a factor of $ \frac{1}{1-p}$ to prevent signal attenuation. Formally, if we start with a tensor $x_i$, where $i \in \mathbb{N}^k$ represents the indices of the $k$-dimensional tensor, the output after applying dropout is given by:

$$
x'_i = x_i \cdot p_i \cdot \frac{1}{1-p},
$$

where $p_i$ is sampled from a Bernoulli distribution with parameter $p$, i.e., $p_i \sim \text{Bernoulli}(p)$.

---

This example demonstrates the transformation of a tensor after passing through a dropout layer. 

In [72]:
torch.manual_seed(111)

dropout_layer = nn.Dropout(p=0.3)
tensor = (torch.arange(3 * 3, dtype=float) + 1).reshape((3, 3))

print("Original tensor:")
print(tensor)
print()
print("Dropout result:")
print(dropout_layer(tensor))

Original tensor:
tensor([[1., 2., 3.],
        [4., 5., 6.],
        [7., 8., 9.]], dtype=torch.float64)

Dropout result:
tensor([[ 1.4286,  2.8571,  0.0000],
        [ 5.7143,  0.0000,  8.5714],
        [10.0000,  0.0000, 12.8571]], dtype=torch.float64)


## Batch normalisation

This is a transformation that normalizes data along the last dimension, which is assumed to separate different objects in the sample. It uses standartization method of scaling.

---

Consider a two-dimensional example where each object is represented by a row vector. 

In [3]:
objecs_numer = 5; features_number = 4
X = (
    torch.normal(0, 1, [objecs_numer, features_number]) +\
    torch.arange(features_number)
)
X

tensor([[ 2.2821,  0.4937,  2.2798,  3.1782],
        [ 0.8915,  0.6284,  0.9562,  2.9096],
        [ 0.3376,  0.5322, -0.5936,  2.1784],
        [-0.2011, -0.0289,  2.5192,  3.6398],
        [ 0.0640,  0.2830,  1.4539,  3.7226]])

Let's examine standardization implemented manually – simply substituting the mean and dividing by the standard deviation along each column.

In [48]:
(X - X.mean(axis=0))/(X.std(axis=0))

tensor([[ 1.6310,  0.4277,  0.7702,  0.0838],
        [ 0.2199,  0.9421, -0.2954, -0.3453],
        [-0.3422,  0.5747, -1.5431, -1.5133],
        [-0.8889, -1.5676,  0.9630,  0.8212],
        [-0.6199, -0.3769,  0.1053,  0.9536]])

The following cell demonstrates the application of the most straightforward normalization method, `nn.BatchNorm1d`.

In [55]:
bn = torch.nn.BatchNorm1d(num_features=features_number)
bn(X)

tensor([[ 1.8235,  0.4781,  0.8612,  0.0937],
        [ 0.2459,  1.0532, -0.3303, -0.3860],
        [-0.3826,  0.6425, -1.7252, -1.6919],
        [-0.9938, -1.7524,  1.0766,  0.9181],
        [-0.6930, -0.4213,  0.1177,  1.0661]],
       grad_fn=<NativeBatchNormBackward0>)

We obtained results very close to manual normalization, but not perfectly identical. This discrepancy arises from the numerous mechanisms involved in layer fitting, where computation aligns with classical calculation only under specific circumstances. The following cell replicates such conditions.

In [45]:
bn = torch.nn.BatchNorm1d(num_features=features_number, eps=0, affine=False, momentum=1)
bn(X)
bn.eval()
bn(X)

tensor([[ 1.6310,  0.4277,  0.7702,  0.0838],
        [ 0.2199,  0.9421, -0.2954, -0.3453],
        [-0.3422,  0.5747, -1.5431, -1.5133],
        [-0.8889, -1.5676,  0.9630,  0.8212],
        [-0.6199, -0.3769,  0.1053,  0.9536]])

Therefore, we achieved identical numerical results to the manual implementation. 

## Convolutional

Convolutional layers are implemented in Torch using the classes `torch.nn.Conv<D>d`.

---

Consider the example of an `nn.Conv1d` layer. Suppose we want to perform convolutions with a two-dimensional kernel on a one-channel sequence, producing a single-channel output. The following cell defines and prints the parameters of the `Conv1d` layer.

In [61]:
show_conv = nn.Conv1d(
    in_channels=1,
    out_channels=1,
    kernel_size=2
)

with torch.no_grad():
    show_conv.weight = nn.Parameter(
        torch.arange(
            prod(show_conv.weight.shape), dtype=torch.float
        ).reshape_as(show_conv.weight) + 1
    )
    show_conv.bias = nn.Parameter(torch.tensor([0.]))

print("Weight")
display(show_conv.weight)
print("Bias")
display(show_conv.bias)

Weight


Parameter containing:
tensor([[[1., 2.]]], requires_grad=True)

Bias


Parameter containing:
tensor([0.], requires_grad=True)

Here is an example of data that can be processed using the layer declared above.

In [62]:
samples_count = 5
channels_count = 1
sequesnce_lenth = 5

data = (
    torch.arange(
        samples_count * channels_count * sequesnce_lenth,
        dtype=torch.float
    )
    .reshape([samples_count, channels_count, sequesnce_lenth])
)

data

tensor([[[ 0.,  1.,  2.,  3.,  4.]],

        [[ 5.,  6.,  7.,  8.,  9.]],

        [[10., 11., 12., 13., 14.]],

        [[15., 16., 17., 18., 19.]],

        [[20., 21., 22., 23., 24.]]])

Here are 5 samples from a series of 5 elements, each with one input channel.

In [63]:
show_conv(data)

tensor([[[ 2.,  5.,  8., 11.]],

        [[17., 20., 23., 26.]],

        [[32., 35., 38., 41.]],

        [[47., 50., 53., 56.]],

        [[62., 65., 68., 71.]]], grad_fn=<ConvolutionBackward0>)

Check if the computation for some element matches our expectation:

$$x'_{2,3} = x_{2,3}w_{1} + x_{2,4}w_{2} + b = 7 \times 1 + 8 \times 2 + 0 = 23$$

Where:

- $x_{ij}$: $j$-th element of the sequence of the $i$-th sample.
- $x'_{ij}$: $j$-th element of the output of the $i$-th sample.
- $w_i$: $i$-th weight of the layer under consideration.
- $b$: bias of the layer under consideration.

## Pooling

Pooling layers aggregate different subsets of an array according to a specified function.

---

The following cell demonstrates the application of `torch.nn.MaxPooling` on a vector. Pooling is primarily designed for convolutional networks, so it is applied along the channels, which is the outermost dimension of the input. Therefore, an extra dimension is added to the input.

In [25]:
input = torch.arange(10, dtype=torch.float16)[None, :]
print("Input")
print(input)

output = torch.nn.MaxPool1d(kernel_size=3)(input)
print("Output")
print(output)

Input
tensor([[0., 1., 2., 3., 4., 5., 6., 7., 8., 9.]], dtype=torch.float16)
Output
tensor([[2., 5., 8.]], dtype=torch.float16)


As a result, the values were computed as follows:

- $w_1' = \max(w_1, w_2, w_3) = 2$.
- $w_2' = \max(w_3, w_4, w_5) = 5$.
- $w_3' = \max(w_6, w_7, w_8) = 8$.

Where:

- $w'_i$ is the $i$-th element of the output.
- $w_i$ is the $i$-th element of the input.

**Note** that the last element was skipped because it couldn't form a complete kernel.

## Flatten/unflaten

The `torch.nn.Flatten` layer merges some dimensions into a single dimension. Conversely, `torch.nn.Unflatten` reshapes a specified axis with the specified dimensionality. Check more in the [specific page](layers/flatten_unflatten.ipynb).

---

The following example creates array that we'll use as example. You can consider it as 3 three dimentional samples.

In [26]:
input = torch.arange(81).reshape([3,3,3,3])
input.shape

torch.Size([3, 3, 3, 3])

Now lets apply the default `torch.nn.Flatten` to the array from the previous cell.

In [27]:
x = torch.nn.Flatten()(input)
x.shape

torch.Size([3, 27])

It seems intuitive to get the same array but with one-dimensional observations.

Now we can revert everything with `torch.nn.Unflatten` — we specify the second dimensionality to be unflattened and transform it back into 3-dimensional tensors.

In [30]:
output = torch.nn.Unflatten(dim=1, unflattened_size=(3,3,3))(x)
output.shape

torch.Size([3, 3, 3, 3])