## Some important terms for PyTorch and deep learning with CNNs

**Tensor**: a multidimensional matrix for calculation. Inputs, outputs, and weights are all stored as tensors. A CNN will have an "input" tensor as input (one or more images), and an "output" tensor as the output.

**Kernel**: a filter tensor or weight tensor, most often used in a convolution computation.

**Channel**: when we deal with 2D feature maps, the third-from-last dimension indexes the channel or depth dimension. A 2D image stored as a tensor has three dimensions: channel, row, and column.

**Feature**: could refer to the result of a convolution operation (feature "map") or a hand-crafted input to a model or a unit in a linear/dense/fully-connected layer.

**Feature extraction**: the process of transforming raw data into numerical features that concentrate and/or preserve the useful information in the raw data.

**Stride**: The jump necessary to go from one element to the next when performing an operation on an input tensor.

**Padding**: Additional elements added around the boundaries of a tensor to allow convolution operations or other operations to preserve size. Padding is usually with 0 elements, but other methods include copy-border and mirror reflection.

In PyTorch, we do not have to specify the input size to a convolution operation, but we do need to specify the number of input channels and the kernel size and stride. Any specific operation will lead to a specific ouptut size depending on the input size. For dense or fully-connected layers, however, we need to specify the number of input features as well as the number of output features, so it is necessary to understand how to calculate how tensor operations affect the output tensor size.

## Calculating the number of parameters and output tensor size for tensor operations

### Convolutional layer parameters

The number of parameters $k$ in a kernel for a 2D convolution operation is

$$k = k_w k_h i_c,$$

where $k_w$ is the width of the kernel, $k_h$ is the width of kernel, and $i_c$ is the number input channels.
If we have $o_c$ kernels producing $o_c$ output channels, the total number of parameters/weights can be calculated as

$$n_p = k o_c = k_w k_h i_c o_c.$$

The bias weight in a convolution operation is optional. It's not needed if you apply normalization procedures such as
batch normalization (almost always done in modern networks), but it is important if you're not using batch normalization.
In that case, the number of biases is equal to the number of kernels:

$$n_p = k_w k_h i_c o_c + o_c.$$

### Fully connected layer parameters

PyTorch separates the linear operation from the nonlinear activation function in a fully connected layer. The linear operation
is called a "linear" layer, then you have to add the activation function separately. Other frameworks such as keras use the term
"dense layer" for a fully connected layer including the nonlinear activation function.

The number of weights $s_w$ in a linear layer is

$$s_w = i_f o_f$$

or

$$s_w = i_f o_f + o_f$$

if we include a bias weight (again, not necessary if we are going to follow up with batch normalization).

It is useful to calculate the total number of parameters across all layers in a network to understand how statistically efficient it's going to be.

### Convolutional layer output tensor size

If we have an input tensor of size $w \times h \times c$ and want to perform a convolution with a $k_w \times k_h$ kernel with padding $p$ and stride $s$,
we can calculate the output tensor size as

$$\lfloor \frac{w+2p-k_w}{s} + 1 \rfloor \times \lfloor \frac{h+2p-k_h}{s} + 1 \rfloor.$$

For example, in AlexNet, the input image in the first layer is $224 \times 224 \times 3$.
A convolution of size $11 \times 11$ with padding 2 and stride 4 gives an output feature map width and height of

$$\lfloor \frac{w+2p-k_w}{s} + 1 \rfloor = \lfloor \frac{224+2(2)-11}{4} + 1 \rfloor = \lfloor 55.25 \rfloor = 55$$