In [1]:
import torch
from torch import nn

In [2]:
# It will produce 3 filters with the shape 4 rows, 5 columns
c = nn.Conv2d(1, 3, stride=1, kernel_size=(4, 5))
print(c.weight.shape)
print(c.weight)

torch.Size([3, 1, 4, 5])
Parameter containing:
tensor([[[[-0.1437,  0.1753, -0.0937, -0.1739, -0.0025],
          [ 0.1373, -0.0530,  0.1809, -0.0898,  0.1127],
          [ 0.0645,  0.0225,  0.1821,  0.0647,  0.0613],
          [ 0.1500, -0.0325, -0.0146, -0.0353, -0.2124]]],


        [[[-0.1619,  0.0244,  0.0756, -0.0175, -0.0116],
          [ 0.1618, -0.1654,  0.0500, -0.1326,  0.2008],
          [-0.2177,  0.0448,  0.2171,  0.1807,  0.1891],
          [-0.0334,  0.1107,  0.0766, -0.0709,  0.1484]]],


        [[[ 0.1024,  0.0317, -0.2038,  0.0486,  0.1355],
          [ 0.0859, -0.1986,  0.0337,  0.2104, -0.1045],
          [ 0.0785, -0.1585,  0.1806, -0.1438, -0.1261],
          [ 0.0785,  0.0375, -0.0441,  0.2103, -0.0319]]]], requires_grad=True)


## CNN
### Reminder:
- Input shape: $n_h\times n_w$
- Kernel shape: $k_h\times k_w$,
- Output shape: $(n_h-k_h+1) \times (n_w-k_w+1)$. Since kernels are usually > 1, the output will be always smaller than input
- Example: Input $240 \times 240$ pixel image, $10$ layers of $5 \times 5$ convolutions = $200 \times 200$ pixel image
- The ``out_channels`` is what convolution will produce so these are the **number of filters**. They are usually choosen by intuition.

### Padding
- Padding can handle this issue and affect the output size
- It adds extra pixels around the boundary of the input image
- padding=1, input=$3\times 3$, kernel_size=2, output=$4 \times 4$
$$(n_h-k_h+p_h+1)\times(n_w-k_w+p_w+1).$$
- **For the kernel size we usually use odd numbers such as 1,3,5,7 to keep the spatial dimensionality while padding.**

### Stride
- Sometimes we want to move our kernel-window more than 1 element
- This has computational or downsample reasons
- The first layer in the ResNet uses convolution with strides. This is a great example of when striding gives you an advantage. This layer by itself significantly reduces the amount of computation that has to be done by the network in the subsequent layers. It compresses multiple 3x3 convolution (3 to be exact) in to one 7x7 convolution, to make sure that it has exactly the same receptive field as 3 convolution layers (even though it is less powerful in terms of what it can learn).

### Multiple Input Channels
- When the input has multiple channels, we need to construct a conv kernel with the same number of input channels
<img src="https://d2l.ai/_images/conv-multi-in.svg">

### 1x1 conv
- Typically used to adjust the number of channels between network layers and to control model complexity.

### Pooling
- Downsample feature maps
- Convolutional layers prove very effective, and stacking convolutional layers in deep models allows layers close to the input to learn low-level features (e.g. lines) and layers deeper in the model to learn high-order or more abstract features, like shapes or specific objects.
- A limitation of the feature map output of convolutional layers is that they record the precise position of features in the input. This means that small movements in the position of the feature in the input image will result in a different feature map. This can happen with re-cropping, rotation, shifting, and other minor changes to the input image.
- A common approach to addressing this problem from signal processing is called down sampling. This is where a lower resolution version of an input signal is created that still contains the large or important structural elements, without the fine detail that may not be as useful to the task.
- Down sampling can be achieved with convolutional layers by changing the stride of the convolution across the image. A more robust and common approach is to use a pooling layer.
- A pooling layer is a new layer added after the convolutional layer. Specifically, after a nonlinearity (e.g. ReLU) has been applied to the feature maps output by a convolutional layer; for example the layers in a model may look as follows:
<img src="http://d2l.ai/_images/pooling.svg">
- Max pooling extracts the most important features like edges whereas, average pooling extracts features so smoothly.

### Dropout
- Faced with more features than examples, linear models tend to overfit
- Regularization to prevent over-fitting.
- Simply put, dropout refers to ignoring units (i.e. neurons) during the training phase of certain set of neurons which is chosen at random.

### Batch Normalization
- Together with residual blocks—covered later in Section 7.6—batch normalization has made it possible for practitioners to routinely train networks with over 100 layers.
- we can apply batch normalization after the convolution and before the nonlinear activation function.
- BN after ReLU makes much more sense - the weight matrix W then looks at mean-centered data.
- -> CONV/FC -> BatchNorm -> ReLu(or other activation) -> Dropout -> CONV/FC ->

### Overfitting
- CNN orientierung at trainingsdata
- Model classifies training data too good and test data too bad
- Model is bad at generalization
- Can be tested if training acc is high but test acc low
- can be avoided:
- - more training data
- - reduce amount of features (not needed ones), less dimensions
- - keep model simple
- - regularization (dropout, weight decay)
- - weight decay: Keep weights small. prevent large weight values. The larger the weight are, the larger the l2 norm will be 