# Model validation

- A validation set is used to measure how well a model generalizes during training.
- It can also tells us when to stop training; when the validation loss stops decreasing (and especially when the validation loss starts increasting and training loss is still decreasing)

# CNNs vs MLPs

- CNNs shine when exposed to real-world messy data.
- CNNs use relationships between pixels that are close, whereas MLPs disregard these (spatial; 2D) relationships.
- MLP only use fully connected layers. High model complexity. CNNs use more sparsely connected layers.
- In CNNs, hidden nodes focuses on segments of the images (locally connected layers). Less prone to overfitting.
- Multiple hidden nodes can focus on segments multiple times.

# Frequency of images

- Frequency in images is a rate of change. Images change in space, and a high frequency image is one where the intensity changes a lot. And the level of brightness changes quickly from one pixel to the next. A low frequency image may be one that is relatively uniform in brightness or changes very slowly. 
- High-frequency components also correspond to the edges of objects in images, which can help us classify those objects.

# High-pass filters

- Sharpen an image
- Enhance high-frequency parts of an image

# Convolution kernels

- A kernel is a matrix of numbers that modifies an image
- Kernel convolution is an operation applied to the input image. It relies on centering a pixel and looking at it's surrounding neighbors. How to handle edges:
    - Extend the image by copying border pixels far enough such that the filtered image has the same size as the original
    - Pad the image with a border of 0's, i.e. black pixels
    - Crop the image and skip the edges.

# Convolutional layer

- Produced by applying a series of many different image filters, also known as convolutional kernels, to an input image.
- 4 different filter produces 3 differently filtered output images. When we stack these images, we form a complete convolutional layer with a depth of 4.
- NNs learn the weights of these filters.
- Collections (stacks) of filtered outputs are called **feature maps** or **activation maps**.
- Convolutional layers are only locally connected, i.e. connected to only a small subset of the previous layers' nodes.

# Stride and padding

- Stride and padding are hyperparameters which control the behaviour of the convolutional layer.
- **Stride** of the filter is the amount by which the filter slides over the image. Stride of 1 makes the filter output roughly the same size of the original image.
- **Padding** includes a region (e.g. of pixel values 0) around the edge of the image in order to enable producing filter output values around the edges of the image.
- **Window size** determines the size of the filter (i.e. kernel matrix)

# Pooling layers

- Pooling layers take convolutional layers as an input, with the purpose of reducing dimensionality.
- **Max pooling** takes a stack of feature maps as input, hovers over this map with a window with a certain window size and a certain stride and takes the maximum of each window. This reduces the size of the feature map. 
- **Average pooling** takes an average of the pixel values in the window, instead of the maximum. Max pooling is better at noticing the most important details about edges and other features in an image. In some cases, however, *smoothing* might be better.

# Capsule networks

- Pooling operations throw away some image information. This can cause issues in applications where you do need spatial information, e.g. number of features and relative locations of these features (think of facial recognition and the distinguishing features).
- **Capsule networks** provide a way to detection parts of an object and represent spatial relationships between those parts. They can recognize the same object in a variety of poses and with the typical number of features, even if they have not seen that pose in training data.
- Capsule networks are made of parent and child nodes that build up a complete picture of an object. E.g. the face is the parent, and the individual features are the child nodes.
- **Capsules** are collection of nodes, where each nodes contains information about a specific part like width, orientation, colors etcetera. Each capsules outputs a vector with some magnitude and orientation:
    - **Magnitude**: probability that a part exists (i.e. the length of the vector). Should stay very high even when an object is in a different orientation. It is a normalised function of the weighted inputs to a particular capsule; a nonlinear function called **squashing**.
    - **Orientation**: State of the part properties (i.e. the direction of the vector). It will only change if one of the part properties changes (e.g. position, orientation, shape).
- The fact that the output of a capsule is vector, with some orientation, makes it possible to use a powerful **dynamic routing** mechanism to ensure that the output of a capsule gets sent to the appropriate parent calsuple in the next layer of capsules.

# Dynamic routing

- Process for finding the *best* connections between the output of one capsule and the inputs of the next layer. 
- Imagine a capsule network for a face as a parent and fully connected subfeatures of the face (e.g. mouth&nose and eyes in the next layer, and each eye, nose mouth in the next). Dynamic routing iteratively changes *coupling coefficients*, which are the probabilities that the output of a capsule should go to a parent capsule. It does so by *routing by agreement* based on how the output vectors of the child versus the parameter are in agreement based on the dot product of these vectors.
- This is useful for determining spatial relationships between the parts. Namely, the capsule network will check whether pose of each part (i.e. position and orientation) are in agreement based on the vector orientations of the child and parent capsules. 

# Convolutional layers in PyTorch

CNN hyperparameters:
- **Depth** of the *input* (e.g. 3 for RGB images)
- **Depth** of the *convolutional layer* (e.g. 16 for 16 filters)
- **Kernal size** of filters (e.g. 4 for 4x4 filters). Typically range between 3 to 7 for larger images
- **Stride** for hovering. Typically 1 also being the default for many frameworks. Stride determines the size of the output of the layer. E.g. stride of 2 halves the size of the output.
- **Padding** for the number of pixels to be padded around the edges of the image. Common methods of padding are padding with all 0-pixels or padding them with the nearest pixel value. There is a relation between the kernel size and layers of padding.

Max pooling layers are typically put after convolutional layers to shrink the x,y dimensions of an input. But by applying more filters to each of the filter images of the previous convolutional layer, the next layer will become deeper.

In [None]:
def __init__(self):
        super(ModelName, self).__init__()
        self.features = nn.Sequential(
              nn.Conv2d(1, 16, 2, stride=2),
              nn.MaxPool2d(2, 2),
              nn.ReLU(True),

              nn.Conv2d(16, 32, 3, padding=1),
              nn.MaxPool2d(2, 2),
              nn.ReLU(True) 
         )

## Number of parameters in a Convolutional Layer

Variables:
- **K**: Number of filters in CL
- **F**: Height and width of the kernel/filter
- **D_in**: Depth of the previous layer

- Per filter there is one weight per value in the filter, such that there are F\*F\*D_in weights per filter.
- Across all filters there are then K\*F\*F\*D_in weights in the CL
- There is one bias term per filter.
- Total number of parameters is then K\*F\*F\*D_in + K

## Shape of the convolutional layer

Variables:
- **S** stride of the CL
- **P** padding of the image
- **W_in** width/height (square) of the previous layer

- Depth is always equal to the number of filters K
- Spatial dimensions of a CL can be calculated as: (W_in - F+2P)/S + 1

# Feature vector

- CL's in a CNN convert an image array to a representation that encodes only content of the image.
- Feature vector <> Feature level representation of an image.
- When an image moves through the CL's, more detail is discarded, like style and texture and the CNN is pushed towards answering key questions about the presence of unique patterns and shapes.
- The featur vector is then feeding one or more fully connected layers to determine what object is contained in the image.

# Invariant representation

- Algorithm should be able to detect whether an object is present irrespective of its size (**scale invariance**), rotation (**angle invariance**) or location in the image (**translation invariance**).
- CNNs have some translation invariance due to max pooling applied after each convolutional layer. Max pooling extracts the signal irrespective of where that signal is coming from.
- You can make the algorithm more invariant by including duplicates of samples but with objects at different angles and locations. These samples are seens as perceived by the model as different samples
