# My summary

## Overfitting and Underfitting

Understanding overfitting and underfitting is the key to building successful machine learning and deep learning models.

### Overfitting

Overfitting, or not generalizing, is a common problem in machine learning and deep learning. We say a particular algorithm overfits when **it performs well on the training dataset but fails to perform on unseen or validation and test datasets**. This mostly occurs due to the algorithm identifying patterns that are too specific to the training dataset. In simpler words, we can say that the algorithm figures out a way to memorize the dataset so that it performs really well on the training dataset and fails to perform on the unseen data.

There are different techniques that can be used to avoid the algorithm overfitting. Some of the techniques are:

- Getting more data.
- Reducing the size of the network.
- Applying weight regularizer.
- Applying dropout.

#### Getting more data

If you are able to get more data on which the algorithm can train, that can help the algorithm to avoid overfitting by focusing on general patterns rather than on patterns specific to small data points. There are several cases where getting more labeled data could be a challenge.

There are techniques, such as data augmentation, that can be used to generate more training data in problems related to computer vision. Data augmentation is a technique where you can adjust the images slightly by performing different actions such as rotating, cropping, and generating more data. With enough domain understanding, you can create synthetic data too if capturing actual data is expensive. There are other ways that can help to avoid overfitting when you are unable to get more data.

#### Reducing the size of the network

One of the key principles that helps to solve the problem of overfitting or generalization is building simpler models. One technique for building simpler models is to reduce the complexity of the architecture by reducing its size.

The size of the network in general refers to the number of layers or the number of weight parameters used in a network. In the example of image classification, we used a ResNet model that has 18 blocks consisting of different layers inside it. The torchvision library in PyTorch comes with ResNet models of different sizes starting from 18 blocks and going up to 152 blocks. Say, for example, if we are using a ResNet block with 152 blocks and the model is overfitting, then we can try using a ResNet with 101 blocks or 50 blocks. In the custom architectures we build, we can simply remove some intermediate linear layers, thus preventing our PyTorch models from memorizing the training dataset. Let's look at an example code snippet that demonstrates what it means exactly to reduce the network size:

In [9]:
import torch
from torch import nn

In [10]:
class Architectura1(nn.Module):
    
    def __init__(self, input_size, hidden_size, num_classes):
        super(Architectura1, self).__init__()
        self.fc1 = nn.Linear(input_size, hidden_size)
        self.relu = nn.ReLU()
        self.fc2 = nn.Linear(hidden_size, num_classes)
        self.relu = nn.ReLU()
        self.fc3 = nn.Linear(hidden_size, num_classes)

    def forward(self, x):
        out = self.fc1(x)
        out = self.relu(out)
        out = self.fc2(out)
        out = self.relu(out)
        out = self.fc3(out)
        return out

The preceding architecture has three linear layers, and let's say it overfits our training data. So, let's recreate the architecture with reduced capacity:

In [11]:
class Architectura2(nn.Module):
    
    def __init__(self, input_size, hidden_size, num_classes):
        super(Architectura2, self).__init__()
        self.fc1 = nn.Linear(input_size, hidden_size)
        self.relu = nn.ReLU()
        self.fc2 = nn.Linear(hidden_size, num_classes)

    def forward(self, x):
        out = self.fc1(x)
        out = self.relu(out)
        out = self.fc2(out)
        return out

The preceding architecture has only two linear layers, thus reducing the capacity and, in turn, potentially avoiding overfitting the training dataset.

#### Applying weight regularization

Other important thing is ensuring that the weights of the network do not take larger values. Regularization provides constraints on the network by penalizing the model when the weights of the model are larger. Whenever the model uses larger weights, the regularization kicks in and increases the loss value, thus penalizing the model. There are two types of regularization possible. They are:

- **L1 regularization:** The sum of absolute values of weight coefficients are added to the cost. It is often referred to as the L1 norm of the weights.
- **L2 regularization:** The sum of squares of all weight coefficients are added to the cost. It is often referred to as the L2 norm of the weights.

PyTorch provides an easy way to use L2 regularization by enabling the `weight_decay` parameter in the optimizer:

In [13]:
model = Architectura1(10, 20, 2)
optimizer = torch.optim.Adam(
        model.parameters(), lr=1e-4, weight_decay=1e-5)

By default, the weight decay parameter is set to zero. We can try different values for weight decay; a small value such as `1e-5` works most of the time.

#### Dropout

Dropout is one of the most commonly used and the most powerful regularization techniques used in deep learning. It was developed by Hinton and his students at the University of Toronto. Dropout is applied to intermediate layers of the model during the training time.

It is often common to use a threshold of dropout values in the range of 0.2 to 0.5, and the dropout is applied at different layers. Dropouts are used only during the training times, and during the testing values are scaled down by the factor equal to the dropout. PyTorch provides dropout as another layer, thus making it easier to use. The following code snippet shows how to use a dropout layer in PyTorch:

In [19]:
def __init__(self):
    super(mnist_model, self).__init__()
    self.feats = nn.Sequential(
        nn.Conv2d(1, 32, 5, 1, 1),
        nn.MaxPool2d(2, 2),
        nn.ReLU(True),
        nn.BatchNorm2d(32),

        nn.Conv2d(32, 64, 3,  1, 1),
        nn.ReLU(True),
        nn.BatchNorm2d(64),

        nn.Conv2d(64, 64, 3,  1, 1),
        nn.MaxPool2d(2, 2),
        nn.ReLU(True),
        nn.BatchNorm2d(64),

        nn.Conv2d(64, 128, 3, 1, 1),
        nn.ReLU(True),
        nn.BatchNorm2d(128)
    )

    self.classifier = nn.Conv2d(128, 10, 1)
    self.avgpool = nn.AvgPool2d(6, 6)
    self.dropout = nn.Dropout(0.5) 

### Underfitting

There are times when our model may fail to learn any patterns from our training data, which will be quite evident when the model fails to perform well even on the dataset it is trained on. One common thing to try when your model underfits is to acquire more data for the algorithm to train on. Another approach is to increase the complexity of the model by increasing the number of layers or by increasing the number of weights or parameters used by the model. It is often a good practice not to use any of the aforementioned regularization techniques until we actually overfit the dataset.