# 6.2  Regularization and dropout

Regularization is a method to reduce the variance of the model to alleviate overfitting. Commonly there are `L1` and `L2` regularization. Besides them, dropout is also commonly used for alleviating overfitting.

In this session, we will check both approaches to handle overfitting.

## Regularization

The loss function measures the difference between the output of the model and the real label. Based on this difference, we optimize the model by changing its parameters over time. However, if we leave the parameters free to take any values, they may fit the data too well, making the model overfit. To prevent this, we add a new term to the loss function, called regularization, to impose a constraint on the parameters.
Specifically, this constraint forces the parameters to be as small as possible (in other words, as close to zero as possible), thus preventing the parameters from taking any values and overfitting to the data.

In this section, we will see how Pytorch implements the L2 regularization, which is usually called `weight decay`.
Despite the previously explained concept (that the regulatization is added to the loss), in Pytorch, the weight decay is defined in the optimizer (as we will see below), essentially because, if defined, this term will be added to the loss during the optimization process, regardless of the loss function used.


In [None]:
import torch
import torch.nn as nn
import matplotlib.pyplot as plt

torch.manual_seed(1)

n_hidden = 200  # number of neurons for the hidden layers
max_iter = 2000  # maximum iterations 2000 times
disp_interval = 200  # epoch interval for plotting
lr_init = 0.01  # learning rate initialization

Creating a simple synthetic dataset.

In [None]:
# Construct a batch of virtual data
def gen_data(num_data=10, x_range=(-2, 2)):
    w = 1.5
    train_x = torch.linspace(*x_range, num_data).unsqueeze_(1)
    train_y = w*train_x + torch.normal(0, 0.5, size=train_x.size())
    test_x = torch.linspace(*x_range, num_data).unsqueeze_(1)
    test_y = w*test_x + torch.normal(0, 1.0, size=test_x.size())

    return train_x, train_y, test_x, test_y

train_x, train_y, test_x, test_y = gen_data(x_range=(-1, 1))

Defining the model.

In [None]:
class MLP(nn.Module):
    def __init__(self, neural_num):
        super(MLP, self).__init__()
        self.linears = nn.Sequential(
            nn.Linear(1, neural_num),
            nn.ReLU(inplace=True),
            nn.Linear(neural_num, neural_num),
            nn.ReLU(inplace=True),
            nn.Linear(neural_num, neural_num),
            nn.ReLU(inplace=True),
            nn.Linear(neural_num, 1),
        )

    def forward(self, x):
        return self.linears(x)

# Instantiate two fully connected networks built above for comparison
net_normal = MLP(neural_num=n_hidden)  # this network will be trained without weight decay
net_weight_decay = MLP(neural_num=n_hidden)  # this network will be WITH weight decay

Let's define a simple Mean Squared Error loss - [MSELoss](https://pytorch.org/docs/stable/generated/torch.nn.MSELoss.html)

Note that the problem here is regression - not classification!

In [None]:
loss_func = torch.nn.MSELoss()

Define the SGD optimizers (two in this example). You can check the SGD documentation [here](https://pytorch.org/docs/stable/generated/torch.optim.SGD.html).

The first SGD does not define the `weight_decay` parameter, that is, does not use regularization.
On the other hand, the second optimizer explicitly defines the `weight_decay` parameter, setting a value of `1e-2` to it.

In [None]:
# net_normal no regularizatio
optim_normal = torch.optim.SGD(net_normal.parameters(), lr=lr_init, momentum=0.9)  # parameters of the normal net

# net_weight_decay with regularizationï¼Œcoefficient 1e-2
optim_wdecay = torch.optim.SGD(net_weight_decay.parameters(), lr=lr_init, momentum=0.9, weight_decay=1e-2)  # parameters of the net_weight_decay

Now, we have everything we need to train our model.

In [None]:
for epoch in range(max_iter):
    # forward
    pred_normal, pred_wdecay = net_normal(train_x), net_weight_decay(train_x)  # both networks are being used
    # calculating loss based on the output of both networks
    loss_normal, loss_wdecay = loss_func(pred_normal, train_y), loss_func(pred_wdecay, train_y)

    # backpropagation for both networks
    optim_normal.zero_grad()
    optim_wdecay.zero_grad()

    loss_normal.backward()
    loss_wdecay.backward()

    optim_normal.step()
    optim_wdecay.step()

    if (epoch+1) % disp_interval == 0:
        print("Iteration " + str(epoch+1) + "/" + str(max_iter) + ": loss_normal: " + str(loss_normal.item()) + " vs loss_wdecay: " + str(loss_wdecay.item()))

So, we got better loss on the training set using the model **without** regularization, huh? Alright! Let's calculate the loss for the test set for comparison.

In [None]:
net_normal.eval()
net_weight_decay.eval()
test_pred_normal, test_pred_wdecay = net_normal(test_x), net_weight_decay(test_x)

print(loss_func(test_pred_normal, test_y), loss_func(test_pred_wdecay, test_y))

We got a better loss on the test set when using the model **with** regularization. That's weird, huh? Essentially, this is because the model trained without regularization overfitted, so it's producing good results for training (since it's very well-fitted on that data), but it can't generalize to new data like the test set, thus producing bad results on that set.

Let's plot the train and test data points and the learned curves from both networks for comparison.

In [None]:
# plot
plt.scatter(train_x.data.numpy(), train_y.data.numpy(), c='blue', s=50, alpha=0.3, label='train')
plt.scatter(test_x.data.numpy(), test_y.data.numpy(), c='red', s=50, alpha=0.3, label='test')
plt.plot(test_x.data.numpy(), test_pred_normal.data.numpy(), 'r-', lw=3, label='no weight decay')
plt.plot(test_x.data.numpy(), test_pred_wdecay.data.numpy(), 'b--', lw=3, label='weight decay')

plt.ylim((-2.5, 2.5))
plt.legend(loc='upper left')
plt.title("Epoch: {}".format(epoch+1))
plt.show()
plt.close()

In this plot, we can see that the red solid line (learned by the model **without** regularization) is passing through **all** the blue/training data points - showing how overfitted it is. On the other hand, the red dotted line (learned by the model with regularization), while very similar to the blue line, is slightly more flexible and can generalize better to the test/red data points.

## Dropout

The concept of Dropout is simply to understand: randomly deactivate each neuron in the network.

**Random**: There is an inactivation probability (Dropout probability).

**Inactivation**: The weight corresponding to the neuron is 0, that is, the neuron does not contribute to the network's outcome.

By preventing the network from relying too much on specific patterns/neurons, it generalizes better to unseen data, alleviating fitting problem.

The following code is similar to the above to compare Dropout with non-Dropout.

In [None]:
import torch
import torch.nn as nn
import matplotlib.pyplot as plt

torch.manual_seed(0)

n_hidden = 200
max_iter = 2000
disp_interval = 400
lr_init = 0.01

In [None]:
def gen_data(num_data=20, x_range=(-1, 1)):
    w = 1.5
    train_x = torch.linspace(*x_range, num_data).unsqueeze_(1)
    train_y = w*train_x + torch.normal(0, 0.5, size=train_x.size())
    test_x = torch.linspace(*x_range, num_data).unsqueeze_(1)
    test_y = w*test_x + torch.normal(0, 1.0, size=test_x.size())

    return train_x, train_y, test_x, test_y

train_x, train_y, test_x, test_y = gen_data(x_range=(-1, 1))

The network architecture is defined below. Note the [nn.Dropout](https://pytorch.org/docs/stable/generated/torch.nn.Dropout.html) layers.
These layers receive a parameter as input that is "probability of an element to be zeroed".

In the code below, two networks are instantiated - one with dropout probability 0.5 and the other with probability 0.0 (meaning that, dropouts will not do anything in this case).

In [None]:
class MLP(nn.Module):
    def __init__(self, neural_num, d_prob=0.5):
        super(MLP, self).__init__()
        self.linears = nn.Sequential(

            nn.Linear(1, neural_num),
            nn.ReLU(inplace=True),

            nn.Dropout(d_prob),  # dropout
            nn.Linear(neural_num, neural_num),
            nn.ReLU(inplace=True),

            nn.Dropout(d_prob),  # dropout
            nn.Linear(neural_num, neural_num),
            nn.ReLU(inplace=True),

            nn.Dropout(d_prob),  # dropout
            nn.Linear(neural_num, 1),
        )

    def forward(self, x):
        return self.linears(x)

net_prob_0 = MLP(neural_num=n_hidden, d_prob=0.)
net_prob_05 = MLP(neural_num=n_hidden, d_prob=0.5)

In [None]:
optim_normal = torch.optim.SGD(net_prob_0.parameters(), lr=lr_init, momentum=0.9)
optim_reglar = torch.optim.SGD(net_prob_05.parameters(), lr=lr_init, momentum=0.9)

loss_func = torch.nn.MSELoss()

In [None]:
for epoch in range(max_iter):

    pred_normal, pred_dropout = net_prob_0(train_x), net_prob_05(train_x)
    loss_normal, loss_dropout = loss_func(pred_normal, train_y), loss_func(pred_dropout, train_y)

    optim_normal.zero_grad()
    optim_reglar.zero_grad()

    loss_normal.backward()
    loss_dropout.backward()

    optim_normal.step()
    optim_reglar.step()

    if (epoch+1) % disp_interval == 0:
        print("Iteration " + str(epoch+1) + "/" + str(max_iter) + ": loss_normal: " + str(loss_normal.item()) + " vs loss_dropout: " + str(loss_wdecay.item()))

Again, better loss on the training set with the model **without** dropout!

In [None]:
net_prob_0.eval()
net_prob_05.eval()

test_pred_prob_0, test_pred_prob_05 = net_prob_0(test_x), net_prob_05(test_x)

print(loss_func(test_pred_prob_0, test_y), loss_func(test_pred_prob_05, test_y))

But **again**, better loss on the test set using the model trained **with** dropout!!!

In [None]:
# plot
plt.clf()
plt.scatter(train_x.data.numpy(), train_y.data.numpy(), c='blue', s=50, alpha=0.3, label='train')
plt.scatter(test_x.data.numpy(), test_y.data.numpy(), c='red', s=50, alpha=0.3, label='test')
plt.plot(test_x.data.numpy(), test_pred_prob_0.data.numpy(), 'r-', lw=3, label='d_prob_0')
plt.plot(test_x.data.numpy(), test_pred_prob_05.data.numpy(), 'b--', lw=3, label='d_prob_05')

plt.ylim((-2.5, 2.5))
plt.legend(loc='upper left')
plt.show()
plt.close()

How overfitted is the red line (learned by the model without dropout)?