# Regularization in Machine Learning


Regularization is a key concept in machine learning used to prevent overfitting by introducing additional information or constraints to a model. 
This notebook explores different regularization methods and their implementation in **Vanilla Python** and **PyTorch**.
    
Topics Covered:
1. L2 Regularization (Ridge)
2. Dropout
3. Weight Decay
4. Batch Normalization

In [3]:
import math
import numpy as np
import torch
from pathlib import Path
import pickle
import gzip

import torch
from torch.utils.data import DataLoader
import torch.nn as nn
import torch.optim as optim

In [4]:
MNIST_URL='https://github.com/mnielsen/neural-networks-and-deep-learning/blob/master/data/mnist.pkl.gz?raw=true'
path_data = Path('data')
path_data.mkdir(exist_ok=True)
path_gz = path_data/'mnist.pkl.gz'

In [5]:
from urllib.request import urlretrieve
if not path_gz.exists(): urlretrieve(MNIST_URL, path_gz)

In [6]:
def load_data():
    with gzip.open(path_gz, 'rb') as f:
        ((x_train, y_train), (x_valid, y_valid), _) = pickle.load(f, encoding='latin-1')
        
    x_train, y_train, x_valid, y_valid = map(lambda x: torch.from_numpy(x).float(),
                                             (x_train, y_train, x_valid, y_valid))
    
    return (x_train, y_train, x_valid, y_valid)

In [7]:
x_train, y_train, x_valid, y_valid = load_data()

In [8]:
x_train.shape

torch.Size([50000, 784])

# No Regularization

In [27]:
class Network(nn.Module):
    
    def __init__(self):
        super().__init__()
        self.model = nn.Sequential(
        nn.Linear(784, 128),
        nn.ReLU(),
        nn.Linear(128, 64),
        nn.ReLU(),
        nn.Linear(64, 10),
        nn.ReLU()
        )
        
    def forward(self, x):
        return self.model(x)

In [28]:
def train(model, x, y, criterion, optimizer, epochs=500):
    
    for epoch in range(epochs):
        preds = model(x)
        loss = criterion(preds, y.long())
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        
        if epoch == 1: print(loss)
        if epoch % 100 == 0: print(loss)

In [29]:
def calc_acc(model, x, y):
    
    model.eval() 
    correct = 0
    total = 0

    with torch.no_grad(): 

        outputs = model(x)
        _, predicted = torch.max(outputs, 1)
        total += y.size(0)
        correct += (predicted == y).sum().item() 

    accuracy = (correct / total) * 100
    return accuracy

In [30]:
model = Network()
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.01)

In [31]:
x_train, y_train, x_valid, y_valid = load_data()
small_x_train = x_train[:10000]
small_y_train = y_train[:10000]

In [32]:
%time train(model, x_train, y_train, criterion, optimizer)

tensor(2.3022, grad_fn=<NllLossBackward0>)
tensor(2.2453, grad_fn=<NllLossBackward0>)
tensor(0.2836, grad_fn=<NllLossBackward0>)
tensor(0.2490, grad_fn=<NllLossBackward0>)
tensor(0.2457, grad_fn=<NllLossBackward0>)
tensor(0.2451, grad_fn=<NllLossBackward0>)
CPU times: user 51.2 s, sys: 2min 9s, total: 3min
Wall time: 19.1 s


In [33]:
calc_acc(model, x_valid, y_valid)

86.72999999999999

Adding lambda * vector norm ^2 to the loss

In [183]:
lambda_reg = 0.001
loss = 0

In [184]:
l2_loss = sum(torch.norm(param) ** 2 for param in model.parameters())
total_loss = loss + lambda_reg * l2_loss

In [176]:
t = torch.tensor([3.,4.])

In [177]:
torch.norm(t)

tensor(5.)

In [180]:
math.sqrt(3**2 + 4**2)

5.0

PyTorch implements l2 through weight decay

# Weight Decay

In [167]:
model = Network()
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.01, weight_decay=1e-15)

In [168]:
x_train, y_train, x_valid, y_valid = load_data()

In [169]:
%time train(model, x_train, y_train, criterion, optimizer)

tensor(2.3083, grad_fn=<NllLossBackward0>)
tensor(2.1972, grad_fn=<NllLossBackward0>)
tensor(0.2956, grad_fn=<NllLossBackward0>)
tensor(0.2461, grad_fn=<NllLossBackward0>)
tensor(0.2376, grad_fn=<NllLossBackward0>)
tensor(0.2362, grad_fn=<NllLossBackward0>)
CPU times: user 51 s, sys: 2min 3s, total: 2min 54s
Wall time: 18.6 s


In [170]:
calc_acc(model, x_valid, y_valid)

88.33

When should I use what?

- If you're using SGD, it doesnt really matter. Use whats easier for u (usually whats supported easier in the framework ure using).
- If you're using a smarter optimizer like Adam, use weight decay. 

Why is weight decay prefered over L2 using Adam?

Adam adapts the learning rate according to the size of the gradient. L2 changes the gradient directly, which affects how the learning rate in Adam will perform. Weight decay affects the weight change itself, not the gradient, hence Adam will not treat it differently.

As adam scales the lr according to the gradient, if the gradient is bigger (because of more L2), the lr will be smaller. In other words, the affect of L2 wont be the same for every weight, but it <b> will </b> be for weight decay.

# Dropout

In [171]:
class NetworkWithDropout(nn.Module):
    
    def __init__(self):
        super().__init__()
        self.model = nn.Sequential(
        nn.Linear(784, 128),
        nn.ReLU(),
        nn.Linear(128, 64),
        nn.ReLU(),
        nn.Dropout(),
        nn.Linear(64, 10),
        nn.ReLU(),
        nn.Dropout()
        )
        
    def forward(self, x):
        return self.model(x)

In [152]:
model = NetworkWithDropout()
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.01)

In [153]:
x_train, y_train, x_valid, y_valid = load_data()

In [154]:
%time train(model, x_train, y_train, criterion, optimizer)

tensor(2.3049, grad_fn=<NllLossBackward0>)
tensor(2.2832, grad_fn=<NllLossBackward0>)
tensor(1.2152, grad_fn=<NllLossBackward0>)
tensor(1.1770, grad_fn=<NllLossBackward0>)
tensor(1.1625, grad_fn=<NllLossBackward0>)
tensor(1.1635, grad_fn=<NllLossBackward0>)
CPU times: user 1min 13s, sys: 3min 11s, total: 4min 24s
Wall time: 28.2 s


In [155]:
calc_acc(model, x_valid, y_valid)

97.42

# BatchNorm

In [42]:
class NetworkWithBatchNorm(nn.Module):
    
    def __init__(self):
        super().__init__()
        self.model = nn.Sequential(
        nn.Linear(784, 128),
        nn.BatchNorm1d(128),
        nn.ReLU(),
        nn.Linear(128, 64),
        nn.BatchNorm1d(64),
        nn.ReLU(),
        nn.Linear(64, 10),
        nn.BatchNorm1d(10),
        nn.ReLU(),
        )
        
    def forward(self, x):
        return self.model(x)

In [43]:
model = NetworkWithBatchNorm()
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.1)

In [44]:
x_train, y_train, x_valid, y_valid = load_data()

In [45]:
%time train(model, x_train, y_train, criterion, optimizer)

tensor(2.5129, grad_fn=<NllLossBackward0>)
tensor(2.1002, grad_fn=<NllLossBackward0>)
tensor(0.0871, grad_fn=<NllLossBackward0>)
tensor(0.0208, grad_fn=<NllLossBackward0>)
tensor(0.0060, grad_fn=<NllLossBackward0>)
tensor(0.0028, grad_fn=<NllLossBackward0>)
CPU times: user 1min 12s, sys: 2min 25s, total: 3min 38s
Wall time: 23.1 s


In [46]:
calc_acc(model, x_valid, y_valid)

97.09

In [None]:
model.model

In [51]:
for i, layer in enumerate(model.model):
    if isinstance(layer, nn.BatchNorm1d):
        print(f"BatchNorm Layer {i}:")
        print(f"  Gamma (Scale): {layer.weight.data}")
#         print(f"  Beta (Shift): {layer.bias.data}")

BatchNorm Layer 1:
  Gamma (Scale): tensor([ 1.1652,  0.3029,  0.5780,  0.8863,  2.1060,  0.7894,  1.9146,  0.8595,
         1.3479,  2.2594,  0.7499,  1.2502,  0.6055,  0.7897,  1.3201,  1.8231,
         0.3432,  1.7832,  1.3496,  1.3438,  0.2845,  1.8903,  1.2405,  1.8726,
         1.4783,  0.0918,  0.2287,  1.1246,  0.0831,  0.7382,  1.4862,  1.2834,
         1.0800,  0.1879,  1.7311,  0.0501,  0.0605,  0.8945,  0.8620,  1.5843,
         0.0808,  1.1216,  1.7470,  0.8454,  1.1101,  1.1589,  0.5654,  1.4216,
         1.6013,  2.1777, -0.5968,  1.1514,  0.5874,  1.7584,  1.9480,  1.7042,
         2.0037,  1.8315,  2.3074,  1.8558,  0.1167, -0.0719,  0.9678,  1.3273,
         0.9043,  0.6597,  1.9749,  1.6451,  1.2883,  1.9117,  2.0002,  1.6170,
         0.9194,  2.4162,  1.3838,  0.9666,  0.0900,  2.2153,  1.6092,  1.5673,
        -0.9424,  0.8346,  1.6571,  1.2399,  1.3433,  1.3593,  1.7042,  0.0706,
         1.2761,  1.4833,  0.1141,  0.7444,  1.4354,  1.4378,  1.5721,  0.0462,
    