# Model Selection, Underfitting and Overfitting

-------

The goal of machine learning is find a learning algorithm (algorithm that is able to learn from data) or simply model, train a model by adjusting the model parameters to get the best possible performance, both on the training (with minimum training error) and the test dat or new inputs (the trained model must be able to generalized well with minimum generalization error or test error) but the challenge in machine learning is how well does the trained model perform not just on the training data, but also on new unseen inputs (test inputs).This is a fundamental problem in machine learning between <b> optimization (the process of adjusting a model parameters to give the best possible performance on the training data) and generalization (how well does the trained model performs on newly unseen data) because a trained model can perform well well on the training dataset but performs poorly on newly unseen data points.</b>


# NOTE
------
<h3 style='color:blue'>The training error is the error of our model as calculated on the training dataset</h3>
<h3 style='color:blue'>The generalization error is the expected value of the error on a test or new data points drawn from the same underlying data distribution as our original sample</h3>


---
---

```
The factors determining how well a machine learning algorithm will perform are its ability to:
1. Make the training error small.
2. Make the gap between training and test error small.
These two factors correspond to the two central challenges in machine learning: underfitting and overfitting.

(source: From the book, Deep Learning by Ian Goodfellow,Yoshua Bengio and Aaron Courville, page 111) 
```

# Overftting

-------
When the complexity of the model is too high (highly flexible models) as compared to the underlying distribution of the data the model is trying to learn from, it tends to learn the noise present in data and is called overfitting. An Overfitted models has it training error much lower than validation error. <b>An overfitted model fails to Generalize well and has high Variance and Low Bias and the techniques used to combat overfitting are called regularization</b>.

# Underfitting
----
Underfitting occurs when the model can neither obtain sufficiently low error value on the training set nor generalize to new data and has low Variance and high Bias. Underfitted models are not able to reduce the training error. W

# Regularization
----

## Regularization are techniques used to combat overfitting  and this reduces the test error or generalization erro

```
 Regularization is any modification we make to a learning algorithm that is intended to reduce its generalization error 
 but not its training error
 
 (source: From the book, Deep Learning by Ian Goodfellow,Yoshua Bengio and Aaron Courville, page 120)
```

#  WEIGHT  REGULARIZATION
<img src='images/we.jpg'>
(source: From the book, Deep Learning by Ian Goodfellow,Yoshua Bengio and Aaron Courville, page 120)
<img src='images/weight.jpg'>
(source: From the book, Deep Learning with python by François Chollet, page 107)


# 1. Weight decay is also known as L2 regularization or ridge regression or Tikhonov regularization

<b>L2 regularization is also called weight decay in the context of neural networks</b> prevent the weights from growing too large unless it is really necessary. It can be realized by
adding a term to the cost OR objective function that penalizes large weights and is defined as

$$\tilde{\ell}(w)=\ell(w) + \frac{\lambda}{2}w^{2} $$

where $ \tilde{\ell}$ is the regularized cost fucbtion $\ell_{0}$ is an error measure (usually the sum of squared errors) and $\lambda$ is a hyperparameter chosen ahead of time that controls how weights are penalized. (weights the relative contribution of the norm penalty term $w^{2} $  relative to the standard objective function $\ell$)



with the corresponding parameter gradient
$$\bigtriangledown \tilde{\ell}_{w}(w)=\bigtriangledown \ell_{w}(w) + \lambda w $$

The new updated weight after an iteration can be expressed as
$$w=w-\eta \bigtriangledown\tilde{\ell}_{w}(w)=w-\eta(\lambda w +\bigtriangledown \ell_{w}(w)) $$

$$ w=(1- \eta\lambda) w -\eta \bigtriangledown \ell_{w}(w)) $$
where $\eta$ is the learning rate


The addition of the weight decay term has modified the learning rule of the weight vector by a constant factor on each step just before updating the weights

For linear regression, the objective function, sum of squared errors is defined as
$$e=(Xw-y)^{T}(Xw-y)$$

When L2 regularization is added, the objective function changes to
$$e=(Xw-y)^{T}(Xw-y)+ \frac{\lambda}{2}w^{2}$$

and this the solution $w$ from
$$ w=(XX^{T})^{-1}X^{T}y $$

$$ To$$

$$ w=(XX^{T}   + \lambda  I )^{-1}X^{T}y $$

Where the diagonal entries of this matrix $ \lambda  I $ correspond to the variance of each input feature

<img src='images/lp.jpg'>
 (source: From the book am using: Dive into Deep Learning by Aston Zhang, Zachary C. Lipton, Mu Li, and Alexander J. Smola page 155-156)

For more on the effects of weight regularization 
<a href='papers/563-a-simple-weight-decay-can-improve-generalization.pdf'>A Simple Weight Decay Can Improve Generalization by Anders Krogh and John A. Hertz</a>

# High-Dimensional Linear Regression

In [1]:
import torch
import numpy as np
import torch.nn as nn
from torch.autograd import Variable
import random
import torch.nn.functional as F
from torch.utils.data import TensorDataset,DataLoader
import matplotlib.pyplot as plt
import pandas as pd
%matplotlib inline
torch.manual_seed(1000)

<torch._C.Generator at 0x259031b2bd0>

<img src='images/highp.jpg'>
(source: From the book am using: Dive into Deep Learning by Aston Zhang, Zachary C. Lipton, Mu Li, and Alexander J. Smola page 156)

In [2]:
def synthetic_data(w, b, num_examples):  #@save
    """Generate y = Xw + b + noise."""
    X =torch.normal(0,0.01,(num_examples, w.shape[0]))
    y=X@w+b + torch.zeros(((X@w+b).shape)).normal_(mean=0,std=0.01)
    return X, y.reshape(-1,1)

In [3]:
batch_size=5
lr=0.05
lambd=1e-5

In [4]:
n_train, n_test, num_inputs= 20, 100, 200
true_w, true_b=torch.zeros((num_inputs,1))*0.01,0.05

In [5]:
features, labels= synthetic_data(true_w, true_b, n_train)
x_test,y_test = synthetic_data(true_w, true_b, n_test)

In [6]:
labels[0:3]

tensor([[0.0458],
        [0.0567],
        [0.0482]])

In [7]:
def data_iter(features,labels,batch_size):
    dataset=TensorDataset(*(features,labels))
    data_loader=DataLoader(dataset=dataset,batch_size=batch_size,shuffle=True)
    return data_loader

In [8]:
train_iter=data_iter(features,labels,batch_size=batch_size)
test_iter=data_iter(x_test,y_test,batch_size=batch_size)

In [9]:
for x,y in train_iter:
    print(x.shape)
    print(y.shape)
    print(y)
    break

torch.Size([5, 200])
torch.Size([5, 1])
tensor([[0.0441],
        [0.0482],
        [0.0567],
        [0.0618],
        [0.0569]])


# Initializing Model Parameters

In [10]:
def init_params():
    w_1=torch.normal(0,0.01,(num_inputs,5))
    b_1= torch.zeros(5)
    w_2=torch.normal(0,0.01,(5,1))
    b_2= torch.zeros(1)
    w_1.requires_grad_(True)
    b_1.requires_grad_(True)
    w_2.requires_grad_(True)
    b_2.requires_grad_(True)
    return [w_1,w_2,b_1,b_2]
w_1,w_2, b_1,b_2 = init_params()

# L2 REGULARIZATION
$$\tilde{\ell}(w)=\ell(w) + \frac{\lambda}{2}w^{2} $$

In [11]:
def l2_regularizer(y_hat, y,w_1,w_2,lambd=lambd):
    square_error=(y_hat - y.reshape(y_hat.shape)) ** 2 
    l2_penalty=(lambd*((w_1**2).sum()+(w_2**2).sum()))/2 
    return square_error + l2_penalty  

# Defining the Model

In [12]:
def relu(x):
    a=torch.zeros_like(x)
    return torch.max(a,x)
def Linear(X,w_1,w_2,b_1,b_2):
    l1=torch.matmul(X,w_1)+b_1
    h1=relu(l1)
    output=torch.matmul(h1,w_2)+b_2
    return output
net=Linear

In [13]:
def sgd(params,lr,batch_size):
    for param in params:
        param.data.sub_(lr*param.grad/batch_size)
        param.grad.data.zero_()

In [14]:
num_epochs = 50# Number of iterations
for epoch in range(num_epochs):
    epoch+=1
    for X,y in train_iter:
        pred=net(X,w_1,w_2,b_1,b_2)
        loss=l2_regularizer(pred,y,w_1,w_2)
    loss.sum().backward()
    sgd([w_1,w_2,b_1,b_2],lr=lr,batch_size=batch_size)
    with torch.no_grad():
        train_l = l2_regularizer(net(features, w_1,w_2,b_1,b_2), labels,w_1,w_2)
        test_l = l2_regularizer(net(x_test, w_1,w_2,b_1,b_2),y_test,w_1,w_2)
    if epoch%5==0:
        print('epoch : %d, training loss :%f testing loss :%f' % (epoch + 1,
                                                                  train_l.mean().numpy(), 
                                                                  test_l.mean().numpy()))

epoch : 6, training loss :0.001005 testing loss :0.000904
epoch : 11, training loss :0.000455 testing loss :0.000394
epoch : 16, training loss :0.000223 testing loss :0.000190
epoch : 21, training loss :0.000126 testing loss :0.000114
epoch : 26, training loss :0.000105 testing loss :0.000101
epoch : 31, training loss :0.000100 testing loss :0.000099
epoch : 36, training loss :0.000097 testing loss :0.000099
epoch : 41, training loss :0.000097 testing loss :0.000099
epoch : 46, training loss :0.000097 testing loss :0.000099
epoch : 51, training loss :0.000097 testing loss :0.000099


# CONCISE IMPLEMENTATION
-----
------

In [15]:
class LR(nn.Module):
    def __init__(self):
        super().__init__()
        self.linear1=nn.Linear(200,5)
        self.relu=nn.ReLU()
        self.linear2=nn.Linear(5,1)
    def forward(self,x):
        h1=self.relu(self.linear1(x))
        output=self.linear2(h1)
        return output
nett=LR()

In [16]:
opt=torch.optim.SGD(nett.parameters(),lr=lr,weight_decay=lambd)

In [17]:
mse_err=nn.MSELoss()

In [18]:
for epoch in range(50):
    epoch +=1
    for XX,yy in train_iter:
        prd=nett(XX)
        #pred=float(pred)
        l2_l=mse_err(prd,yy)
    opt.zero_grad()
    l2_l.backward()
    opt.step()
    with torch.no_grad():
        train_l = mse_err(nett(features), labels)
        test_l = mse_err(nett(x_test),y_test)
    if epoch%5==0:
        print('epoch : %d, training loss :%f testing loss :%f' % (epoch + 1, train_l.mean().numpy(), test_l.mean().numpy()))

epoch : 6, training loss :0.022875 testing loss :0.022264
epoch : 11, training loss :0.006335 testing loss :0.006019
epoch : 16, training loss :0.001766 testing loss :0.001605
epoch : 21, training loss :0.000487 testing loss :0.000412
epoch : 26, training loss :0.000211 testing loss :0.000173
epoch : 31, training loss :0.000125 testing loss :0.000110
epoch : 36, training loss :0.000114 testing loss :0.000106
epoch : 41, training loss :0.000117 testing loss :0.000107
epoch : 46, training loss :0.000114 testing loss :0.000106
epoch : 51, training loss :0.000110 testing loss :0.000106


In [19]:
model=nn.Sequential(nn.Linear(num_inputs,5),
                    nn.ReLU(),
                    nn.Linear(5,1))

In [20]:
mse_loss=nn.MSELoss()

In [21]:
optm=torch.optim.SGD([{'params':model[0].weight,'weight_decay':lambd},{'params':model[0].bias}],lr=lr)

In [22]:
for epoch in range(150):
    epoch+=1
    for x,y in train_iter:
        model.train()
        pred=model(x)
        los=mse_loss(pred,y)
        optm.zero_grad()
        los.backward()
        optm.step()
    if epoch%5==0:
        print('epocgh %d, loss %f'%(epoch,los))

epocgh 5, loss 0.008473
epocgh 10, loss 0.006948
epocgh 15, loss 0.003955
epocgh 20, loss 0.002995
epocgh 25, loss 0.002086
epocgh 30, loss 0.001252
epocgh 35, loss 0.000875
epocgh 40, loss 0.000365
epocgh 45, loss 0.000823
epocgh 50, loss 0.000311
epocgh 55, loss 0.000140
epocgh 60, loss 0.000262
epocgh 65, loss 0.000145
epocgh 70, loss 0.000044
epocgh 75, loss 0.000255
epocgh 80, loss 0.000174
epocgh 85, loss 0.000055
epocgh 90, loss 0.000080
epocgh 95, loss 0.000042
epocgh 100, loss 0.000158
epocgh 105, loss 0.000071
epocgh 110, loss 0.000019
epocgh 115, loss 0.000103
epocgh 120, loss 0.000048
epocgh 125, loss 0.000186
epocgh 130, loss 0.000070
epocgh 135, loss 0.000134
epocgh 140, loss 0.000156
epocgh 145, loss 0.000147
epocgh 150, loss 0.000040


In [23]:
predicted=model(x_test)

In [24]:
y_true=y_test.numpy()
predicted=predicted.data.numpy()

In [25]:
predicted=np.array(predicted)
y_true=np.array(y_true)

In [26]:
d=np.concatenate((y_true,predicted),1)

In [27]:
data=pd.DataFrame(d,columns=['y_true','predicted'])

In [28]:
data.head(50)

Unnamed: 0,y_true,predicted
0,0.036797,0.051809
1,0.062678,0.049475
2,0.049847,0.052281
3,0.06247,0.052692
4,0.066164,0.049744
5,0.051938,0.052134
6,0.043664,0.05087
7,0.059681,0.051552
8,0.052711,0.048505
9,0.050714,0.052297
