# Model Selection, Underfitting and Overfitting

-------

The goal of machine learning is find a learning algorithm (algorithm that is able to learn from data) or simply model, train a model by adjusting the model parameters to get the best possible performance, both on the training (with minimum training error) and the test dat or new inputs (the trained model must be able to generalized well with minimum generalization error or test error) but the challenge in machine learning is how well does the trained model perform not just on the training data, but also on new unseen inputs (test inputs).This is a fundamental problem in machine learning between <b> optimization (the process of adjusting a model parameters to give the best possible performance on the training data) and generalization (how well does the trained model performs on newly unseen data) because a trained model can perform well well on the training dataset but performs poorly on newly unseen data points.</b>


# NOTE
------
<h3 style='color:blue'>The training error is the error of our model as calculated on the training dataset</h3>
<h3 style='color:blue'>The generalization error is the expected value of the error on a test or new data points drawn from the same underlying data distribution as our original sample</h3>


---
---

```
The factors determining how well a machine learning algorithm will perform are its ability to:
1. Make the training error small.
2. Make the gap between training and test error small.
These two factors correspond to the two central challenges in machine learning: underfitting and overfitting.

(source: From the book, Deep Learning by Ian Goodfellow,Yoshua Bengio and Aaron Courville, page 111) 
```

# Overftting

-------
When the complexity of the model is too high (highly flexible models) as compared to the underlying distribution of the data the model is trying to learn from, it tends to learn the noise present in data and is called overfitting. An Overfitted models has it training error much lower than validation error. <b>An overfitted model fails to Generalize well and has high Variance and Low Bias and the techniques used to combat overfitting are called regularization</b>.

# Underfitting
----
Underfitting occurs when the model can neither obtain sufficiently low error value on the training set nor generalize to new data and has low Variance and high Bias. Underfitted models are not able to reduce the training error. W

# Regularization
----

## Regularization are techniques used to combat overfitting  and this reduces the test error or generalization erro

```
 Regularization is any modification we make to a learning algorithm that is intended to reduce its generalization error 
 but not its training error
 
 (source: From the book, Deep Learning by Ian Goodfellow,Yoshua Bengio and Aaron Courville, page 120)
```

#  WEIGHT  REGULARIZATION
<img src='images/we.jpg'>
(source: From the book, Deep Learning by Ian Goodfellow,Yoshua Bengio and Aaron Courville, page 120)
<img src='images/weight.jpg'>
(source: From the book, Deep Learning with python by François Chollet, page 107)


# 1. Weight decay is also known as L2 regularization or ridge regression or Tikhonov regularization

<b>L2 regularization is also called weight decay in the context of neural networks</b> prevent the weights from growing too large unless it is really necessary. It can be realized by
adding a term to the cost OR objective function that penalizes large weights and is defined as

$$\tilde{\ell}(w)=\ell(w) + \frac{\lambda}{2}w^{2} $$

where $ \tilde{\ell}$ is the regularized cost fucbtion $\ell_{0}$ is an error measure (usually the sum of squared errors) and $\lambda$ is a hyperparameter chosen ahead of time that controls how weights are penalized. (weights the relative contribution of the norm penalty term $w^{2} $  relative to the standard objective function $\ell$)



with the corresponding parameter gradient
$$\bigtriangledown \tilde{\ell}_{w}(w)=\bigtriangledown \ell_{w}(w) + \lambda w $$

The new updated weight after an iteration can be expressed as
$$w=w-\eta \bigtriangledown\tilde{\ell}_{w}(w)=w-\eta(\lambda w +\bigtriangledown \ell_{w}(w)) $$

$$ w=(1- \eta\lambda) w -\eta \bigtriangledown \ell_{w}(w)) $$
where $\eta$ is the learning rate


The addition of the weight decay term has modified the learning rule of the weight vector by a constant factor on each step just before updating the weights

For linear regression, the objective function, sum of squared errors is defined as
$$e=(Xw-y)^{T}(Xw-y)$$

When L2 regularization is added, the objective function changes to
$$e=(Xw-y)^{T}(Xw-y)+ \frac{\lambda}{2}w^{2}$$

and this the solution $w$ from
$$ w=(XX^{T})^{-1}X^{T}y $$

$$ To$$

$$ w=(XX^{T}   + \lambda  I )^{-1}X^{T}y $$

Where the diagonal entries of this matrix $ \lambda  I $ correspond to the variance of each input feature

<img src='images/lp.jpg'>
 (source: From the book am using: Dive into Deep Learning by Aston Zhang, Zachary C. Lipton, Mu Li, and Alexander J. Smola page 155-156)

For more on the effects of weight regularization 
<a href='papers/563-a-simple-weight-decay-can-improve-generalization.pdf'>A Simple Weight Decay Can Improve Generalization by Anders Krogh and John A. Hertz</a>

# High-Dimensional Linear Regression

In [1]:
import d2l
from mxnet import gluon, npx,np,init,autograd
from mxnet.gluon import nn
import mxnet
%matplotlib inline
npx.set_np()

<img src='images/highp.jpg'>
(source: From the book am using: Dive into Deep Learning by Aston Zhang, Zachary C. Lipton, Mu Li, and Alexander J. Smola page 156)

In [2]:
n_train, n_test, num_inputs, batch_size = 20, 100, 200, 5
true_w, true_b=np.zeros((num_inputs,1))*0.01,0.05

In [3]:
features, y_train= d2l.synthetic_data(true_w, true_b, n_train)
x_test, y_test= d2l.synthetic_data(true_w, true_b, n_train)
train_iter = d2l.load_array((features, y_train), batch_size)
test_data = d2l.synthetic_data(true_w, true_b, n_test)
test_iter = d2l.load_array(test_data, batch_size, is_train=False)

# defining our model

In [4]:
net=gluon.nn.Sequential()
#net.add(gluon.nn.Dense(20,activation='relu'))
net.add(gluon.nn.Dense(1))

# Initializing Model Parameters

In [5]:
net.initialize(mxnet.init.Normal())

# l2 regularization

# gluon.loss.L2Loss
L2Loss(Loss): Calculates the mean squared error between `label` and `pred`.

  $$ L = \frac{1}{2} \sum_i \vert {label}_i - {pred}_i \vert^2$$

In [6]:
l2_loss=gluon.loss.L2Loss()

# optimizer

In [7]:
optimizer=gluon.Trainer(net.collect_params('.*weight'), 'sgd',{'learning_rate': 0.01,'wd':4})

# TRAINING LOOP

In [8]:
num_epochs = 50# Number of iterations
batch_size=5
for epoch in range(num_epochs):
    train_loss=0
    for X,y in train_iter:
        with autograd.record():
            y_hat=net(X)
            loss_val=l2_loss(y_hat,y)# Minibatch loss in X and y
        loss_val.backward() # Compute gradient on l with respect to [w, b]
        optimizer.step(batch_size=batch_size)
        train_loss +=loss_val.mean().asnumpy()
    val_loss=0  
    for x_test,y_test in test_iter:
        pred=net(x_test)
        va_loss=l2_loss(pred,y_test)
        val_loss+=va_loss.mean().asnumpy()
    print('epoch %d, training loss: %f, validation loss: %f' % (epoch, train_loss, val_loss))    

epoch 0, training loss: 0.027003, validation loss: 0.233873
epoch 1, training loss: 0.010031, validation loss: 0.175761
epoch 2, training loss: 0.004061, validation loss: 0.135716
epoch 3, training loss: 0.001936, validation loss: 0.107246
epoch 4, training loss: 0.001228, validation loss: 0.086421
epoch 5, training loss: 0.000903, validation loss: 0.071606
epoch 6, training loss: 0.000740, validation loss: 0.060605
epoch 7, training loss: 0.000696, validation loss: 0.052405
epoch 8, training loss: 0.000671, validation loss: 0.046444
epoch 9, training loss: 0.000657, validation loss: 0.042019
epoch 10, training loss: 0.000658, validation loss: 0.038791
epoch 11, training loss: 0.000649, validation loss: 0.036291
epoch 12, training loss: 0.000657, validation loss: 0.034500
epoch 13, training loss: 0.000634, validation loss: 0.033120
epoch 14, training loss: 0.000617, validation loss: 0.031939
epoch 15, training loss: 0.000641, validation loss: 0.031015
epoch 16, training loss: 0.000605,

In [9]:
net=gluon.nn.Sequential()
#net.add(gluon.nn.Dense(20,activation='relu'))
net.add(gluon.nn.Dense(1))

In [10]:
net.initialize(mxnet.init.Normal(sigma=0.01))

In [11]:
# The weight parameter has been decayed. Weight names generally end with "weight".
trainer_w = gluon.Trainer(net.collect_params('.*weight'), 'sgd',{'learning_rate': 0.01, 'wd': 4})
    # The bias parameter has not decayed. Bias names generally end with "bias".
    # The biases require less data to fit accurately than the weights
trainer_b = gluon.Trainer(net.collect_params('.*bias'), 'sgd',{'learning_rate': 0.01})

In [12]:
num_epochs = 50# Number of iterations
batch_size=5
for epoch in range(num_epochs):
    train_loss=0
    for X,y in train_iter:
        with autograd.record():
            y_hat=net(X)
            loss_val=l2_loss(y_hat,y)# Minibatch loss in X and y
        loss_val.backward() # Compute gradient on l with respect to [w, b]
        trainer_w.step(batch_size=batch_size)
        trainer_b.step(batch_size=batch_size)
        train_loss +=loss_val.mean().asnumpy()
    val_loss=0  
    for x_test,y_test in test_iter:
        pred=net(x_test)
        va_loss=l2_loss(pred,y_test)
        val_loss+=va_loss.mean().asnumpy()
    print('epoch %d, training loss: %f, validation loss: %f' % (epoch, train_loss, val_loss))    

epoch 0, training loss: 0.025873, validation loss: 0.141982
epoch 1, training loss: 0.008845, validation loss: 0.107829
epoch 2, training loss: 0.003253, validation loss: 0.084527
epoch 3, training loss: 0.001566, validation loss: 0.067880
epoch 4, training loss: 0.000932, validation loss: 0.055731
epoch 5, training loss: 0.000678, validation loss: 0.046763
epoch 6, training loss: 0.000585, validation loss: 0.040098
epoch 7, training loss: 0.000507, validation loss: 0.035093
epoch 8, training loss: 0.000517, validation loss: 0.031386
epoch 9, training loss: 0.000473, validation loss: 0.028606
epoch 10, training loss: 0.000460, validation loss: 0.026327
epoch 11, training loss: 0.000463, validation loss: 0.024596
epoch 12, training loss: 0.000450, validation loss: 0.023218
epoch 13, training loss: 0.000450, validation loss: 0.022106
epoch 14, training loss: 0.000421, validation loss: 0.021108
epoch 15, training loss: 0.000430, validation loss: 0.020322
epoch 16, training loss: 0.000423,