# Model Selection, Underfitting and Overfitting

-------

The goal of machine learning is find a learning algorithm (algorithm that is able to learn from data) or simply model, train a model by adjusting the model parameters to get the best possible performance, both on the training (with minimum training error) and the test dat or new inputs (the trained model must be able to generalized well with minimum generalization error or test error) but the challenge in machine learning is how well does the trained model perform not just on the training data, but also on new unseen inputs (test inputs).This is a fundamental problem in machine learning between <b> optimization (the process of adjusting a model parameters to give the best possible performance on the training data) and generalization (how well does the trained model performs on newly unseen data) because a trained model can perform well well on the training dataset but performs poorly on newly unseen data points.</b>


# NOTE
------
<h3 style='color:blue'>The training error is the error of our model as calculated on the training dataset</h3>
<h3 style='color:blue'>The generalization error is the expected value of the error on a test or new data points drawn from the same underlying data distribution as our original sample</h3>


---
---

```
The factors determining how well a machine learning algorithm will perform are its ability to:
1. Make the training error small.
2. Make the gap between training and test error small.
These two factors correspond to the two central challenges in machine learning: underfitting and overfitting.

(source: From the book, Deep Learning by Ian Goodfellow,Yoshua Bengio and Aaron Courville, page 111) 
```

# Overftting

-------
When the complexity of the model is too high (highly flexible models) as compared to the underlying distribution of the data the model is trying to learn from, it tends to learn the noise present in data and is called overfitting. An Overfitted models has it training error much lower than validation error. <b>An overfitted model fails to Generalize well and has high Variance and Low Bias and the techniques used to combat overfitting are called regularization</b>.

# Underfitting
----
Underfitting occurs when the model can neither obtain sufficiently low error value on the training set nor generalize to new data and has low Variance and high Bias. Underfitted models are not able to reduce the training error. W

# Regularization
----

## Regularization are techniques used to combat overfitting  and this reduces the test error or generalization erro

```
 Regularization is any modification we make to a learning algorithm that is intended to reduce its generalization error 
 but not its training error
 
 (source: From the book, Deep Learning by Ian Goodfellow,Yoshua Bengio and Aaron Courville, page 120)
```

#  WEIGHT  REGULARIZATION
<img src='images/we.jpg'>
(source: From the book, Deep Learning by Ian Goodfellow,Yoshua Bengio and Aaron Courville, page 120)
<img src='images/weight.jpg'>
(source: From the book, Deep Learning with python by François Chollet, page 107)


# 1. Weight decay is also known as L2 regularization or ridge regression or Tikhonov regularization

<b>L2 regularization is also called weight decay in the context of neural networks</b> prevent the weights from growing too large unless it is really necessary. It can be realized by
adding a term to the cost OR objective function that penalizes large weights and is defined as

$$\tilde{\ell}(w)=\ell(w) + \frac{\lambda}{2}w^{2} $$

where $ \tilde{\ell}$ is the regularized cost fucbtion $\ell_{0}$ is an error measure (usually the sum of squared errors) and $\lambda$ is a hyperparameter chosen ahead of time that controls how weights are penalized. (weights the relative contribution of the norm penalty term $w^{2} $  relative to the standard objective function $\ell$)



with the corresponding parameter gradient
$$\bigtriangledown \tilde{\ell}_{w}(w)=\bigtriangledown \ell_{w}(w) + \lambda w $$

The new updated weight after an iteration can be expressed as
$$w=w-\eta \bigtriangledown\tilde{\ell}_{w}(w)=w-\eta(\lambda w +\bigtriangledown \ell_{w}(w)) $$

$$ w=(1- \eta\lambda) w -\eta \bigtriangledown \ell_{w}(w)) $$
where $\eta$ is the learning rate


The addition of the weight decay term has modified the learning rule of the weight vector by a constant factor on each step just before updating the weights

For linear regression, the objective function, sum of squared errors is defined as
$$e=(Xw-y)^{T}(Xw-y)$$

When L2 regularization is added, the objective function changes to
$$e=(Xw-y)^{T}(Xw-y)+ \frac{\lambda}{2}w^{2}$$

and this the solution $w$ from
$$ w=(XX^{T})^{-1}X^{T}y $$

$$ To$$

$$ w=(XX^{T}   + \lambda  I )^{-1}X^{T}y $$

Where the diagonal entries of this matrix $ \lambda  I $ correspond to the variance of each input feature

<img src='../images/lp.jpg'>
 (source: From the book am using: Dive into Deep Learning by Aston Zhang, Zachary C. Lipton, Mu Li, and Alexander J. Smola page 155-156)

For more on the effects of weight regularization 
<a href='https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.465.1947&rep=rep1&type=pdf'>A Simple Weight Decay Can Improve Generalization by Anders Krogh and John A. Hertz</a>

# High-Dimensional Linear Regression

In [1]:
import d2l
import tensorflow as tf
import keras
from keras import layers
from keras import models

Using TensorFlow backend.


<img src='../images/highp.jpg'>
(source: From the book am using: Dive into Deep Learning by Aston Zhang, Zachary C. Lipton, Mu Li, and Alexander J. Smola page 156)

In [2]:
def synthetic_data(w, b, num_examples):  #@save
    """Generate y = Xw + b + noise."""
    X = tf.zeros((num_examples, w.shape[0]))
    X += tf.random.normal(shape=X.shape)
    y=X@w+b
    y += tf.random.normal(shape=y.shape, stddev=0.01)
    y = tf.reshape(y, (-1, 1))
    return X, y

In [3]:
n_train, n_test, num_inputs, batch_size = 20, 100, 200, 5
true_w, true_b=tf.zeros((num_inputs,1))*0.01,0.05

In [4]:
features, labels= synthetic_data(true_w, true_b, n_train)
x_test,y_test = synthetic_data(true_w, true_b, n_test)

In [5]:
features.shape

TensorShape([20, 200])

In [6]:
class L2(keras.regularizers.Regularizer):
    def __init__(self,strength):
        self.strength=strength
    def call(self,w):
        return self.strength *tf.reduce_sum(tf.square(w))
    
def mean_squared_error(y_true, y_pred):
    return tf.math.reduce_mean(tf.square(y_true - y_pred))

def mean_absolute_error(y_true,y_pred):
    absolute=keras.backend.abs(y_true-y_pred)
    return keras.backend.mean(absolute,axis=-1)


In [7]:
net=keras.models.Sequential()
net.add(keras.layers.Dense(10,activation='relu',input_shape=(200,),use_bias=True,bias_initializer='zeros',
                           kernel_regularizer=L2(0.05)))
net.add(keras.layers.Dense(5,activation='relu',kernel_regularizer=L2(0.01)))
net.add(keras.layers.Dense(1))
net.compile(optimizer='sgd',loss=mean_squared_error,metrics=[mean_absolute_error])

In [8]:
net.summary()

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_1 (Dense)              (None, 10)                2010      
_________________________________________________________________
dense_2 (Dense)              (None, 5)                 55        
_________________________________________________________________
dense_3 (Dense)              (None, 1)                 6         
Total params: 2,071
Trainable params: 2,071
Non-trainable params: 0
_________________________________________________________________


In [9]:
net.fit(features,labels,steps_per_epoch=5,validation_data=(x_test,y_test),validation_steps=5,epochs=20)

Train on 20 samples, validate on 100 samples
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


<keras.callbacks.callbacks.History at 0x1723cc0bfd0>