One way to deal with overfitting is to constrain the model. A process known as Regularization. Effectively we reduce the reach of the model making it harder to overfit. 
When the model is linear we can achieve this by constraining the weights of the model.
If the model is polynomial we can also constrain it by reducing the number of polynomial degrees.



#### Ridge Regression

This is basically a regularized version of Linear Regression.
How:
During training (and only during) a regularization term ( $\propto \sum^n_{i=1} \theta^2_i $) is added to the models  cost function.
Applying the term forces the learning algorithm to fit the data and also keep the model weights small.
After training use uregularized perf measure.
The cost function with regression can be defined as $J(\theta) = MSE(\theta) + \propto {1 \over 2} \sum^n_{i=1}\theta^2_i$

* Ridge Regression is sensitive to the scale of input features. Remember to scale the data first.

Ridge Regression can be performed
* Computing the close-form equation which can be represented as:  $\hat\theta = {\biggr(\mathbf  X^T \mathbf{.}\mathbf{X} + \propto\mathbf{A} \biggr)}^{-1} . \mathbf{X} . \mathbf{y}$

In [7]:
import numpy as np
from sklearn.linear_model import Ridge
X_new = np.linspace(0, 3, 100).reshape(100, 1)
m = 20
X = 3 * np.random.rand(m, 1)
y = 1 + 0.5 * X + np.random.randn(m, 1) / 1.5


rregularizer = Ridge(alpha=1, solver="cholesky") # setting the solver hyperparam - we are using Andre Cholesky's matrix factoriaztion technique
rregularizer.fit(X, y)
rregularizer.predict([[1.5]])

array([[ 1.39804895]])

* Compute using Gradient Descent.


In [8]:
from sklearn.linear_model import SGDRegressor
sgd_regularizer = SGDRegressor(penalty='l2', random_state=42)
sgd_regularizer.fit(X, y.ravel())
sgd_regularizer.predict([[1.5]])

array([ 1.07209499])