Regularization is used to tackle overfitting by penalizing complex models. Common use cases include

* Reduce `overfitting` and thus improve `generalization` and `stability`.
* `Feature selection` when L1 method is used.

Common regularization techniques are 

* L1(`Lasso`) - adds `absolute` model weights to cost function as penalty - Tends to shrink `smaller coefficients to zero` so can be used for `feature selection`.
* L2(`Ridge`) - adds `squared` model weights to cost function as penalty - Tends to moderately shrink all coefficients.
* `Elastic Net` - combines penalties of both L1 and L2 regularization  


While the plain `OLS` (Ordinary Least Squared) has the below `cost function`  
        SSE  
        where SSE = Sum of Squared Errors

The `L1` tries to reduce the coefficients by adding the absolute value of slope as penalty to cost function  
        *(SSE + λ\*Σ(|slopes|))* 
        
and the `L2` tries to reduce the coefficients by adding the squared value of slope as penalty to cost function    
        *(SSE + λ\*Σ(slopes<sup>2</sup>))*  
        
and the `Elastic Net` tries to reduce the coefficients by adding both the squared value of slope and absolute value of slope as penalties to cost function    
        *(SSE + λ<sub>1</sub>\*Σ(|slopes|) + λ<sub>2</sub>\*Σ(slopes<sup>2</sup>))*       

#### Linear Regression

In [30]:
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

# Select predictors and target
X = data.loc[:,~data.columns.isin(['hp','model'])]  # predictors
y = data["hp"]  # target variable

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 42)

#Train the model
model = LinearRegression()
model.fit(X_train, y_train)

#Evaluate the model
print("Coefficients:", model.coef_)
print("Total absolute value of coefficients is ", round(sum([abs(coef) for coef in model.coef_]),1))

Coefficients: [ -1.17799003   5.34568682   0.20192979   5.34053483  10.05095453
 -24.6823462   43.33448042   4.57583899  -8.92920991   4.58366261]
Total absolute value of coefficients is  108.2


#### Lasso Regularization

In [34]:
from sklearn.linear_model import Lasso
from sklearn.model_selection import GridSearchCV

estimator = Lasso()

param_grid = {"alpha":list(range(1,11))}
model_hp = GridSearchCV(estimator, param_grid, cv = 5)
model_hp.fit(X_train, y_train)
# print("Best alpha is ",model_hp.best_params_)
print("Best Lasso coefficients are ",model_hp.best_estimator_.coef_)
print("Total absolute value of coefficients is ", round(sum([abs(coef) for coef in model_hp.best_estimator_.coef_]),1))

Best Lasso coefficients are  [ -1.49718221   0.           0.28242638   0.          -0.
 -13.5543781    0.           0.           0.           3.10896649]
Total absolute value of coefficients is  18.4


The total of absolute values of coefficients is much smaller in Lasso compared to plain Regression. 
Note that `Lasso` is shrinking many of the coefficients to `zero`. So `Lasso` is useful for feature selection in datasets with many irrelevant or redundant features.

#### Ridge Regularization

In [35]:
from sklearn.linear_model import Ridge
from sklearn.model_selection import GridSearchCV

estimator = Ridge()

param_grid = {"alpha":list(range(1,11))}
model_hp = GridSearchCV(estimator, param_grid, cv = 5)
model_hp.fit(X_train, y_train)
# print("Best alpha is ",model_hp.best_params_)
print("Best Ridge coefficients are ",model_hp.best_estimator_.coef_)
print("Total absolute value of coefficients is ", round(sum([abs(coef) for coef in model_hp.best_estimator_.coef_]),1))

Best Ridge coefficients are  [ -1.86705119   1.84606275   0.28744799   2.48396553  -2.39828212
 -10.62757636   2.22136648   1.39019163   1.67958646   5.46624545]
Total absolute value of coefficients is  30.3


The total of absolute values of coefficients is much smaller in Ridge compared to plain Regression. Here, the coefficients are shrinked but not shrinked to zero. So `Ridge` is effective in reducing overfitting by preventing extreme parameter values.

#### Elastic Net Regularization

In [38]:
from sklearn.linear_model import ElasticNet
from sklearn.model_selection import GridSearchCV

estimator = ElasticNet(alpha=2, l1_ratio = 1)

param_grid = { "alpha" : [ 0.1, 0.2, 1, 2, 3, 5, 10],
"l1_ratio" : [0.1, 0.5, 0.75, 0.9, 0.95, 1]}
model_hp = GridSearchCV(estimator, param_grid, cv = 5)
model_hp.fit(X_train, y_train)
# print("Best alpha is ",model_hp.best_params_)
print("Best ElasticNet coefficients are ",model_hp.best_estimator_.coef_)
print("Total absolute value of coefficients is ", round(sum([abs(coef) for coef in model_hp.best_estimator_.coef_]),1))

Best ElasticNet coefficients are  [-1.56836374  1.27634404  0.3222228   1.44318271 -1.20846679 -6.9534298
 -0.02118137  1.01026995  1.62825564  3.58364342]
Total absolute value of coefficients is  19.0


The total of absolute values of coefficients is much smaller than Ridge but larger than Lasso.   
As Lasso may shrink all but one of correlated featues, we may lose information while Ridge keeps all the features even when not needed.   
`ElasticNet` finds a balance between Ridge and Lasso.