# Regularization

- Regularization in linear models is a technique used to prevent overfitting by adding a penalty term to the loss function, encouraging the model to favor simpler models with smaller coefficients. It helps to improve the generalization of the model by reducing variance while controlling bias to some extent. Regularization is particularly useful when dealing with high-dimensional data or when there is multicollinearity among the predictors.

- In linear regression, the standard least squares method aims to minimize the sum of squared residuals: $min(\Sigma_{i=1}^n(y_i-\hat{y}_i)^2)$

- However, when the number of features is large relative to the number of observations or when the features are highly correlated, the model may become too flexible and fit the noise in the data, leading to overfitting. Regularization addresses this issue by adding a penalty term to the loss function.

## Types of Regularization

### Lasso (L1 Regularization):

- Adds the absolute value of the coefficients as the penalty term.
- Encourages sparsity by shrinking some coefficients to exactly zero, effectively performing feature selection.
- Useful when there are many irrelevant features in the dataset.
- Helps in automatic feature selection by setting some coefficients to zero.

### Ridge (L2 Regularization):

- Adds the squared magnitude of the coefficients as the penalty term.
- Shrinks the coefficients towards zero, but they rarely become exactly zero.
- Useful when dealing with multicollinearity, as it tends to distribute the coefficient weights across correlated features.
- Helps in reducing the impact of multicollinearity by reducing the variance of the estimates.

### Elastic Net:

- Combines both L1 and L2 penalties.
- Allows for learning a sparse model where few of the weights are non-zero like Lasso, while still maintaining the regularization properties of Ridge.
- Useful when there are many features and some of them are correlated.
- Regularization Parameter (Lambda or Alpha)
- The regularization parameter controls the strength of the penalty term.
- Larger values of the regularization parameter result in greater shrinkage of coefficients.
- The optimal value of the regularization parameter is typically determined using techniques like cross-validation.

### Advantages of Regularization

- Prevents Overfitting: Regularization discourages overly complex models that fit the noise in the data.
- Improves Generalization: Regularized models tend to generalize better to unseen data.
- Handles Multicollinearity: Regularization methods can handle multicollinearity by shrinking the coefficients.

### Disadvantages of Regularization

- Loss of Interpretability: Regularized models may be harder to interpret due to the shrinkage of coefficients.
- Tuning Complexity: Choosing the optimal regularization parameter requires cross-validation, which can be computationally expensive.
- Bias-Variance Tradeoff: Regularization introduces bias to reduce variance, and finding the right balance is crucial.


## Imports

In [3]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

## Data and Setup

In [4]:
df = pd.read_csv("https://raw.githubusercontent.com/erkansirin78/datasets/master/Advertising.csv")
df = df.drop("ID", axis=1)
X = df.drop('Sales',axis=1)
y = df['Sales']

### Polynomial Conversion

In [5]:
from sklearn.preprocessing import PolynomialFeatures

In [7]:
polynomial_converter = PolynomialFeatures(degree=4,include_bias=False)

In [8]:
poly_features = polynomial_converter.fit_transform(X)

### Train | Test Split

In [9]:
from sklearn.model_selection import train_test_split

In [10]:
X_train, X_test, y_train, y_test = train_test_split(poly_features, y, test_size=0.3, random_state=101)

----
----

## Scaling the Data

While our particular data set has all the values in the same order of magnitude ($1000s of dollars spent), typically that won't be the case on a dataset, and since the mathematics behind regularized models will sum coefficients together, its important to standardize the features. Review the theory videos for more info, as well as a discussion on why we only **fit** to the training data, and **transform** on both sets separately.

In [11]:
from sklearn.preprocessing import StandardScaler

In [12]:
scaler = StandardScaler()

In [13]:
scaler.fit(X_train)

In [14]:
X_train = scaler.transform(X_train)

In [15]:
X_test = scaler.transform(X_test)

## Ridge Regression

Make sure to view video lectures for full explanation of Ridge Regression and choosing an alpha.

In [16]:
from sklearn.linear_model import Ridge

In [23]:
ridge_model = Ridge(alpha=5)

In [24]:
ridge_model.fit(X_train,y_train)

In [25]:
test_predictions = ridge_model.predict(X_test)

In [26]:
from sklearn.metrics import mean_absolute_error,mean_squared_error

In [27]:
MAE = mean_absolute_error(y_test,test_predictions)
MSE = mean_squared_error(y_test,test_predictions)
RMSE = np.sqrt(MSE)

In [28]:
MAE

0.590270071394175

In [29]:
RMSE

0.8426977902539234

How did it perform on the training set? (This will be used later on for comparison)

In [30]:
# Training Set Performance
train_predictions = ridge_model.predict(X_train)
MAE = mean_absolute_error(y_train,train_predictions)
MAE

0.46795776176000015

### Choosing an alpha value with Cross-Validation

Review the theory video for full details.

In [31]:
from sklearn.linear_model import RidgeCV

In [32]:
# Choosing a scoring: https://scikit-learn.org/stable/modules/model_evaluation.html
# Negative RMSE so all metrics follow convention "Higher is better"

# See all options: sklearn.metrics.SCORERS.keys()
ridge_cv_model = RidgeCV(alphas=(0.1, 1.0, 10.0),scoring='neg_mean_absolute_error')

In [33]:
# The more alpha options you pass, the longer this will take.
# Fortunately our data set is still pretty small
ridge_cv_model.fit(X_train,y_train)

In [34]:
ridge_cv_model.alpha_

0.1

In [35]:
test_predictions = ridge_cv_model.predict(X_test)

In [36]:
MAE = mean_absolute_error(y_test,test_predictions)
MSE = mean_squared_error(y_test,test_predictions)
RMSE = np.sqrt(MSE)

In [37]:
MAE

0.42422716289579987

In [38]:
RMSE

0.6242818489615023

In [39]:
# Training Set Performance
# Training Set Performance
train_predictions = ridge_cv_model.predict(X_train)
MAE = mean_absolute_error(y_train,train_predictions)
MAE

0.28345030778621877

In [40]:
ridge_cv_model.coef_

array([ 5.07665809,  0.69581004,  0.3719049 , -4.01391514,  3.8488346 ,
       -0.56376907, -0.84743483,  0.48029363, -0.43721227, -0.46487399,
       -1.74834461, -0.84928794,  2.18893967, -0.30026264,  0.41337395,
       -0.68655586,  0.28318356, -0.51573875,  0.87590086,  1.10213681,
        1.53908841,  0.88777377, -2.14518671,  0.23163482,  0.52256028,
        0.60556143, -0.37374377,  0.22760633, -0.78495661,  0.4796754 ,
       -0.07194254, -0.07792478,  0.12231745, -0.3446115 ])


-----

## Lasso Regression

In [41]:
from sklearn.linear_model import LassoCV

In [42]:
# https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LassoCV.html
lasso_cv_model = LassoCV(eps=0.1,n_alphas=100,cv=5)

In [43]:
lasso_cv_model.fit(X_train,y_train)

In [44]:
lasso_cv_model.alpha_

0.4943070909225832

In [45]:
test_predictions = lasso_cv_model.predict(X_test)

In [46]:
MAE = mean_absolute_error(y_test,test_predictions)
MSE = mean_squared_error(y_test,test_predictions)
RMSE = np.sqrt(MSE)

In [47]:
MAE

0.6541723168495853

In [48]:
RMSE

1.1308000958086055

In [49]:
# Training Set Performance
# Training Set Performance
train_predictions = lasso_cv_model.predict(X_train)
MAE = mean_absolute_error(y_train,train_predictions)
MAE

0.6912807137865276

In [50]:
lasso_cv_model.coef_

array([1.00265103, 0.        , 0.        , 0.        , 3.79745277,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        ])

## Elastic Net

Elastic Net combines the penalties of ridge regression and lasso in an attempt to get the best of both worlds!

In [51]:
from sklearn.linear_model import ElasticNetCV

In [52]:
elastic_model = ElasticNetCV(l1_ratio=[.1, .5, .7,.9, .95, .99, 1],tol=0.01)

In [53]:
elastic_model.fit(X_train,y_train)

In [54]:
elastic_model.l1_ratio_

1.0

In [55]:
test_predictions = elastic_model.predict(X_test)

In [56]:
MAE = mean_absolute_error(y_test,test_predictions)
MSE = mean_squared_error(y_test,test_predictions)
RMSE = np.sqrt(MSE)

In [57]:
MAE

0.6130298726238844

In [58]:
RMSE

0.8020991123737414

In [59]:
# Training Set Performance
# Training Set Performance
train_predictions = elastic_model.predict(X_train)
MAE = mean_absolute_error(y_train,train_predictions)
MAE

0.45282159833736707

In [60]:
elastic_model.coef_

array([ 3.73093257e+00,  1.30130044e+00,  2.82257606e-01, -7.30289664e-01,
        1.66741988e+00, -4.03061467e-01, -3.94578622e-01,  1.08664452e-01,
        0.00000000e+00, -7.29503996e-01,  4.15847815e-01, -2.02293806e-02,
        7.96359956e-01,  5.38485864e-04,  0.00000000e+00, -4.09572014e-01,
        9.20244085e-03,  1.49635282e-02, -2.26648010e-02, -3.86367871e-01,
       -2.07472587e-01,  7.41704188e-02,  1.61716964e-01,  1.00482305e-01,
        1.80207111e-02,  5.88703129e-01,  0.00000000e+00,  4.37429433e-02,
       -0.00000000e+00, -1.97121345e-01, -8.22178158e-03, -3.32018563e-02,
       -4.50389743e-03, -3.03644690e-02])

-----
---