# Regularization with SciKit-Learn

Previously we created a new polynomial feature set and then applied our standard linear regression on it, but we can be smarter about model choice and utilize regularization.

Regularization attempts to minimize the RSS (residual sum of squares) *and* a penalty factor. This penalty factor will penalize models that have coefficients that are too large. Some methods of regularization will actually cause non useful features to have a coefficient of zero, in which case the model does not consider the feature.

Let's explore two methods of regularization, Ridge Regression and Lasso. We'll combine these with the polynomial feature set (it wouldn't be as effective to perform regularization of a model on such a small original feature set of the original X).

## Imports

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

## Data and Setup

In [2]:
df = pd.read_csv("Advertising.csv")
X = df.drop('sales',axis=1)
y = df['sales']

### Polynomial Conversion

In [3]:
from sklearn.preprocessing import PolynomialFeatures

In [4]:
polynomial_converter = PolynomialFeatures(degree=3,include_bias=False)

In [5]:
poly_features = polynomial_converter.fit_transform(X)
poly_features.shape

(200, 19)

### Train | Test Split

In [6]:
from sklearn.model_selection import train_test_split

In [7]:
X_train, X_test, y_train, y_test = train_test_split(poly_features, y, test_size=0.3, random_state=101)

----
----

## Scaling the Data

While our particular data set has all the values in the same order of magnitude ($1000s of dollars spent), typically that won't be the case on a dataset, and since the mathematics behind regularized models will sum coefficients together, its important to standardize the features. Review the theory videos for more info, as well as a discussion on why we only **fit** to the training data, and **transform** on both sets separately.

In [8]:
from sklearn.preprocessing import StandardScaler

In [9]:
# help(StandardScaler)

In [10]:
scaler = StandardScaler()

In [11]:
scaler.fit(X_train)

In [12]:
X_train = scaler.transform(X_train)

In [13]:
X_test = scaler.transform(X_test)

## Ridge Regression

In [14]:
from sklearn.linear_model import Ridge

In [15]:
from sklearn.metrics import mean_squared_error, r2_score

### Choosing an alpha value with Cross-Validation

Review the theory video for full details.

In [16]:
from sklearn.linear_model import RidgeCV

In [17]:
help(RidgeCV)

Help on class RidgeCV in module sklearn.linear_model._ridge:

class RidgeCV(sklearn.base.MultiOutputMixin, sklearn.base.RegressorMixin, _BaseRidgeCV)
 |  RidgeCV(alphas=(0.1, 1.0, 10.0), *, fit_intercept=True, scoring=None, cv=None, gcv_mode=None, store_cv_values=False, alpha_per_target=False)
 |  
 |  Ridge regression with built-in cross-validation.
 |  
 |  See glossary entry for :term:`cross-validation estimator`.
 |  
 |  By default, it performs efficient Leave-One-Out Cross-Validation.
 |  
 |  Read more in the :ref:`User Guide <ridge_regression>`.
 |  
 |  Parameters
 |  ----------
 |  alphas : array-like of shape (n_alphas,), default=(0.1, 1.0, 10.0)
 |      Array of alpha values to try.
 |      Regularization strength; must be a positive float. Regularization
 |      improves the conditioning of the problem and reduces the variance of
 |      the estimates. Larger values specify stronger regularization.
 |      Alpha corresponds to ``1 / (2C)`` in other linear models such as
 |

In [18]:
# Choosing a scoring: https://scikit-learn.org/stable/modules/model_evaluation.html
# Negative RMSE so all metrics follow convention "Higher is better"

# See all options: sklearn.metrics.SCORERS.keys()
ridge_cv_model = RidgeCV(alphas=(0.1, 1.0, 10.0),scoring='neg_mean_squared_error')

In [19]:
# The more alpha options you pass, the longer this will take.
# Fortunately our data set is still pretty small
ridge_cv_model.fit(X_train,y_train)

In [20]:
# alpha value
ridge_cv_model.alpha_

0.1

In [21]:
test_predictions = ridge_cv_model.predict(X_test)

In [22]:
MSE = mean_squared_error(y_test,test_predictions)
RMSE = np.sqrt(MSE)

In [23]:
MSE

0.3820129881541673

In [24]:
RMSE

0.6180719926951611

In [25]:
# Training Set Performance
# Training Set Performance
# Check the improvement results
train_predictions = ridge_cv_model.predict(X_train)
MSE = mean_squared_error(y_train,train_predictions)
MSE

0.22079858719844545

In [26]:
RMSE = np.sqrt(MSE)
RMSE

0.46989210165573697

In [27]:
ridge_cv_model.coef_

array([ 5.40769392,  0.5885865 ,  0.40390395, -6.18263924,  4.59607939,
       -1.18789654, -1.15200458,  0.57837796, -0.1261586 ,  2.5569777 ,
       -1.38900471,  0.86059434,  0.72219553, -0.26129256,  0.17870787,
        0.44353612, -0.21362436, -0.04622473, -0.06441449])


-----

## Lasso Regression

In [28]:
from sklearn.linear_model import LassoCV

In [29]:
# https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LassoCV.html
lasso_cv_model = LassoCV(eps=0.1,n_alphas=100,cv=5)

In [30]:
lasso_cv_model.fit(X_train,y_train)

In [31]:
lasso_cv_model.alpha_

0.4943070909225832

In [32]:
test_predictions = lasso_cv_model.predict(X_test)

In [33]:
MSE = mean_squared_error(y_test,test_predictions)
RMSE = np.sqrt(MSE)

In [34]:
MSE

1.2787088713079868

In [35]:
RMSE

1.1308001022762542

In [50]:
# Training Set Performance
# Training Set Performance
# Check the improvement results
train_predictions = lasso_cv_model.predict(X_train)
MSE = mean_squared_error(y_train,train_predictions)
MSE

1.1983243366988934

In [51]:
RMSE = np.sqrt(MSE)
RMSE

1.0946800156661733

In [52]:
lasso_cv_model.coef_

array([1.002651  , 0.        , 0.        , 0.        , 3.79745279,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        ])

## Elastic Net

Elastic Net combines the penalties of ridge regression and lasso in an attempt to get the best of both worlds!

In [39]:
from sklearn.linear_model import ElasticNetCV

In [40]:
elastic_model = ElasticNetCV(l1_ratio=[.1, .5, .7,.9, .95, .99, 1],tol=0.01)

In [41]:
elastic_model.fit(X_train,y_train)

In [42]:
elastic_model.l1_ratio_

1.0

In [43]:
test_predictions = elastic_model.predict(X_test)

In [44]:
MSE = mean_squared_error(y_test,test_predictions)
RMSE = np.sqrt(MSE)

In [45]:
MSE

0.5603340214638837

In [46]:
RMSE

0.7485546215633724

In [47]:
# Training Set Performance
# Training Set Performance
train_predictions = elastic_model.predict(X_train)
MSE = mean_squared_error(y_train,train_predictions)
MSE

0.4148865295703134

In [48]:
RMSE = np.sqrt(MSE)
RMSE

0.6441168601816858

In [49]:
elastic_model.coef_

array([ 3.78993643,  0.89232919,  0.28765395, -1.01843566,  2.15516144,
       -0.3567547 , -0.271502  ,  0.09741081,  0.        , -1.05563151,
        0.2362506 ,  0.07980911,  1.26170778,  0.01464706,  0.00462336,
       -0.39986069,  0.        ,  0.        , -0.05343757])