## Regularization

One of the common issues when building a machine learning model for prediction is "**overfitting**". To reduce the risk of overfitting, data scientists use different strategies to adjust the model, so it can be better used to predict unseen data.

Overfitting:
When do we expect to see an overfitting issue from a ML model?

It usually happens when the model is fitting the training data very well (low bias), but not so well for predicting the testing data (high variance). Especially when we are building the OLS linear regression model, the objective of the model is to minimize the sum of squared residual (error), therefore, the estimated parameters are by design to yield the lowest bias with the given dataset.

Now, the question is if the OLS model is having such low bias fitting the training dataset, does it have a good predictive power for unseen data? In many cases, it is not necessary to be true.  So, how can we use the bias and variance trade-off to create a better predictive model based on the OLS linear regression foundation?

The answer is "Regularization"!!

1. Lasso Regression (L1)  
2. Ridge Regression (L2)  
3. ElasticNet (L1 + L2)  

The idea behind "**Regularization**" is to add a little more **bias** to our model, so it could have lower **variance** to predict the unseen data.

Note: The regularization is not designed for interpretation, but to improve predictive power of the model, so the estimated coefficients should not be interpreted directly nor perform statistical inference (hypothesis test).

### Lasso Regression

The general concept for Lasso Regression is to add a penalty to a well-fitted OLS model (smaller $\beta$s), so the variance of the model would decrease for better prediction of the unseen data. Think about the loss function between the OLS and Lasso regression models,

OLS:

$$SSE = \sum{(y_i - \beta_0 - \beta_1 x_i)^2}$$

Losso:

$$SSE = \sum{(y_i - \beta_0 - \beta_1 x_i)^2} + \lambda |\beta_1|$$

where $\lambda$ represents the weight of the penalty to the estimated coefficient in the OLS model. The absolute value of $\beta_1$ is the term we are trying to adjust in the model.  Since the estimated slope coefficient $\beta_1$ could be positive or negative, so we use absolute value to ensure that it will punish the loss function by adding more error to the model.

The ultimate goal is to minimize the loss function by adjusting the slope coefficient, so the model will have higher level of variance. However, this adjustment will make the model with higher bias as a trade-off.


#### Validating the Model:

For Lasso regression, we use CV error to validate the value of $\lambda$, so we can use the correct penalty on the estimated parameter for the best balance of bias and variance that yields the lowest error.

Python Code:

Note: Because "*lambda*" is a special keyword in Python, so the parameter name in Lasso, Ridge, and ElasticNet regression function is replaced by "*alpha*".

``` Python
# Import dependencies for lasso regression
from sklearn.linear_model import Lasso, LassoCV
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import cross_validate

# Create the lasso regression object
lasso = Lasso(alpha=100, random_state=123456)

# Fit the data to the model object
lasso.fit(X, y)

# Predict with the training data
y_hat = lasso.predict(X)

# Calculate the goodness of fit of the model, R^2
lasso.score(X, y)

# Calculate the mean squared error of the model
mean_squared_error(y_hat, y)

# Calculate the cross-validation R^2 and mse
cv = cross_validation(X=X, y=y, estimator=lasso, cv=8,
                      scoring=('r2', 'neg_mean_squared_error'),
                      return_train_score=True)

# Check the CV train R^2
cv['train_r2']

# Check the CV test R^2
cv['test_r2']

# Create the regularized regression with cross-validation
losso_cv = LossoCV(cv=8)

# Fit the training the data with the model
losso_cv.fit(X, y)

# Calculate the goodness of fit of the model, R^2
losso_cv.score(X, y)

# Calculate the mean squared error of the model
mean_squared_error(y, losso_cv.predict(X))
```

#### When to use Lasso Regression?

By design of the loss function, the estimated slope parameters could be driven to 0 for minimum error condition, which mean the parameters could be eliminated from the model. Therefore, if you know that you modeling with some useless predictors or features from your dataset in the model, you should consider using Lasso Reqression as a strategy for feature selection to simplify your model specification.

Try to build the Lasso Regression model with following dataset.

In [2]:
# Import dependencies
import pandas as pd

In [7]:
# Loading data from the web into pd dataframe
path = 'https://jaredlander.com/data/manhattan_Train.csv'
manhattan = pd.read_csv(path)
manhattan = manhattan[['TotalValue', 'LotArea', 'NumFloors', 'UnitsTotal',
                       'LotFront', 'LotDepth', 'BldgFront', 'BldgDepth',
                       'BuiltFAR', 'ResidFAR', "CommFAR"]]
manhattan.head()

Unnamed: 0,TotalValue,LotArea,NumFloors,UnitsTotal,LotFront,LotDepth,BldgFront,BldgDepth,BuiltFAR,ResidFAR,CommFAR
0,327600.0,769,4.5,3,19.0,53.92,19.0,54.0,5.34,10.0,15.0
1,943650.0,1512,5.0,7,36.17,46.67,36.0,44.0,4.94,10.0,15.0
2,897300.0,2180,3.0,3,34.92,69.75,34.0,69.0,2.81,10.0,15.0
3,914400.0,2275,4.0,3,42.17,55.25,41.0,63.0,3.57,10.0,15.0
4,927900.0,1885,5.5,2,29.0,66.92,29.0,66.0,4.9,10.0,15.0


In [None]:
# Try to code the Lasso Regression here!



### Ridge Regression

Ridge Regression has the same concept as to the Lasso Regression. The idea is to adjust the slope coefficient to add more bias to the model with the return of lower variance for predicting unseen data. The only different is, instead of using the absolute value of the slope coefficient, we squared the slope coefficient in the loss function.

OLS:

$$SSE = \sum{(y_i - \beta_0 - \beta_1 x_i)^2}$$

Losso:

$$SSE = \sum{(y_i - \beta_0 - \beta_1 x_i)^2} + \lambda |\beta_1|$$

Ridge:

$$SSE = \sum{(y_i - \beta_0 - \beta_1 x_i)^2} + \lambda \beta_1^2$$

where $\lambda$ again represents the penalty weight on the estimated slope coefficient, $\beta_1$.

In most cases, Ridge Regression is preferred over Lasso Regression. Therefore, if you wonder which one to use for building a predictive model, it could be a safe option to start with Ridge Regression.

#### Validating the Model:

For Ridge regression, we use CV error to validate the value of $\lambda$, so we can use the correct penalty on the estimated parameter for the best balance of bias and variance that yields the lowest error.

Python Code:

``` Python
# Import dependencies for lasso regression
from sklearn.linear_model import Ridge, RidgeCV
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import cross_validate

# Create the lasso regression object
ridge = Ridge(alpha=100, random_state=123456)

# Fit the data to the model object
ridge.fit(X, y)

# Predict with the training data
y_hat = ridge.predict(X)

# Calculate the goodness of fit of the model, R^2
ridge.score(X, y)

# Calculate the mean squared error of the model
mean_squared_error(y_hat, y)

# Calculate the cross-validation R^2 and mse
cv = cross_validation(X=X, y=y, estimator=ridge, cv=8,
                      scoring=('r2', 'neg_mean_squared_error'),
                      return_train_score=True)

# Check the CV train R^2
cv['train_r2']

# Check the CV test R^2
cv['test_r2']

# Create the regularized regression with cross-validation
ridge_cv = RidgeCV(cv=8)

# Fit the training the data with the model
ridge_cv.fit(X, y)

# Calculate the goodness of fit of the model, R^2
ridge_cv.score(X, y)

# Calculate the mean squared error of the model
mean_squared_error(y, ridge_cv.predict(X))
```

#### When to use Ridge Regression?

By design of the loss function, the estimated parameters can only be driven to very close to 0 (cannot be exactly 0). Therefore, it is the best situation to use Ridge Regression, when you know most of the predictors or features you are using in the model are useful to predict the target value, so you only reducing the weight of the slope parameters, instead of eliminating them from your model.

Try the code for building the Ridge Regression with the same dataset.

In [None]:
# Try to build the Ridge Regression here!



### ElasticNet

Finally, ElasticNet is a combination of both the Lasso and Ridge Regression models. The concept of ElasticNet is to minimize a loss function, which is the sum of the errors from both the Lasso and Ridge regression. A specific feature in ElsaticNet is to adjust the weight parameter $\rho$, which represents the weight of Lasso Regression the loss function. 

Weight Parameter: $\rho$

$\rho = 0.1$ means that the Lasso Regression will have weight of 10% in the loss function and 90% weight is from the Ridge Regression.

$\rho = 0.5$ means that the Lasso and Ridge Regressions both have 50% weight in the loss function.

ElasticNet:

$$SSE = \rho \sum{(y_i - \beta_0 - \beta_1 x_i)^2} + \lambda |\beta_1| + (1 - \rho) \sum{(y_i - \beta_0 - \beta_1 x_i)^2} + \lambda \beta_1^2$$


Python Code:
    
``` Python
# Import dependencies for ElasticNet regression
from sklearn.linear_model import ElasticNet, ElasticNetCV
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import cross_validate

# Create the ElasticNet object, with p = 0.5
en = ElasticNet(alpha=100, 11_ratio=0.5, random_state=123456)

# Fit the data to the model object
en.fit(X, y)

# Predict with the training data
y_hat = en.predict(X)

# Calculate the goodness of fit of the model, R^2
en.score(X, y)

# Calculate the mean squared error of the model
mean_squared_error(y_hat, y)

# Calculate the cross-validation R^2 and mse
cv = cross_validation(X=X, y=y, estimator=en, cv=8,
                      scoring=('r2', 'neg_mean_squared_error'),
                      return_train_score=True)

# Check the CV train R^2
cv['train_r2']

# Check the CV test R^2
cv['test_r2']

# Create the regularized regression with cross-validation
en_cv = ElasticNetCV(cv=8)

# Fit the training the data with the model
en_cv.fit(X, y)

# Calculate the goodness of fit of the model, R^2
en_cv.score(X, y)

# Calculate the mean squared error of the model
mean_squared_error(y, en_cv.predict(X))
```

#### When to use ElasticNet?

You can use ElasticNet in any situation, but setting $\rho$ is a challenge for beginners and often we need to check with CV error to confirm the best $\rho$ to use for a particular problem.

Let's try to build the ElasticNet model and compares to the previous.

In [None]:
# Try to build the ElasticNet Regression here!

