### GLM - Ridge (L2) Regression

Many times, if we fit our models too closely to our training data, this can lead to a phenomenom called **overfitting**. It may seem like a good thing when we are able to match our data as close as possible, but often times there are differences in the data samples in our test set compared to our training set. To avoid this, most models are paired with some form of regularization (or penalization) that tries to account for unseen data in the test set. This may impact the performance on our training data, but can lead to better predictions on test data and improve overall generalization.


For linear regression models, one form of regularization is known as **Ridge (L2) regression**. Instead of using the least squares loss (which is the loss function used to calculate our MSE cost function): 
$$ L(\beta) = \sum_i^n (y_i - \hat y_i)^2 $$ 

In ridge regression we additionally penalize the coefficients by adding a regularization term: 

$$ L(\beta) = \sum_i^n (y_i - \hat y_i)^2  + \alpha \sum_j^p \beta^2 $$ 

This regularization term aims to minimize the size of any one coefficient (or weight), penalizing any reliance on a given subset of features which commonly leads to overfitting.

Ridge regression takes a **hyperparameter**, called alpha, $\alpha$ (sometimes lambda, $\lambda$). This hyperparameter indicates how much regularization should be done. In other words, how much to care about the coefficient penalty term vs how much to care about the sum of squared errors term. The higher the value of alpha the more regularization, and the smaller the resulting coefficients will be. See [here](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Ridge.html#sklearn.linear_model.Ridge) for more. 

If we use an `alpha` value of `0` then we get the same solution as the OLS regression done above. Let's prove that.

In [None]:
from sklearn.linear_model import Ridge
ridge_reg = Ridge(alpha=0,  # regularization
                  solver='auto',
                  random_state = rand_seed) 
ridge_reg.fit(X_train, y_train)

# Predictions
ridge_train_pred = ridge_reg.predict(X_train)
ridge_test_pred = ridge_reg.predict(X_test)

In [None]:
print('Train RMSE: %.04f' % (mse(y_train, ridge_train_pred, squared=False)))
print('Test RMSE: %.04f' % (mse(y_test, ridge_test_pred, squared=False)))

Generally we don't know what the best value hypterparameter values should be, and so we need to leverage some type of trial and error method to determine the best values. We won't cover it today (it's covered in detail on Day 2), but scikit-learn provides a `RidgeCV` model that does just that. It fits a ridge regression model by first using cross-validation to find a good value of alpha. See [here](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.RidgeCV.html#sklearn.linear_model.RidgeCV) for more.

Just for our sanity, let's see if we can improve on our baseline linear regression model using a ridge model by setting our alpha value to 0.1.

In [None]:
ridge_reg = Ridge(alpha=0.1,  # regularization
                  solver='auto',
                  random_state = rand_seed) 
ridge_reg.fit(X_train, y_train)

# Predictions
ridge_train_pred = ridge_reg.predict(X_train)
ridge_test_pred = ridge_reg.predict(X_test)

In [None]:
print('Train RMSE: %.04f' % (mse(y_train, ridge_train_pred, squared=False)))
print('Test RMSE: %.04f' % (mse(y_test, ridge_test_pred, squared=False)))

Looks like despite doing slightly worse on the training set, it did a bit better than using regular OLS on the test set!

### GLM - Lasso (L1) Regression

**Lasso (L1) regression** is another form of regularized regression that penalizes the coefficients in a least squares loss. Rather than taking a squared penalty of the coefficients, Lasso uses an absolute value penalty: 

$$ L(\beta) = \sum_i^n (y_i - \hat y_i)^2  + \alpha \sum_j^p |\beta| $$ 

This has a similar effect on making the coefficients smaller, but also has a tendency to force some coefficients to 0. This leads to what is called **sparser** models, and is another way to reduce overfitting introduced by more complex models.

See [here](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Lasso.html#sklearn.linear_model.Lasso) for more.