<a href="https://colab.research.google.com/github/danielbauer1979/MSDIA_PredictiveModelingAndMachineLearning/blob/main/GB886_VI_4_TelCoRidgeExample.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Regularization Approaches: Ridge and LASSO Regression

## Ridge and LASSO Regression

Ridge and LASSO regression are both examples of *regularized* regression approaches.  In what follows, we will first briefly review the corresponding approaches, and particularly highlight how they differ from their unregularized counterparts.   We then will work through a simulated example to become familiar with the impact of the *tuning parameter* on the resulting coefficient estimates.  We will also determine the in- and out-of-sample fit depending on the choice of the tuning parameter, uncovering a familiar relationship.

## Review of Concepts and Maths

In a conventional (linear) regression problem with independent variables $x_i$ and depedent variables $y_i$, we are determining the best fit in the least-squares sense:
$$
\hat{\beta}^{\text{OLS}} = \text{argmin}_{\beta}\left\{\sum_{i=1}^n \left(y_i - \left(\beta_0 + \sum_{j=1}^p \beta_j\,x_{i,j}\right)\right)^2 \right\}.
$$
Within a *regularized* approach, we now include penalties for choosing many or large parameters:
$$
\hat{\beta}^{\text{REG}}_\lambda = \text{argmin}_{\beta}\left\{\sum_{i=1}^n \left(y_i - \left(\beta_0 + \sum_{j=1}^p \beta_j\,x_{i,j}\right)\right)^2 + \lambda \times R(\beta) \right\}.
$$
Here, $R(\beta)$ is a so-called *regularization* term that imposes a penalty on the complexity of the regression equation.  In particular, within Rigde regression the penalty term is *quadratic*, $R(\beta) = \sum_{j=1}^p \beta_j^2,$ wheras the LASSO uses an L1 penalty, $R(\beta) = \sum_{j=1}^p |\beta_j|.$  

We call $\lambda$ the *tuning parameter*, and it governs how sizable the complexity penalty will be.  In particular, note that for $\lambda=0$ we are back to the unregularized problem, whereas for large lambda the penalty will be severe -- so this will lead to *shrinkage* of the coefficient estimates.  As $\lambda$ becomes large and larger, the prediction will more and more closely resemble a *constant* prediction, $\hat{y}_i = \beta_0.$  Thus, the choice of the tuning parameter will directly be related to trading off a reduction in variance (due to shrinkage) with an increase in bias (due to the less flexible model fit).  Again, we will explore these aspects in more detail in the context of our example below.

## Application: Telecom Work Measurement Study Again

To showcase the regularization approaches discussed here, we go back to our  [Telecom Work Measurement Study](http://www.statsci.org/data/oz/telecom.html) example. Here, we predict hours worked in a given section based on characteristics.

### Preliminaries

As usually, we start by loading some of the packages we will use and the data. We use the "full" data from both sub-datasets:

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm

from sklearn.preprocessing import StandardScaler #Recall that for regularized approaches we need to scale our features

In [None]:
!git clone https://github.com/danielbauer1979/MSDIA_PredictiveModelingAndMachineLearning.git

In [None]:
data_base = pd.read_csv('MSDIA_PredictiveModelingAndMachineLearning/GB886_V_4_tel_base.csv')
data_addtl = pd.read_csv('MSDIA_PredictiveModelingAndMachineLearning/GB886_V_4_tel_addtl.csv')
data = pd.concat([data_base, data_addtl], ignore_index=True)

### Baseline Linear Regression

We start by re-running the baseline linear regression using all features:

In [None]:
y = data['Hours']
X = data.drop(columns=['Hours'])
X = sm.add_constant(X)
model_full = sm.OLS(y, X).fit()
model_full.summary()

However, recall that for regularized approach, we will want to scale our features. The [standard scaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html) simply subtracts the mean for each feature and divides each feature by its standard error, so that the resulting variance of each feature is the same at one:

In [None]:
X = data.drop(columns=['Hours'])
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
X_scaled = sm.add_constant(X_scaled)

Let's rerun the OLS linear regression:

In [None]:
model_full = sm.OLS(y,X_scaled).fit()
model_full.summary()

So the cofficients change. However, note that the R-squared and the other statistics remain identical. This is no surprise because the predictions remain identical. All we do is change the "units" of the features (think about changing currencies).

### Ridge Regression

Let's commence with running the rigde regression using 'statsmodels' (where it can be implemented as a special case of the "elastic net" regression, we will come back to this). We start running the model with a zero penalty (alpha):

In [None]:
model_ridge = sm.OLS(y, X_scaled).fit_regularized(method='elastic_net', alpha=0.0, L1_wt=0.0)

In [None]:
model_ridge.params

So, what we see is that the coefficients here are exactly the same as in the OLS case---which is no surprise because we have no penalty!

However, if we go to a penalty of one, the coefficents change:

In [None]:
model_ridge = sm.OLS(y, X_scaled).fit_regularized(method='elastic_net', alpha=1.0, L1_wt=0.0)
model_ridge.params

Here, 'statmodels' does not have the same convenient output functionality. Therefore, we instead will go to the (more powerful) predictive modeling package 'sklearn'. Let's plot the R-squared, the MSE, and the coefficients for the features SOA, SOB, and SOC to illustrate the impact of the penalty parameter on the coefficients:

In [None]:
from sklearn.linear_model import Ridge

alpha_values = np.linspace(0, 500, 500)
r2_values = []
mse_values = []
coef_x3 = []
coef_x4 = []
coef_x5 = []

for alpha in alpha_values:
  model_ridge = Ridge(alpha=alpha)
  model_ridge.fit(X_scaled, y)
  y_pred = model_ridge.predict(X_scaled)
  r2 = 1 - np.sum((y - y_pred)**2) / np.sum((y - np.mean(y))**2)
  mse = np.mean((y - y_pred)**2)
  r2_values.append(r2)
  mse_values.append(mse)
  coef_x3.append(model_ridge.coef_[3])
  coef_x4.append(model_ridge.coef_[4])
  coef_x5.append(model_ridge.coef_[5])

# Plot MSE and R^2
fig, ax1 = plt.subplots()
color = 'tab:red'
ax1.set_xlabel('Alpha')
ax1.set_ylabel('MSE', color=color)
ax1.plot(alpha_values, mse_values, color=color)
ax1.tick_params(axis='y', labelcolor=color)

ax2 = ax1.twinx()
color = 'tab:blue'
ax2.set_ylabel('R-squared', color=color)
ax2.plot(alpha_values, r2_values, color=color)
ax2.tick_params(axis='y', labelcolor=color)

fig.tight_layout()
plt.title('MSE and R-squared vs. Alpha (sklearn)')
plt.show()

# Plot coefficients
plt.plot(alpha_values, coef_x3, label='x3')
plt.plot(alpha_values, coef_x4, label='x4')
plt.plot(alpha_values, coef_x5, label='x5')
plt.xlabel('Alpha')
plt.ylabel('Coefficient Value')
plt.title('Coefficients vs. Alpha (sklearn)')
plt.legend()
plt.show()


Let's also generate predictions for the last model with the highest penalty:

In [None]:
model_ridge.predict(X_scaled)

So, we notice all predictions are very similar---and very close to the mean of y:

In [None]:
np.mean(y)

Hence, we see the effect that the predictions are pulled to the mean!