# Supervised Learning with scikit-learn (Lasso Regression)

In [134]:
# Loading libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Loading classes
from sklearn.linear_model import Ridge, Lasso, LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.model_selection import train_test_split, KFold, cross_val_score, cross_validate, RandomizedSearchCV
from sklearn import preprocessing
from scipy.stats import uniform

# Ignoring future warnings for readability reasons
from warnings import simplefilter
simplefilter(action='ignore', category=FutureWarning)

# Lasso Regression

Ridge regression introduces a penalty for each coefficient. The penalty is introduced by adding the sum of the absolute value of the squared coefficients of the linear regression model to the loss function. The extent of the penalty is defined by the hyperparameter alpha which is multiplied with the sum of the coefficients. It is important to know that while the coefficients of the standard OLS regression are scale invariant, those of Lasso regression aren't which is the same like when applying Ridge regression.

The advantage of Lasse regression over OLS is that to some extent ridge regression reduces the variance of the predictionms at cost at the expense of a slightly increased bias. Furthermore, in contrast to Ridge, the Lasso can rule out some of the coeffients completely and can therefore be used for feature selection and reduction.

In the following example, Lasso regression is applied on sales data using expenditure for different media channels. First, I apply Lasso regression with alpha = 0.5 and non standardized data. Then, I standardize the data and show that coefficients differ. Finally, I apply randomized search cross to hyptertune the model.

In [110]:
# Loading data
sales_df = pd.read_csv("datasets/advertising_and_sales_clean.csv")

# Preview data
sales_df.head()

Unnamed: 0,tv,radio,social_media,influencer,sales
0,16000.0,6566.23,2907.98,Mega,54732.76
1,13000.0,9237.76,2409.57,Mega,46677.9
2,41000.0,15886.45,2913.41,Mega,150177.83
3,83000.0,30020.03,6922.3,Mega,298246.34
4,15000.0,8437.41,1406.0,Micro,56594.18


In [111]:
# Splitting data into X matrix and y vector
X = sales_df.drop(["sales", "influencer"], axis = 1).values
y = sales_df["sales"].values

# Splitting data into train and test sample
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)

# Initialize Ridge regression model
lasso = Lasso(alpha = 0.5)

# Train model
lasso.fit(X_train, y_train)

# Prediction
y_pred = lasso.predict(X_test)

# Compute MSE and R2 squared
mse = mean_squared_error(y_test, y_pred)
r2_squared = r2_score(y_test, y_pred)

# Print results
print("Coefficients: {}".format(lasso.coef_))
print("Mean Squared Error (Lasso regression, alpha = 0.5): {}".format(mse))
print("R2 squared (Lasso regression, alpha = 0.5): {}".format(r2_squared))

Coefficients: [ 3.56337679e+00 -2.63635719e-03 -1.40532634e-02]
Mean Squared Error (Lasso regression, alpha = 0.5): 8321632.083843608
R2 squared (Lasso regression, alpha = 0.5): 0.9990105553100916


The preprocessing library of scikit-learn includes the StandardScaler() which standardizes numerical data to zero mean and standard deviation 1.

In [112]:
# Initialize Ridge regression model
lasso_std = Lasso(alpha = 0.5)

# Standardizing Data
data_df_scaler = preprocessing.StandardScaler().fit(sales_df.drop(["influencer"], axis = 1))
data_df_std = pd.DataFrame(data_df_scaler.transform(sales_df.drop(["influencer"], axis = 1)))

# Rename columns
data_df_std.rename(columns={0: 'tv', 1: 'radio', 2: 'social_media', 3: 'sales'}, inplace=True)

# Inspecting normalized data
data_df_std.head()

Unnamed: 0,tv,radio,social_media,sales
0,-1.458233,-1.199655,-0.18792,-1.480283
1,-1.573167,-0.923162,-0.413342,-1.566885
2,-0.500455,-0.235048,-0.185464,-0.454098
3,1.108613,1.227723,1.627684,1.137871
4,-1.496545,-1.005995,-0.867238,-1.46027


In [122]:
# Split and create train and test dataset into X_std and y_std
X_std = data_df_std.drop("sales", axis = 1).values

y_std = data_df_std["sales"].values

X_std_train, X_std_test, y_std_train, y_std_test = train_test_split(X_std, y_std, test_size = 0.2, random_state = 42)

In [123]:
# Train model
lasso_std.fit(X_std_train, y_std_train)

# Prediction
y_std_pred = lasso_std.predict(X_std_test)

# Compute MSE and R2 squared
mse = mean_squared_error(y_std_test, y_std_pred)
r2_squared = r2_score(y_std_pred, y_std_test)

# Print results
print("Coefficients: {}".format(lasso_std.coef_))
print("Mean Squared Error (Lasso regression, alpha = 0.5): {}".format(mse))
print("R2 squared (Lasso regression, alpha = 0.5): {}".format(r2_squared))

Coefficients: [0.50292537 0.         0.        ]
Mean Squared Error (Lasso regression, alpha = 0.5): 0.24053322885402215
R2 squared (Lasso regression, alpha = 0.5): 0.022786049404024844


As stated above, coefficients differ when using standardized data since these are not scale invariant in Lasso regressions. Interestingly, setting alpha to 0.5 eliminates two out of three features but also the R-squared is very low. Next, I apply randomized search cross validation to hypertune alpha eventhough the fit is very good since we know that the larger alpha is, the more biased the model is. Therefore, a small alpha should be preferred.

In [133]:
# Defining k folds
kf = KFold(4, random_state=42, shuffle=True)

# Defining grid using randomly distributed values for alpha between 0 and 50
param_grid = {'alpha': np.linspace(0, 30, num=30)}

# Initializing a new model
lasso = Lasso()

# Defining Randomized Search CV
lasso_cv = RandomizedSearchCV(lasso, param_grid, cv = kf, random_state = 42, n_iter = 10)

# Fitting model using standardized train and test data
lasso_cv.fit(X_std_train, y_std_train)

# Results of GridSearchCV
pd.DataFrame(lasso_cv.cv_results_).iloc[1:,5:].sort_values("rank_test_score")

  return fit_method(estimator, *args, **kwargs)
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  return fit_method(estimator, *args, **kwargs)
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  return fit_method(estimator, *args, **kwargs)
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  return fit_method(estimator, *args, **kwargs)
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  return fit_method(estimator, *args, **kwargs)
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(


Unnamed: 0,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,mean_test_score,std_test_score,rank_test_score
9,{'alpha': 0.0},0.998969,0.99891,0.99901,0.9990559,0.998986,5.4e-05,1
1,{'alpha': 15.517241379310345},-0.013192,-2.7e-05,-0.012195,-5.60829e-09,-0.006353,0.00635,2
2,{'alpha': 23.793103448275865},-0.013192,-2.7e-05,-0.012195,-5.60829e-09,-0.006353,0.00635,2
3,{'alpha': 17.586206896551726},-0.013192,-2.7e-05,-0.012195,-5.60829e-09,-0.006353,0.00635,2
4,{'alpha': 8.275862068965518},-0.013192,-2.7e-05,-0.012195,-5.60829e-09,-0.006353,0.00635,2
5,{'alpha': 9.310344827586208},-0.013192,-2.7e-05,-0.012195,-5.60829e-09,-0.006353,0.00635,2
6,{'alpha': 28.965517241379313},-0.013192,-2.7e-05,-0.012195,-5.60829e-09,-0.006353,0.00635,2
7,{'alpha': 24.827586206896555},-0.013192,-2.7e-05,-0.012195,-5.60829e-09,-0.006353,0.00635,2
8,{'alpha': 12.413793103448278},-0.013192,-2.7e-05,-0.012195,-5.60829e-09,-0.006353,0.00635,2


In [132]:
# Predictions for best model from RandomizedSearchCV
y_std_pred_lasso_cv = lasso_cv.predict(X_test_std)

# Compute MSE and R2 squared
mse = mean_squared_error(y_std_test, y_std_pred_lasso_cv)
r2_squared = r2_score(y_std_test, y_std_pred_lasso_cv)

# Print results
print("Best alpha value: {}".format(lasso_cv.best_estimator_))
print("Coefficients (best lasso model): {}".format(lasso_cv.best_estimator_.coef_))
print("Mean Squared Error (best lasso model): {}".format(mse))
print("R squared (best lasso model): {}".format(r2_squared))

Best alpha value: Lasso(alpha=0.0)
Coefficients (best lasso model): [ 1.00002011e+00 -2.73876324e-04 -3.34075327e-04]
Mean Squared Error (best lasso model): 0.001318411870044389
R squared (best lasso model): 0.9986439051759302


Interestingly, the warnings indicate that there is a convergence problem for the lasso algorithm with alpha = 0 and recommend using the OLS model. For comparison, OLS regression is fitted and yields slightly better MSE and R suqared

In [131]:
# Initializing linear model
ols = LinearRegression()

# Model training
ols.fit(X_std_train, y_std_train)

# Computing predictions with test data
y_std_pred_ols = ols.predict(X_std_test)

# Compute MSE and R2 squared
mse = mean_squared_error(y_std_test, y_std_pred_ols)
r2_squared = r2_score(y_std_test, y_std_pred_ols)

# Print results
print("Coefficients: {}".format(ols.coef_))
print("Mean Squared Error: {}".format(mse))
print("R squared: {}".format(r2_squared))

Coefficients: [ 1.00002011e+00 -2.73876324e-04 -3.34075327e-04]
Mean Squared Error: 0.0009619501643116172
R squared: 0.9990105552987837
