## LASSO regression

LASSO regression is a type of linear regression that uses L1 regularization, which adds a penalty proportional to the absolute values of the coefficients multiplied by a tuning parameter alpha. 

This contrasts with ridge regression, which uses L2 regularization (penalty proportional to the squared values of coefficients). 

The key difference is that LASSO tends to produce sparse models by forcing many coefficients exactly to zero, effectively performing automatic feature selection, while ridge regression only shrinks coefficients but keeps all features in the model.

### Key Characteristics of LASSO Regression:

- Uses **L1 regularization**: sum of absolute values of coefficients times alpha.

- Can shrink some coefficients to exactly zero, hence performs feature selection.

- Typically slower and can produce convergence warnings due to the harder optimization problem.

- Useful when you want a simpler, interpretable model with fewer features.

- The zeroing of coefficients emerges organically from the global optimization, not by sequential feature selection.

| Aspect                 | Ridge Regression (L2)                | LASSO Regression (L1)                     |
|------------------------|------------------------------------|------------------------------------------|
| Regularization Type    | Penalizes sum of squared coefficients | Penalizes sum of absolute coefficients    |
| Feature Selection      | No feature selection, all features retained with smaller weights | Performs automatic feature selection by zeroing some coefficients |
| Model Complexity       | Includes all features, shrunk coefficients | Simpler model with fewer features         |
| Numerical Stability    | More stable and faster              | May encounter convergence warnings       |
| Use Case               | When all features are relevant      | When only some features matter            |

In [1]:
# Example Python Implementation of LASSO Regression Using GridSearchCV

import numpy as np
import pandas as pd
from sklearn.linear_model import Lasso
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.metrics import mean_squared_error

# Example dataset
from sklearn.datasets import load_diabetes
data = load_diabetes()
X = data.data
y = data.target

# Split into train/dev sets
X_train, X_dev, y_train, y_dev = train_test_split(X, y, test_size=0.2, random_state=42)

# Pipeline with scaling and LASSO regression
pipe = Pipeline([
    ('scaler', StandardScaler()),
    ('lasso', Lasso(max_iter=10000))
])

# Define range of alpha values to search
alpha_values = np.logspace(-5, 1, 30)

# Parameter grid for GridSearchCV
param_grid = {'lasso__alpha': alpha_values}

# Setup GridSearchCV with cross-validation
grid_search = GridSearchCV(pipe, param_grid, scoring='neg_mean_squared_error', cv=5)

# Fit on training data (GridSearchCV internally does cross-validation)
grid_search.fit(X_train, y_train)

# Best alpha and performance on dev set
best_alpha = grid_search.best_estimator_.named_steps['lasso'].alpha
print(f"Best alpha: {best_alpha}")

y_pred = grid_search.best_estimator_.predict(X_dev)
dev_mse = mean_squared_error(y_dev, y_pred)
print(f"Development set MSE: {dev_mse}")

# Extract non-zero coefficients
coefficients = grid_search.best_estimator_.named_steps['lasso'].coef_
non_zero_coefs = {data.feature_names[i]: coef for i, coef in enumerate(coefficients) if coef != 0}
print("Selected Features and their coefficients:")
print(non_zero_coefs)

Best alpha: 1.4873521072935119
Development set MSE: 2802.6510986801695
Selected Features and their coefficients:
{'age': np.float64(0.05934148216469456), 'sex': np.float64(-8.257945720277121), 'bmi': np.float64(26.20705687774852), 'bp': np.float64(15.179997187720314), 's1': np.float64(-5.173265909034207), 's3': np.float64(-11.174893258381422), 's5': np.float64(22.180295881633334), 's6': np.float64(1.8630807438295154)}


### Summary

- This pipeline scales features to z-scores before fitting LASSO.

- Grid search tests multiple alpha values to find the best regularization strength that minimizes the dev set error.

- LASSO's automatic feature selection results in a sparse model with many coefficients set to zero.

- This is beneficial when interpretability or feature sparsity is desired, unlike ridge regression which retains all features but shrinks their impact.

This matches the conceptual description where LASSO yields a smaller subset of features chosen through global optimization, showing its power for feature selection compared to the shrinkage-only effect of ridge regression.

Sources:

[1](https://www.geeksforgeeks.org/machine-learning/ridge-regression-vs-lasso-regression/)
[2](https://www.shiksha.com/online-courses/articles/ridge-regression-vs-lasso-regression/)
[3](https://www.r-bloggers.com/2024/01/understanding-lasso-and-ridge-regression-3/)
[4](https://www.reddit.com/r/datascience/comments/q1heaz/lasso_vs_ridge_regression/)
[5](https://www.datacamp.com/tutorial/tutorial-lasso-ridge-regression)
[6](https://www.youtube.com/watch?v=Xm2C_gTAl8c)
[7](https://www.statology.org/when-to-use-ridge-lasso-regression/)
[8](https://www.reddit.com/r/learndatascience/comments/qn200d/when_to_use_lassoridge_regression/)
[9](https://www.tutorialspoint.com/ridge-and-lasso-regression-explained)
[10](https://www.sciencedirect.com/science/article/pii/S1877705817341474)