# Regression with Regularization: Ridge, Lasso, and ElasticNet

## Introduction

This notebook demonstrates the use of **Ridge**, **Lasso**, and **ElasticNet** regression models to predict prices and assess the effects of regularization. Regularization is crucial in regression to prevent overfitting by penalizing large coefficients, improving generalizability on unseen data.

### Overview of Models
- **Ridge Regression (L2 Regularization)**: Penalizes large coefficients, shrinking them to prevent overfitting, while retaining all features.
- **Lasso Regression (L1 Regularization)**: Shrinks coefficients by an absolute value penalty, setting some coefficients to zero and effectively performing feature selection.
- **ElasticNet Regression**: Combines L1 and L2 penalties, balancing the strengths of Ridge and Lasso.

### Notebook Outline
- **Data Preparation**: Loading and scaling the dataset.
- **Model Training**: Training Ridge, Lasso, and ElasticNet models with cross-validation to optimize parameters.
- **Evaluation**: Comparing models using Mean Absolute Error (MAE), Mean Squared Error (MSE), Root Mean Squared Error (RMSE), and R² Score.
.
g


In [21]:
import pandas as pd
from sklearn import linear_model
import numpy as np
import warnings
warnings.filterwarnings("ignore")
import pandas as pd
import numpy as np
from sklearn import linear_model
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

### Purpose of Regularization

Regularization helps us create models that are both accurate and interpretable by:
- Preventing overfitting, especially in high-dimensional datasets.
- Enhancing interpretability through feature selection (in the case of Lasso and ElasticNet).
- Reducing model variance, making predictions more stable across different datasets.

## Data Preparation

Before training the models, we standardize the features using `StandardScaler`. Standardization centers the data around zero with a standard deviation of one, making features comparable in scale. This is essential for Ridge, Lasso, and ElasticNet, which rely on relative magnitudes of coefficients and penalize larger values.

Note: We fit the scaler on the training set only to prevent data leakage, ensuring the test data remains unseen during trainng.
.


In [35]:
train_path = "./data/data_train.csv"
test_path = "./data/data_test.csv"
data_train = pd.read_csv(train_path)
data_test = pd.read_csv(test_path)
data_test.head()

Unnamed: 0,S_0,S_1,S_2,S_3,S_4,S_5,S_6,S_7,S_8,S_9,...,S_89,S_90,S_91,S_92,S_93,S_94,S_95,S_96,S_97,price
0,830,260,24.9,30,216.22,4,190,1580,40.3,40.84,...,544.0,652.8,1755.0,4718,36.4,30.32,1625.98,28449.75,377633.3,40
1,820,250,303.4,370,928.27,7,2400,3052,124.4,108.42,...,1408.4,1473.5,8842.18,8864,105.52,105.32,12113.01,45966.0,1188995.8,580
2,1860,480,539.4,290,836.28,4,3300,2859,111.4,104.33,...,1294.2,1288.0,8555.54,8585,95.35,94.34,12063.5,11491.5,954628.6,380
3,1230,270,123.0,100,341.94,6,630,3678,86.2,83.21,...,1005.0,1270.5,8002.21,8058,79.65,76.92,5062.76,43710.0,788018.3,140
4,2100,500,126.0,60,1158.47,11,300,3321,31.4,33.0,...,225.7,409.3,1750.0,4063,27.84,18.37,1549.44,35763.0,420700.5,180


In [3]:
print("train:", train_df.shape)
print("test:", test_df.shape)

train: (334, 99)
test: (10, 99)


In [6]:
y_train = data_train['price'].values # train target
y_test = data_test['price'].values # test target
X_train = data_train.drop(columns= ['price']).values # train variables
X_test = data_test.drop(columns= ['price']).values # test variables

### Standartization

`StandardScaler` is used to **standardize** features by removing the mean and scaling to unit variance. This centers the data around zero with a standard deviation of one, making feature scales more uniform. Standardization is especially important for regularized regression models like **Ridge**, **Lasso**, and **ElasticNet**, as well as any algorithm sensitive to feature scale.

Regularization methods apply penalties to control coefficient magnitudes and prevent overfitting. If features vary widely in scale , regularization will disproportionately penalize coefficients associated with larger-scaled features. To allow regularization to affect all coefficients fairly we need to standartize the features.

Standardized data helps gradient-based algorithms converge more quickly and reliably. When data is scaled, the algorithm takes more balanced steps, as it doesn’t need to account for differences in feature magnitudes.


In [7]:
 # stanndard scaling since Ridge has no in-built normalization
scaler = StandardScaler()
X_scaled_train = scaler.fit_transform(X_train)
X_scaled_test =  scaler.transform(X_test)

## Model Training

We train three regression models with cross-validation:
- **RidgeCV**: Optimizes the `alpha` parameter to control the degree of L2 regularization.
- **LassoCV**: Selects the best `alpha` for L1 regularization, potentially setting some coefficients to zero for feature selection.
- **ElasticNetCV**: Chooses optimal `alpha` and `l1_ratio`, balancing L1 and L2 penalties.

The ranges for `alpha` and `l1_ratio` are chosen based on common practice, allowing the models to explore low to moderate regulariz
For each model, we fit the training data with cross-validation to select the best regularization parameter (`alpha`). This helps in finding a model that generalizes well on unseen data.ation.
cNet).


In [8]:
# define parameters
alphas=np.logspace(-4, -1, 4)
l1_ratio=np.arange(0.6, 1, 0.1)

In [10]:
# models
reg_ridge = linear_model.RidgeCV(alphas=alphas)
reg_lasso = linear_model.LassoCV(alphas=alphas, max_iter=10000)
reg_elas = linear_model.ElasticNetCV(alphas=alphas, l1_ratio=l1_ratio)

In [11]:
# fit the models
reg_ridge.fit(X_scaled_train, y_train)
reg_lasso.fit(X_scaled_train, y_train)
reg_elas.fit(X_scaled_train, y_train)

## Optimal Parameters

All models selects the same alpha value of 0.1. It is not too low, not too high. The choice of the same medium alpha value across all models shows that a moderate level of regularization is optimal for this dataset. Additionally, the preference for higher L1 regularization in ElasticNet (through an l1_ratio of 0.9) aligns with Lasso’s performance in setting many coefficients to zero, focusing the model on the most relevant features for prediction.

In [26]:
# get selected attributes
ridge_alpha = reg_ridge.alpha_
ridge_coef = reg_ridge.coef_
lasso_alpha = reg_lasso.alpha_
lasso_coef = reg_lasso.coef_
elas_alpha = reg_elas.alpha_
elas_coef = reg_elas.coef_
elas_l1 = reg_elas.l1_ratio_

params_dict = {
            'ridge': {'alpha': ridge_alpha, 'pred': ridge_predictions, 'coefficients': ridge_coef},
            'lasso': {'alpha': lasso_alpha, 'pred': lasso_predictions, 'coefficients': lasso_coef},
            'elastic_net': {'alpha': elas_alpha, 'l1_ratio': elas_l1, 'pred': elas_preds, 'coefficients': elas_coef}
            }

In [28]:
report_data = {
    'Model': ['Ridge', 'Lasso', 'ElasticNet'],
    'Alpha': [params_dict['ridge']['alpha'], params_dict['lasso']['alpha'], params_dict['elastic_net']['alpha']],
    'L1 Ratio': [None, None, params_dict['elastic_net']['l1_ratio']]  # L1 ratio only applies to ElasticNet
}
params_df = pd.DataFrame(report_data)
params_df

Unnamed: 0,Model,Alpha,L1 Ratio
0,Ridge,0.1,
1,Lasso,0.1,
2,ElasticNet,0.1,0.9


## Predictions and Model Evaluation

After fitting each model, we make predictions on the test set to evaluate their performance. Additionally, we extract the best regularization parameters (`alpha`) and the coefficients to compare the influence of each feature across model.


In [32]:
# Tredictions on the test set
ridge_predictions = reg_ridge.predict(X_scaled_test)
lasso_predictions = reg_lasso.predict(X_scaled_test)
elas_preds = reg_elas.predict(X_scaled_test)

In [19]:
# Evaluation Metrics
def evaluate_model(y_true, y_pred):
    mae = mean_absolute_error(y_true, y_pred)
    mse = mean_squared_error(y_true, y_pred)
    rmse = np.sqrt(mse)
    r2 = r2_score(y_true, y_pred)
    return {'MAE': mae, 'MSE': mse, 'RMSE': rmse, 'R2': r2}

In [22]:
# Evaluate each model
ridge_eval = evaluate_model(y_test, ridge_predictions)
lasso_eval = evaluate_model(y_test, lasso_predictions)
elas_eval = evaluate_model(y_test, elas_preds)

In [31]:
# Evaluation dictionary
eval_results = {
    'Ridge': ridge_eval,
    'Lasso': lasso_eval,
    'ElasticNet': elas_eval
}

### Model Performance & Discussion

1. **Mean Absolute Error (MAE)**:
   - The **Lasso model** stands out here with the lowest MAE of 29.09, which means it has the smallest average prediction error. This suggests that Lasso is really effective at keeping its predictions close to the actual values overall.
   - **Ridge** follows closely with an MAE of 29.76. It’s also quite accurate but just slightly behind Lasso in terms of precision for individual predictions.
   - **ElasticNet** comes in with the highest MAE at 30.44, so it doesn’t quite match Ridge or Lasso in minimizing average errors.

2. **Mean Squared Error (MSE)**:
   - Lasso performs best here again with the lowest MSE of 1082.82, which indicates it handles larger errors well, keeping them from impacting the model too much.
   - Ridge is also strong, with an MSE of 1162.16. It’s close to Lasso and does a good job of keeping big errors in check.
   - ElasticNet, with an MSE of 1330.08, doesn’t manage larger errors as effectively, trailing behind both Ridge and Lasso on this metric.

3. **Root Mean Squared Error (RMSE)**:
   - RMSE generally follows the MSE trend since it’s derived from it. Here, **Lasso** again leads with the lowest RMSE at 32.91, followed by **Ridge** at 34.09, and **ElasticNet** at 36.47.
   - Since RMSE penalizes larger errors, Lasso’s lower RMSE suggests it’s the best choice if we’re particularly concerned about occasional larger errors in predictions.

4. **R² Score (Coefficient of Determination)**:
   - With an R² score of 0.953, **Lasso** explains about 95.3% of the variance in the target variable, showing it captures the underlying patterns of the data quite well.
   - **Ridge** is right behind with an R² of 0.949, which means it’s also a solid fit for the data, though not quite as high as Lasso.
   - **ElasticNet** has an R² of 0.942, so it explains slightly less variance than the other two models.

In summary, Lasso has the edge overall, with consistently strong performance across all metrics, especially in minimizing average errors and explaining variance. Ridge is also a good choice and holds up well next to Lasso, while ElasticNet doesn’t perform as well for this dataset, possibly due to the combined penalties not adding extra benefit here.
e selection is not desired.
ssion** is recommended for
zation.


In [24]:
# Convert evaluation results to a DataFrame 
eval_df = pd.DataFrame(eval_results).T  
eval_df

Unnamed: 0,MAE,MSE,RMSE,R2
Ridge,29.756227,1162.157554,34.090432,0.949474
Lasso,29.094528,1082.816244,32.906173,0.952923
ElasticNet,30.435917,1330.078827,36.470246,0.942173


## Conclusion

Based on the evaluation metrics:
- **Lasso Regression** performed best overall, achieving the lowest MAE, MSE, and RMSE, as well as the highest R². This suggests that Lasso is the most effective model for this dataset, likely due to its ability to set some coefficients to zero, effectively selecting features that best contribute to the target variable.
- **Ridge Regression** also performed well and might be preferred if retaining all features is important, as it does not set coefficients to zero.
- **ElasticNet** did not perform as well as the other two models on this dataset. It may be less suitable here since the combination of L1 and L2 regularization did not provide significant additional benefit.

In summary, **Lasso Regression** is recommended for this dataset due to its superior performance across all metrics. However, **Ridge Regression** remains a strong alternative if feature selection is not desired.

## Appendix

Below is the `regression` function used in this notebook, which trains Ridge, Lasso, and ElasticNet models with cross-validation, scales features, and evaluates performance.

This function prepares the data, scales it using `StandardScaler`, trains the models, and then returns the predictions, coefficients, and selected parameters for each model.


In [18]:
def regression(data_train, data_test):
    # prepare model input
    y_train = data_train['price'].values # train target
    y_test = data_test['price'].values # test target
    X_train = data_train.drop(columns= ['price']).values # train variables
    X_test = data_test.drop(columns= ['price']).values # test variables

    # stanndard scaling since Ridge has no in-built normalization
    scaler = StandardScaler()
    X_scaled_train = scaler.fit_transform(X_train)
    X_scaled_test =  scaler.transform(X_test)

    # define parameters
    alphas=np.logspace(-4, -1, 4)
    l1_ratio=np.arange(0.6, 1, 0.1)

    # models
    reg_ridge = linear_model.RidgeCV(alphas=alphas)
    reg_lasso = linear_model.LassoCV(alphas=alphas, max_iter=10000)
    reg_elas= linear_model.ElasticNetCV(alphas=alphas, l1_ratio=l1_ratio)

    # fit the models
    reg_ridge.fit(X_scaled_train, y_train)
    reg_lasso.fit(X_scaled_train, y_train)
    reg_elas.fit(X_scaled_train, y_train)
    
    # predictions on the test set
    ridge_predictions = reg_ridge.predict(X_scaled_test)
    lasso_predictions = reg_lasso.predict(X_scaled_test)
    elas_preds=reg_elas.predict(X_scaled_test)
   
    # get selected attributes of the fited models
    ridge_alpha = reg_ridge.alpha_
    ridge_coef = reg_ridge.coef_
    lasso_alpha = reg_lasso.alpha_
    lasso_coef = reg_lasso.coef_
    elas_alpha = reg_elas.alpha_
    elas_coef = reg_elas.coef_
    elas_l1 = reg_elas.l1_ratio_
    
    return {
        'ridge': {'alpha': ridge_alpha, 'pred': ridge_predictions, 'coefficients': ridge_coef},
        'lasso': {'alpha': lasso_alpha, 'pred': lasso_predictions, 'coefficients': lasso_coef},
        'elastic_net': {'alpha': elas_alpha, 'l1_ratio': elas_l1, 'pred': elas_preds, 'coefficients': elas_coef}
        }
