## Lasso & Ridge

Lasso and Ridge are two types of regularization techniques that are used to prevent overfitting in linear regression models. They work by adding a penalty term to the cost function that is being optimized. The penalty term is a function of the coefficients of the model, and it is designed to discourage the coefficients from becoming too large. This has the effect of reducing the complexity of the model, which in turn reduces the risk of overfitting.


In [134]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.linear_model import Lasso, LassoCV, Ridge, RidgeCV
from sklearn.feature_selection import SelectFromModel
from sklearn.model_selection import train_test_split
from sklearn.metrics import root_mean_squared_error
import numpy as np

In [135]:
df = pd.read_csv("./datasets/dataset_colesterol.csv")

In [136]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 8 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   Id                  1000 non-null   int64  
 1   Grupo Sanguíneo     1000 non-null   object 
 2   Fumante             1000 non-null   object 
 3   Nível de Atividade  1000 non-null   object 
 4   Idade               1000 non-null   int64  
 5   Peso                1000 non-null   float64
 6   Altura              1000 non-null   int64  
 7   Colesterol          1000 non-null   float64
dtypes: float64(2), int64(3), object(3)
memory usage: 62.6+ KB


In [137]:
df.drop("Id", axis=1, inplace=True)
df = pd.get_dummies(
    df, columns=["Grupo Sanguíneo", "Fumante", "Nível de Atividade"], drop_first=True
)

In [138]:
df.head()

Unnamed: 0,Idade,Peso,Altura,Colesterol,Grupo Sanguíneo_AB,Grupo Sanguíneo_B,Grupo Sanguíneo_O,Fumante_Sim,Nível de Atividade_Baixo,Nível de Atividade_Moderado
0,33,85.1,186,199.63,0,1,0,1,1,0
1,68,105.0,184,236.98,0,0,0,0,0,1
2,25,64.8,180,161.79,0,0,1,0,0,0
3,43,120.2,167,336.24,0,0,0,0,0,0
4,79,88.5,175,226.23,1,0,0,0,1,0


## Lasso (L1 regularization)


In [139]:
X = df.drop("Colesterol", axis=1)
y = df["Colesterol"]

In [140]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=51
)

In [141]:
# The bigger the alpha, the bigger will be the penalty and the less features will be selected
lasso_model = Lasso(alpha=0.1)
lasso_model.fit(X_train, y_train)

In [142]:
# Show feature importance
def show_feature_importance(model):
    model_name = model.__class__.__name__
    print(f"Feature importance for {model_name}:")
    importance = np.abs(model.coef_)
    feature_names = X.columns

    # Sort features by coefficient in descending order
    sorted_indices = np.argsort(importance)[::-1]
    sorted_features = feature_names[sorted_indices]
    sorted_importance = importance[sorted_indices]

    for feature, importance in zip(sorted_features, sorted_importance):
        print(f"- {feature}: {importance}")

In [143]:
show_feature_importance(lasso_model)

Feature importance for Lasso:
- Peso: 2.495925884059651
- Altura: 2.211340300934588
- Fumante_Sim: 1.6291029934262662
- Nível de Atividade_Moderado: 1.258715438477648
- Grupo Sanguíneo_AB: 1.0751583484069567
- Nível de Atividade_Baixo: 0.8075223247778703
- Idade: 0.018362011583088
- Grupo Sanguíneo_O: 0.010553536908134856
- Grupo Sanguíneo_B: 0.0


In [144]:
def show_model_performance(model, X_test, y_test):
    y_pred = model.predict(X_test)
    return root_mean_squared_error(y_test, y_pred)

In [145]:
show_model_performance(lasso_model, X_test, y_test)

8.961885997805972

## LassoCV (L1 regularization with cross-validation)


In [146]:
lasso_cv_model = LassoCV(alphas=[0.1, 0.5, 1], cv=5, random_state=51)
lasso_cv_model.fit(X, y)

In [147]:
show_feature_importance(lasso_cv_model)

Feature importance for LassoCV:
- Peso: 2.48720195517911
- Altura: 2.2015378865949926
- Fumante_Sim: 1.8303877959017587
- Grupo Sanguíneo_AB: 1.7440905213856726
- Nível de Atividade_Baixo: 0.4793488234111328
- Idade: 0.01862481015222054
- Nível de Atividade_Moderado: 0.0
- Grupo Sanguíneo_O: 0.0
- Grupo Sanguíneo_B: 0.0


In [148]:
show_model_performance(lasso_cv_model, X, y)

8.778812768540636

## Ridge (L2 regularization)


In [149]:
ridge_model = Ridge(alpha=0.1)
ridge_model.fit(X_train, y_train)

In [150]:
show_feature_importance(ridge_model)

Feature importance for Ridge:
- Peso: 2.4863830923553354
- Nível de Atividade_Moderado: 2.410334561153793
- Altura: 2.2052163884018507
- Fumante_Sim: 2.14791216785652
- Grupo Sanguíneo_AB: 1.8485448776951132
- Nível de Atividade_Baixo: 1.8442940333342237
- Grupo Sanguíneo_O: 0.7972525784697226
- Grupo Sanguíneo_B: 0.2229554835383535
- Idade: 0.014864518078201874


In [151]:
show_model_performance(ridge_model, X_test, y_test)

9.017498964051006

## RidgeCV (L2 regularization with cross-validation)


In [152]:
ridge_cv_model = RidgeCV(alphas=[0.1, 0.5, 1], cv=5)
ridge_cv_model.fit(X, y)

In [153]:
show_feature_importance(ridge_cv_model)

Feature importance for RidgeCV:
- Peso: 2.4742638075268193
- Fumante_Sim: 2.3071403992069373
- Grupo Sanguíneo_AB: 2.2325133270026623
- Altura: 2.1905519870475207
- Nível de Atividade_Baixo: 1.5099077070763411
- Nível de Atividade_Moderado: 1.0887598616405172
- Grupo Sanguíneo_B: 0.11471258685377374
- Grupo Sanguíneo_O: 0.018148395760780478
- Idade: 0.016838347188250894


In [154]:
show_model_performance(ridge_cv_model, X, y)

8.761338616224378