### 6.5 Ridge and Lasso
In this notebook, we will get an insight into regularization techniques such as **l1 and l2** regularization, and apply them to a linear regression model. The purpose of this notebook is **illustrative**: You should understand the concept of regularization, and more broadly of model tuning and hyperparameters, by looking at this simplified example.

In [None]:
# Import necessary libraries
import numpy as np
import pandas as pd
from sklearn.linear_model import Ridge, Lasso
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

In [9]:
# Open the dataset

import kagglehub
import os

# Download latest version
path = kagglehub.dataset_download("yasserh/housing-prices-dataset")
#path = kagglehub.dataset_download("ignacioazua/world-gdp-population-and-co2-emissions-dataset")

print("Path to dataset files:", path)

print("Path to dataset files:", path) # Path to the downloaded folder 
filename = os.listdir(path)
print(filename) # Shows content of the folder
#filepath=os.path.join(path, "World_GDP_Population_CO2_Emissions_Dataset.csv")
filepath=os.path.join(path, "Housing.csv")
print(filepath)

Path to dataset files: /home/cgraiff/.cache/kagglehub/datasets/yasserh/housing-prices-dataset/versions/1
Path to dataset files: /home/cgraiff/.cache/kagglehub/datasets/yasserh/housing-prices-dataset/versions/1
['Housing.csv']
/home/cgraiff/.cache/kagglehub/datasets/yasserh/housing-prices-dataset/versions/1/Housing.csv


In [10]:
df = pd.read_csv(filepath)
df.head()

Unnamed: 0,price,area,bedrooms,bathrooms,stories,mainroad,guestroom,basement,hotwaterheating,airconditioning,parking,prefarea,furnishingstatus
0,13300000,7420,4,2,3,yes,no,no,no,yes,2,yes,furnished
1,12250000,8960,4,4,4,yes,no,no,no,yes,3,no,furnished
2,12250000,9960,3,2,2,yes,no,yes,no,no,2,yes,semi-furnished
3,12215000,7500,4,2,2,yes,no,yes,no,yes,3,yes,furnished
4,11410000,7420,4,1,2,yes,yes,yes,no,yes,2,no,furnished


In [20]:
df["price_standardized"] = (df["price"] - df["price"].mean())/(df["price"].std())
df["area_standardized"] = (df["area"] - df["area"].mean())/(df["area"].std())
df

Unnamed: 0,price,area,bedrooms,bathrooms,stories,mainroad,guestroom,basement,hotwaterheating,airconditioning,parking,prefarea,furnishingstatus,price_standardized,area_standardized
0,13300000,7420,4,2,3,yes,no,no,no,yes,2,yes,furnished,4.562174,1.045766
1,12250000,8960,4,4,4,yes,no,no,no,yes,3,no,furnished,4.000809,1.755397
2,12250000,9960,3,2,2,yes,no,yes,no,no,2,yes,semi-furnished,4.000809,2.216196
3,12215000,7500,4,2,2,yes,no,yes,no,yes,3,yes,furnished,3.982096,1.082630
4,11410000,7420,4,1,2,yes,yes,yes,no,yes,2,no,furnished,3.551716,1.045766
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
540,1820000,3000,2,1,1,yes,no,yes,no,no,2,no,unfurnished,-1.575421,-0.990968
541,1767150,2400,3,1,1,no,no,no,no,no,0,no,semi-furnished,-1.603676,-1.267448
542,1750000,3620,2,1,1,yes,no,no,no,no,0,no,unfurnished,-1.612845,-0.705273
543,1750000,2910,3,1,1,no,no,no,no,no,0,no,furnished,-1.612845,-1.032440


In [21]:
numerical_cols = df.select_dtypes(include=[np.number])
y = numerical_cols["price_standardized"]
X = numerical_cols[["area_standardized", "bedrooms", "stories", "bathrooms", "parking"]]

print("Shape of input data: {} and shape of target variable: {}".format(X.shape, y.shape))

X.head()

Shape of input data: (545, 5) and shape of target variable: (545,)


Unnamed: 0,area_standardized,bedrooms,stories,bathrooms,parking
0,1.045766,4,3,2,2
1,1.755397,4,4,4,3
2,2.216196,3,2,2,2
3,1.08263,4,2,2,3
4,1.045766,4,2,1,2


In [22]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:
# Ridge Regression
ridge_model = Ridge(alpha=10)  # alpha is the regularization strength: adjust it with several experiments to find the best value
ridge_model.fit(X_train, y_train)
y_pred_ridge = ridge_model.predict(X_test)
MSE = mean_squared_error(y_test, y_pred_ridge)
print("Ridge Regression MSE:", MSE)
print("Ridge Coefficients:", ridge_model.coef_)

Ridge Regression MSE: 0.6578778520855152
Ridge Coefficients: [0.3546772  0.09418723 0.26321391 0.57026253 0.1816129 ]


The actual "good values" for the MSE depend a lot on the dataset and the task. However, a good way to have an insight can be to compare your error to the standard deviation in the data: If the MSE is only slightly better, then your model is not much better than guessing based on the mean. However, to do that, you need the **root mean squared error**, which is in the same order of magnitude as `y`.

In [None]:
rmse = np.sqrt(MSE)
print("RMSE:", rmse)

RMSE: 0.8110966971240329


In [None]:
# Compare it to standard deviation of y
print(np.std(y_test))

1.2019832787497904


In [24]:
# Lasso Regression
lasso_model = Lasso(alpha=0.1)  # alpha is the regularization strength
lasso_model.fit(X_train, y_train)
y_pred_lasso = lasso_model.predict(X_test)
print("Lasso Regression MSE:", mean_squared_error(y_test, y_pred_lasso))
print("Lasso Coefficients:", lasso_model.coef_)


Lasso Regression MSE: 0.7837322047506355
Lasso Coefficients: [0.32377743 0.02998577 0.21261271 0.29658393 0.10275146]
