<a href="https://colab.research.google.com/github/castudil/exploratory-data-analysis/blob/main/S7%20Linear%20Regression/LASSO.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#LASSO

Lasso (short for Least Absolute Shrinkage and Selection Operator) is a linear regression method that, like ridge regression, adds a penalty term to the model's loss function to help prevent overfitting. However, the penalty term used in lasso is different than the one used in ridge regression.

In lasso, the penalty term is the absolute value of the sum of the coefficients multiplied by a tuning parameter. This penalty term encourages the model to set some of the coefficients to exactly zero, effectively performing feature selection and making the model simpler and more interpretable.

Lasso is often used in situations where there are many input variables, and it is not clear which variables are most important for predicting the output. By setting some of the coefficients to zero, lasso can effectively select the most important variables and discard the rest.

Lasso can be a powerful tool for feature selection and creating interpretable models, but it may not perform as well as more complex models in situations where the relationships between the input variables and the output are highly nonlinear or complex.

In [9]:
import statsmodels.api as sm

# Load the Boston Housing dataset from the R repository
data = sm.datasets.get_rdataset('Boston', package='MASS').data

# Separate the features (X) and the target variable (y)
X = data.drop('medv', axis=1)
y = data['medv']


We use the get_rdataset() function from the statsmodels library to load the Boston Housing dataset from the R repository. We specify the name of the dataset ('Boston') and the package it belongs to ('MASS'). The function returns a Dataset object that we can access the data frame using the data attribute. Then, we extract the features (X) and the target variable (y) like in the previous examples.

In [5]:
from sklearn.linear_model import Lasso
from sklearn.model_selection import train_test_split

# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Create a Lasso regression object with alpha (tuning parameter) set to 0.1
lasso = Lasso(alpha=0.1)

# Fit the Lasso regression model to the training data
lasso.fit(X_train, y_train)

# Predict on the test data and calculate the R^2 score
score = lasso.score(X_test, y_test)
print("R^2 score:", score)


R^2 score: 0.6918147952283058


In this example, we first load the Boston Housing dataset, which is a common dataset for regression problems. Then, we split the data into training and test sets using the train_test_split() function from scikit-learn.

Next, we create a Lasso regression object using Lasso(alpha=0.1) and fit it to the training data using lasso.fit(X_train, y_train). The alpha parameter is the tuning parameter that controls the strength of the penalty term.

Finally, we predict on the test data using lasso.predict(X_test) and calculate the R^2 score using lasso.score(X_test, y_test), which measures how well the model fits the data. A higher R^2 score indicates a better fit.

Note that in practice, we would typically use cross-validation to choose the best value of alpha for the Lasso model, rather than setting it manually like we did in this example.

In [20]:
from sklearn.linear_model import Lasso
from sklearn.preprocessing import StandardScaler
import numpy as np


# Standardize the features
X = StandardScaler().fit_transform(X)

# Initialize the Lasso model
lasso = Lasso(alpha=0.1)

# Fit the model
lasso.fit(X, y)

# Get the coefficients of the model
coef = lasso.coef_

# Print the coefficients and corresponding features
for feature, coef in zip(data.columns[:-1], coef):
    if coef != 0:  
        print(feature, coef)
    else:
        print("*",feature, coef)



crim -0.63230363899053
zn 0.7084093141987329
* indus -0.0
chas 0.6576072255074015
nox -1.574193347902239
rm 2.8262690307935308
* age -0.0
dis -2.4220790078781755
rad 1.1959368149844016
tax -0.8464677789680638
ptratio -1.9224934488824688
black 0.7621653890824795
lstat -3.7261838282515605


In this example, we use the get_rdataset() function from statsmodels to load the Boston Housing dataset. We then standardize the features using StandardScaler from scikit-learn's preprocessing module. We set the target variable y to be the last column of the dataset.

Next, we initialize the Lasso model with an alpha value of 0.1. We fit the model using the standardized features and target variable.

Finally, we print the coefficients of the model along with their corresponding feature names. The features with non-zero coefficients are considered relevant by the Lasso model.