Problem Statement: Predicting Housing Prices using Lasso Regression model from SKLearn

First Steps are common:
- Import all libraries needed
- Load the csv into a dataframe
- Split the training data into train & test sets
- Examine the input dataframe to get familiar with the data df.info()
- Pre-process the numerical & categorical features seperately
- Create a pipeline that combines the preprocessing and the model
- I used a Lasso Regression model from scikit-learn since it penalizes feature with low impact on the sales price. Which is useful as the dataset has so many similar features that don't necessarly impact as much the housing price
- Next is the train the model with the training data
- Evaluate the model's performance on the test data


In [151]:
# Install packages (if needed)
# !pip install pandas numpy matplotlib seaborn scikit-learn

# Import libraries
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import Lasso
from sklearn.pipeline import Pipeline
from sklearn.metrics import mean_squared_error

np.set_printoptions(precision=2)


# Load the dataset (make sure you’ve downloaded it from Kaggle first)
train_df = pd.read_csv('Data/train.csv')

# View the shape and first few rows
print("Dataset shape:", train_df.shape)


train_df.drop(columns=['Id'], inplace=True)  # Drop the 'Id' column as it's not needed for training
# Define features and target variable
# Assuming the target variable is 'SalePrice' and the rest are features
# Here, we assume the first 79 columns are features and the 80th column is the target variable
X= train_df.iloc[:, :-1]  # First 79 columns as features
y = train_df.iloc[:, -1]   # 80th column (index 79) as target ('SalePrice')

train_df.info()  # Display the first few rows of the training data
train_df.describe()  # Display summary statistics of the training data

Dataset shape: (1460, 81)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 80 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   MSSubClass     1460 non-null   int64  
 1   MSZoning       1460 non-null   object 
 2   LotFrontage    1201 non-null   float64
 3   LotArea        1460 non-null   int64  
 4   Street         1460 non-null   object 
 5   Alley          91 non-null     object 
 6   LotShape       1460 non-null   object 
 7   LandContour    1460 non-null   object 
 8   Utilities      1460 non-null   object 
 9   LotConfig      1460 non-null   object 
 10  LandSlope      1460 non-null   object 
 11  Neighborhood   1460 non-null   object 
 12  Condition1     1460 non-null   object 
 13  Condition2     1460 non-null   object 
 14  BldgType       1460 non-null   object 
 15  HouseStyle     1460 non-null   object 
 16  OverallQual    1460 non-null   int64  
 17  OverallCond    1460 non-nu

Unnamed: 0,MSSubClass,LotFrontage,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,MasVnrArea,BsmtFinSF1,BsmtFinSF2,...,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,MiscVal,MoSold,YrSold,SalePrice
count,1460.0,1201.0,1460.0,1460.0,1460.0,1460.0,1460.0,1452.0,1460.0,1460.0,...,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0
mean,56.89726,70.049958,10516.828082,6.099315,5.575342,1971.267808,1984.865753,103.685262,443.639726,46.549315,...,94.244521,46.660274,21.95411,3.409589,15.060959,2.758904,43.489041,6.321918,2007.815753,180921.19589
std,42.300571,24.284752,9981.264932,1.382997,1.112799,30.202904,20.645407,181.066207,456.098091,161.319273,...,125.338794,66.256028,61.119149,29.317331,55.757415,40.177307,496.123024,2.703626,1.328095,79442.502883
min,20.0,21.0,1300.0,1.0,1.0,1872.0,1950.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,2006.0,34900.0
25%,20.0,59.0,7553.5,5.0,5.0,1954.0,1967.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5.0,2007.0,129975.0
50%,50.0,69.0,9478.5,6.0,5.0,1973.0,1994.0,0.0,383.5,0.0,...,0.0,25.0,0.0,0.0,0.0,0.0,0.0,6.0,2008.0,163000.0
75%,70.0,80.0,11601.5,7.0,6.0,2000.0,2004.0,166.0,712.25,0.0,...,168.0,68.0,0.0,0.0,0.0,0.0,0.0,8.0,2009.0,214000.0
max,190.0,313.0,215245.0,10.0,9.0,2010.0,2010.0,1600.0,5644.0,1474.0,...,857.0,547.0,552.0,508.0,480.0,738.0,15500.0,12.0,2010.0,755000.0


Let's split the dataframe into training and test set

In [153]:
X_train, X_test,  y_train, y_test = train_test_split(X,y, test_size=0.2, random_state=42)   
print("X_train shape:", X_train.shape)
print("y_train shape:", y_train.shape)

X_train shape: (1168, 79)
y_train shape: (1168,)


Let's break down the column types to pre-process them seperately

In [154]:
# Select non-numeric columns
categorical_cols = X_train.select_dtypes(include=['object']).columns
print("Categorical columns:", categorical_cols)
# Select numeric columns
numerical_cols = X_train.select_dtypes(include=['int64', 'float64']).columns
print("Numerical columns:", numerical_cols)

Categorical columns: Index(['MSZoning', 'Street', 'Alley', 'LotShape', 'LandContour', 'Utilities',
       'LotConfig', 'LandSlope', 'Neighborhood', 'Condition1', 'Condition2',
       'BldgType', 'HouseStyle', 'RoofStyle', 'RoofMatl', 'Exterior1st',
       'Exterior2nd', 'MasVnrType', 'ExterQual', 'ExterCond', 'Foundation',
       'BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2',
       'Heating', 'HeatingQC', 'CentralAir', 'Electrical', 'KitchenQual',
       'Functional', 'FireplaceQu', 'GarageType', 'GarageFinish', 'GarageQual',
       'GarageCond', 'PavedDrive', 'PoolQC', 'Fence', 'MiscFeature',
       'SaleType', 'SaleCondition'],
      dtype='object')
Numerical columns: Index(['MSSubClass', 'LotFrontage', 'LotArea', 'OverallQual', 'OverallCond',
       'YearBuilt', 'YearRemodAdd', 'MasVnrArea', 'BsmtFinSF1', 'BsmtFinSF2',
       'BsmtUnfSF', 'TotalBsmtSF', '1stFlrSF', '2ndFlrSF', 'LowQualFinSF',
       'GrLivArea', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath', '

Now let's examine the correlation between the numeric features and the target

Analyzing a correlation matrix that displays the pairwise correlations between all features indicates the level of independence between them.

It also indicates how predictive each feature is of the target.

I can eliminate any strong dependencies or correlations between features by selecting the best one from each correlated group.

In [155]:
X_train[numerical_cols].corr()

Unnamed: 0,MSSubClass,LotFrontage,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,MasVnrArea,BsmtFinSF1,BsmtFinSF2,...,GarageArea,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,MiscVal,MoSold,YrSold
MSSubClass,1.0,-0.371269,-0.116501,0.029719,-0.052768,-0.001928,0.036081,-0.013491,-0.080944,-0.074205,...,-0.09614,-0.022712,-0.011753,-0.008086,-0.058672,-0.033155,0.003578,-0.006216,-0.014139,-0.028758
LotFrontage,-0.371269,1.0,0.427009,0.249726,-0.051725,0.123858,0.091253,0.223236,0.236182,0.065393,...,0.336914,0.07767,0.143753,-0.004749,0.081709,0.052087,0.245424,0.000787,0.02385,0.003097
LotArea,-0.116501,0.427009,1.0,0.102088,0.001625,0.013541,0.017216,0.125634,0.22427,0.122366,...,0.179124,0.177537,0.08632,-0.024948,0.0232,0.046353,0.086463,0.038358,0.003973,-0.005098
OverallQual,0.029719,0.249726,0.102088,1.0,-0.087599,0.558124,0.538251,0.416085,0.204864,-0.050637,...,0.550476,0.232991,0.288691,-0.121967,0.025278,0.06084,0.079182,-0.03204,0.053355,-0.017635
OverallCond,-0.052768,-0.051725,0.001625,-0.087599,1.0,-0.386268,0.055034,-0.143063,-0.043388,0.047532,...,-0.139431,-0.007352,-0.029882,0.067696,0.017296,0.057537,-0.007496,0.077016,0.007975,0.023782
YearBuilt,-0.001928,0.123858,0.013541,0.558124,-0.386268,1.0,0.587311,0.318101,0.223348,-0.051112,...,0.46661,0.217083,0.176241,-0.392513,0.029117,-0.047356,0.004362,-0.033683,0.006257,0.00067
YearRemodAdd,0.036081,0.091253,0.017216,0.538251,0.055034,0.587311,1.0,0.159416,0.104648,-0.068891,...,0.364826,0.211789,0.224053,-0.205697,0.042385,-0.060476,0.011388,-0.006076,0.025142,0.046533
MasVnrArea,-0.013491,0.223236,0.125634,0.416085,-0.143063,0.318101,0.159416,1.0,0.244347,-0.070397,...,0.383544,0.165278,0.12048,-0.128226,0.031382,0.058113,0.021877,-0.031668,0.00281,0.005671
BsmtFinSF1,-0.080944,0.236182,0.22427,0.204864,-0.043388,0.223348,0.104648,0.244347,1.0,-0.043652,...,0.280523,0.189172,0.092833,-0.115216,0.027128,0.054527,0.165951,0.004168,-0.011884,0.032167
BsmtFinSF2,-0.074205,0.065393,0.122366,-0.050637,0.047532,-0.051112,-0.068891,-0.070397,-0.043652,1.0,...,0.003819,0.069764,0.022075,0.0551,-0.030294,0.115927,0.055403,0.008115,-0.016695,0.040306


Let's preprocess the numeric types, by applying an Imputer to fill in the missing values by the mean of that feature and using StandardScaler to avoid the model favoring features with higher values

In [156]:
# Preprocessing for numerical data
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaler', StandardScaler())
])

Then we will do the same preprocessing for the categorical features by using an Imputer but this time, it will add the most frequent category to the missing values as well as the OneHotEncoder to encode the categorical features. The Handle unknown variable is useful since if in the test data, there is a category that wasn't present in training, it will be ignored instead of raising an error. Also the sparse_output = False, to force a Numpy array instead of a sparse matric from the encoder

In [157]:
# Preprocessing for categorical data
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore', sparse_output=False))
])

Now we will combine both preprocessors

In [158]:
# Combine preprocessing steps
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numerical_cols),
        ('cat', categorical_transformer, categorical_cols)
    ]
)

Time to create the full pipeline

In [159]:
model = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('regressor', Lasso(alpha=0.1, random_state=42))
])

Fit the model

In [160]:
model.fit(X_train, y_train)

  model = cd_fast.enet_coordinate_descent(


0,1,2
,steps,"[('preprocessor', ...), ('regressor', ...)]"
,transform_input,
,memory,
,verbose,False

0,1,2
,transformers,"[('num', ...), ('cat', ...)]"
,remainder,'drop'
,sparse_threshold,0.3
,n_jobs,
,transformer_weights,
,verbose,False
,verbose_feature_names_out,True
,force_int_remainder_cols,'deprecated'

0,1,2
,missing_values,
,strategy,'mean'
,fill_value,
,copy,True
,add_indicator,False
,keep_empty_features,False

0,1,2
,copy,True
,with_mean,True
,with_std,True

0,1,2
,missing_values,
,strategy,'most_frequent'
,fill_value,
,copy,True
,add_indicator,False
,keep_empty_features,False

0,1,2
,categories,'auto'
,drop,
,sparse_output,False
,dtype,<class 'numpy.float64'>
,handle_unknown,'ignore'
,min_frequency,
,max_categories,
,feature_name_combiner,'concat'

0,1,2
,alpha,0.1
,fit_intercept,True
,precompute,False
,copy_X,True
,max_iter,1000
,tol,0.0001
,warm_start,False
,positive,False
,random_state,42
,selection,'cyclic'


Predict housing price 

In [161]:
# Predict and evaluate
y_pred = model.predict(X_train)
mse = mean_squared_error(y_train, y_pred)
print(f"Train Mean Squared Error: {mse:.2f}")

Train Mean Squared Error: 381582860.54


Let's evaluate on the test set

In [162]:
y_test_pred = model.predict(X_test)
test_mse = mean_squared_error(y_test, y_test_pred)
print(f"Test Mean Squared Error: {test_mse:.2f}")

Test Mean Squared Error: 804730042.80


Let's evaluate the model on the test set

In [163]:
test_file_path = "Data/test.csv"
test_data = pd.read_csv(test_file_path)

ids = test_data.pop('Id')


preds = model.predict(test_data)
output = pd.DataFrame({'Id': ids,
                       'SalePrice': preds.squeeze()})

output.head()

Unnamed: 0,Id,SalePrice
0,1461,118697.778232
1,1462,160273.563883
2,1463,185765.094054
3,1464,199111.771806
4,1465,209609.223206


Sample Submission

In [165]:
sample_submission_df = pd.read_csv('Data/sample_submission.csv')
sample_submission_df['SalePrice'] = model.predict(test_data)
sample_submission_df.to_csv('submission.csv', index=False)
sample_submission_df.head()

Unnamed: 0,Id,SalePrice
0,1461,118697.778232
1,1462,160273.563883
2,1463,185765.094054
3,1464,199111.771806
4,1465,209609.223206
