# House Pricing Data Set: Machine Learning Examples

This is the summary of an example of various machine learnng techniques applied to the problem of house price prediction.

Dataset is from kaggle: https://www.kaggle.com/datasets/bharatsahu/house-price-prediction-dataset-insurancehub-usa

Guillaume Häfner

In [6]:
import pandas as pd
from sklearn.model_selection import train_test_split as TTS
from sklearn.linear_model import LinearRegression as LR
from sklearn.metrics import mean_squared_error as MSE

import seaborn as sns

%matplotlib inline

In [3]:
# Load the data
df = pd.read_csv("data/train.csv")
print("Shape of df before handling of missing data: ", df.shape)

## Handling of missing values
df_NA = (df.isna().mean() * 100)

# Drop rows where missing values are less than 1%
df.dropna(subset=df_NA[df_NA < 1].index.tolist(), inplace=True)

# Impute missing values with mean for columns with missing values higher than 1% and numeric type
for column in df_NA[df_NA >= 1].index:
    if pd.api.types.is_numeric_dtype(df[column]):
        df[column] = df[column].fillna(df[column].mean())

print("Shape of df after handling of missing data: ", df.shape)
print(df.columns)



Shape of df before handling of missing data:  (1460, 81)
Shape of df after handling of missing data:  (1451, 81)
Index(['Id', 'MSSubClass', 'MSZoning', 'LotFrontage', 'LotArea', 'Street',
       'Alley', 'LotShape', 'LandContour', 'Utilities', 'LotConfig',
       'LandSlope', 'Neighborhood', 'Condition1', 'Condition2', 'BldgType',
       'HouseStyle', 'OverallQual', 'OverallCond', 'YearBuilt', 'YearRemodAdd',
       'RoofStyle', 'RoofMatl', 'Exterior1st', 'Exterior2nd', 'MasVnrType',
       'MasVnrArea', 'ExterQual', 'ExterCond', 'Foundation', 'BsmtQual',
       'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinSF1',
       'BsmtFinType2', 'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', 'Heating',
       'HeatingQC', 'CentralAir', 'Electrical', '1stFlrSF', '2ndFlrSF',
       'LowQualFinSF', 'GrLivArea', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath',
       'HalfBath', 'BedroomAbvGr', 'KitchenAbvGr', 'KitchenQual',
       'TotRmsAbvGrd', 'Functional', 'Fireplaces', 'FireplaceQu', 'GarageType',

## Linear Regression Models

In this analysis, we compare the performance of two linear regression models trained on the dataset. The goal is to predict the `SalePrice` of houses based on different sets of features.

### Feature Sets

1. **LotArea Only Model**:
   - This model utilizes only the `LotArea` column as the predictor variable.
   
2. **All Other Columns Model**:
   - This model uses all columns except `SalePrice` as predictor variables.

### Evaluation Metric

We evaluate the performance of each model using mean squared error (MSE) on the testing set. Lower MSE values indicate better performance in predicting house prices.


In [9]:
# Apply one-hot encoding to categorical columns
df = pd.get_dummies(df)


X_area = df[['LotArea']] # Feature Set 1
X_all = df.drop(columns=['SalePrice']) # Feature Set 2
y = df['SalePrice']

# Split data
X_train_area, X_test_area, y_train_area, y_test_area = TTS(X_area, y, test_size=0.2, random_state=42)
X_train_all, X_test_all, y_train_all, y_test_all = TTS(X_all, y, test_size=0.2, random_state=42)

# Train model
model_area = LR()
model_all = LR()

model_area.fit(X_train_area, y_train_area)
model_all.fit(X_train_all, y_train_all)

# Evaluate model prediction
y_pred_area = model_area.predict(X_test_area)
y_pred_all = model_all.predict(X_test_all)

# Evaluate the models
mse_area = MSE(y_test_area, y_pred_area)
mse_all = MSE(y_test_all, y_pred_all)

print("Mean Squared Error (LotArea only):", mse_area)
print("Mean Squared Error (All columns):", mse_all)


Mean Squared Error (LotArea only): 5074667187.426045
Mean Squared Error (All columns): 809285788.9023495


# Feature engineering

Clearly, the mean squared errors (MSE) are pretty high for the linear regression model. So, let's figure out what features are most significant and try to improve the model.

In [19]:
COR_THRESHOLD = 0.5

feature_correlation = df.corr()['SalePrice'].sort_values(ascending=False)
significant_features = feature_correlation[(abs(feature_correlation) > COR_THRESHOLD) & (feature_correlation.index != 'SalePrice')].index.tolist()

print("The Significant features are:")
for feature in significant_features:
    print(f"{feature} (r = {feature_correlation[feature]:.2f})")

The Significant features are:
OverallQual (r = 0.79)
GrLivArea (r = 0.71)
GarageCars (r = 0.64)
GarageArea (r = 0.62)
TotalBsmtSF (r = 0.61)
1stFlrSF (r = 0.61)
FullBath (r = 0.56)
BsmtQual_Ex (r = 0.55)
TotRmsAbvGrd (r = 0.54)
YearBuilt (r = 0.52)
YearRemodAdd (r = 0.51)
KitchenQual_Ex (r = 0.50)
KitchenQual_TA (r = -0.52)
ExterQual_TA (r = -0.59)


In [27]:
X_redux = df[significant_features]

X_train_redux, X_test_redux, y_train_redux, y_test_redux = TTS(X_redux, y, test_size=0.2, random_state=42)

model_redux = LR()
model_redux.fit(X_train_redux, y_train_redux)
y_pred_redux = model_redux.predict(X_test_redux)

# Evaluate the models
mse_redux = MSE(y_test_redux, y_pred_redux)

mse_max = max([mse_all, mse_area, mse_redux])

print(f"Mean Squared Error (LotArea only):, {mse_area:.0f} ({100*mse_area/mse_max:.1f}%)")
print(f"Mean Squared Error (All columns):, {mse_all:.0f} ({100*mse_all/mse_max:.1f}%)")
print(f"Mean Squared Error (Significant columns only):, {mse_redux:.0f} ({100*mse_redux/mse_max:.1f}%)")

Mean Squared Error (LotArea only):, 5074667187 (100.0%)
Mean Squared Error (All columns):, 809285789 (15.9%)
Mean Squared Error (Significant columns only):, 931562980 (18.4%)
