# House Pricing Data Set: Machine Learning Examples

This is the summary of an example of various machine learnng techniques applied to the problem of house price prediction.

Dataset is from kaggle: https://www.kaggle.com/datasets/bharatsahu/house-price-prediction-dataset-insurancehub-usa

Guillaume Häfner

In [11]:
import pandas as pd
from sklearn.model_selection import train_test_split as TTS
from sklearn.linear_model import LinearRegression as LR
from sklearn.metrics import root_mean_squared_error as MSE

%matplotlib inline

In [4]:
# Load the data
df = pd.read_csv("data/housing_train.csv")


## Handling of missing values
print("Shape of df before handling of missing data: ", df.shape)
df_NA = (df.isna().mean() * 100)

df.dropna(subset=df_NA[df_NA < 1].index.tolist(), inplace=True)

# Impute missing values with mean for columns with missing values higher than 1% and numeric type
for column in df_NA[df_NA >= 1].index:
    if pd.api.types.is_numeric_dtype(df[column]):
        df[column] = df[column].fillna(df[column].mean())

print("Shape of df after handling of missing data: ", df.shape)
print(df.columns)



Shape of df before handling of missing data:  (2919, 81)
Shape of df after handling of missing data:  (2883, 81)
Index(['Id', 'MSSubClass', 'MSZoning', 'LotFrontage', 'LotArea', 'Street',
       'Alley', 'LotShape', 'LandContour', 'Utilities', 'LotConfig',
       'LandSlope', 'Neighborhood', 'Condition1', 'Condition2', 'BldgType',
       'HouseStyle', 'OverallQual', 'OverallCond', 'YearBuilt', 'YearRemodAdd',
       'RoofStyle', 'RoofMatl', 'Exterior1st', 'Exterior2nd', 'MasVnrType',
       'MasVnrArea', 'ExterQual', 'ExterCond', 'Foundation', 'BsmtQual',
       'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinSF1',
       'BsmtFinType2', 'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', 'Heating',
       'HeatingQC', 'CentralAir', 'Electrical', '1stFlrSF', '2ndFlrSF',
       'LowQualFinSF', 'GrLivArea', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath',
       'HalfBath', 'BedroomAbvGr', 'KitchenAbvGr', 'KitchenQual',
       'TotRmsAbvGrd', 'Functional', 'Fireplaces', 'FireplaceQu', 'GarageType',

## Linear Regression Models

In this analysis, we compare the performance of two linear regression models trained on the dataset. The goal is to predict the `SalePrice` of houses based on different sets of features.

### Feature Sets

1. **LotArea Only Model**:
   - This model utilizes only the `LotArea` column as the predictor variable.
   
2. **All Columns Model**:
   - This model uses all columns except `SalePrice` as predictor variables.

3. **Only significant columns**:
   - This model calculates correlation coefficients and only uses the ones above a given threshold.

### Evaluation Metric

We evaluate the performance of each model using mean squared error (MSE) on the testing set. Lower MSE values indicate better performance in predicting house prices.


In [8]:
TARGET_VAR = 'SalePrice'
def run_LR(feature_columns, df=df):
    
    df = pd.get_dummies(df)
    X = df[feature_columns]
    y = df[TARGET_VAR]

    X_train, X_test, y_train, y_test = TTS(X,y,test_size=0.2, random_state=42)
    model = LR()
    model.fit(X_train,y_train)
    y_pred = model.predict(X_test)
    mse = MSE(y_test,y_pred)

    print(f"Root mean squared error: {mse:.2f}.")

In [13]:
print("Only Area as feature column.")
run_LR(feature_columns=['LotArea'])

print("All columns.")
run_LR(feature_columns=[cols for cols in df.columns if cols != TARGET_VAR])

Only Area as feature column.
Root mean squared error: 50690.91.
All columns.
Root mean squared error: 41351.77.
Mean Squared Error (LotArea only): 50690.91278595372
Mean Squared Error (All columns): 41351.76594087202


# Feature engineering

Clearly, the mean squared errors (MSE) are pretty high for the linear regression model. So, let's figure out what features are most significant and try to improve the model.

In [6]:
COR_THRESHOLD = 0.5

feature_correlation = df.corr()['SalePrice'].sort_values(ascending=False)
significant_features = feature_correlation[(abs(feature_correlation) > COR_THRESHOLD) & (feature_correlation.index != 'SalePrice')].index.tolist()

print("The Significant features are:")
for feature in significant_features:
    print(f"{feature} (r = {feature_correlation[feature]:.2f})")

The Significant features are:
OverallQual (r = 0.55)
GrLivArea (r = 0.52)


In [15]:
print("Only significant feature columns.")
run_LR(significant_features)

Only significant feature columns.
Root mean squared error: 41639.07.
