# House Pricing Data Set: Machine Learning Examples

This is the summary of an example of various machine learnng techniques applied to the problem of house price prediction.

Dataset is from kaggle: https://www.kaggle.com/datasets/bharatsahu/house-price-prediction-dataset-insurancehub-usa

Guillaume Häfner

In [2]:
import pandas as pd
from sklearn.model_selection import train_test_split as TTS
from sklearn.linear_model import LinearRegression as LR
from sklearn.metrics import mean_squared_error as MSE

In [3]:
# Load the data
df = pd.read_csv("data/train.csv")
print("Shape of df before handling of missing data: ", df.shape)

## Handling of missing values
df_NA = (df.isna().mean() * 100)

# Drop rows where missing values are less than 1%
df.dropna(subset=df_NA[df_NA < 1].index.tolist(), inplace=True)

# Impute missing values with mean for columns with missing values higher than 1% and numeric type
for column in df_NA[df_NA >= 1].index:
    if pd.api.types.is_numeric_dtype(df[column]):
        df[column] = df[column].fillna(df[column].mean())

print("Shape of df after handling of missing data: ", df.shape)
print(df.columns)



Shape of df before handling of missing data:  (1460, 81)
Shape of df after handling of missing data:  (1451, 81)
Index(['Id', 'MSSubClass', 'MSZoning', 'LotFrontage', 'LotArea', 'Street',
       'Alley', 'LotShape', 'LandContour', 'Utilities', 'LotConfig',
       'LandSlope', 'Neighborhood', 'Condition1', 'Condition2', 'BldgType',
       'HouseStyle', 'OverallQual', 'OverallCond', 'YearBuilt', 'YearRemodAdd',
       'RoofStyle', 'RoofMatl', 'Exterior1st', 'Exterior2nd', 'MasVnrType',
       'MasVnrArea', 'ExterQual', 'ExterCond', 'Foundation', 'BsmtQual',
       'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinSF1',
       'BsmtFinType2', 'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', 'Heating',
       'HeatingQC', 'CentralAir', 'Electrical', '1stFlrSF', '2ndFlrSF',
       'LowQualFinSF', 'GrLivArea', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath',
       'HalfBath', 'BedroomAbvGr', 'KitchenAbvGr', 'KitchenQual',
       'TotRmsAbvGrd', 'Functional', 'Fireplaces', 'FireplaceQu', 'GarageType',

We will start with the simple linear regression. Linear regression is used 

In [4]:

X_area = df[['LotArea']] # Feature Set 1
X_all = df.drop(columns=['SalePrice']) # Feature Set 2
y = df['SalePrice']

# Apply one-hot encoding to categorical variables
X_all = pd.get_dummies(X_all)

# Split data
X_train_area, X_test_area, y_train_area, y_test_area = TTS(X_area, y, test_size=0.2, random_state=42)
X_train_all, X_test_all, y_train_all, y_test_all = TTS(X_all, y, test_size=0.2, random_state=42)

# Train model
model_area = LR()
model_all = LR()

model_area.fit(X_train_area, y_train_area)
model_all.fit(X_train_all, y_train_all)

# Evaluate model prediction
y_pred_area = model_area.predict(X_test_area)
y_pred_all = model_all.predict(X_test_all)

# Evaluate the models
mse_area = MSE(y_test_area, y_pred_area)
mse_all = MSE(y_test_all, y_pred_all)

print("Mean Squared Error (LotArea only):", mse_area)
print("Mean Squared Error (All other columns):", mse_all)


Mean Squared Error (LotArea only): 5074667187.426045
Mean Squared Error (All other columns): 809285788.9023495
