# 1 Background Information

When an analyst determines the price of a house, they need to consider many factors, including location, size, and construction material quality. Luckily, machine learning models exist to do the work for them. In this project, I will train and evaluate regression models to predict the prices of houses. I used a dataset of 2919 houses in Ames, Iowa with 79 predictors ([Description of predictors](https://www.kaggle.com/competitions/house-prices-advanced-regression-techniques/data?select=data_description.txt)). 

# 2 Import Libraries and Data

In [19]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import warnings
warnings.filterwarnings('ignore')

sns.set_theme()

The authors of the dataset already randomly split it into a training set and testing set so I simply imported them.

In [20]:
# Use the Id column as the row index
# The house prices for the test data were not provided by the authors
train_data = pd.read_csv('../input/house-prices-advanced-regression-techniques/train.csv', index_col="Id")
X_test = pd.read_csv("../input/house-prices-advanced-regression-techniques/test.csv", index_col="Id")

# Extract the response from the training data
X_train = train_data.drop(['SalePrice'], axis=1)
y_train = train_data.SalePrice

# Useful for transforming the training and test data together
X = pd.concat([X_train, X_test], axis=0)

print("Training dataset size: ", X_train.shape)
print("Test dataset size: ", X_test.shape)

Training dataset size:  (1460, 79)
Test dataset size:  (1459, 79)


# 3 Data Procesesing

## 3.1 Removing categorical predictors

To simplify the analysis, I only used quantitative predictors.

In [21]:
X = X.select_dtypes(include="number")
# Manually remove categorical columns encoded with integers
X.drop(columns=["MSSubClass", "MoSold"], inplace=True) 
print(f"There are {len(X.columns)} predictors left:\n",
      list(X.columns), sep="")

There are 34 predictors left:
['LotFrontage', 'LotArea', 'OverallQual', 'OverallCond', 'YearBuilt', 'YearRemodAdd', 'MasVnrArea', 'BsmtFinSF1', 'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', '1stFlrSF', '2ndFlrSF', 'LowQualFinSF', 'GrLivArea', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath', 'HalfBath', 'BedroomAbvGr', 'KitchenAbvGr', 'TotRmsAbvGrd', 'Fireplaces', 'GarageYrBlt', 'GarageCars', 'GarageArea', 'WoodDeckSF', 'OpenPorchSF', 'EnclosedPorch', '3SsnPorch', 'ScreenPorch', 'PoolArea', 'MiscVal', 'YrSold']


## 3.2 Handling missing data

In [22]:
missing_data_count = X.isnull().sum()
missing_data_percent = X.isnull().sum() / len(X) * 100
missing_data = pd.DataFrame({
    'No. of missing values': missing_data_count,
    'Percentage of values missing': missing_data_percent
})
missing_data = missing_data[missing_data.loc[:,'No. of missing values'] > 0]
missing_data.sort_values(by='Percentage of values missing', ascending=False, inplace=True)
missing_data.head(3)

Unnamed: 0,No. of missing values,Percentage of values missing
LotFrontage,486,16.649538
GarageYrBlt,159,5.447071
MasVnrArea,23,0.787941


- **LotFrontage**: The distance between the house and the front road. I will impute the missing values with the average value in the training dataset.
- **GarageYrBlt**: The year the garage was built. Since it is highly correlated with **YearBuilt** (r = 0.826, see the correlation map below), I will impute the missing values with the **YearBuilt** values. 

For the remaining predictors, the number of missing values are small so I simply deleted them.

In [23]:
X.LotFrontage = (X.LotFrontage.
                 fillna(X.LotFrontage.mean()))
X.GarageYrBlt = (X.GarageYrBlt.
                 fillna(X.YearBuilt))

# Drop all rows with missing values
X_train = X[:len(X_train)].dropna()
y_train = X[:len(X_train)].dropna()
X_test = X[len(X_train):].dropna()
train_data = pd.concat([X_train, y_train])

print("No. of missing values remaining is", 
      sum(X_train.isnull().sum()) + sum(X_test.isnull().sum()))
print("New training set size:", X_train.shape)
print("New test set size:", X_test.shape)

No. of missing values remaining is 0
New training set size: (1452, 34)
New test set size: (1449, 34)


# 4 Exploratory Data Analysis

## 4.1 Distribution of the house prices (the response)

In [None]:
plt.figure(figsize=(8,6))
sns.histplot(y_train)
plt.title("House Price Distribution")
plt.xlabel("House Price ($USD)")
plt.ylabel("No. of houses")

The distribution of house prices is positively skewed due to the presence of a few very expensive houses. 

## 4.2 Correlations between variables

In [None]:
correlation_matrix = train_data.corr()
lower_triangle_mask = np.tril(np.ones(correlation_matrix.shape), k = -1).astype(bool)
# Replace all values above the lower triangle with NaN
correlation_matrix = correlation_matrix.where(lower_triangle_mask)

with sns.axes_style("dark"):
    f, ax = plt.subplots(figsize=(12, 9))
    sns.heatmap(correlation_matrix, cmap="Blues", square=True);

In [None]:
correlation_matrix.columns

In [None]:
most_positive = correlation_matrix.loc["SalePrice"].sort_values(ascending=False)[0:5]
print(f"The {len(most_positive)} predictors that were most positively correlated with the house price were:\n", 
      most_positive, "\n", sep="")

most_negative = correlation_matrix.loc["SalePrice"].sort_values(ascending=True)[0:1]
print(f"The predictor that was most negatively correlated with the house price was:\n", 
       most_negative, sep="")

- **OverallQual**: Quality of the construction materials and workmanship on a scale of 1 to 10. 
- **GrLivArea**: The total living area above the ground. 
- **GarageCars**: No. of cars that can fit in the garage. 
- **GarageArea**: Area of the garage. (Notice that is is highly correlated with GarageCars). 
- **TotalBsmtSF**: The area of the basement.

There were no quantitative predictors that were significantly negatively correlated with the house price.