# Introduction To Machine Learning

Ellis Huntley <br>
University of Manchester <br>
September 2022

This model uses a decision tree to predict housing prices. The data used is real estate information from Iowa, taken from a ML introductory course found at https://www.kaggle.com.

The purpose of this project is to reinforce basic ML practises and reproduce results using decision trees.



### Initialise

Pandas is an open source Python package which introduces the dataframe. The dataframe is a multidimensional array, similar to an Excel spreadsheet, that makes handling/analysing large data sets simpler. Because of this, it is particularly useful in ML when training the model.

For these reasons, Pandas will be used throughout this project.

In [1]:
import pandas as pd
import numpy as np

#read in Iowa data
data_file_path = 'train.csv' #in same folder

data = pd.read_csv(data_file_path)

data.describe() #show data in dataframe

Unnamed: 0,Id,MSSubClass,LotFrontage,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,MasVnrArea,BsmtFinSF1,...,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,MiscVal,MoSold,YrSold,SalePrice
count,1460.0,1460.0,1201.0,1460.0,1460.0,1460.0,1460.0,1460.0,1452.0,1460.0,...,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0
mean,730.5,56.89726,70.049958,10516.828082,6.099315,5.575342,1971.267808,1984.865753,103.685262,443.639726,...,94.244521,46.660274,21.95411,3.409589,15.060959,2.758904,43.489041,6.321918,2007.815753,180921.19589
std,421.610009,42.300571,24.284752,9981.264932,1.382997,1.112799,30.202904,20.645407,181.066207,456.098091,...,125.338794,66.256028,61.119149,29.317331,55.757415,40.177307,496.123024,2.703626,1.328095,79442.502883
min,1.0,20.0,21.0,1300.0,1.0,1.0,1872.0,1950.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,2006.0,34900.0
25%,365.75,20.0,59.0,7553.5,5.0,5.0,1954.0,1967.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5.0,2007.0,129975.0
50%,730.5,50.0,69.0,9478.5,6.0,5.0,1973.0,1994.0,0.0,383.5,...,0.0,25.0,0.0,0.0,0.0,0.0,0.0,6.0,2008.0,163000.0
75%,1095.25,70.0,80.0,11601.5,7.0,6.0,2000.0,2004.0,166.0,712.25,...,168.0,68.0,0.0,0.0,0.0,0.0,0.0,8.0,2009.0,214000.0
max,1460.0,190.0,313.0,215245.0,10.0,9.0,2010.0,2010.0,1600.0,5644.0,...,857.0,547.0,552.0,508.0,480.0,738.0,15500.0,12.0,2010.0,755000.0


### Choosing the Model Data

The variable we want to model is the house prices. We must first find the appropriate column in the data set by printing the columns.

We will also create a new data frame that holds the relevant features used when predicting price.

In [2]:
print(data.columns)

Index(['Id', 'MSSubClass', 'MSZoning', 'LotFrontage', 'LotArea', 'Street',
       'Alley', 'LotShape', 'LandContour', 'Utilities', 'LotConfig',
       'LandSlope', 'Neighborhood', 'Condition1', 'Condition2', 'BldgType',
       'HouseStyle', 'OverallQual', 'OverallCond', 'YearBuilt', 'YearRemodAdd',
       'RoofStyle', 'RoofMatl', 'Exterior1st', 'Exterior2nd', 'MasVnrType',
       'MasVnrArea', 'ExterQual', 'ExterCond', 'Foundation', 'BsmtQual',
       'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinSF1',
       'BsmtFinType2', 'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', 'Heating',
       'HeatingQC', 'CentralAir', 'Electrical', '1stFlrSF', '2ndFlrSF',
       'LowQualFinSF', 'GrLivArea', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath',
       'HalfBath', 'BedroomAbvGr', 'KitchenAbvGr', 'KitchenQual',
       'TotRmsAbvGrd', 'Functional', 'Fireplaces', 'FireplaceQu', 'GarageType',
       'GarageYrBlt', 'GarageFinish', 'GarageCars', 'GarageArea', 'GarageQual',
       'GarageCond', 'PavedDrive

In [3]:
price = data.SalePrice # select sale price as target variable to be modelled

features = ['LotArea', 'YearBuilt', '1stFlrSF', '2ndFlrSF', 'FullBath', 'BedroomAbvGr', 'TotRmsAbvGrd'] # features

X = data[features] # dataframe with relevant features


### Building the Model

This will be done through the scikit-learn library, the most common library used when modelling data that is usually stored in dataframes.

Machine learning models allow randomness in model training. Using a specific number ensures the same result is produced every time - this is considered good practise.

In [4]:
from sklearn.tree import DecisionTreeRegressor

model = DecisionTreeRegressor(random_state=1)

model.fit(X, price)

predictions = model.predict(X) # predict sale price using X as training data

# compare results

print(price.head())
print(predictions[:len(price.head())]) # show a few results

0    208500
1    181500
2    223500
3    140000
4    250000
Name: SalePrice, dtype: int64
[208500. 181500. 223500. 140000. 250000.]


### Model Validation

We want to test how good our model is. This can be done by 'hiding' half of the training data from the model, then comparing the model's outcome (modelled via the unhidden data/training data) to the hidden data. This hidden data is known as the validation data.

To quantify how accurate the model is we usse the Mean Absolute Error (MAE).

In [5]:
# split data
from sklearn.model_selection import train_test_split

train_X, val_X, train_price, val_price = train_test_split(X, price, random_state=1)

model = DecisionTreeRegressor(random_state=1)

model.fit(train_X, train_price)

val_predictions = model.predict(val_X)

#calculate error
from sklearn.metrics import mean_absolute_error

mae = mean_absolute_error(val_price, val_predictions)

print('The house price error is ${0:.2f}'.format(mae))

The house price error is $29652.93


### Underfitting and Overfitting

Overfitting is when the model fits the training data almost exactly, but performs poorly when presented with new data. For example, dividing our decision tree by houses' features too many times means that the final part of the tree (the leaf - this gives the sale price) contains so few houses that the model is inaccurate for new data, but predicts the model data perfectly.

Dividing the tree very few times means the tree can fail to capture important features/patterns in the data. The resuluting model will be inaccurate for both training data and validating data.

We want to find the optimal point between overfitting and underfitting. This can be done via a function that compares MAE values for a different number of maximum leaves.

In [6]:
def get_mae(max_leaves, train_X, val_X, train_price, val_price):
    model = DecisionTreeRegressor(max_leaf_nodes=max_leaves, random_state=1)
    model.fit(train_X, train_price)
    prediction = model.predict(val_X)
    mae = mean_absolute_error(val_price, prediction)
    return(mae)

max_leaves_array = np.arange(5, 505, 5)
mae_list =[]

for leaves in max_leaves_array:
    new_mae = get_mae(leaves, train_X, val_X, train_price, val_price)
    mae_list.append(new_mae)
    
best_tree_size = max_leaves_array[np.argmin(mae_list)]

print('Best Max Leaf Nodes: {0} \t\t MAE: ${1:.2f}'.format(best_tree_size, np.min(mae_list)))

#Train model with full data since we now know best tree size

final_model = DecisionTreeRegressor(random_state=1, max_leaf_nodes=best_tree_size)
final_model.fit(X, price)


Best Max Leaf Nodes: 70 		 MAE: $26763.34


DecisionTreeRegressor(max_leaf_nodes=70, random_state=1)

### Random Forests

A better way of using decision trees is the random forest. The random forest uses many trees and then makes its predictions by averaging the predictions of each component tree. It is generally more accurate than a single decision tree. Betters models than the random forest exist, but these tend to be very sensitive to the parameters and time must be spent getting the right ones, whereas the random forest works well with the default parameters.

In [7]:
from sklearn.ensemble import RandomForestRegressor

forest_model = RandomForestRegressor(random_state=1)
forest_model.fit(train_X, train_price)

prediction = forest_model.predict(val_X)

print('${0:.2f}'.format(mean_absolute_error(val_price, prediction)))

$21857.16
