<img width="800" alt="Australian housing" src="https://upload.wikimedia.org/wikipedia/commons/thumb/d/d3/Elger_Street_housing_001.jpg/640px-Elger_Street_housing_001.jpg">

# Home in the Outback - Predicting Housing Prices in Australia
By Andrew Yang

## Business Understanding

According to Statistia, housing affordability is a significant issue in Australia that has sparked many policy debates. Under 30% of Australians completely own their property, while approximately 30% of private households are composed of renters. As noted in the report, residental housing prices has typically seen strong annual growth, as have rental costs for the past decade [1]. This has led to several Australian cities such as Sydney and Auckland to be rated as some of the least affordable cities based on the ratio between housing prices and annual income [2].

<img width="512" alt="Least affordable housing in the world - Statista" src="https://cdn.statcdn.com/Infographic/images/normal/16902.jpeg">


Such an issue has also been compounded by the frequent practice of underquoting housing in the Australian real estate industry. Various organizations operating in the space have noted that "underquoting remains 'endemic' due to poorly regulated and underpoliced regulations". While the practice helps listers to gather more interested buyers, it also obfuscates the true price, creating much grief among hopeful buyers. As David Morrell, a buyers advocate in Australia, puts it: "It’s misleading, it’s a fraud on purchasers. And until significant penalties are handed out, regulations have been ineffective … if one does it, others have to follow" [3].

A housing price model can mitigate the issue until stronger laws punishing underquoting are passed. By estimating the price of housing through the usage of a model, buyers can more accurately determine which properties they can afford in the Australian real estate market. This saves them time searching through properties that have an unknown true price, while helping in spotting potential homes at a discount. 

## Data Understanding

The Kaggle [dataset](https://www.kaggle.com/datasets/thedevastator/australian-housing-data-1000-properties-sampled) used is a sample of 1000 Australian property listings. It contains information about the housing itself (such as size and amenities count) as well as housing type and product variation. This information can be used to specifically model prices for Australian housing based on common property aspects.

After data cleaning, outlier removal, and imputation, the cleaned data set includes 843 observations and 8 columns. Of these, seven features were chosen for modeling housing prices. Building size, land size, and property type were chosen because housing size is a key component in determining property value. Bedroom, bathroom, and parking count can also affect housing value. Lastly, product depth may have an affect on how a property is marketed: housing considered "premiere" could have higher prices than those considered "standard".

![prop type vs price](visualizations/propType_price.png)

**Figure 1: Median price of property types.**

![bedrooms vs price](visualizations/bedroom_price.png)

**Figure 2: Relationship between bedroom count and housing prices.**

![bathroom vs price](visualizations/bathroom_price.png)

**Figure 3: Relationship between bathroom count and housing prices.**

![parking vs price](visualizations/parking_price.png)

**Figure 4: Relationship between parking spaces and housing prices.**

![building size vs price](visualizations/buildsize_price.png)

**Figure 5: Relationship between building size and housing prices.**

![land size vs price](visualizations/landsize_price.png)

**Figure 6: Relationship between land size and housing prices.**

The exploratory data analysis (EDA) reveals some relationships between price and various predictors. Figure one shows that townhouses have the largest median listing price, with houses as the second most expensive property type. Meanwhile, figure two demonstrates a positive correlation between the number of bedrooms and housing price, while figures 3 & 4 show a similar correlation between bathroom count as well as parking space count with housing price. Finally, both building and land size are positively correlated with price (Figures 5 & 6).

|   | Median | IQR |
| --- | --- | ---|
|Building Size (meters squared)|147.0|103.0
|Land Size (meters squared)|294.5|650.0|
|Price (AUD)| 485000.0|177500.0|
|Bedrooms|3.0|1.0|
|Bathrooms|2.0|1.0|
|Parking Count|2.0|1.0|

**Table 1: Descriptive statistics of numeric columns.**

|Property Type|Count|
|---|---|
|House|375|
|Unit|221|
|Apartment|192|
|Townhouse|36|
|Duplex/Semi-detached|19|

**Table 2: Frequency of property types.**

|Product Depth|Count|
|---|---|
|Premiere|563|
|Feature|154|
|Standard|126|

**Table 3: Frequency of product depths.**

In terms of descriptive statistics, table one shows the median and interquartile value of the numeric features as well as the target variable (building size, land size, price, etc); in each of the quantitative variable's cases, they possessed a right-skewed distribution. Table two and three show the frequency of each category. 

One key limitation of the data is that the target variable (price) is based on property listings, which are subject to underquoting. It would be more accurate to model based on housing sold, but that data is not publicly availible for Australian housing. Because of that, the model will most likely predict pricing that is consistently less than the true value of the property. Additionally, the dataset is relatively small, and as such may be prone to overfitting. Managing this issue will require model regularization.

## Data Preperation

In [1]:
# import the following packages for the EDA and data cleaning sections
import pandas as pd
import numpy as np
import re
import matplotlib.pyplot as plt
import seaborn as sb
from numpy import median

### Raw Data and Feature Removal
Preparing the data requires us to first import the data and remove all columns that are not our predictors or target.

In [2]:
data = pd.read_csv("data/RealEstateAU_1000_Samples.csv")
data.head()

Unnamed: 0,index,TID,breadcrumb,category_name,property_type,building_size,land_size,preferred_size,open_date,listing_agency,...,state,zip_code,phone,latitude,longitude,product_depth,bedroom_count,bathroom_count,parking_count,RunDate
0,0,1350988,Buy>NT>DARWIN CITY,Real Estate & Property for sale in DARWIN CITY...,House,,,,Added 2 hours ago,Professionals - DARWIN CITY,...,NT,800,08 8941 8289,,,premiere,2.0,1.0,1.0,2022-05-27 15:54:05
1,1,1350989,Buy>NT>DARWIN CITY,Real Estate & Property for sale in DARWIN CITY...,Apartment,171m²,,171m²,Added 7 hours ago,Nick Mousellis Real Estate - Eview Group Member,...,NT,800,0411724000,,,premiere,3.0,2.0,2.0,2022-05-27 15:54:05
2,2,1350990,Buy>NT>DARWIN CITY,Real Estate & Property for sale in DARWIN CITY...,Unit,,,,Added 22 hours ago,Habitat Real Estate - THE GARDENS,...,NT,800,08 8981 0080,,,premiere,2.0,1.0,1.0,2022-05-27 15:54:05
3,3,1350991,Buy>NT>DARWIN CITY,Real Estate & Property for sale in DARWIN CITY...,House,,,,Added yesterday,Ray White - NIGHTCLIFF,...,NT,800,08 8982 2403,,,premiere,1.0,1.0,0.0,2022-05-27 15:54:05
4,4,1350992,Buy>NT>DARWIN CITY,Real Estate & Property for sale in DARWIN CITY...,Unit,201m²,,201m²,Added yesterday,Carol Need Real Estate - Fannie Bay,...,NT,800,0418885966,,,premiere,3.0,2.0,2.0,2022-05-27 15:54:05


In [3]:
# remove columns that are not predictors or target variable
removeCol = ["index", "TID", "breadcrumb", "category_name", "open_date", 
             "preferred_size", "listing_agency",
             "location_number", "location_type", "location_name", "address",
            "address_1", "city", "state", "zip_code", "phone", "latitude",
            "longitude", "RunDate"]
data_c = data.drop(removeCol, axis = 1)
data_c.head()

Unnamed: 0,property_type,building_size,land_size,price,product_depth,bedroom_count,bathroom_count,parking_count
0,House,,,"$435,000",premiere,2.0,1.0,1.0
1,Apartment,171m²,,"Offers Over $320,000",premiere,3.0,2.0,2.0
2,Unit,,,"$310,000",premiere,2.0,1.0,1.0
3,House,,,"$259,000",premiere,1.0,1.0,0.0
4,Unit,201m²,,"$439,000",premiere,3.0,2.0,2.0


### Property Type and Product Depth Trimming
Property types that have less than 10 observations in the dataset, such as acreages and studios, are removed, as well as the mid-tier product depth. This removes many observations with missing values that are difficult to impute on a property type basis.

In [4]:
data_c["property_type"].value_counts()

House                   441
Unit                    230
Apartment               212
Townhouse                38
Residential Land         33
Duplex/Semi-detached     19
Acreage                   9
Block Of Units            6
Other                     4
Villa                     4
Studio                    2
Lifestyle                 1
Warehouse                 1
Name: property_type, dtype: int64

In [5]:
data_c["product_depth"].value_counts()

premiere    659
feature     172
standard    162
midtier       7
Name: product_depth, dtype: int64

In [6]:
removePropType = ["Residential Land", "Acreage", "Block Of Units", "Other", 
                  "Villa", "Studio", "Lifestyle", "Warehouse"]
data_c2 = data_c[~data_c["property_type"].isin(removePropType)]
data_c2 = data_c2[data_c2["product_depth"] != "midtier"]

data_c2.head()

Unnamed: 0,property_type,building_size,land_size,price,product_depth,bedroom_count,bathroom_count,parking_count
0,House,,,"$435,000",premiere,2.0,1.0,1.0
1,Apartment,171m²,,"Offers Over $320,000",premiere,3.0,2.0,2.0
2,Unit,,,"$310,000",premiere,2.0,1.0,1.0
3,House,,,"$259,000",premiere,1.0,1.0,0.0
4,Unit,201m²,,"$439,000",premiere,3.0,2.0,2.0


### Building and Land Size Cleaning

Building and land size are converted from strings in the original data into a numeric column. Because some properties are in hectares, special care needs to be taken in order to properly convert areas into meters squared.

In [7]:
# some units are hectare, others meters squared- corrects numbers to be in meteres squared
def correctSize(string):
    # if the string is a NaN, leave it be; we will impute it later
    if(pd.isna(string)): 
        return
    else: 
        # if the value is in hectares, remove commas and alphabetic letters
        # and convert to meters squared
        if "ha" in string:
            return float(re.sub("[a-zA-Z,]", "", string))*10000
        else:
            # else just remove commas and alphabetic letters
            string2 = string.replace("m²","")
            return float(re.sub("[a-zA-Z,]", "", string2))

In [8]:
# convert all area variables (building size, land size) to meters squared
sizePredictors = ["building_size","land_size"]
data_c3 = data_c2.copy()
data_c3["building_size"] = data_c3["building_size"].apply(correctSize)
data_c3["land_size"] = data_c3["land_size"].apply(correctSize)
data_c3.head()

Unnamed: 0,property_type,building_size,land_size,price,product_depth,bedroom_count,bathroom_count,parking_count
0,House,,,"$435,000",premiere,2.0,1.0,1.0
1,Apartment,171.0,,"Offers Over $320,000",premiere,3.0,2.0,2.0
2,Unit,,,"$310,000",premiere,2.0,1.0,1.0
3,House,,,"$259,000",premiere,1.0,1.0,0.0
4,Unit,201.0,,"$439,000",premiere,3.0,2.0,2.0


### Price Cleaning

Prices in the original dataset are within strings, and are extracted by a two part process. First, the price is checked if it is a real price instead of a non-numeric value in string format. For example, if the price said "Under Contract", the observation was set to NaN. Afterwards, if the price does exist, the string is stripped of alphabetic and other characters before it is converted to a number.

In [9]:
# if the string has a $, it is a price.
def isPrice(string):
    if "$" in string: 
        return True
    else: 
        return False

In [10]:
def removeAlphabetic(string):
    if(pd.isna(string)): #leave nulls as nulls
        return np.nan
    
    # remove alphabetic, comma, and $
    string2 = re.sub("[a-zA-Z,$]", "", string)
    try:
        # remove whitespace and turn to a number
        return float(string2.replace(" ",""))
    except:
        # make it a null value
        return np.nan

In [11]:
# set non-prices to null 
data_c3.loc[~data_c3["price"].apply(isPrice), "price"] = np.nan

# clean up strings into numbers
data_c4 = data_c3.copy()
data_c4["price"] = data_c4["price"].apply(removeAlphabetic).copy()
data_c4.head()

Unnamed: 0,property_type,building_size,land_size,price,product_depth,bedroom_count,bathroom_count,parking_count
0,House,,,435000.0,premiere,2.0,1.0,1.0
1,Apartment,171.0,,320000.0,premiere,3.0,2.0,2.0
2,Unit,,,310000.0,premiere,2.0,1.0,1.0
3,House,,,259000.0,premiere,1.0,1.0,0.0
4,Unit,201.0,,439000.0,premiere,3.0,2.0,2.0


### Data Imputation
Because the distribution of our numeric predictors are mostly right skewed, missing data was imputed using the median values for each type of property.

In [12]:
# group by property type, then fill in missing values with the median
data_c5 = data_c4.groupby("property_type").transform(lambda x: x.fillna(x.median()))
data_c5["property_type"] = data_c4["property_type"]
data_c5["product_depth"] = data_c4["product_depth"]
data_c5.head()

Unnamed: 0,building_size,land_size,price,bedroom_count,bathroom_count,parking_count,property_type,product_depth
0,210.0,804.0,435000.0,2.0,1.0,1.0,House,premiere
1,171.0,178.0,320000.0,3.0,2.0,2.0,Apartment,premiere
2,107.0,153.0,310000.0,2.0,1.0,1.0,Unit,premiere
3,210.0,804.0,259000.0,1.0,1.0,0.0,House,premiere
4,201.0,153.0,439000.0,3.0,2.0,2.0,Unit,premiere


### Outlier Removal in Building Size, Land Size, and Price

Lastly, outlier observations in building size, land size, and price were removed to produce better fitting models.

In [13]:
def getQuantiles(column):
    quantiles = np.quantile(column, [0, 0.25, 0.5, 0.75, 1])
    q1 = quantiles[1]
    q3 = quantiles[3]
    return [q1, q3]

In [14]:
# remove outliers in price
quantiles = getQuantiles(data_c5["price"])
iqr = quantiles[1]-quantiles[0]
data_c6 = data_c5.loc[(data_c5["price"] <= quantiles[1] + 1.5*iqr) & 
                      (data_c5["price"] >= 0) &
                     (data_c5["price"] >= quantiles[0] - 1.5*iqr)]

In [15]:
# remove outliers in building size
quantiles = getQuantiles(data_c6["building_size"])
iqr = quantiles[1]-quantiles[0]
data_c7 = data_c6.loc[(data_c6["building_size"] <= quantiles[1] + 1.5*iqr) & 
                      (data_c6["building_size"] >= 0) &
                     (data_c6["building_size"] >= quantiles[0] - 1.5*iqr)]

In [16]:
# remove outliers in land size
quantiles = getQuantiles(data_c7["land_size"])
iqr = quantiles[1]-quantiles[0]
data_c8 = data_c7.loc[(data_c7["land_size"] <= quantiles[1] + 1.5*iqr) & 
                      (data_c7["land_size"] >= 0) &
                     (data_c7["land_size"] >= quantiles[0] - 1.5*iqr)]

In [17]:
# produce clean data
data_clean = data_c8.copy()
data_clean.sample(10, random_state = 123)

Unnamed: 0,building_size,land_size,price,bedroom_count,bathroom_count,parking_count,property_type,product_depth
784,198.0,178.0,450000.0,3.0,2.0,2.0,Apartment,premiere
312,210.0,818.0,420000.0,3.0,1.0,4.0,House,premiere
380,210.0,804.0,490000.0,3.0,2.0,2.0,House,premiere
327,135.0,178.0,380000.0,3.0,2.0,2.0,Apartment,standard
317,210.0,813.0,435000.0,3.0,1.0,2.0,House,premiere
895,210.0,804.0,500000.0,4.0,2.0,2.0,House,feature
229,210.0,343.0,555000.0,4.0,2.0,2.0,House,premiere
160,172.0,178.0,559000.0,2.0,2.0,1.0,Apartment,standard
919,149.0,804.0,515000.0,4.0,2.0,2.0,House,premiere
836,155.0,487.0,555000.0,3.0,2.0,2.0,House,premiere


In [18]:
#export data
filepath = "data/"
data_clean.to_csv(filepath+"data_clean.csv", index = False)

## Modeling

Root Mean Squared Error (RMSE) was the best option for performance metric. This is because the measure punishes performance exponentially more the further the predicted value is from the true housing price. Additionally, it is easier to interpret, as the average error is in Australian dollars (AUD).

In [19]:
# import machine learning packages
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.metrics import mean_squared_error, mean_absolute_error

from sklearn.tree import DecisionTreeRegressor, plot_tree
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import GradientBoostingRegressor

### Data Import and Split

In [20]:
# import our cleaned data
data = pd.read_csv("data/data_clean.csv", index_col = False)
# define our features that are categories
# these features will be encoded later
categories = ["property_type","product_depth"]
data[categories] = data[categories].apply(lambda x: x.astype("category"))

In [21]:
# set our predictors and target variables
X = data.drop("price", axis = 1)
y = data["price"]

In [22]:
# split our data into training and testing data (75% training, 25% testing)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 123)

### Data Pipeline and Functions

In [23]:
# set up scaling step for numeric columns
nums = X.drop(categories, axis = 1).columns
num_trans = Pipeline(
    steps = [("scaler", StandardScaler())]
)

# set up one hot encoder for categorical variables
cat_trans = OneHotEncoder(handle_unknown = "ignore")

# combine scaling and encoding into a single column transformer
processor = ColumnTransformer(
    transformers = [
        ("num", num_trans, nums),
        ("cat", cat_trans, categories)
    ]
)

In [24]:
# build a pipeline using the preprocessor ColumnTransformer
# and the model specified
def pipeMaker(model):
    # two steps:
    # - a preprocessing step to encode and scale the data
    # - a model step to fit the model on the transformed data
    pipe = Pipeline(
        steps = [("preprocessor", processor), ("model", model)]
    ) 
    return pipe

In [25]:
# prints out model metrics
def printMetrics(string, y_train_pred, y_train, y_test_pred, y_test):
    # model label
    print(string + " Performance")
    
    # training regression metrics
    print("\nTraining")
    print("RMSE: "+ str(mean_squared_error(y_train, y_train_pred, squared = False)))
    print("MAE: "+ str(mean_absolute_error(y_train, y_train_pred)))
    
    # testing regression metrics
    print("\nTest")
    print("RMSE: "+ str(mean_squared_error(y_test, y_test_pred, squared = False)))
    print("MAE: "+ str(mean_absolute_error(y_test, y_test_pred)))

In [26]:
# get model performance, comparing training and testing metrics
def getModelPerf(string, pipe, X_train, y_train, X_test, y_test):
    # get predicted train/test values
    y_train_pred = pipe.predict(X_train)
    y_test_pred = pipe.predict(X_test)
    
    # prints out model performance metrics
    printMetrics(string, y_train_pred, y_train, y_test_pred, y_test)   

### Baseline Model
We use the overall median housing prices as our baseline, as prices possessed a right-skewed distribution. 

This leaves us with an average test error of 124,670.86 AUD- on average, this model will predict a price that is 124,670.86 AUD off of the true price of the property.

In [27]:
# make the predicted values the overall median housing price
pred_train = [np.median(data["price"])] * len(y_train)
pred_test = [np.median(data["price"])] * len(y_test)

In [28]:
# baseline model performance
printMetrics("Baseline (median) Model", pred_train, y_train, pred_test, y_test)   

Baseline (median) Model Performance

Training
RMSE: 140709.85860185194
MAE: 112789.31962025317

Test
RMSE: 124670.31404384715
MAE: 98649.28909952607


### Decision Tree
Decision trees are quick to create, which allows swift training to build up-to-date models. This type of model should also provide improved peformance compared to the baseline. To optimize performance, GridSearchCV was used to find the set of hyperparameters that led to the best model performance. 

This resulted in a model with an average price test error of 94,997.47 AUD.

In [29]:
# decision tree pipeline
regress_dt = DecisionTreeRegressor(random_state = 123, ccp_alpha = 0.1)
dt = pipeMaker(regress_dt)

# determine best parameters using a parameter grid
param_grid = {
    'model__max_depth': [1, 3, 5, 10],
    'model__min_samples_split': [5, 10, 20],
    'model__min_samples_leaf':[5, 10, 15],
    'model__max_features':["auto", "sqrt", "log2"] 
}

# use gridsearchCV to determine which model params lead to the best model
# based on RMSE
# cross-validated 10 times
dt_test = GridSearchCV(dt, param_grid, cv=10, scoring = "neg_mean_squared_error")
dt_test.fit(X_train, y_train)
dt_test.best_params_

{'model__max_depth': 10,
 'model__max_features': 'sqrt',
 'model__min_samples_leaf': 5,
 'model__min_samples_split': 5}

In [30]:
# notable improvement over baseline 
# (~125k baseline vs ~95k decision tree test errors)
getModelPerf("Decision Tree Model", dt_test, X_train, y_train, X_test, y_test)

Decision Tree Model Performance

Training
RMSE: 88948.44202486823
MAE: 67639.61169522975

Test
RMSE: 94997.46552529749
MAE: 71849.56353578318


### Gradient Boosting
Gradient boosting models are typically more powerful than decision tree, making it worthwhile to search for significant performance increases for the housing price regression despite the longer runtime. By using a gradient boosting model, the housing price error can be reduced, giving more accurate predictions on average for potential homeowners. 

The gradient boosting model had an average price test error of 87,969.37 AUD.

In [31]:
# set up gradient boosting model
regress_gb = GradientBoostingRegressor(random_state = 123, ccp_alpha = 0.01)

# decision tree pipeline
gb = pipeMaker(regress_gb)

# determine best parameters
param_grid = {
    'model__max_depth': [3, 5, 10],
    'model__min_samples_split': [5, 10, 20],
    'model__min_samples_leaf':[1, 5, 10, 15],
    'model__max_features':["auto", "sqrt", "log2"],
}

# use gridsearchCV to determine which model params lead to the best model
# based on RMSE
# cross-validated 10 times
gb_test = GridSearchCV(gb, param_grid, cv=10, scoring = "neg_mean_squared_error")
gb_test.fit(X_train, y_train)
gb_test.best_params_

{'model__max_depth': 3,
 'model__max_features': 'sqrt',
 'model__min_samples_leaf': 1,
 'model__min_samples_split': 20}

In [32]:
# best gradient boosting model according to gridsearchCV in both RMSE and MAE
# improved performance over decision tree, and no overfitting
getModelPerf("Gradient Boosting Model", gb_test, X_train, y_train, X_test, y_test)

Gradient Boosting Model Performance

Training
RMSE: 80903.03003981354
MAE: 62168.32768203028

Test
RMSE: 87969.36918240116
MAE: 68884.80788929528


### Gradient Boosting (Subsampled)
By using subsampling, any overfitting in the model from the small sample size could be reduced, potentially leading to performance improvements over the original gradient boosting model. This allows for a more generalized model for predicting housing prices. 

Adding subsampling = 0.5 led to an average price test error of 85,967.02 AUD.

In [33]:
# make the following adjustments to reduce overfitting:
# - add subsample = 0.5 (regularization to reduce variance)
# cross-validated 10 times
regress_gb2 = GradientBoostingRegressor(random_state = 123, ccp_alpha = 0.01, 
                                     max_depth = 3, min_samples_leaf = 1, 
                                    min_samples_split = 20, max_features = "sqrt",
                                       subsample = 0.5)
gb2 = pipeMaker(regress_gb2)

In [34]:
# subsampled gradient boosting model is actually improved
# improved performance over original gradient boosting
# lower errors and closer training/test results -> better generalized
gb2.fit(X_train, y_train)
getModelPerf("Gradient Boosting Model (Subsampled)", gb2, X_train, y_train, X_test, y_test)

Gradient Boosting Model (Subsampled) Performance

Training
RMSE: 81767.37901169198
MAE: 63618.012762477236

Test
RMSE: 85967.02244773197
MAE: 65196.84907461885


## Evaluation

| Model | RMSE (AUD)|
| --- | --- |
| Baseline | 124,670.31 |
| Decision Tree | 94,997.47 |
| Gradient Boosting | 87,969.37 |
| Gradient Boosting (subsampled) | 85,967.02 |

**Table 4: RMSE performance between models.**

In context of the business issue, the model with the lowest RMSE would be the best. For potential buyers, if a model can predict a housing price closer to the true value of the property, they can make more informed choices about whether to purchase or pass on it. This would allow them to avoid properties whose listing prices are significantly different than the predicted "true price" as well as determine which properties can be brought at a discount. As an example, a model that predicts the true price within 30,000 AUD of the true price would be better than one which predicts within 100,000 AUD- it would be more precise.

Because of this, the gradient boosting model with subsampling is the best model, offering an approximately 31% performance improvement over the baseline. On average, it will predict a housing price that is 85,967.02 AUD of the true price.

## Conclusion

The final model is intended to be used by people who hope to become future property owners within the Australian housing market. For many, house hunting can be both difficult and time consuming. The addition of rampant underquoting as well as rising housing prices year after year makes it crucial for buyers to find a good deal that meets their requirements. If used as intended, the model will allow potential buyers the ability to more quickly determine which homes are truly within their price range, avoiding the trap of underquoted listings, as well as discover homes at a discount, which saves them both money and time.

Still, there are various improvements possible for the model. The relatively small sample of the dataset used to develop the gradient boosting model with subsampling makes it possible that the model will perform poorly when faced with out-of-sample cases. Training the model on a larger dataset will both reduce potential issues with overfitting as well as expand the number of cases that it can handle. Additionally, as noted in a prior section, the dataset utilizes listing prices, which can be affected by the very prevalent issue of underquoting in the Australian real estate industry. Using data of sold housing would bring the models closer to their intended purpose. 

By making these adjustments, future homeowners will be able to more easily find a property they can call home until Australia can resolve its housing affordability issue.

## Works Cited

[1] Granwal, L. *Residential housing market in Australia - statistics & facts*. Statista, 12 Dec 2022, [https://www.statista.com/topics/4987/residential-housing-market-in-australia/#topicHeader__wrapper](https://www.statista.com/topics/4987/residential-housing-market-in-australia/#topicHeader__wrapper). Accessed 2 Jan 2023.

[2] Buchholz, Katharina. *Where It’s Hardest to Afford a Home*. Statista, 21 Mar 2022, [https://www.statista.com/chart/16902/places-where-its-hardest-to-afford-a-home/](https://www.statista.com/chart/16902/places-where-its-hardest-to-afford-a-home/). Accessed 2 Jan 2023.

[3] Cassidy, Caitlin. *Underquoting in Australian real estate industry is leaving buyers feeling betrayed.* The Guardian, 26 Aug 2022. [https://www.theguardian.com/business/2022/aug/27/underquoting-in-australian-real-estate-industry-is-leaving-buyers-feeling-betrayed](https://www.theguardian.com/business/2022/aug/27/underquoting-in-australian-real-estate-industry-is-leaving-buyers-feeling-betrayed). Accessed 3 Jan 2023.