## Perth House Prices Prediction

Given *data about houses in Perth*, let's try to predict the **price** of a given house.

We will use three different linear regression models to make our predictions.

Data source: https://www.kaggle.com/datasets/syuzai/perth-house-prices

### Importing Libraries

In [1]:
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

from sklearn.linear_model import LinearRegression, Ridge, Lasso

import warnings
warnings.filterwarnings(action='ignore')

In [2]:
data = pd.read_csv('archive/all_perth_310121.csv')
data

Unnamed: 0,ADDRESS,SUBURB,PRICE,BEDROOMS,BATHROOMS,GARAGE,LAND_AREA,FLOOR_AREA,BUILD_YEAR,CBD_DIST,NEAREST_STN,NEAREST_STN_DIST,DATE_SOLD,POSTCODE,LATITUDE,LONGITUDE,NEAREST_SCH,NEAREST_SCH_DIST,NEAREST_SCH_RANK
0,1 Acorn Place,South Lake,565000,4,2,2.0,600,160,2003.0,18300,Cockburn Central Station,1800,09-2018\r,6164,-32.115900,115.842450,LAKELAND SENIOR HIGH SCHOOL,0.828339,
1,1 Addis Way,Wandi,365000,3,2,2.0,351,139,2013.0,26900,Kwinana Station,4900,02-2019\r,6167,-32.193470,115.859554,ATWELL COLLEGE,5.524324,129.0
2,1 Ainsley Court,Camillo,287000,3,1,1.0,719,86,1979.0,22600,Challis Station,1900,06-2015\r,6111,-32.120578,115.993579,KELMSCOTT SENIOR HIGH SCHOOL,1.649178,113.0
3,1 Albert Street,Bellevue,255000,2,1,2.0,651,59,1953.0,17900,Midland Station,3600,07-2018\r,6056,-31.900547,116.038009,SWAN VIEW SENIOR HIGH SCHOOL,1.571401,
4,1 Aman Place,Lockridge,325000,4,1,2.0,466,131,1998.0,11200,Bassendean Station,2000,11-2016\r,6054,-31.885790,115.947780,KIARA COLLEGE,1.514922,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
33651,9C Gold Street,South Fremantle,1040000,4,3,2.0,292,245,2013.0,16100,Fremantle Station,1500,03-2016\r,6162,-32.064580,115.751820,CHRISTIAN BROTHERS' COLLEGE,1.430350,49.0
33652,9C Pycombe Way,Westminster,410000,3,2,2.0,228,114,,9600,Stirling Station,4600,02-2017\r,6061,-31.867055,115.841403,JOHN SEPTIMUS ROE ANGLICAN COMMUNITY SCHOOL,1.679644,35.0
33653,9D Pycombe Way,Westminster,427000,3,2,2.0,261,112,,9600,Stirling Station,4600,02-2017\r,6061,-31.866890,115.841418,JOHN SEPTIMUS ROE ANGLICAN COMMUNITY SCHOOL,1.669159,35.0
33654,9D Shalford Way,Girrawheen,295000,3,1,2.0,457,85,1974.0,12600,Warwick Station,4400,10-2016\r,6064,-31.839680,115.842410,GIRRAWHEEN SENIOR HIGH SCHOOL,0.358494,


In [3]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 33656 entries, 0 to 33655
Data columns (total 19 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   ADDRESS           33656 non-null  object 
 1   SUBURB            33656 non-null  object 
 2   PRICE             33656 non-null  int64  
 3   BEDROOMS          33656 non-null  int64  
 4   BATHROOMS         33656 non-null  int64  
 5   GARAGE            31178 non-null  float64
 6   LAND_AREA         33656 non-null  int64  
 7   FLOOR_AREA        33656 non-null  int64  
 8   BUILD_YEAR        30501 non-null  float64
 9   CBD_DIST          33656 non-null  int64  
 10  NEAREST_STN       33656 non-null  object 
 11  NEAREST_STN_DIST  33656 non-null  int64  
 12  DATE_SOLD         33656 non-null  object 
 13  POSTCODE          33656 non-null  int64  
 14  LATITUDE          33656 non-null  float64
 15  LONGITUDE         33656 non-null  float64
 16  NEAREST_SCH       33656 non-null  object

### Preprocessing

In [117]:
df = data.copy()

In [118]:
df

Unnamed: 0,ADDRESS,SUBURB,PRICE,BEDROOMS,BATHROOMS,GARAGE,LAND_AREA,FLOOR_AREA,BUILD_YEAR,CBD_DIST,NEAREST_STN,NEAREST_STN_DIST,DATE_SOLD,POSTCODE,LATITUDE,LONGITUDE,NEAREST_SCH,NEAREST_SCH_DIST,NEAREST_SCH_RANK
0,1 Acorn Place,South Lake,565000,4,2,2.0,600,160,2003.0,18300,Cockburn Central Station,1800,09-2018\r,6164,-32.115900,115.842450,LAKELAND SENIOR HIGH SCHOOL,0.828339,
1,1 Addis Way,Wandi,365000,3,2,2.0,351,139,2013.0,26900,Kwinana Station,4900,02-2019\r,6167,-32.193470,115.859554,ATWELL COLLEGE,5.524324,129.0
2,1 Ainsley Court,Camillo,287000,3,1,1.0,719,86,1979.0,22600,Challis Station,1900,06-2015\r,6111,-32.120578,115.993579,KELMSCOTT SENIOR HIGH SCHOOL,1.649178,113.0
3,1 Albert Street,Bellevue,255000,2,1,2.0,651,59,1953.0,17900,Midland Station,3600,07-2018\r,6056,-31.900547,116.038009,SWAN VIEW SENIOR HIGH SCHOOL,1.571401,
4,1 Aman Place,Lockridge,325000,4,1,2.0,466,131,1998.0,11200,Bassendean Station,2000,11-2016\r,6054,-31.885790,115.947780,KIARA COLLEGE,1.514922,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
33651,9C Gold Street,South Fremantle,1040000,4,3,2.0,292,245,2013.0,16100,Fremantle Station,1500,03-2016\r,6162,-32.064580,115.751820,CHRISTIAN BROTHERS' COLLEGE,1.430350,49.0
33652,9C Pycombe Way,Westminster,410000,3,2,2.0,228,114,,9600,Stirling Station,4600,02-2017\r,6061,-31.867055,115.841403,JOHN SEPTIMUS ROE ANGLICAN COMMUNITY SCHOOL,1.679644,35.0
33653,9D Pycombe Way,Westminster,427000,3,2,2.0,261,112,,9600,Stirling Station,4600,02-2017\r,6061,-31.866890,115.841418,JOHN SEPTIMUS ROE ANGLICAN COMMUNITY SCHOOL,1.669159,35.0
33654,9D Shalford Way,Girrawheen,295000,3,1,2.0,457,85,1974.0,12600,Warwick Station,4400,10-2016\r,6064,-31.839680,115.842410,GIRRAWHEEN SENIOR HIGH SCHOOL,0.358494,


In [119]:
{column: len(df[column].unique()) for column in df.select_dtypes('object').columns}

{'ADDRESS': 33566,
 'SUBURB': 321,
 'NEAREST_STN': 68,
 'DATE_SOLD': 350,
 'NEAREST_SCH': 160}

In [120]:
# Drop high cardinality ADDRESS column
df = df.drop('ADDRESS', axis=1)

In [121]:
{column: len(df[column].unique()) for column in df.select_dtypes('object').columns}

{'SUBURB': 321, 'NEAREST_STN': 68, 'DATE_SOLD': 350, 'NEAREST_SCH': 160}

In [122]:
df.isna().mean()

SUBURB              0.000000
PRICE               0.000000
BEDROOMS            0.000000
BATHROOMS           0.000000
GARAGE              0.073627
LAND_AREA           0.000000
FLOOR_AREA          0.000000
BUILD_YEAR          0.093743
CBD_DIST            0.000000
NEAREST_STN         0.000000
NEAREST_STN_DIST    0.000000
DATE_SOLD           0.000000
POSTCODE            0.000000
LATITUDE            0.000000
LONGITUDE           0.000000
NEAREST_SCH         0.000000
NEAREST_SCH_DIST    0.000000
NEAREST_SCH_RANK    0.325410
dtype: float64

In [123]:
# Drop high missing value columns (> 25%)
df = df.drop('NEAREST_SCH_RANK', axis=1)

In [124]:
df.isna().sum()

SUBURB                 0
PRICE                  0
BEDROOMS               0
BATHROOMS              0
GARAGE              2478
LAND_AREA              0
FLOOR_AREA             0
BUILD_YEAR          3155
CBD_DIST               0
NEAREST_STN            0
NEAREST_STN_DIST       0
DATE_SOLD              0
POSTCODE               0
LATITUDE               0
LONGITUDE              0
NEAREST_SCH            0
NEAREST_SCH_DIST       0
dtype: int64

In [125]:
df['GARAGE'].unique()

array([ 2.,  1.,  3.,  8.,  6.,  4., nan,  5.,  7.,  9., 10., 12., 32.,
       14., 16., 11., 13., 17., 18., 21., 20., 99., 26., 22., 50., 31.])

In [126]:
def preprocess_inputs(df, garage='continuous'):
    df = df.copy()

    if garage == 'continuous':
        df['GARAGE'] = df['GARAGE'].fillna(df['GARAGE'].mode()[0])
    if garage == 'categorical':
        dummies = pd.get_dummies(df['GARAGE'], prefix='GARAGE', dtype=int)
        df = pd.concat([df, dummies], axis=1)
        df = df.drop('GARAGE', axis=1)
    return df

In [127]:
df = preprocess_inputs(df, garage='categorical')

In [128]:
df.isna().sum()

SUBURB                 0
PRICE                  0
BEDROOMS               0
BATHROOMS              0
LAND_AREA              0
FLOOR_AREA             0
BUILD_YEAR          3155
CBD_DIST               0
NEAREST_STN            0
NEAREST_STN_DIST       0
DATE_SOLD              0
POSTCODE               0
LATITUDE               0
LONGITUDE              0
NEAREST_SCH            0
NEAREST_SCH_DIST       0
GARAGE_1.0             0
GARAGE_2.0             0
GARAGE_3.0             0
GARAGE_4.0             0
GARAGE_5.0             0
GARAGE_6.0             0
GARAGE_7.0             0
GARAGE_8.0             0
GARAGE_9.0             0
GARAGE_10.0            0
GARAGE_11.0            0
GARAGE_12.0            0
GARAGE_13.0            0
GARAGE_14.0            0
GARAGE_16.0            0
GARAGE_17.0            0
GARAGE_18.0            0
GARAGE_20.0            0
GARAGE_21.0            0
GARAGE_22.0            0
GARAGE_26.0            0
GARAGE_31.0            0
GARAGE_32.0            0
GARAGE_50.0            0


In [129]:
df['BUILD_YEAR'] = df['BUILD_YEAR'].fillna(df['BUILD_YEAR'].median())

In [130]:
df.isna().sum()

SUBURB              0
PRICE               0
BEDROOMS            0
BATHROOMS           0
LAND_AREA           0
FLOOR_AREA          0
BUILD_YEAR          0
CBD_DIST            0
NEAREST_STN         0
NEAREST_STN_DIST    0
DATE_SOLD           0
POSTCODE            0
LATITUDE            0
LONGITUDE           0
NEAREST_SCH         0
NEAREST_SCH_DIST    0
GARAGE_1.0          0
GARAGE_2.0          0
GARAGE_3.0          0
GARAGE_4.0          0
GARAGE_5.0          0
GARAGE_6.0          0
GARAGE_7.0          0
GARAGE_8.0          0
GARAGE_9.0          0
GARAGE_10.0         0
GARAGE_11.0         0
GARAGE_12.0         0
GARAGE_13.0         0
GARAGE_14.0         0
GARAGE_16.0         0
GARAGE_17.0         0
GARAGE_18.0         0
GARAGE_20.0         0
GARAGE_21.0         0
GARAGE_22.0         0
GARAGE_26.0         0
GARAGE_31.0         0
GARAGE_32.0         0
GARAGE_50.0         0
GARAGE_99.0         0
dtype: int64

In [131]:
# Extract Date features 
df['DATE_SOLD'] = pd.to_datetime(df['DATE_SOLD'])
df['DATE_YEAR'] = df['DATE_SOLD'].apply(lambda x: x.year)
df['DATE_MONTH'] = df['DATE_SOLD'].apply(lambda x: x.month)
df = df.drop('DATE_SOLD', axis=1)

In [132]:
df

Unnamed: 0,SUBURB,PRICE,BEDROOMS,BATHROOMS,LAND_AREA,FLOOR_AREA,BUILD_YEAR,CBD_DIST,NEAREST_STN,NEAREST_STN_DIST,...,GARAGE_20.0,GARAGE_21.0,GARAGE_22.0,GARAGE_26.0,GARAGE_31.0,GARAGE_32.0,GARAGE_50.0,GARAGE_99.0,DATE_YEAR,DATE_MONTH
0,South Lake,565000,4,2,600,160,2003.0,18300,Cockburn Central Station,1800,...,0,0,0,0,0,0,0,0,2018,9
1,Wandi,365000,3,2,351,139,2013.0,26900,Kwinana Station,4900,...,0,0,0,0,0,0,0,0,2019,2
2,Camillo,287000,3,1,719,86,1979.0,22600,Challis Station,1900,...,0,0,0,0,0,0,0,0,2015,6
3,Bellevue,255000,2,1,651,59,1953.0,17900,Midland Station,3600,...,0,0,0,0,0,0,0,0,2018,7
4,Lockridge,325000,4,1,466,131,1998.0,11200,Bassendean Station,2000,...,0,0,0,0,0,0,0,0,2016,11
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
33651,South Fremantle,1040000,4,3,292,245,2013.0,16100,Fremantle Station,1500,...,0,0,0,0,0,0,0,0,2016,3
33652,Westminster,410000,3,2,228,114,1995.0,9600,Stirling Station,4600,...,0,0,0,0,0,0,0,0,2017,2
33653,Westminster,427000,3,2,261,112,1995.0,9600,Stirling Station,4600,...,0,0,0,0,0,0,0,0,2017,2
33654,Girrawheen,295000,3,1,457,85,1974.0,12600,Warwick Station,4400,...,0,0,0,0,0,0,0,0,2016,10


In [133]:
{column: len(df[column].unique()) for column in df.select_dtypes('object').columns}

{'SUBURB': 321, 'NEAREST_STN': 68, 'NEAREST_SCH': 160}

In [134]:
# One-hot encode the nominal features
for column in ['SUBURB', 'NEAREST_STN', 'NEAREST_SCH', 'POSTCODE']:
    dummies = pd.get_dummies(df[column], prefix=column, dtype=int)
    df = pd.concat([df, dummies], axis=1)
    df = df.drop(column, axis=1)

In [135]:
df

Unnamed: 0,PRICE,BEDROOMS,BATHROOMS,LAND_AREA,FLOOR_AREA,BUILD_YEAR,CBD_DIST,NEAREST_STN_DIST,LATITUDE,LONGITUDE,...,POSTCODE_6169,POSTCODE_6170,POSTCODE_6171,POSTCODE_6172,POSTCODE_6173,POSTCODE_6174,POSTCODE_6175,POSTCODE_6176,POSTCODE_6556,POSTCODE_6558
0,565000,4,2,600,160,2003.0,18300,1800,-32.115900,115.842450,...,0,0,0,0,0,0,0,0,0,0
1,365000,3,2,351,139,2013.0,26900,4900,-32.193470,115.859554,...,0,0,0,0,0,0,0,0,0,0
2,287000,3,1,719,86,1979.0,22600,1900,-32.120578,115.993579,...,0,0,0,0,0,0,0,0,0,0
3,255000,2,1,651,59,1953.0,17900,3600,-31.900547,116.038009,...,0,0,0,0,0,0,0,0,0,0
4,325000,4,1,466,131,1998.0,11200,2000,-31.885790,115.947780,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
33651,1040000,4,3,292,245,2013.0,16100,1500,-32.064580,115.751820,...,0,0,0,0,0,0,0,0,0,0
33652,410000,3,2,228,114,1995.0,9600,4600,-31.867055,115.841403,...,0,0,0,0,0,0,0,0,0,0
33653,427000,3,2,261,112,1995.0,9600,4600,-31.866890,115.841418,...,0,0,0,0,0,0,0,0,0,0
33654,295000,3,1,457,85,1974.0,12600,4400,-31.839680,115.842410,...,0,0,0,0,0,0,0,0,0,0


In [136]:
# Split df into X and y
y = df['PRICE']
X = df.drop('PRICE', axis=1)

In [137]:
# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.7, shuffle=True, random_state=1)

In [138]:
X_train

Unnamed: 0,BEDROOMS,BATHROOMS,LAND_AREA,FLOOR_AREA,BUILD_YEAR,CBD_DIST,NEAREST_STN_DIST,LATITUDE,LONGITUDE,NEAREST_SCH_DIST,...,POSTCODE_6169,POSTCODE_6170,POSTCODE_6171,POSTCODE_6172,POSTCODE_6173,POSTCODE_6174,POSTCODE_6175,POSTCODE_6176,POSTCODE_6556,POSTCODE_6558
25967,3,1,682,86,1988.0,21200,523,-31.772800,115.784080,1.153645,...,0,0,0,0,0,0,0,0,0,0
28153,3,2,302,140,2008.0,2700,900,-31.931662,115.844089,0.518217,...,0,0,0,0,0,0,0,0,0,0
33655,3,1,296,95,1995.0,16700,1700,-31.882163,116.014755,1.055564,...,0,0,0,0,0,0,0,0,0,0
211,5,2,576,183,2012.0,16000,1900,-31.829932,115.768217,0.898705,...,0,0,0,0,0,0,0,0,0,0
23604,4,2,771,162,1977.0,15300,3700,-31.845770,115.756950,1.958158,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7813,3,1,507,82,1964.0,8400,2200,-32.026569,115.844941,0.980037,...,0,0,0,0,0,0,0,0,0,0
32511,4,2,684,149,1987.0,40600,2800,-32.301830,115.735060,0.578224,...,1,0,0,0,0,0,0,0,0,0
5192,4,2,520,129,1995.0,27700,1700,-32.168043,116.006048,1.066233,...,0,0,0,0,0,0,0,0,0,0
12172,3,2,364,134,2001.0,10000,1100,-32.040937,115.844959,1.587311,...,0,0,0,0,0,0,0,0,0,0


In [139]:
y_train

25967     486000
28153     855000
33655     295000
211       730000
23604     825000
          ...   
7813     2000000
32511     322500
5192      355000
12172     725000
33003     428500
Name: PRICE, Length: 23559, dtype: int64

In [140]:
# Scale X
scaler = StandardScaler()

scaler.fit(X_train)

X_train = pd.DataFrame(scaler.transform(X_train), columns=X_train.columns, index=X_train.index)
X_test = pd.DataFrame(scaler.transform(X_test), columns=X_test.columns, index=X_test.index)

### Training

In [141]:
models = {
    "                Linear Regression": LinearRegression(),
    "Ridge (L2-Regularized) Regression": Ridge(),
    "Lasso (L1-Regularized) Regression": Lasso()
}

In [142]:
for name, model in models.items():
    model.fit(X_train, y_train)
    print(name + " trained.")

                Linear Regression trained.
Ridge (L2-Regularized) Regression trained.
Lasso (L1-Regularized) Regression trained.


### Results

In [143]:
for name, model in models.items():
    print(name + ": R^2 Score: {:.5f}".format(model.score(X_test, y_test)))

                Linear Regression: R^2 Score: 0.77742
Ridge (L2-Regularized) Regression: R^2 Score: 0.77742
Lasso (L1-Regularized) Regression: R^2 Score: 0.77739


In [144]:
ridge_model = Ridge(alpha=10)
ridge_model.fit(X_train, y_train)

print("R^2 Score: {:.5f}".format(ridge_model.score(X_test, y_test)))

R^2 Score: 0.77743
