## Perth House Price Prediction

Given *data about houses in Perth*, let's try to predict the **price** of a given house.

We will use three different linear regression models to make our predictions.

Data source: https://www.kaggle.com/datasets/syuzai/perth-house-prices

### Importing Libraries

In [1]:
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

from sklearn.linear_model import LinearRegression, Ridge, Lasso

import warnings
warnings.filterwarnings(action='ignore')

In [2]:
data = pd.read_csv('all_perth_310121.csv')
data

Unnamed: 0,ADDRESS,SUBURB,PRICE,BEDROOMS,BATHROOMS,GARAGE,LAND_AREA,FLOOR_AREA,BUILD_YEAR,CBD_DIST,NEAREST_STN,NEAREST_STN_DIST,DATE_SOLD,POSTCODE,LATITUDE,LONGITUDE,NEAREST_SCH,NEAREST_SCH_DIST,NEAREST_SCH_RANK
0,1 Acorn Place,South Lake,565000,4,2,2.0,600,160,2003.0,18300,Cockburn Central Station,1800,09-2018\r,6164,-32.115900,115.842450,LAKELAND SENIOR HIGH SCHOOL,0.828339,
1,1 Addis Way,Wandi,365000,3,2,2.0,351,139,2013.0,26900,Kwinana Station,4900,02-2019\r,6167,-32.193470,115.859554,ATWELL COLLEGE,5.524324,129.0
2,1 Ainsley Court,Camillo,287000,3,1,1.0,719,86,1979.0,22600,Challis Station,1900,06-2015\r,6111,-32.120578,115.993579,KELMSCOTT SENIOR HIGH SCHOOL,1.649178,113.0
3,1 Albert Street,Bellevue,255000,2,1,2.0,651,59,1953.0,17900,Midland Station,3600,07-2018\r,6056,-31.900547,116.038009,SWAN VIEW SENIOR HIGH SCHOOL,1.571401,
4,1 Aman Place,Lockridge,325000,4,1,2.0,466,131,1998.0,11200,Bassendean Station,2000,11-2016\r,6054,-31.885790,115.947780,KIARA COLLEGE,1.514922,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
33651,9C Gold Street,South Fremantle,1040000,4,3,2.0,292,245,2013.0,16100,Fremantle Station,1500,03-2016\r,6162,-32.064580,115.751820,CHRISTIAN BROTHERS' COLLEGE,1.430350,49.0
33652,9C Pycombe Way,Westminster,410000,3,2,2.0,228,114,,9600,Stirling Station,4600,02-2017\r,6061,-31.867055,115.841403,JOHN SEPTIMUS ROE ANGLICAN COMMUNITY SCHOOL,1.679644,35.0
33653,9D Pycombe Way,Westminster,427000,3,2,2.0,261,112,,9600,Stirling Station,4600,02-2017\r,6061,-31.866890,115.841418,JOHN SEPTIMUS ROE ANGLICAN COMMUNITY SCHOOL,1.669159,35.0
33654,9D Shalford Way,Girrawheen,295000,3,1,2.0,457,85,1974.0,12600,Warwick Station,4400,10-2016\r,6064,-31.839680,115.842410,GIRRAWHEEN SENIOR HIGH SCHOOL,0.358494,


In [3]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 33656 entries, 0 to 33655
Data columns (total 19 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   ADDRESS           33656 non-null  object 
 1   SUBURB            33656 non-null  object 
 2   PRICE             33656 non-null  int64  
 3   BEDROOMS          33656 non-null  int64  
 4   BATHROOMS         33656 non-null  int64  
 5   GARAGE            31178 non-null  float64
 6   LAND_AREA         33656 non-null  int64  
 7   FLOOR_AREA        33656 non-null  int64  
 8   BUILD_YEAR        30501 non-null  float64
 9   CBD_DIST          33656 non-null  int64  
 10  NEAREST_STN       33656 non-null  object 
 11  NEAREST_STN_DIST  33656 non-null  int64  
 12  DATE_SOLD         33656 non-null  object 
 13  POSTCODE          33656 non-null  int64  
 14  LATITUDE          33656 non-null  float64
 15  LONGITUDE         33656 non-null  float64
 16  NEAREST_SCH       33656 non-null  object

### Preprocessing

In [4]:
df = data.copy()

In [5]:
{column: len(df[column].unique()) for column in df.select_dtypes('object').columns}

{'ADDRESS': 33566,
 'SUBURB': 321,
 'NEAREST_STN': 68,
 'DATE_SOLD': 350,
 'NEAREST_SCH': 160}

In [6]:
# Drop high cardinality ADDRESS column
df = df.drop('ADDRESS', axis=1)

In [7]:
df.isna().mean()*100

SUBURB               0.000000
PRICE                0.000000
BEDROOMS             0.000000
BATHROOMS            0.000000
GARAGE               7.362729
LAND_AREA            0.000000
FLOOR_AREA           0.000000
BUILD_YEAR           9.374257
CBD_DIST             0.000000
NEAREST_STN          0.000000
NEAREST_STN_DIST     0.000000
DATE_SOLD            0.000000
POSTCODE             0.000000
LATITUDE             0.000000
LONGITUDE            0.000000
NEAREST_SCH          0.000000
NEAREST_SCH_DIST     0.000000
NEAREST_SCH_RANK    32.541003
dtype: float64

In [8]:
# Drop high-missing value columns (> 25%)
df = df.drop('NEAREST_SCH_RANK', axis=1)

In [9]:
df.isna().sum()

SUBURB                 0
PRICE                  0
BEDROOMS               0
BATHROOMS              0
GARAGE              2478
LAND_AREA              0
FLOOR_AREA             0
BUILD_YEAR          3155
CBD_DIST               0
NEAREST_STN            0
NEAREST_STN_DIST       0
DATE_SOLD              0
POSTCODE               0
LATITUDE               0
LONGITUDE              0
NEAREST_SCH            0
NEAREST_SCH_DIST       0
dtype: int64

In [10]:
df['GARAGE'].unique()

array([ 2.,  1.,  3.,  8.,  6.,  4., nan,  5.,  7.,  9., 10., 12., 32.,
       14., 16., 11., 13., 17., 18., 21., 20., 99., 26., 22., 50., 31.])

In [11]:
# Fill missing values
garage = 'categorical'   # continuous/categorical
if garage == 'continuous':
    df['GARAGE'] = df['GARAGE'].fillna(df['GARAGE'].median())

if garage == 'categorical':
    dummies = pd.get_dummies(df['GARAGE'], prefix='GARAGE')
    df = pd.concat([df, dummies], axis=1)
    df = df.drop('GARAGE', axis=1)

In [12]:
df['BUILD_YEAR'] = df['BUILD_YEAR'].fillna(df['BUILD_YEAR'].median())

In [13]:
# Extract Date features
df['DATE_SOLD'] = pd.to_datetime(df['DATE_SOLD'])
df['DATE_YEAR'] = df['DATE_SOLD'].apply(lambda x: x.year)
df['DATE_MONTH'] = df['DATE_SOLD'].apply(lambda x: x.month)
df = df.drop('DATE_SOLD', axis=1)

In [14]:
{column: len(df[column].unique()) for column in df.select_dtypes('object').columns}

{'SUBURB': 321, 'NEAREST_STN': 68, 'NEAREST_SCH': 160}

In [15]:
# One-hot encode nominal features
for column in ['SUBURB', 'NEAREST_STN', 'NEAREST_SCH', 'POSTCODE']:
    dummies = pd.get_dummies(df[column], prefix=column)
    df = pd.concat([df, dummies], axis=1)
    df = df.drop(column, axis=1)

In [16]:
# Split df into X and y
y = df['PRICE']
X = df.drop('PRICE', axis=1)

In [17]:
# Train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.7, shuffle=True, random_state=4)

In [18]:
X_train.shape, X_test.shape

((23559, 700), (10097, 700))

In [19]:
# Scale X
scaler = StandardScaler()
scaler.fit(X_train)
X_train = pd.DataFrame(scaler.transform(X_train), index=X_train.index, columns=X_train.columns)
X_test = pd.DataFrame(scaler.transform(X_test), index=X_test.index, columns=X_test.columns)

In [20]:
X_train

Unnamed: 0,BEDROOMS,BATHROOMS,LAND_AREA,FLOOR_AREA,BUILD_YEAR,CBD_DIST,NEAREST_STN_DIST,LATITUDE,LONGITUDE,NEAREST_SCH_DIST,...,POSTCODE_6169,POSTCODE_6170,POSTCODE_6171,POSTCODE_6172,POSTCODE_6173,POSTCODE_6174,POSTCODE_6175,POSTCODE_6176,POSTCODE_6556,POSTCODE_6558
3344,-0.871826,0.303237,-0.144360,-0.487472,1.092920,0.265234,1.951977,1.031998,0.867838,-0.565627,...,-0.124043,-0.067863,-0.05459,-0.074777,-0.077318,-0.047931,-0.069423,-0.047484,-0.057635,-0.039661
24569,1.789056,0.303237,-0.134793,0.768150,0.790640,-0.350658,0.460922,0.841785,-0.226862,-0.453321,...,-0.124043,-0.067863,-0.05459,-0.074777,-0.077318,-0.047931,-0.069423,-0.047484,-0.057635,-0.039661
11284,1.789056,0.303237,-0.124723,-0.682791,-0.166580,-0.799380,0.438667,0.582273,-0.128410,-0.610313,...,-0.124043,-0.067863,-0.05459,-0.074777,-0.077318,-0.047931,-0.069423,-0.047484,-0.057635,-0.039661
16754,-0.871826,-1.393667,-0.125730,-0.878110,-1.023039,0.194846,-0.696315,-0.839889,1.020655,-0.520883,...,-0.124043,-0.067863,-0.05459,-0.074777,-0.077318,-0.047931,-0.069423,-0.047484,-0.057635,-0.039661
32568,0.458615,0.303237,-0.138696,1.451766,0.589120,-1.080931,-0.763079,-0.333219,-0.156394,-0.595402,...,-0.124043,-0.067863,-0.05459,-0.074777,-0.077318,-0.047931,-0.069423,-0.047484,-0.057635,-0.039661
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
23346,0.458615,0.303237,-0.119688,0.851858,0.639500,0.643568,-0.629551,1.276193,-1.217119,0.363753,...,-0.124043,-0.067863,-0.05459,-0.074777,-0.077318,-0.047931,-0.069423,-0.047484,-0.057635,-0.039661
11863,0.458615,0.303237,-0.134038,0.279853,0.639500,0.256436,-0.451515,1.073218,-1.067756,-0.475575,...,-0.124043,-0.067863,-0.05459,-0.074777,-0.077318,-0.047931,-0.069423,-0.047484,-0.057635,-0.039661
27063,0.458615,3.697046,-0.127933,0.461220,-2.030638,-1.353683,-0.983621,0.049449,-0.562017,-0.289660,...,-0.124043,-0.067863,-0.05459,-0.074777,-0.077318,-0.047931,-0.069423,-0.047484,-0.057635,-0.039661
8366,0.458615,0.303237,-0.124723,0.517026,0.790640,0.142056,1.462377,0.945830,0.890513,-0.869749,...,-0.124043,-0.067863,-0.05459,-0.074777,-0.077318,-0.047931,-0.069423,-0.047484,-0.057635,-0.039661


### Training

In [21]:
models = {
                    "Linear Regression": LinearRegression(),
    "Ridge (L2-Regularized) Regression": Ridge(),
    "Lasso (L1-Regularized) Regression": Lasso()
}

In [22]:
for name, model in models.items():
    model.fit(X_train, y_train)
    print(name + " trained.")

Linear Regression trained.
Ridge (L2-Regularized) Regression trained.
Lasso (L1-Regularized) Regression trained.


### Results

In [23]:
for name, model in models.items():
    print(name + "R^2 Score: {:.5f}".format(model.score(X_test, y_test)))

Linear RegressionR^2 Score: 0.77596
Ridge (L2-Regularized) RegressionR^2 Score: 0.77599
Lasso (L1-Regularized) RegressionR^2 Score: 0.77610


In [24]:
# lasso_model = Lasso(alpha=10.0)
# lasso_model.fit(X_train, y_train)

# print("R^2 Score: {:.5f}".format(lasso_model.score(X_test, y_test)))