## 🏠 London House Sales Prediction

Given data about *houses in London*, let's try to predict how many **houses will be sold** in a given month and area.

We will use a variety of regression models to make our predictions. 

Data source: https://www.kaggle.com/datasets/justinas/housing-in-london?resource=download

### Importing Libraries

In [1]:
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.neighbors import KNeighborsRegressor
from sklearn.neural_network import MLPRegressor
from sklearn.svm import LinearSVR, SVR
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from xgboost import XGBRegressor
from lightgbm import LGBMRegressor
from catboost import CatBoostRegressor

import warnings
warnings.filterwarnings(action="ignore")

In [2]:
data = pd.read_csv("housing_in_london_monthly_variables.csv")
data

Unnamed: 0,date,area,average_price,code,houses_sold,no_of_crimes,borough_flag
0,1995-01-01,city of london,91449,E09000001,17.0,,1
1,1995-02-01,city of london,82203,E09000001,7.0,,1
2,1995-03-01,city of london,79121,E09000001,14.0,,1
3,1995-04-01,city of london,77101,E09000001,7.0,,1
4,1995-05-01,city of london,84409,E09000001,10.0,,1
...,...,...,...,...,...,...,...
13544,2019-09-01,england,249942,E92000001,64605.0,,0
13545,2019-10-01,england,249376,E92000001,68677.0,,0
13546,2019-11-01,england,248515,E92000001,67814.0,,0
13547,2019-12-01,england,250410,E92000001,,,0


In [3]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13549 entries, 0 to 13548
Data columns (total 7 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   date           13549 non-null  object 
 1   area           13549 non-null  object 
 2   average_price  13549 non-null  int64  
 3   code           13549 non-null  object 
 4   houses_sold    13455 non-null  float64
 5   no_of_crimes   7439 non-null   float64
 6   borough_flag   13549 non-null  int64  
dtypes: float64(2), int64(2), object(3)
memory usage: 741.1+ KB


### Preprocessing

In [4]:
df = data.copy()
df

Unnamed: 0,date,area,average_price,code,houses_sold,no_of_crimes,borough_flag
0,1995-01-01,city of london,91449,E09000001,17.0,,1
1,1995-02-01,city of london,82203,E09000001,7.0,,1
2,1995-03-01,city of london,79121,E09000001,14.0,,1
3,1995-04-01,city of london,77101,E09000001,7.0,,1
4,1995-05-01,city of london,84409,E09000001,10.0,,1
...,...,...,...,...,...,...,...
13544,2019-09-01,england,249942,E92000001,64605.0,,0
13545,2019-10-01,england,249376,E92000001,68677.0,,0
13546,2019-11-01,england,248515,E92000001,67814.0,,0
13547,2019-12-01,england,250410,E92000001,,,0


In [5]:
# Drop redundant columns
df = df.drop('code', axis=1)
df

Unnamed: 0,date,area,average_price,houses_sold,no_of_crimes,borough_flag
0,1995-01-01,city of london,91449,17.0,,1
1,1995-02-01,city of london,82203,7.0,,1
2,1995-03-01,city of london,79121,14.0,,1
3,1995-04-01,city of london,77101,7.0,,1
4,1995-05-01,city of london,84409,10.0,,1
...,...,...,...,...,...,...
13544,2019-09-01,england,249942,64605.0,,0
13545,2019-10-01,england,249376,68677.0,,0
13546,2019-11-01,england,248515,67814.0,,0
13547,2019-12-01,england,250410,,,0


In [9]:
df.isna().mean()*100

date              0.000000
area              0.000000
average_price     0.000000
houses_sold       0.693778
no_of_crimes     45.095579
borough_flag      0.000000
dtype: float64

In [10]:
# Drop columns with too many missing values
df = df.drop('no_of_crimes', axis=1)

In [11]:
df

Unnamed: 0,date,area,average_price,houses_sold,borough_flag
0,1995-01-01,city of london,91449,17.0,1
1,1995-02-01,city of london,82203,7.0,1
2,1995-03-01,city of london,79121,14.0,1
3,1995-04-01,city of london,77101,7.0,1
4,1995-05-01,city of london,84409,10.0,1
...,...,...,...,...,...
13544,2019-09-01,england,249942,64605.0,0
13545,2019-10-01,england,249376,68677.0,0
13546,2019-11-01,england,248515,67814.0,0
13547,2019-12-01,england,250410,,0


In [13]:
df.isna().sum()

date              0
area              0
average_price     0
houses_sold      94
borough_flag      0
dtype: int64

In [14]:
# Drop rows with missing target values
missing_target_rows = df[df["houses_sold"].isna()].index
df = df.drop(missing_target_rows, axis=0).reset_index(drop=True)

In [15]:
df

Unnamed: 0,date,area,average_price,houses_sold,borough_flag
0,1995-01-01,city of london,91449,17.0,1
1,1995-02-01,city of london,82203,7.0,1
2,1995-03-01,city of london,79121,14.0,1
3,1995-04-01,city of london,77101,7.0,1
4,1995-05-01,city of london,84409,10.0,1
...,...,...,...,...,...
13450,2019-07-01,england,248562,70681.0,0
13451,2019-08-01,england,249432,75079.0,0
13452,2019-09-01,england,249942,64605.0,0
13453,2019-10-01,england,249376,68677.0,0


In [16]:
df.isna().sum()

date             0
area             0
average_price    0
houses_sold      0
borough_flag     0
dtype: int64

In [17]:
pd.to_datetime(df['date']).apply(lambda x: x.year)

0        1995
1        1995
2        1995
3        1995
4        1995
         ... 
13450    2019
13451    2019
13452    2019
13453    2019
13454    2019
Name: date, Length: 13455, dtype: int64

In [18]:
# Extract date features
df['date'] = pd.to_datetime(df['date'])

In [19]:
df['year'] = df['date'].apply(lambda x: x.year)
df['month'] = df['date'].apply(lambda x: x.month)
df = df.drop('date', axis=1)

In [20]:
df

Unnamed: 0,area,average_price,houses_sold,borough_flag,year,month
0,city of london,91449,17.0,1,1995,1
1,city of london,82203,7.0,1,1995,2
2,city of london,79121,14.0,1,1995,3
3,city of london,77101,7.0,1,1995,4
4,city of london,84409,10.0,1,1995,5
...,...,...,...,...,...,...
13450,england,248562,70681.0,0,2019,7
13451,england,249432,75079.0,0,2019,8
13452,england,249942,64605.0,0,2019,9
13453,england,249376,68677.0,0,2019,10


In [21]:
df['area'].unique()

array(['city of london', 'barking and dagenham', 'barnet', 'bexley',
       'brent', 'bromley', 'camden', 'croydon', 'ealing', 'enfield',
       'tower hamlets', 'greenwich', 'hackney', 'south east',
       'hammersmith and fulham', 'haringey', 'harrow', 'havering',
       'hillingdon', 'hounslow', 'islington', 'kensington and chelsea',
       'kingston upon thames', 'lambeth', 'lewisham', 'merton', 'newham',
       'redbridge', 'richmond upon thames', 'southwark', 'sutton',
       'waltham forest', 'wandsworth', 'westminster', 'inner london',
       'outer london', 'north east', 'north west', 'yorks and the humber',
       'east midlands', 'west midlands', 'east of england', 'london',
       'south west', 'england'], dtype=object)

In [22]:
# One hot encode the area column
area_dummies = pd.get_dummies(df['area'], prefix='area')
df = pd.concat([df, area_dummies], axis=1)
df = df.drop('area', axis=1)
df

Unnamed: 0,average_price,houses_sold,borough_flag,year,month,area_barking and dagenham,area_barnet,area_bexley,area_brent,area_bromley,...,area_south east,area_south west,area_southwark,area_sutton,area_tower hamlets,area_waltham forest,area_wandsworth,area_west midlands,area_westminster,area_yorks and the humber
0,91449,17.0,1,1995,1,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
1,82203,7.0,1,1995,2,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
2,79121,14.0,1,1995,3,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
3,77101,7.0,1,1995,4,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
4,84409,10.0,1,1995,5,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
13450,248562,70681.0,0,2019,7,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
13451,249432,75079.0,0,2019,8,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
13452,249942,64605.0,0,2019,9,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
13453,249376,68677.0,0,2019,10,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False


In [23]:
# Split df into X and y
y = df['houses_sold']
X = df.drop('houses_sold', axis=1)

In [24]:
# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.7, shuffle=True, random_state=1)
X_train

Unnamed: 0,average_price,borough_flag,year,month,area_barking and dagenham,area_barnet,area_bexley,area_brent,area_bromley,area_camden,...,area_south east,area_south west,area_southwark,area_sutton,area_tower hamlets,area_waltham forest,area_wandsworth,area_west midlands,area_westminster,area_yorks and the humber
10752,128885,0,2018,11,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
10236,137190,0,2000,10,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
4512,73211,1,1997,2,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
9208,342013,1,2014,10,False,False,False,False,False,False,...,False,False,False,False,False,True,False,False,False,False
2672,473140,1,2018,5,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
905,64510,1,1995,9,False,False,True,False,False,False,...,False,False,False,False,False,False,False,False,False,False
5192,203931,1,2003,12,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
12172,196204,0,2012,8,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
235,761544,1,2014,8,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False


In [25]:
y_train

10752     3824.0
10236     6593.0
4512       280.0
9208       334.0
2672       220.0
          ...   
905        281.0
5192       478.0
12172     7823.0
235         27.0
13349    37892.0
Name: houses_sold, Length: 9418, dtype: float64

In [26]:
# Scale X
scaler = StandardScaler()
scaler.fit(X_train)
X_train = pd.DataFrame(scaler.transform(X_train), index=X_train.index, columns=X_train.columns)
X_test = pd.DataFrame(scaler.transform(X_test), index=X_test.index, columns=X_test.columns)

In [27]:
X_train

Unnamed: 0,average_price,borough_flag,year,month,area_barking and dagenham,area_barnet,area_bexley,area_brent,area_bromley,area_camden,...,area_south east,area_south west,area_southwark,area_sutton,area_tower hamlets,area_waltham forest,area_wandsworth,area_west midlands,area_westminster,area_yorks and the humber
10752,-0.707342,-1.657173,1.542734,1.315615,-0.145406,-0.150649,-0.148796,-0.155015,-0.152482,-0.148049,...,-0.148796,-0.156801,-0.152482,-0.148423,-0.146165,-0.148049,-0.152846,-0.152117,-0.152482,-0.151385
10236,-0.663007,-1.657173,-0.963307,1.024950,-0.145406,-0.150649,-0.148796,-0.155015,-0.152482,-0.148049,...,-0.148796,-0.156801,-0.152482,-0.148423,-0.146165,-0.148049,-0.152846,-0.152117,-0.152482,-0.151385
4512,-1.004551,0.603437,-1.380980,-1.300369,-0.145406,-0.150649,-0.148796,-0.155015,-0.152482,-0.148049,...,-0.148796,-0.156801,-0.152482,-0.148423,-0.146165,-0.148049,-0.152846,-0.152117,-0.152482,-0.151385
9208,0.430416,0.603437,0.985836,1.024950,-0.145406,-0.150649,-0.148796,-0.155015,-0.152482,-0.148049,...,-0.148796,-0.156801,-0.152482,-0.148423,-0.146165,6.754536,-0.152846,-0.152117,-0.152482,-0.151385
2672,1.130422,0.603437,1.542734,-0.428374,-0.145406,-0.150649,-0.148796,-0.155015,-0.152482,-0.148049,...,-0.148796,-0.156801,-0.152482,-0.148423,-0.146165,-0.148049,-0.152846,-0.152117,-0.152482,-0.151385
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
905,-1.051000,0.603437,-1.659429,0.734285,-0.145406,-0.150649,6.720615,-0.155015,-0.152482,-0.148049,...,-0.148796,-0.156801,-0.152482,-0.148423,-0.146165,-0.148049,-0.152846,-0.152117,-0.152482,-0.151385
5192,-0.306718,0.603437,-0.545633,1.606280,-0.145406,-0.150649,-0.148796,-0.155015,-0.152482,-0.148049,...,-0.148796,-0.156801,-0.152482,-0.148423,-0.146165,-0.148049,-0.152846,-0.152117,-0.152482,-0.151385
12172,-0.347968,-1.657173,0.707387,0.443620,-0.145406,-0.150649,-0.148796,-0.155015,-0.152482,-0.148049,...,-0.148796,-0.156801,-0.152482,-0.148423,-0.146165,-0.148049,-0.152846,-0.152117,-0.152482,-0.151385
235,2.670032,0.603437,0.985836,0.443620,-0.145406,-0.150649,-0.148796,-0.155015,-0.152482,-0.148049,...,-0.148796,-0.156801,-0.152482,-0.148423,-0.146165,-0.148049,-0.152846,-0.152117,-0.152482,-0.151385


### Training

In [29]:
models = {
    "                     Linear Regression": LinearRegression(),
    " Linear Regression (L2 Regularization)": Ridge(),
    " Linear Regression (L1 Regularization)": Lasso(),
    "                   K-Nearest Neighbors": KNeighborsRegressor(),
    "                        Neural Network": MLPRegressor(),
    "Support Vector Machine (Linear Kernel)": LinearSVR(),
    "   Support Vector Machine (RBF Kernel)": SVR(),
    "                         Decision Tree": DecisionTreeRegressor(),
    "                         Random Forest": RandomForestRegressor(),
    "                     Gradient Boosting": GradientBoostingRegressor(),
    "                               XGBoost": XGBRegressor(),
    "                              LightGBM": LGBMRegressor(),
    "                              CatBoost": CatBoostRegressor(verbose=0)
}

In [30]:
for name, model in models.items():
    model.fit(X_train, y_train)
    print(name + " trained.")

                     Linear Regression trained.
 Linear Regression (L2 Regularization) trained.
 Linear Regression (L1 Regularization) trained.
                   K-Nearest Neighbors trained.
                        Neural Network trained.
Support Vector Machine (Linear Kernel) trained.
   Support Vector Machine (RBF Kernel) trained.
                         Decision Tree trained.
                         Random Forest trained.
                     Gradient Boosting trained.
                               XGBoost trained.
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.004082 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 432
[LightGBM] [Info] Number of data points in the train set: 9418, number of used features: 49
[LightGBM] [Info] Start training from score 3895.653642
                              LightGBM trained.
                 

### Results

In [31]:
# RMSE metric
for name, model in models.items():
    y_pred = model.predict(X_test)
    rmse = np.sqrt(np.mean((y_test - y_pred)**2))
    print(name + " RMSE: {:.4f}".format(rmse))

                     Linear Regression RMSE: 3336.8416
 Linear Regression (L2 Regularization) RMSE: 3336.2725
 Linear Regression (L1 Regularization) RMSE: 3336.4065
                   K-Nearest Neighbors RMSE: 2264.7955
                        Neural Network RMSE: 3322.9521
Support Vector Machine (Linear Kernel) RMSE: 10322.3665
   Support Vector Machine (RBF Kernel) RMSE: 12501.5496
                         Decision Tree RMSE: 1817.1483
                         Random Forest RMSE: 1488.5985
                     Gradient Boosting RMSE: 1741.3415
                               XGBoost RMSE: 1289.3104
                              LightGBM RMSE: 1410.0693
                              CatBoost RMSE: 1295.0794


In [32]:
# R2 Score metric
for name, model in models.items():
    print(name + " R^2: {:.2f}".format(model.score(X_test, y_test)))

                     Linear Regression R^2: 0.92
 Linear Regression (L2 Regularization) R^2: 0.92
 Linear Regression (L1 Regularization) R^2: 0.92
                   K-Nearest Neighbors R^2: 0.96
                        Neural Network R^2: 0.92
Support Vector Machine (Linear Kernel) R^2: 0.27
   Support Vector Machine (RBF Kernel) R^2: -0.07
                         Decision Tree R^2: 0.98
                         Random Forest R^2: 0.98
                     Gradient Boosting R^2: 0.98
                               XGBoost R^2: 0.99
                              LightGBM R^2: 0.99
                              CatBoost R^2: 0.99
