**MODELLING PART 2**

- So far the linear models (LR, Ridge, Lasso, ElasticNet) have performed very poorly compared to KNN and RandomForest. 

- Performing a PowerTransformation on the target variable hasn't helped, it's possible that these models are struggling to capture a non-linear relationship between predictor variables and target.

- A possible solution might be to try introducing Polynomial Features.


In [31]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('darkgrid')
from sklearn.linear_model import LinearRegression, Ridge, Lasso, ElasticNet, RidgeCV, LassoCV, ElasticNetCV
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import PolynomialFeatures, OneHotEncoder
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import r2_score
from sklearn.preprocessing import PowerTransformer
from sklearn.pipeline import Pipeline
from sklearn.compose import TransformedTargetRegressor

In [2]:
# import transaction dataframe

latest_df = pd.read_csv('../big_files/latest_df.csv')
latest_df = latest_df.drop(columns = ['Unnamed: 0'])

In [3]:
# list of features used as predictor variables

updated_features = ['prop_type', 'fsm_lsoa', 'ea_in_ward', 'avg_airbnb', 'airbnb_tot', 'crime_lsoa', 'deli_count', 'flor_count', 'rest_count', 'income_rank_pos','employment_rank_pos', 'education_rank_pos', 'health_dep_score', 'crime_rank_pos', 'housing_rank_pos', 'living_env_pos', 'zone_50m','zone_200m', 'zone_400m', 'zone_800m', 'zone_1600m', 'zone_2400m', 'dist_traf', 'gang_prob_pos', 'crime_worry_pos', 'safety_fears_pos', 'satisfaction_pos', 'gun_crime', 'knife_crime', 'good_school', 'tube_zone', 'any_tube']
updated_df = latest_df[updated_features]
updated_df.head(1)

Unnamed: 0,prop_type,fsm_lsoa,ea_in_ward,avg_airbnb,airbnb_tot,crime_lsoa,deli_count,flor_count,rest_count,income_rank_pos,...,dist_traf,gang_prob_pos,crime_worry_pos,safety_fears_pos,satisfaction_pos,gun_crime,knife_crime,good_school,tube_zone,any_tube
0,F,54.22,2.0,105.952941,278.0,287.0,0.0,0.0,0.0,5089,...,7797.835405,69.360995,71.860724,82.861349,85.408471,0.0,19.0,1,2,1.0


In [4]:
# create list of continous features for Polynomials (don't transform categorical variables)
# create a list of categorical features for dummifying

poly_features = [col for col in updated_features if updated_df[col].dtype == 'float' or latest_df[col].dtype == 'int']
cat_features = [col for col in updated_features if col not in poly_features]

In [5]:
# use ColumnTransformer to apply Polynomial transformation only to the continuous features, and dummify categorical features

poly = PolynomialFeatures(degree = 2)
ohe = OneHotEncoder()

t = [('polynomials', poly, poly_features), ('cat', ohe, cat_features)]

col_trans = ColumnTransformer(transformers = t, 
                              remainder='passthrough') 

updated_poly = col_trans.fit_transform(updated_df)

X = updated_poly
y = latest_df['price'].values

#train test split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=1)

# standardize the data

scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)


In [25]:
# create a linear regression model instance

model = LinearRegression(n_jobs = -2)

# get cross validated scores
scores = cross_val_score(model, X_train, y_train, cv=5)
print("Cross-validated training scores:", scores)
print("Mean cross-validated training score:", scores.mean())
# fit and evaluate the data on the whole training set
model.fit(X_train, y_train)
print("Training Score:", model.score(X_train, y_train))
# evaluate the data on the test set
print("Test Score:", model.score(X_test, y_test))

Cross-validated training scores: [0.29567131 0.30243297 0.30991318 0.32800819 0.31812583]
Mean cross-validated training score: 0.310830297612838
Training Score: 0.31555814408287075
Test Score: 0.2523231251628759


The Polynomial Features have given the LinearRegression model a significant boost, as might be expected.

But still far short of the performance of KNN and RandomForest.



In [31]:
# instantiate GridSearch object, and grid search for best parameters for Ridge model

ridge = Ridge()
params = {'alpha': np.logspace(-4, 4, 10)}
gs_ridge = GridSearchCV(ridge, params, cv = 5, n_jobs = -2, verbose = 3)
gs_ridge.fit(X_train, y_train)

Fitting 5 folds for each of 10 candidates, totalling 50 fits


GridSearchCV(cv=5, estimator=Ridge(), n_jobs=-2,
             param_grid={'alpha': array([1.00000000e-04, 7.74263683e-04, 5.99484250e-03, 4.64158883e-02,
       3.59381366e-01, 2.78255940e+00, 2.15443469e+01, 1.66810054e+02,
       1.29154967e+03, 1.00000000e+04])},
             verbose=3)

In [32]:
# Ridge scores the same as Linear Regression (as it did without PolyNomial Features)

gs_ridge.best_score_

0.31083522545979436

The PolyNomial Features have given a boost to the linear models.

We should try them with our most successful model, Random Forest.


In [13]:
rf = RandomForestRegressor(n_estimators = 500, verbose = 3, n_jobs = -2)
rf.fit(X_train, y_train)
cv_score = cross_val_score(rf, X_train, y_train, cv = 3)
test_predictions = rf.predict(X_test)
test_score = r2_score(y_test, test_predictions)


[Parallel(n_jobs=-2)]: Using backend ThreadingBackend with 3 concurrent workers.


building tree 1 of 500building tree 2 of 500
building tree 3 of 500

building tree 4 of 500
building tree 5 of 500
building tree 6 of 500
building tree 7 of 500
building tree 8 of 500
building tree 9 of 500
building tree 10 of 500
building tree 11 of 500
building tree 12 of 500
building tree 13 of 500
building tree 14 of 500
building tree 15 of 500
building tree 16 of 500
building tree 17 of 500
building tree 18 of 500
building tree 19 of 500
building tree 20 of 500
building tree 21 of 500
building tree 22 of 500
building tree 23 of 500
building tree 24 of 500
building tree 25 of 500
building tree 26 of 500
building tree 27 of 500
building tree 28 of 500
building tree 29 of 500


[Parallel(n_jobs=-2)]: Done  26 tasks      | elapsed: 10.5min


building tree 30 of 500
building tree 31 of 500
building tree 32 of 500
building tree 33 of 500
building tree 34 of 500
building tree 35 of 500
building tree 36 of 500
building tree 37 of 500
building tree 38 of 500
building tree 39 of 500
building tree 40 of 500
building tree 41 of 500
building tree 42 of 500
building tree 43 of 500
building tree 44 of 500
building tree 45 of 500
building tree 46 of 500
building tree 47 of 500
building tree 48 of 500
building tree 49 of 500
building tree 50 of 500
building tree 51 of 500
building tree 52 of 500
building tree 53 of 500
building tree 54 of 500
building tree 55 of 500
building tree 56 of 500
building tree 57 of 500
building tree 58 of 500
building tree 59 of 500
building tree 60 of 500
building tree 61 of 500
building tree 62 of 500
building tree 63 of 500
building tree 64 of 500
building tree 65 of 500
building tree 66 of 500
building tree 67 of 500
building tree 68 of 500
building tree 69 of 500
building tree 70 of 500
building tree 71

[Parallel(n_jobs=-2)]: Done 122 tasks      | elapsed: 46.0min


building tree 126 of 500
building tree 127 of 500
building tree 128 of 500
building tree 129 of 500
building tree 130 of 500
building tree 131 of 500
building tree 132 of 500
building tree 133 of 500
building tree 134 of 500
building tree 135 of 500
building tree 136 of 500
building tree 137 of 500
building tree 138 of 500
building tree 139 of 500
building tree 140 of 500
building tree 141 of 500
building tree 142 of 500
building tree 143 of 500
building tree 144 of 500
building tree 145 of 500
building tree 146 of 500
building tree 147 of 500
building tree 148 of 500
building tree 149 of 500
building tree 150 of 500
building tree 151 of 500
building tree 152 of 500
building tree 153 of 500
building tree 154 of 500
building tree 155 of 500
building tree 156 of 500
building tree 157 of 500
building tree 158 of 500
building tree 159 of 500
building tree 160 of 500
building tree 161 of 500
building tree 162 of 500
building tree 163 of 500
building tree 164 of 500
building tree 165 of 500


[Parallel(n_jobs=-2)]: Done 282 tasks      | elapsed: 119.3min


building tree 286 of 500
building tree 287 of 500
building tree 288 of 500
building tree 289 of 500
building tree 290 of 500
building tree 291 of 500
building tree 292 of 500
building tree 293 of 500
building tree 294 of 500
building tree 295 of 500
building tree 296 of 500
building tree 297 of 500
building tree 298 of 500
building tree 299 of 500
building tree 300 of 500
building tree 301 of 500
building tree 302 of 500
building tree 303 of 500
building tree 304 of 500
building tree 305 of 500
building tree 306 of 500
building tree 307 of 500
building tree 308 of 500
building tree 309 of 500
building tree 310 of 500
building tree 311 of 500
building tree 312 of 500
building tree 313 of 500
building tree 314 of 500
building tree 315 of 500
building tree 316 of 500
building tree 317 of 500
building tree 318 of 500
building tree 319 of 500
building tree 320 of 500
building tree 321 of 500
building tree 322 of 500
building tree 323 of 500
building tree 324 of 500
building tree 325 of 500


[Parallel(n_jobs=-2)]: Done 500 out of 500 | elapsed: 210.0min finished
[Parallel(n_jobs=-2)]: Using backend LokyBackend with 3 concurrent workers.
[Parallel(n_jobs=-2)]: Done  26 tasks      | elapsed:  6.8min
[Parallel(n_jobs=-2)]: Done 122 tasks      | elapsed: 30.9min
[Parallel(n_jobs=-2)]: Done 282 tasks      | elapsed: 72.8min
[Parallel(n_jobs=-2)]: Done 500 out of 500 | elapsed: 163.0min finished
[Parallel(n_jobs=3)]: Using backend ThreadingBackend with 3 concurrent workers.
[Parallel(n_jobs=3)]: Done  26 tasks      | elapsed:    1.1s
[Parallel(n_jobs=3)]: Done 122 tasks      | elapsed:    5.3s
[Parallel(n_jobs=3)]: Done 282 tasks      | elapsed:   11.4s
[Parallel(n_jobs=3)]: Done 500 out of 500 | elapsed:   21.4s finished
[Parallel(n_jobs=-2)]: Using backend LokyBackend with 3 concurrent workers.
[Parallel(n_jobs=-2)]: Done  26 tasks      | elapsed:  7.8min
[Parallel(n_jobs=-2)]: Done 122 tasks      | elapsed: 33.5min
[Parallel(n_jobs=-2)]: Done 282 tasks      | elapsed: 73.2min

NameError: name 'r2_score' is not defined

In [23]:
# the original RandomForest model (trained without PolyNomial Feautures) scored over 0.52, so in this case the PolyNomial Feature haven't helped.

print('mean cross_val_score', cv_score.mean())
print('test score', r2_score(y_test, test_predictions))

mean cross_val_score 0.5054958616314571
test score 0.4705026628042951


It seems unlikely that any of the linear models will get close to the scores achieved by Random Forest. 

However it might be worth trying Polynomial Features with interaction terms only.

In [26]:
# use ColumnTransformer to apply Polynomial transformation only to the continuous features, and dummify categorical features

poly = PolynomialFeatures(degree = 2, interaction_only = True)
ohe = OneHotEncoder()

poly_features = [col for col in updated_features if updated_df[col].dtype == 'float' or latest_df[col].dtype == 'int']
cat_features = [col for col in updated_features if col not in poly_features]

t = [('polynomials', poly, poly_features), ('cat', ohe, cat_features)]
col_trans = ColumnTransformer(transformers = t, 
                              remainder='passthrough') 

updated_poly = col_trans.fit_transform(updated_df)

X = updated_poly
y = latest_df['price'].values

#train test split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=1)

# standardize the data

scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

In [27]:
# create a linear regression model instance

model = LinearRegression(n_jobs = -2)

# get cross validated scores
scores = cross_val_score(model, X_train, y_train, cv=5)
print("Cross-validated training scores:", scores)
print("Mean cross-validated training score:", scores.mean())
# fit and evaluate the data on the whole training set
model.fit(X_train, y_train)
print("Training Score:", model.score(X_train, y_train))
# evaluate the data on the test set
print("Test Score:", model.score(X_test, y_test))

Cross-validated training scores: [0.29357902 0.29951905 0.30795153 0.32590788 0.31488793]
Mean cross-validated training score: 0.3083690810404379
Training Score: 0.3129348553220591
Test Score: 0.25009051245077696


Interaction only performs slightly worse for Linear Regression than full Polynomial Features.

It might be worth trying it for Ridge too.


In [28]:
# instantiate ridgeCV model, and grid search for best parameter

ridge_cv = RidgeCV(alphas= np.logspace(-4, 4, 10), cv=5)
ridge_cv.fit(X_train, y_train)

RidgeCV(alphas=array([1.00000000e-04, 7.74263683e-04, 5.99484250e-03, 4.64158883e-02,
       3.59381366e-01, 2.78255940e+00, 2.15443469e+01, 1.66810054e+02,
       1.29154967e+03, 1.00000000e+04]),
        cv=5)

In [29]:
ridge_cv.best_score_

0.30836903631248613

The interaction only Polynomials haven't helped the linear models as much as the full Polynomial Features.

One other possibility is that we can revert to the full Polynomial Features and try transforming the target variable.

In [34]:
poly = PolynomialFeatures(degree = 2)
ohe = OneHotEncoder()

t = [('polynomials', poly, poly_features), ('cat', ohe, cat_features)]

col_trans = ColumnTransformer(transformers = t, 
                              remainder='passthrough') 

updated_poly = col_trans.fit_transform(updated_df)

X = updated_poly
y = latest_df['price'].values

#train test split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=1)

# standardize the data

scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

lr = LinearRegression()
pt = PowerTransformer(method = 'box-cox')

# PowerTransformer requires a matrix/DataFrame
pt.fit(y_train.reshape(-1, 1))
 
y_train_box = pt.transform(y_train.reshape(-1, 1))
y_test_box = pt.transform(y_test.reshape(-1, 1))

In [35]:
# fit the Linear Regression model on the transformed y_train, and make predictions

lr.fit(X_train, y_train_box)
predictions_trans = lr.predict(X_test)
predictions_trans

array([[-0.68893546],
       [ 1.2600506 ],
       [ 0.87642938],
       ...,
       [-0.21309012],
       [-1.26462668],
       [-0.24658507]])

In [39]:
# transform the predictions back to the correct scale

real_predictions = pt.inverse_transform(predictions_trans)
real_predictions[:10]

array([[ 350703.31007035],
       [1218774.77233074],
       [ 914852.63631143],
       [2324477.6959181 ],
       [ 354492.56547286],
       [ 432271.35774304],
       [ 326303.79766797],
       [ 577993.52737507],
       [ 402653.14718649],
       [ 390172.17550834]])

In [42]:
# compare the predictions to y_test

r2_score(y_test, real_predictions)

0.23424487322879672

The test score is lower than it was without the PowerTransformation.

It's unlikely to help, but we can try the same thing with other linear models.


In [43]:
# try the same thing with Ridge

ridge_cv_box = RidgeCV(alphas= np.logspace(-4, 4, 10), cv=5)
ridge_cv_box.fit(X_train, y_train_box)

RidgeCV(alphas=array([1.00000000e-04, 7.74263683e-04, 5.99484250e-03, 4.64158883e-02,
       3.59381366e-01, 2.78255940e+00, 2.15443469e+01, 1.66810054e+02,
       1.29154967e+03, 1.00000000e+04]),
        cv=5)

In [44]:
ridge_box_preds = ridge_cv_box.predict(X_test)
ridge_box_preds

array([[-0.68872289],
       [ 1.26021462],
       [ 0.87746586],
       ...,
       [-0.21281082],
       [-1.26515642],
       [-0.24736987]])

In [45]:
real_ridge_preds = pt.inverse_transform(ridge_box_preds)
r2_score(y_test, real_ridge_preds)

0.23422346481679635

Experimenting with Polynomial Features has helped the linear models, as expected. Scores for the RandomForest haven't improved.

Combining the Polynomial Features with a transformation on the target variable didn't help.

One other possibilibity would be to take the Polynomial Features to a higher power (currently 2). However, this would explode the number of features, and given how far the linear models are lagging behind Random Forest it is unlikely to provide us with our best model.

