### Auto ML parameter tuning example

Hyper-parameter tuning is often the most tedious part of building a machine learning model. Here I will show an example of how to use the `hpsklearn` package to automatically tune a good machine learning model. How it works is that a default hyperparameter space has been defined, and the package will be able to search through space and find the best combination that gives the best model.

We will use the same housing dataset as an example.

In [1]:
from sklearn.model_selection import train_test_split
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

### Data

In [2]:
houses = pd.read_csv("https://raw.githubusercontent.com/Ziqi-Li/GIS5122/main/data/seattle_data_cleaned.csv")

In [3]:
houses.head()

Unnamed: 0.1,Unnamed: 0,bathrooms,sqft_living,sqft_lot,grade,condition,waterfront,view,age,UTM_X,UTM_Y,log_10_price
0,9172,3.0,2660,4600,8,3,0,0,109,552217.557035,5274945.0,6.091315
1,2264,2.25,2530,8736,7,4,0,0,57,565692.484331,5272758.0,5.790988
2,348,2.0,1390,13464,7,4,0,0,28,562451.661509,5245291.0,5.31513
3,16463,1.0,940,4264,7,5,0,0,66,546816.935618,5264407.0,5.619093
4,12598,2.25,2070,7225,8,3,0,0,36,564343.195352,5244978.0,5.477121


In [4]:
sampled = houses.sample(frac=0.2,random_state=1)

y = sampled.log_10_price

X = sampled[['bathrooms', 'sqft_living', 'sqft_lot', 'grade',
       'condition', 'waterfront', 'view', 'age', 'UTM_X', 'UTM_Y']]

We can split the data into training (80%) and testing (20%). The model will be trained based on the training data and the testing data will be used to evaluate the model accuracy on unseen data.

In [5]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

For a complete list of models supported, see this link:

https://github.com/hyperopt/hyperopt-sklearn

In this example, I chose
- Linear Regression (baseline)
- XGBoost
- Random Forest
- Decision Tree

In [6]:
from hpsklearn import HyperoptEstimator,linear_regression, decision_tree_regressor
from hpsklearn import xgboost_regression,random_forest_regressor

from hyperopt import tpe

### hpsklearn
Below is a function that can train any model, the only change you need to is to replace the function with your data and a specific model. `max_evals=10` indicates the number of hyperparameter combinations being evaluated. Increase it to a large number if you think that will help.

The best combination will be deterimined by a cross-validatuon process of your training data.

In [16]:
def train_any_model(X_train, y_train, any_regressor,max_evals=10):
    
    estim = HyperoptEstimator(regressor=any_regressor("myModel"), preprocessing=[],
                              algo=tpe.suggest,max_evals=max_evals,trial_timeout=240,n_jobs=-1)
    
    #5-fold cross validation of your training data
    estim.fit(X_train, y_train, n_folds=5, cv_shuffle=True, random_state=123)
    return estim

### XGBoost

In [8]:
%%time
xgb_models = train_any_model(X_train, y_train, xgboost_regression)

best_xgb = xgb_models.best_model()['learner']

# Make predictions
xgb_pred = best_xgb.predict(X_test)

100%|█████████████| 1/1 [00:05<00:00,  5.05s/trial, best loss: 299.10736343209487]
100%|████████████| 2/2 [00:05<00:00,  5.72s/trial, best loss: 0.11579538731155403]
100%|████████████| 3/3 [00:11<00:00, 11.34s/trial, best loss: 0.11579538731155403]
100%|████████████| 4/4 [00:08<00:00,  8.12s/trial, best loss: 0.11579538731155403]
100%|████████████| 5/5 [00:05<00:00,  5.39s/trial, best loss: 0.11579538731155403]
100%|████████████| 6/6 [00:04<00:00,  4.86s/trial, best loss: 0.11579538731155403]
100%|████████████| 7/7 [00:07<00:00,  7.01s/trial, best loss: 0.11579538731155403]
100%|████████████| 8/8 [00:02<00:00,  2.21s/trial, best loss: 0.11579538731155403]
100%|████████████| 9/9 [00:03<00:00,  3.79s/trial, best loss: 0.11579538731155403]
100%|██████████| 10/10 [00:01<00:00,  1.64s/trial, best loss: 0.11579538731155403]
CPU times: user 4.84 s, sys: 5.02 s, total: 9.86 s
Wall time: 56.2 s


### Linear Regression

In [9]:
%%time
lr_models = train_any_model(X_train, y_train, linear_regression)

best_lr = lr_models.best_model()['learner']

# Make predictions 
lr_pred = best_lr.predict(X_test)

100%|████████████| 1/1 [00:01<00:00,  1.35s/trial, best loss: 0.23584924783186423]
100%|████████████| 2/2 [00:01<00:00,  1.29s/trial, best loss: 0.23584924783186423]
100%|████████████| 3/3 [00:01<00:00,  1.33s/trial, best loss: 0.23584924783186423]
100%|████████████| 4/4 [00:01<00:00,  1.29s/trial, best loss: 0.23584924783186423]
100%|████████████| 5/5 [00:01<00:00,  1.30s/trial, best loss: 0.23584924783186423]
100%|████████████| 6/6 [00:01<00:00,  1.31s/trial, best loss: 0.23584924783186423]
100%|████████████| 7/7 [00:01<00:00,  1.34s/trial, best loss: 0.23584924783186423]
100%|████████████| 8/8 [00:01<00:00,  1.30s/trial, best loss: 0.23584924783186423]
100%|████████████| 9/9 [00:01<00:00,  1.33s/trial, best loss: 0.23584924783186423]
100%|██████████| 10/10 [00:01<00:00,  1.30s/trial, best loss: 0.23584924783186423]
CPU times: user 205 ms, sys: 302 ms, total: 507 ms
Wall time: 13.2 s




### Random Forest

In [10]:
%%time
rf_models = train_any_model(X_train, y_train, random_forest_regressor)

best_rf = rf_models.best_model()['learner']

# Make predictions 
rf_pred = best_rf.predict(X_test)

100%|████████████| 1/1 [00:02<00:00,  2.55s/trial, best loss: 0.33984356214795053]
100%|████████████| 2/2 [00:02<00:00,  2.34s/trial, best loss: 0.20350819296581457]
100%|████████████| 3/3 [00:02<00:00,  2.12s/trial, best loss: 0.20350819296581457]
100%|████████████| 4/4 [00:01<00:00,  1.53s/trial, best loss: 0.20350819296581457]
100%|████████████| 5/5 [00:04<00:00,  4.37s/trial, best loss: 0.20350819296581457]
100%|████████████| 6/6 [00:07<00:00,  7.44s/trial, best loss: 0.20350819296581457]
100%|████████████| 7/7 [00:01<00:00,  1.47s/trial, best loss: 0.14462798181350922]
100%|████████████| 8/8 [00:07<00:00,  7.14s/trial, best loss: 0.13836036585060485]
100%|████████████| 9/9 [00:12<00:00, 12.46s/trial, best loss: 0.13836036585060485]
100%|██████████| 10/10 [00:02<00:00,  2.16s/trial, best loss: 0.13836036585060485]
CPU times: user 8.45 s, sys: 910 ms, total: 9.36 s
Wall time: 44.8 s




### Decision Tree

In [11]:
%%time
dt_models = train_any_model(X_train, y_train, decision_tree_regressor)

best_dt = dt_models.best_model()['learner']

# Make predictions 
dt_pred = best_dt.predict(X_test)

100%|█████████████| 1/1 [00:01<00:00,  1.38s/trial, best loss: 0.2710971654486245]
100%|█████████████| 2/2 [00:01<00:00,  1.39s/trial, best loss: 0.2710971654486245]
100%|█████████████| 3/3 [00:01<00:00,  1.33s/trial, best loss: 0.2710971654486245]
100%|█████████████| 4/4 [00:01<00:00,  1.34s/trial, best loss: 0.2710971654486245]
100%|█████████████| 5/5 [00:01<00:00,  1.35s/trial, best loss: 0.2710971654486245]
100%|█████████████| 6/6 [00:01<00:00,  1.35s/trial, best loss: 0.2710971654486245]
100%|█████████████| 7/7 [00:01<00:00,  1.35s/trial, best loss: 0.2710971654486245]
100%|█████████████| 8/8 [00:01<00:00,  1.34s/trial, best loss: 0.2710971654486245]
100%|█████████████| 9/9 [00:01<00:00,  1.32s/trial, best loss: 0.2710971654486245]
100%|███████████| 10/10 [00:01<00:00,  1.27s/trial, best loss: 0.2710971654486245]
CPU times: user 83.7 ms, sys: 99.6 ms, total: 183 ms
Wall time: 13.5 s




### Cross evaluate model performance on the test data

In [14]:
from sklearn.metrics import mean_squared_error, r2_score

print("XGB - R2:", r2_score(y_test, xgb_pred))
print("RF - R2:", r2_score(y_test, rf_pred))
print("DT - R2:", r2_score(y_test, dt_pred))
print("LR - R2:", r2_score(y_test, lr_pred))

XGB - R2: 0.8903992104779395
RF - R2: 0.8654967478424883
DT - R2: 0.7353998202869545
LR - R2: 0.7752944914831731


In [15]:
print("XGB - MSE:", mean_squared_error(y_test, xgb_pred))
print("RF - MSE:", mean_squared_error(y_test, rf_pred))
print("DT - MSE:", mean_squared_error(y_test, dt_pred))
print("LR - MSE:", mean_squared_error(y_test, lr_pred))

XGB - MSE: 0.00541686347973995
RF - MSE: 0.006647632354615735
DT - MSE: 0.01307748836911186
LR - MSE: 0.011105750862644976


Conclusion: XGBoost has the lowest MSE and highest R2 for this data.