# HOPUS

HOPUS (**HO**using **P**ricing **U**tilitie**S**) contains a variety of routines used to predict real estate prices.

This notebook highlights what HOPUS can do, namely
- clean the raw data,
- train a variety of models for the prediction of real estate prices, and
- evaluate the performance of these models.

## Technical preliminaries

In [5]:
# We clone the HOPUS repository to have access to all its data and routines
!git clone https://github.com/aremondtiedrez/hopus.git
%cd hopus

fatal: destination path 'hopus' already exists and is not an empty directory.
/content/hopus


In [6]:
# Import requisite modules from HOPUS
import evaluation
import models
import preprocessing

## Data cleaning

In [7]:
hpi = preprocessing.home_price_index.load()
preprocessing.home_price_index.preprocess(hpi)

listings_data = preprocessing.property_listings.load()
listings_data = preprocessing.property_listings.preprocess(listings_data, hpi)

preprocessing.property_listings.drop_outliers(listings_data)
preprocessing.property_listings.drop_missing_key_features(listings_data)

In [8]:
price_rmse = evaluation.hpi_rmse(listings_data, target="price")
log_price_rmse = evaluation.hpi_rmse(listings_data, target="logPrice")
print(
    "When using the available home price index\n"
    "instead of the true home price index,\n"
    f"the price RMSE is ${price_rmse/1_000:.0f}k and \n"
    f"the log-price RMSE is {log_price_rmse:.3f}."
)

When using the available home price index
instead of the true home price index,
the price RMSE is $10k and 
the log-price RMSE is 0.021.


## Baseline model
Average (time-normalized) price-per-square-foot over each ZIP code

In [None]:
import numpy as np
import secrets

In [None]:
model = models.Baseline()
model.fit(listings_data, None)
train_mse = model.evaluate(listings_data, listings_data["price"])
train_rmse = np.sqrt(train_mse)
print(f"Training error: ${train_rmse / 1_000:.3f}k")

Training error: $157.830k


In [None]:
seed = secrets.randbits(32)

train_cv_mse, test_cv_mse, trained_models = evaluation.cv_evaluation(models.Baseline, listings_data, listings_data["price"], 100, seed)
train_cv_rmse, test_cv_rmse = np.sqrt(np.array([train_cv_mse, test_cv_mse]))

# Report evaluations
print(f"Cross-validation training error: ${train_cv_rmse / 1_000:.0f}k")
print(f"Cross-validation test error:     ${test_cv_rmse / 1_000:.0f}k")

Cross-validation training error: $158k
Cross-validation test error:     $159k


## Technical preliminary (before linear regression or XGBoost)
We group the columns into key features, auxiliary features, and target
(as well as into information columns and unused columns).

In [9]:
preprocessing.property_listings.group_columns(listings_data)

## Linear regression: training and evaluation

In [None]:
import numpy as np
import secrets
from sklearn.model_selection import train_test_split

In [None]:
seed = secrets.randbits(32)

# Train-test split
train_features, test_features = train_test_split(listings_data["keyPredictionFeatures"], train_size=0.8, shuffle=True, random_state=seed)
train_target, test_target = train_test_split(listings_data[("target", "price")], train_size=0.8, shuffle=True, random_state=seed)

# Train model
model = models.LinearRegression()
model.fit(train_features, train_target)

# Evaluate model
train_rmse = np.sqrt(model.evaluate(train_features, train_target))
test_rmse = np.sqrt(model.evaluate(test_features, test_target))

# Report evaluations
print(f"Seed: {seed}")
print(f"Training error: ${train_rmse / 1_000:.3f}k")
print(f"Test error:     ${test_rmse / 1_000:.3f}k")

Seed: 1480318944
Training error: $161.367k
Test error:     $137.479k


## Linear regression: evaluation with cross-validation

In [None]:
import numpy as np
import secrets

In [None]:
seed = secrets.randbits(32)
train_cv_mse, test_cv_mse, trained_models = evaluation.cv_evaluation(models.LinearRegression, listings_data["keyPredictionFeatures"], listings_data[("target", "price")], 100, seed)
train_cv_rmse, test_cv_rmse = np.sqrt(np.array([train_cv_mse, test_cv_mse]))

# Report evaluations
print(f"Cross-validation training error: ${train_cv_rmse / 1_000:.0f}k")
print(f"Cross-validation test error:     ${test_cv_rmse / 1_000:.0f}k")

Cross-validation training error: $156k
Cross-validation test error:     $159k


## Linear regression: cross-validation for various data subsets

In [None]:
import numpy as np
import secrets

In [None]:
seed = secrets.randbits(32)

In [None]:
# --------------------------------------------------
# PART A: USING ONLY THE KEY PREDICTION FEATURES
# --------------------------------------------------

# Case 1: with outliers and imperfect samples
listings_data = preprocessing.property_listings.load()
listings_data = preprocessing.property_listings.preprocess(listings_data, hpi)

#preprocessing.property_listings.drop_outliers(listings_data)
#preprocessing.property_listings.drop_missing_key_features(listings_data)

preprocessing.property_listings.group_columns(listings_data)

train_cv_mse, test_cv_mse, trained_models = evaluation.cv_evaluation(models.LinearRegression, listings_data["keyPredictionFeatures"], listings_data[("target", "price")], 100, seed)
train_cv_rmse, test_cv_rmse = np.sqrt(np.array([train_cv_mse, test_cv_mse]))

print("--------------------------------------------------")
print(f"With outliers and with imperfect samples.")
print("--------------------------------------------------")
print(f"Cross-validation training error: ${train_cv_rmse / 1_000:.0f}k")
print(f"Cross-validation test error:     ${test_cv_rmse / 1_000:.0f}k")

# Case 2: without outliers but with imperfect samples
listings_data = preprocessing.property_listings.load()
listings_data = preprocessing.property_listings.preprocess(listings_data, hpi)

preprocessing.property_listings.drop_outliers(listings_data)
#preprocessing.property_listings.drop_missing_key_features(listings_data)

preprocessing.property_listings.group_columns(listings_data)

train_cv_mse, test_cv_mse, trained_models = evaluation.cv_evaluation(models.LinearRegression, listings_data["keyPredictionFeatures"], listings_data[("target", "price")], 100, seed)
train_cv_rmse, test_cv_rmse = np.sqrt(np.array([train_cv_mse, test_cv_mse]))

print("--------------------------------------------------")
print(f"Without outliers but with imperfect samples.")
print("--------------------------------------------------")
print(f"Cross-validation training error: ${train_cv_rmse / 1_000:.0f}k")
print(f"Cross-validation test error:     ${test_cv_rmse / 1_000:.0f}k")

# Case 3: with outliers but without imperfect samples
listings_data = preprocessing.property_listings.load()
listings_data = preprocessing.property_listings.preprocess(listings_data, hpi)

#preprocessing.property_listings.drop_outliers(listings_data)
preprocessing.property_listings.drop_missing_key_features(listings_data)

preprocessing.property_listings.group_columns(listings_data)

train_cv_mse, test_cv_mse, trained_models = evaluation.cv_evaluation(models.LinearRegression, listings_data["keyPredictionFeatures"], listings_data[("target", "price")], 100, seed)
train_cv_rmse, test_cv_rmse = np.sqrt(np.array([train_cv_mse, test_cv_mse]))

print("--------------------------------------------------")
print(f"With outliers but without imperfect samples.")
print("--------------------------------------------------")
print(f"Cross-validation training error: ${train_cv_rmse / 1_000:.0f}k")
print(f"Cross-validation test error:     ${test_cv_rmse / 1_000:.0f}k")

# Case 1: without outliers nor imperfect samples
listings_data = preprocessing.property_listings.load()
listings_data = preprocessing.property_listings.preprocess(listings_data, hpi)

preprocessing.property_listings.drop_outliers(listings_data)
preprocessing.property_listings.drop_missing_key_features(listings_data)

preprocessing.property_listings.group_columns(listings_data)

train_cv_mse, test_cv_mse, trained_models = evaluation.cv_evaluation(models.LinearRegression, listings_data["keyPredictionFeatures"], listings_data[("target", "price")], 100, seed)
train_cv_rmse, test_cv_rmse = np.sqrt(np.array([train_cv_mse, test_cv_mse]))

print("--------------------------------------------------")
print(f"Without outliers nor imperfect samples.")
print("--------------------------------------------------")
print(f"Cross-validation training error: ${train_cv_rmse / 1_000:.0f}k")
print(f"Cross-validation test error:     ${test_cv_rmse / 1_000:.0f}k")

--------------------------------------------------
With outliers and with imperfect samples.
--------------------------------------------------
Cross-validation training error: $248k
Cross-validation test error:     $252k
--------------------------------------------------
Without outliers but with imperfect samples.
--------------------------------------------------
Cross-validation training error: $163k
Cross-validation test error:     $167k
--------------------------------------------------
With outliers but without imperfect samples.
--------------------------------------------------
Cross-validation training error: $240k
Cross-validation test error:     $244k
--------------------------------------------------
Without outliers nor imperfect samples.
--------------------------------------------------
Cross-validation training error: $156k
Cross-validation test error:     $160k


In [None]:
# ------------------------------------------------------------
# PART B: USING THE KEY AND THE AUXILIARY PREDICTION FEATURES
# ------------------------------------------------------------
features_label = ["keyPredictionFeatures", "auxiliaryPredictionFeatures"]

# Case 1: with outliers and imperfect samples
listings_data = preprocessing.property_listings.load()
listings_data = preprocessing.property_listings.preprocess(listings_data, hpi)

#preprocessing.property_listings.drop_outliers(listings_data)
#preprocessing.property_listings.drop_missing_key_features(listings_data)

preprocessing.property_listings.group_columns(listings_data)

train_cv_mse, test_cv_mse, trained_models = evaluation.cv_evaluation(models.LinearRegression, listings_data[features_label], listings_data[("target", "price")], 100, seed)
train_cv_rmse, test_cv_rmse = np.sqrt(np.array([train_cv_mse, test_cv_mse]))

print("--------------------------------------------------")
print(f"With outliers and with imperfect samples.")
print("--------------------------------------------------")
print(f"Cross-validation training error: ${train_cv_rmse / 1_000:.0f}k")
print(f"Cross-validation test error:     ${test_cv_rmse / 1_000:.0f}k")

# Case 2: without outliers but with imperfect samples
listings_data = preprocessing.property_listings.load()
listings_data = preprocessing.property_listings.preprocess(listings_data, hpi)

preprocessing.property_listings.drop_outliers(listings_data)
#preprocessing.property_listings.drop_missing_key_features(listings_data)

preprocessing.property_listings.group_columns(listings_data)

train_cv_mse, test_cv_mse, trained_models = evaluation.cv_evaluation(models.LinearRegression, listings_data[features_label], listings_data[("target", "price")], 100, seed)
train_cv_rmse, test_cv_rmse = np.sqrt(np.array([train_cv_mse, test_cv_mse]))

print("--------------------------------------------------")
print(f"Without outliers but with imperfect samples.")
print("--------------------------------------------------")
print(f"Cross-validation training error: ${train_cv_rmse / 1_000:.0f}k")
print(f"Cross-validation test error:     ${test_cv_rmse / 1_000:.0f}k")

# Case 3: with outliers but without imperfect samples
listings_data = preprocessing.property_listings.load()
listings_data = preprocessing.property_listings.preprocess(listings_data, hpi)

#preprocessing.property_listings.drop_outliers(listings_data)
preprocessing.property_listings.drop_missing_key_features(listings_data)

preprocessing.property_listings.group_columns(listings_data)

train_cv_mse, test_cv_mse, trained_models = evaluation.cv_evaluation(models.LinearRegression, listings_data[features_label], listings_data[("target", "price")], 100, seed)
train_cv_rmse, test_cv_rmse = np.sqrt(np.array([train_cv_mse, test_cv_mse]))

print("--------------------------------------------------")
print(f"With outliers but without imperfect samples.")
print("--------------------------------------------------")
print(f"Cross-validation training error: ${train_cv_rmse / 1_000:.0f}k")
print(f"Cross-validation test error:     ${test_cv_rmse / 1_000:.0f}k")

# Case 1: without outliers nor imperfect samples
listings_data = preprocessing.property_listings.load()
listings_data = preprocessing.property_listings.preprocess(listings_data, hpi)

preprocessing.property_listings.drop_outliers(listings_data)
preprocessing.property_listings.drop_missing_key_features(listings_data)

preprocessing.property_listings.group_columns(listings_data)

train_cv_mse, test_cv_mse, trained_models = evaluation.cv_evaluation(models.LinearRegression, listings_data[features_label], listings_data[("target", "price")], 100, seed)
train_cv_rmse, test_cv_rmse = np.sqrt(np.array([train_cv_mse, test_cv_mse]))

print("--------------------------------------------------")
print(f"Without outliers nor imperfect samples.")
print("--------------------------------------------------")
print(f"Cross-validation training error: ${train_cv_rmse / 1_000:.0f}k")
print(f"Cross-validation test error:     ${test_cv_rmse / 1_000:.0f}k")

--------------------------------------------------
With outliers and with imperfect samples.
--------------------------------------------------
Cross-validation training error: $230k
Cross-validation test error:     $248k
--------------------------------------------------
Without outliers but with imperfect samples.
--------------------------------------------------
Cross-validation training error: $144k
Cross-validation test error:     $161k
--------------------------------------------------
With outliers but without imperfect samples.
--------------------------------------------------
Cross-validation training error: $227k
Cross-validation test error:     $240k
--------------------------------------------------
Without outliers nor imperfect samples.
--------------------------------------------------
Cross-validation training error: $138k
Cross-validation test error:     $151k


## XGBoost: training and cross-validation

In [12]:
import numpy as np
import secrets

In [11]:
features = listings_data[["keyPredictionFeatures", "auxiliaryPredictionFeatures"]]
target = listings_data[("target", "price")]

hyperparameters = {
    "max_depth": 5,
    "min_child_weight": 1,
    "gamma": 0,
    "subsample": 0.8,
    "colsample_bytree": 0.8,
    "scale_pos_weight": 1
}

model = models.BoostedTrees(**hyperparameters)
model.fit(features, target)
train_rmse = np.sqrt(model.evaluate(features, target))
print(f"Training error: ${train_rmse / 1_000:.3f}k")

Training error: $27.750k


In [8]:
hyperparameters = {
    "max_depth": 5,
    "min_child_weight": 1,
    "gamma": 0,
    "subsample": 0.8,
    "colsample_bytree": 0.8,
    "scale_pos_weight": 1
}

seed = secrets.randbits(32)

train_cv_mse, test_cv_mse, trained_models = evaluation.cv_evaluation(models.BoostedTrees, features, target, 100, seed, hyperparameters=hyperparameters)
train_cv_rmse, test_cv_rmse = np.sqrt(np.array([train_cv_mse, test_cv_mse]))

# Report evaluations
print(f"Cross-validation training error: ${train_cv_rmse / 1_000:.0f}k")
print(f"Cross-validation test error:     ${test_cv_rmse / 1_000:.0f}k")

Cross-validation training error: $26k
Cross-validation test error:     $128k


## XGBoost: Hierarchical hyperparameter search

We follow the hierarchical hyperparameter search procedure described in
https://www.analyticsvidhya.com/blog/2016/03/complete-guide-parameter-tuning-xgboost-with-codes-python/

In [2]:
import numpy as np
import pandas as pd
import secrets
import time

from itertools import product

### Step 1: `n_estimators`

In [21]:
# STEP 1
# We use sensible default choices and aim to find a good number of estimators to use
hyperparameters = {
    "max_depth": 5,
    "min_child_weight": 1,
    "gamma": 0,
    "subsample": 0.8,
    "colsample_bytree": 0.8,
    "scale_pos_weight": 1,
    "learning_rate": 0.1
}

# Experiment parameters
experiment_parameters = {
    "seed": None,
    "n_splits": 5,
}

experiment_records = []
n_experiments = 5
start_time = time.time()
# START - HYPERPARAMETER-DEPENDENT SECTION
for n_estimators in (10, 30, 100, 300, 1000):
    hyperparameters["n_estimators"] = n_estimators
# END - HYPERPARAMETER-DEPENDENT SECTION
    for _ in range(n_experiments):
        experiment_parameters["seed"] = secrets.randbits(32)
        train_cv_mse, test_cv_mse, trained_models = evaluation.cv_evaluation(models.BoostedTrees, features, target, **experiment_parameters, hyperparameters=hyperparameters)
        experiment_result = {
            "train_cv_mse": train_cv_mse,
            "test_cv_mse": test_cv_mse,
        }
        record = {
            **experiment_parameters,
            **hyperparameters,
            **experiment_result,
        }
        experiment_records.append(record)
end_time = time.time()
print(f"Duration of the experiment: {(end_time - start_time)/60:.1f} minutes.")

experiment_records = pd.DataFrame(experiment_records)

Duration of the experiment: 1.2 minutes.


In [22]:
# Compute the RMSE for each set of hyperparameters (here, only the value of `n_estimators` changes)
np.sqrt(experiment_records.groupby(list(hyperparameters.keys()))["test_cv_mse"].mean())

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,Unnamed: 4_level_0,Unnamed: 5_level_0,Unnamed: 6_level_0,Unnamed: 7_level_0,test_cv_mse
max_depth,min_child_weight,gamma,subsample,colsample_bytree,scale_pos_weight,learning_rate,n_estimators,Unnamed: 8_level_1
5,1,0,0.8,0.8,1,0.1,10,164188.656898
5,1,0,0.8,0.8,1,0.1,30,134271.585738
5,1,0,0.8,0.8,1,0.1,100,127132.624531
5,1,0,0.8,0.8,1,0.1,300,130599.472778
5,1,0,0.8,0.8,1,0.1,1000,133891.596128


### Step 2: `max_depth` and `min_child_weight`

In [23]:
# STEP 2
# We fix `n_estimators = 300` from the previous step and
# now seek to find good values for `max_depth` and `min_child_weight`.
hyperparameters = {
    "n_estimators": 100,
    "gamma": 0,
    "subsample": 0.8,
    "colsample_bytree": 0.8,
    "scale_pos_weight": 1,
    "learning_rate": 0.1
}

# Experiment parameters
experiment_parameters = {
    "seed": None,
    "n_splits": 5,
}

experiment_records = []
n_experiments = 5
start_time = time.time()
# START - HYPERPARAMETER-DEPENDENT SECTION
for max_depth, min_child_weight in product((3, 5, 7, 9), (1, 3, 5)):
    hyperparameters["max_depth"] = max_depth
    hyperparameters["min_child_weight"] = min_child_weight
# END - HYPERPARAMETER-DEPENDENT SECTION
    for _ in range(n_experiments):
        experiment_parameters["seed"] = secrets.randbits(32)
        train_cv_mse, test_cv_mse, trained_models = evaluation.cv_evaluation(models.BoostedTrees, features, target, **experiment_parameters, hyperparameters=hyperparameters)
        experiment_result = {
            "train_cv_mse": train_cv_mse,
            "test_cv_mse": test_cv_mse,
        }
        record = {
            **experiment_parameters,
            **hyperparameters,
            **experiment_result,
        }
        experiment_records.append(record)
end_time = time.time()
print(f"Duration of the experiment: {(end_time - start_time)/60:.1f} minutes.")

experiment_records = pd.DataFrame(experiment_records)

Duration of the experiment: 1.4 minutes.


In [25]:
# Compute the RMSE for each set of hyperparameters (here, only the values of `max_depth` and `min_child_weight` change)
np.sqrt(experiment_records.groupby(list(hyperparameters.keys()))["test_cv_mse"].mean())

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,Unnamed: 4_level_0,Unnamed: 5_level_0,Unnamed: 6_level_0,Unnamed: 7_level_0,test_cv_mse
n_estimators,gamma,subsample,colsample_bytree,scale_pos_weight,learning_rate,max_depth,min_child_weight,Unnamed: 8_level_1
100,0,0.8,0.8,1,0.1,3,1,128484.940657
100,0,0.8,0.8,1,0.1,3,3,130098.750255
100,0,0.8,0.8,1,0.1,3,5,132524.5352
100,0,0.8,0.8,1,0.1,5,1,127399.627202
100,0,0.8,0.8,1,0.1,5,3,126240.155417
100,0,0.8,0.8,1,0.1,5,5,134657.722118
100,0,0.8,0.8,1,0.1,7,1,129365.990649
100,0,0.8,0.8,1,0.1,7,3,133805.088515
100,0,0.8,0.8,1,0.1,7,5,133939.156881
100,0,0.8,0.8,1,0.1,9,1,134836.86591


### Step 3: `gamma`

In [31]:
# STEP 3
# We fix `max_depth = 5` and `min_child_weight = 3` from the previous step and
# now seek to find a good value for `gamma`.
hyperparameters = {
    "n_estimators": 100,
    "max_depth": 5,
    "min_child_weight": 3,
    "subsample": 0.8,
    "colsample_bytree": 0.8,
    "scale_pos_weight": 1,
    "learning_rate": 0.1
}

# Experiment parameters
experiment_parameters = {
    "seed": None,
    "n_splits": 10,
}

experiment_records = []
n_experiments = 10
start_time = time.time()
# START - HYPERPARAMETER-DEPENDENT SECTION
for gamma in (0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7):
    hyperparameters["gamma"] = gamma
# END - HYPERPARAMETER-DEPENDENT SECTION
    for _ in range(n_experiments):
        experiment_parameters["seed"] = secrets.randbits(32)
        train_cv_mse, test_cv_mse, trained_models = evaluation.cv_evaluation(models.BoostedTrees, features, target, **experiment_parameters, hyperparameters=hyperparameters)
        experiment_result = {
            "train_cv_mse": train_cv_mse,
            "test_cv_mse": test_cv_mse,
        }
        record = {
            **experiment_parameters,
            **hyperparameters,
            **experiment_result,
        }
        experiment_records.append(record)
end_time = time.time()
print(f"Duration of the experiment: {(end_time - start_time)/60:.1f} minutes.")

experiment_records = pd.DataFrame(experiment_records)

Duration of the experiment: 3.0 minutes.


In [32]:
# Compute the RMSE for each set of hyperparameters (here, only the value of `gamma` changes)
np.sqrt(experiment_records.groupby(list(hyperparameters.keys()))["test_cv_mse"].mean())

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,Unnamed: 4_level_0,Unnamed: 5_level_0,Unnamed: 6_level_0,Unnamed: 7_level_0,test_cv_mse
n_estimators,max_depth,min_child_weight,subsample,colsample_bytree,scale_pos_weight,learning_rate,gamma,Unnamed: 8_level_1
100,5,3,0.8,0.8,1,0.1,0.0,123956.596997
100,5,3,0.8,0.8,1,0.1,0.1,124706.825855
100,5,3,0.8,0.8,1,0.1,0.2,123208.149445
100,5,3,0.8,0.8,1,0.1,0.3,126462.903659
100,5,3,0.8,0.8,1,0.1,0.4,124128.959584
100,5,3,0.8,0.8,1,0.1,0.5,124574.191058
100,5,3,0.8,0.8,1,0.1,0.6,127750.085266
100,5,3,0.8,0.8,1,0.1,0.7,125643.084585


### Step 4: `subsample` and `colsample_bytree`

In [35]:
# STEP 4
# We fix `gamma = 0.2` from the previous step and
# now seek to find good values for `subsample` and `colsample_bytree`.
hyperparameters = {
    "n_estimators": 100,
    "max_depth": 5,
    "min_child_weight": 3,
    "gamma": 0.2,
    "scale_pos_weight": 1,
    "learning_rate": 0.1
}

# Experiment parameters
experiment_parameters = {
    "seed": None,
    "n_splits": 10,
}

experiment_records = []
n_experiments = 10
start_time = time.time()
# START - HYPERPARAMETER-DEPENDENT SECTION
for subsample, colsample_bytree in product((0.6, 0.7, 0.8, 0.9), (0.6, 0.7, 0.8, 0.9)):
    hyperparameters["subsample"] = subsample
    hyperparameters["colsample_bytree"] = colsample_bytree
# END - HYPERPARAMETER-DEPENDENT SECTION
    for _ in range(n_experiments):
        experiment_parameters["seed"] = secrets.randbits(32)
        train_cv_mse, test_cv_mse, trained_models = evaluation.cv_evaluation(models.BoostedTrees, features, target, **experiment_parameters, hyperparameters=hyperparameters)
        experiment_result = {
            "train_cv_mse": train_cv_mse,
            "test_cv_mse": test_cv_mse,
        }
        record = {
            **experiment_parameters,
            **hyperparameters,
            **experiment_result,
        }
        experiment_records.append(record)
end_time = time.time()
print(f"Duration of the experiment: {(end_time - start_time)/60:.1f} minutes.")

experiment_records = pd.DataFrame(experiment_records)

Duration of the experiment: 5.8 minutes.


In [36]:
# Compute the RMSE for each set of hyperparameters (here, only the values of `subsample` and `colsample_bytree` change)
np.sqrt(experiment_records.groupby(list(hyperparameters.keys()))["test_cv_mse"].mean())

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,Unnamed: 4_level_0,Unnamed: 5_level_0,Unnamed: 6_level_0,Unnamed: 7_level_0,test_cv_mse
n_estimators,max_depth,min_child_weight,gamma,scale_pos_weight,learning_rate,subsample,colsample_bytree,Unnamed: 8_level_1
100,5,3,0.2,1,0.1,0.6,0.6,126372.267256
100,5,3,0.2,1,0.1,0.6,0.7,128550.076969
100,5,3,0.2,1,0.1,0.6,0.8,127118.054087
100,5,3,0.2,1,0.1,0.6,0.9,123376.361519
100,5,3,0.2,1,0.1,0.7,0.6,126324.03382
100,5,3,0.2,1,0.1,0.7,0.7,126534.710717
100,5,3,0.2,1,0.1,0.7,0.8,127830.026204
100,5,3,0.2,1,0.1,0.7,0.9,126192.150668
100,5,3,0.2,1,0.1,0.8,0.6,126254.005388
100,5,3,0.2,1,0.1,0.8,0.7,125544.839515


### Step 5: `reg_lambda`

In [15]:
# STEP 5
# We fix `subsample = 0.6` and `colsample_bytree = 0.9` from the previous step and
# now seek to find a good values for `reg_lambda`.
hyperparameters = {
    "n_estimators": 100,
    "max_depth": 5,
    "min_child_weight": 3,
    "gamma": 0.2,
    "subsample": 0.6,
    "colsample_bytree": 0.9,
    "scale_pos_weight": 1,
    "learning_rate": 0.1
}

# Experiment parameters
experiment_parameters = {
    "seed": None,
    "n_splits": 10,
}

experiment_records = []
n_experiments = 10
start_time = time.time()
# START - HYPERPARAMETER-DEPENDENT SECTION
for reg_lambda in (0, 1e-5, 1e-4, 1e-3, 1e-2, 1, 10, 100):
    hyperparameters["reg_lambda"] = reg_lambda
# END - HYPERPARAMETER-DEPENDENT SECTION
    for _ in range(n_experiments):
        experiment_parameters["seed"] = secrets.randbits(32)
        train_cv_mse, test_cv_mse, trained_models = evaluation.cv_evaluation(models.BoostedTrees, features, target, **experiment_parameters, hyperparameters=hyperparameters)
        experiment_result = {
            "train_cv_mse": train_cv_mse,
            "test_cv_mse": test_cv_mse,
        }
        record = {
            **experiment_parameters,
            **hyperparameters,
            **experiment_result,
        }
        experiment_records.append(record)
end_time = time.time()
print(f"Duration of the experiment: {(end_time - start_time)/60:.1f} minutes.")

experiment_records = pd.DataFrame(experiment_records)

Duration of the experiment: 3.0 minutes.


In [16]:
# Compute the RMSE for each set of hyperparameters (here, only the value of `reg_lambda` changes)
np.sqrt(experiment_records.groupby(list(hyperparameters.keys()))["test_cv_mse"].mean())

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,Unnamed: 4_level_0,Unnamed: 5_level_0,Unnamed: 6_level_0,Unnamed: 7_level_0,Unnamed: 8_level_0,test_cv_mse
n_estimators,max_depth,min_child_weight,gamma,subsample,colsample_bytree,scale_pos_weight,learning_rate,reg_lambda,Unnamed: 9_level_1
100,5,3,0.2,0.6,0.9,1,0.1,0.0,127015.200058
100,5,3,0.2,0.6,0.9,1,0.1,1e-05,125629.753419
100,5,3,0.2,0.6,0.9,1,0.1,0.0001,126045.71737
100,5,3,0.2,0.6,0.9,1,0.1,0.001,123938.810957
100,5,3,0.2,0.6,0.9,1,0.1,0.01,125735.006351
100,5,3,0.2,0.6,0.9,1,0.1,1.0,122878.631742
100,5,3,0.2,0.6,0.9,1,0.1,10.0,123679.489386
100,5,3,0.2,0.6,0.9,1,0.1,100.0,149516.114355


### Step 6: Evaluate the final hyperparameter choice and train a final model

In [21]:
hyperparameters = {
    "max_depth": 5,
    "min_child_weight": 3,
    "gamma": 0.2,
    "subsample": 0.6,
    "colsample_bytree": 0.9,
    "reg_lambda": 0.001,
    "scale_pos_weight": 1,
    "learning_rate": 0.01,
    "n_estimators": 5_000,
}

seed = secrets.randbits(32)

start_time = time.time()
train_cv_mse, test_cv_mse, trained_models = evaluation.cv_evaluation(models.BoostedTrees, features, target, 100, seed, hyperparameters=hyperparameters)
train_cv_rmse, test_cv_rmse = np.sqrt(np.array([train_cv_mse, test_cv_mse]))
end_time = time.time()
print(f"Duration of the experiment: {(end_time - start_time)/60:.1f} minutes.")

# Report evaluations
print(f"Cross-validation training error: ${train_cv_rmse / 1_000:.0f}k")
print(f"Cross-validation test error:     ${test_cv_rmse / 1_000:.0f}k")

Duration of the experiment: 13.0 minutes.
Cross-validation training error: $14k
Cross-validation test error:     $119k


In [22]:
# Save the model
model = models.BoostedTrees(**hyperparameters)
model.fit(features, target)

In [24]:
np.sqrt(model.evaluate(features, target))

np.float64(14189.633962861762)

In [25]:
model._model.save_model("xgb.model")

  self.get_booster().save_model(fname)


In [26]:
model._model.load_model("xgb.model")

  self.get_booster().load_model(fname)


In [27]:
np.sqrt(model.evaluate(features, target))

np.float64(14189.633962861762)