# HOPUS

HOPUS (**HO**using **P**ricing **U**tilitie**S**) contains a variety of routines used to predict real estate prices.

This notebook highlights what HOPUS can do, namely
- clean the raw data,
- train a variety of models for the prediction of real estate prices, and
- evaluate the performance of these models.

## Technical preliminaries

We install the `hopus` packaged and import it.
We also import other libraries, or tools from other libraries, which will be of use.

In [1]:
pip install git+https://github.com/aremondtiedrez/hopus.git

Collecting git+https://github.com/aremondtiedrez/hopus.git
  Cloning https://github.com/aremondtiedrez/hopus.git to /tmp/pip-req-build-39lsi_vc
  Running command git clone --filter=blob:none --quiet https://github.com/aremondtiedrez/hopus.git /tmp/pip-req-build-39lsi_vc
  Resolved https://github.com/aremondtiedrez/hopus.git to commit 2db5d7488479d8719cb8652e7b05180c46f69fee
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Building wheels for collected packages: hopus
  Building wheel for hopus (pyproject.toml) ... [?25l[?25hdone
  Created wheel for hopus: filename=hopus-0.1.0-py3-none-any.whl size=605835 sha256=c51281ff41337a9374f5c35482d9814d0a771568efd3603540684136a50c76b8
  Stored in directory: /tmp/pip-ephem-wheel-cache-zl9o_j7_/wheels/51/70/f3/9b50ab95e93a8c094dca78d2adce93d75f7cc33ce79c7c9cfc
Successfully built hopus
Installing collected packages: hopus
Succ

In [11]:
import secrets
import hopus
from sklearn.model_selection import train_test_split

## Data pipeline

The data used here is obtained via
the [RentCast API](https://www.rentcast.io/api).

Our data pipeline proceeds in three steps.

1. We clean the raw data. The data provided by the RentCast API is already
   of good quality, but nonetheless some gaps in the data are present which
   we must account for. Moreover, the RentCast data comes in the form of a JSON
   file which, due to its hierarchical and nested nature, needs to be untangled
   in order to put the data into a more digestible tabular format.

   Another important step at this stage has to do with the evolution of prices
   over time. Predicting how real estates prices evolve over time is
   an important, complex, and challenging question. We therefore avoid it
   entirely here. Instead we use a national (U.S.) home price index
   (known as HPI below) to "normalize" home prices over time and
   then predict these "normalized" home prices.

2. We remove outliers and samples which are missing key features.

   Why remove outliers? Because, when an unusual real estate sale listing
   is encountered, it is best to gather more information and proceed with caution. For example: some houses are bought at prices significantly above
   market rates by developers than then raze them and then redevelop the lot
   into a multi-apartment building. Without access to additional information,
   such as zoning information which indicates neighbourhoods where such
   redevelomenpts are possible, or such as demographic and economic data, which
   indicates where such redevelopments are likely to be profitable,
   it is unlikely that our model could reliably predict the sale price
   of such properties.

   Why remove samples which are missing key features? For two reasons.
   First, because by *key* features we mean things such as
   the number of bathrooms or the lot size. These are quantities that
   can always be gathered "by-hand".
   Second, because the overwhelming majority of real estate listings data
   accessed via the RentCast API contain all of the features we label
   as key features. Only about 5% of the listings fails to do so.

3. We split the data into a training set and a test set.
   (Nothing about this step is specific to the data or problem at hand.)

In [15]:
# Load the data
hpi = hopus.preprocessing.home_price_index.load_demo_data()
listings_data = hopus.preprocessing.property_listings.load_demo_data()

# Pre-process the data, i.e. clean the data
hopus.preprocessing.home_price_index.preprocess(hpi)
listings_data = hopus.preprocessing.property_listings.preprocess(listings_data, hpi)

# Remove outliers and listings which are missing key features
hopus.preprocessing.property_listings.drop_outliers(listings_data)
hopus.preprocessing.property_listings.drop_missing_key_features(listings_data)

# Split the data into training and testing sets.
# The training and test data obtained by using the seed below
# are already saved in the HOPUS package, to enable easy access
# later in this notebook.
# (The seed used was randomly generated via secrets.randbits)
seed = 726109520
training_data, test_data = train_test_split(listings_data, train_size=0.8, shuffle=True, random_state=seed)

In [16]:
training_data

Unnamed: 0,id,formattedAddress,addressLine1,addressLine2,city,state,stateFips,zipCode,county,countyFips,...,saleMonth,saleYear,trueValueHomePriceIndex,availableValueHomePriceIndex,trueMinusAvailableHomePriceIndex,monthAvgTrueMinusAvailableHomePriceIndex,predictedValueHomePriceIndex,pricePerSqFt,timeNormalizedPricePerSqFt,logPrice
1191,"3604-Larson-Ct,-Madison,-WI-53714","3604 Larson Ct, Madison, WI 53714",3604 Larson Ct,,Madison,WI,55,53714,Dane,25.0,...,8,2024,325.076,323.756,1.320,2.295385,326.051385,261.206897,0.803526,12.621488
714,"325-Racine-Rd,-Madison,-WI-53705","325 Racine Rd, Madison, WI 53705",325 Racine Rd,,Madison,WI,55,53705,Dane,25.0,...,6,2023,308.369,297.428,10.941,4.207205,301.635205,256.273358,0.831061,13.081541
370,"5126-Butterfield-Dr,-Madison,-WI-53704","5126 Butterfield Dr, Madison, WI 53704",5126 Butterfield Dr,,Madison,WI,55,53704,Dane,25.0,...,3,2024,316.884,310.913,5.971,1.714053,312.627053,232.828871,0.734745,12.899220
918,"5525-Marsha-Dr,-Madison,-WI-53705","5525 Marsha Dr, Madison, WI 53705",5525 Marsha Dr,,Madison,WI,55,53705,Dane,25.0,...,12,2024,323.347,324.710,-1.363,-0.247974,324.462026,248.534936,0.768632,12.709269
1031,"3537-Dennett-Dr,-Madison,-WI-53714","3537 Dennett Dr, Madison, WI 53714",3537 Dennett Dr,,Madison,WI,55,53714,Dane,25.0,...,1,2023,292.712,298.542,-5.830,-0.228921,298.313079,241.825613,0.826155,12.779873
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
222,"1408-Arrowood-Dr,-Madison,-WI-53704","1408 Arrowood Dr, Madison, WI 53704",1408 Arrowood Dr,,Madison,WI,55,53704,Dane,25.0,...,7,2023,310.292,301.542,8.750,3.422769,304.964769,207.067918,0.667332,12.834681
819,"2937-Harvey-St,-Madison,-WI-53705","2937 Harvey St, Madison, WI 53705",2937 Harvey St,,Madison,WI,55,53705,Dane,25.0,...,4,2024,320.812,310.781,10.031,3.224667,314.005667,345.896147,1.078190,12.931203
909,"122-Nautilus-Dr,-Madison,-WI-53705","122 Nautilus Dr, Madison, WI 53705",122 Nautilus Dr,,Madison,WI,55,53705,Dane,25.0,...,11,2024,323.745,325.076,-1.331,-0.026289,325.049711,280.728376,0.867128,13.514405
836,"919-University-Bay-Dr,-Madison,-WI-53705","919 University Bay Dr, Madison, WI 53705",919 University Bay Dr,,Madison,WI,55,53705,Dane,25.0,...,5,2024,323.756,312.704,11.052,4.277949,316.981949,504.658385,1.558761,13.384728


In [18]:
test_data

Unnamed: 0,id,formattedAddress,addressLine1,addressLine2,city,state,stateFips,zipCode,county,countyFips,...,saleMonth,saleYear,trueValueHomePriceIndex,availableValueHomePriceIndex,trueMinusAvailableHomePriceIndex,monthAvgTrueMinusAvailableHomePriceIndex,predictedValueHomePriceIndex,pricePerSqFt,timeNormalizedPricePerSqFt,logPrice
355,"4138-Grayhawk-Trl,-Madison,-WI-53704","4138 Grayhawk Trl, Madison, WI 53704",4138 Grayhawk Trl,,Madison,WI,55,53704,Dane,25.0,...,2,2024,312.704,312.100,0.604,0.273500,312.373500,240.610329,0.769451,12.923912
788,"1121-Lorraine-Dr,-Madison,-WI-53705","1121 Lorraine Dr, Madison, WI 53705",1121 Lorraine Dr,,Madison,WI,55,53705,Dane,25.0,...,12,2023,310.913,312.511,-1.598,-0.247974,312.263026,254.248810,0.817749,12.832011
914,"5201-Burnett-Dr,-Madison,-WI-53705","5201 Burnett Dr, Madison, WI 53705",5201 Burnett Dr,,Madison,WI,55,53705,Dane,25.0,...,12,2024,323.347,324.710,-1.363,-0.247974,324.462026,236.553785,0.731579,13.071070
1004,"2013-Sundstrom-St,-Madison,-WI-53713","2013 Sundstrom St, Madison, WI 53713",2013 Sundstrom St,,Madison,WI,55,53713,Dane,25.0,...,8,2024,325.076,323.756,1.320,2.295385,326.051385,250.391236,0.770254,12.676076
1192,"3441-Hargrove-St,-Madison,-WI-53714","3441 Hargrove St, Madison, WI 53714",3441 Hargrove St,,Madison,WI,55,53714,Dane,25.0,...,8,2024,325.076,323.756,1.320,2.295385,326.051385,422.900763,1.300929,12.531773
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
894,"322-Cheyenne-Trl,-Madison,-WI-53705","322 Cheyenne Trl, Madison, WI 53705",322 Cheyenne Trl,,Madison,WI,55,53705,Dane,25.0,...,9,2024,324.710,325.309,-0.599,1.146641,326.455641,294.523700,0.907036,13.369223
1551,"5105-Camden-Rd,-Madison,-WI-53716","5105 Camden Rd, Madison, WI 53716",5105 Camden Rd,,Madison,WI,55,53716,Dane,25.0,...,7,2024,325.631,320.812,4.819,3.422769,324.234769,195.016251,0.598887,12.793859
1637,"2226-Van-Hise-Ave,-Madison,-WI-53726","2226 Van Hise Ave, Madison, WI 53726",2226 Van Hise Ave,,Madison,WI,55,53726,Dane,25.0,...,7,2023,310.292,301.542,8.750,3.422769,304.964769,333.353548,1.074322,13.217128
30,"1228-Spaight-St,-Madison,-WI-53703","1228 Spaight St, Madison, WI 53703",1228 Spaight St,,Madison,WI,55,53703,Dane,25.0,...,9,2023,312.511,308.369,4.142,1.146641,309.515641,382.475661,1.223879,13.623139


In [19]:
import pandas as pd

In [20]:
training_data.to_csv("training_data.csv")

In [21]:
test_data.to_csv("test_data.csv")

# BELOW: ONLY OBSOLETE CELLS

## Data cleaning

In [None]:
hpi = preprocessing.home_price_index.load()
preprocessing.home_price_index.preprocess(hpi)

listings_data = preprocessing.property_listings.load()
listings_data = preprocessing.property_listings.preprocess(listings_data, hpi)

preprocessing.property_listings.drop_outliers(listings_data)
preprocessing.property_listings.drop_missing_key_features(listings_data)

In [None]:
price_rmse = evaluation.hpi_rmse(listings_data, target="price")
log_price_rmse = evaluation.hpi_rmse(listings_data, target="logPrice")
print(
    "When using the available home price index\n"
    "instead of the true home price index,\n"
    f"the price RMSE is ${price_rmse/1_000:.0f}k and \n"
    f"the log-price RMSE is {log_price_rmse:.3f}."
)

When using the available home price index
instead of the true home price index,
the price RMSE is $10k and 
the log-price RMSE is 0.021.


## Baseline model
Average (time-normalized) price-per-square-foot over each ZIP code

In [None]:
import numpy as np
import secrets

In [None]:
model = models.Baseline()
model.fit(listings_data, None)
train_mse = model.evaluate(listings_data, listings_data["price"])
train_rmse = np.sqrt(train_mse)
print(f"Training error: ${train_rmse / 1_000:.3f}k")

Training error: $157.830k


In [None]:
# Save the model
model.save("baseline_model")
del model

In [None]:
# Loading the model
model = models.Baseline()
model.load("baseline_model")
loaded_rmse = np.sqrt(model.evaluate(listings_data, listings_data["price"]))
print(f"Training error: ${loaded_rmse / 1_000:.3f}k")

Training error: $157.830k


In [None]:
# Cross-validation
seed = secrets.randbits(32)

train_cv_mse, test_cv_mse, trained_models = evaluation.cv_evaluation(models.Baseline, listings_data, listings_data["price"], 100, seed)
train_cv_rmse, test_cv_rmse = np.sqrt(np.array([train_cv_mse, test_cv_mse]))

# Report evaluations
print(f"Cross-validation training error: ${train_cv_rmse / 1_000:.0f}k")
print(f"Cross-validation test error:     ${test_cv_rmse / 1_000:.0f}k")

Cross-validation training error: $158k
Cross-validation test error:     $158k


## Baseline model for *log*-prices

In [None]:
# Training and evaluating a single model
model = models.Baseline()
model.fit(listings_data, None)
train_mse = model.evaluate(listings_data, listings_data["logPrice"], target_type="log_price")
train_rmse = np.sqrt(train_mse)
print(f"Training error: {train_rmse:.3f}")

Training error: 0.317


In [None]:
# Cross-validation
seed = secrets.randbits(32)

train_cv_mse, test_cv_mse, trained_models = evaluation.cv_evaluation(models.Baseline, listings_data, listings_data["logPrice"], 100, seed, target_type="log_price")
train_cv_rmse, test_cv_rmse = np.sqrt(np.array([train_cv_mse, test_cv_mse]))

# Report evaluations
print(f"Cross-validation training error: {train_cv_rmse:.3f}")
print(f"Cross-validation test error:     {test_cv_rmse:.3f}")

Cross-validation training error: 0.317
Cross-validation test error:     0.318


## Technical preliminary (before linear regression or XGBoost)
We group the columns into key features, auxiliary features, and target
(as well as into information columns and unused columns).

In [None]:
preprocessing.property_listings.group_columns(listings_data)

## Linear regression: training and evaluation

In [None]:
import numpy as np
import secrets
from sklearn.model_selection import train_test_split

In [None]:
# Train a single model, then save it, delete it, and load it back

# Train
model = models.LinearRegression()
model.fit(listings_data["keyPredictionFeatures"], listings_data[("target", "price")])

# Evaluate
train_rmse = np.sqrt(model.evaluate(listings_data["keyPredictionFeatures"], listings_data[("target", "price")]))
print(f"Training error: ${train_rmse / 1_000:.3f}k")

# Save
model.save("linear_regression_model")

# Delete
del model

# Load
model = models.LinearRegression()
model.load("linear_regression_model")
print(f"Training error (after deleting the original model and loading it back): ${train_rmse / 1_000:.3f}k")

Training error: $155.970k
Training error (after deleting the original model and loading it back): $155.970k


In [None]:
# Cross-validation
seed = secrets.randbits(32)

# Train-test split
train_features, test_features = train_test_split(listings_data["keyPredictionFeatures"], train_size=0.8, shuffle=True, random_state=seed)
train_target, test_target = train_test_split(listings_data[("target", "price")], train_size=0.8, shuffle=True, random_state=seed)

# Train model
model = models.LinearRegression()
model.fit(train_features, train_target)

# Evaluate model
train_rmse = np.sqrt(model.evaluate(train_features, train_target))
test_rmse = np.sqrt(model.evaluate(test_features, test_target))

# Report evaluations
print(f"Seed: {seed}")
print(f"Training error: ${train_rmse / 1_000:.3f}k")
print(f"Test error:     ${test_rmse / 1_000:.3f}k")

Seed: 3645453190
Training error: $152.175k
Test error:     $171.501k


## Linear regression: evaluation with cross-validation

In [None]:
import numpy as np
import secrets

In [None]:
seed = secrets.randbits(32)
train_cv_mse, test_cv_mse, trained_models = evaluation.cv_evaluation(models.LinearRegression, listings_data["keyPredictionFeatures"], listings_data[("target", "price")], 100, seed)
train_cv_rmse, test_cv_rmse = np.sqrt(np.array([train_cv_mse, test_cv_mse]))

# Report evaluations
print(f"Cross-validation training error: ${train_cv_rmse / 1_000:.0f}k")
print(f"Cross-validation test error:     ${test_cv_rmse / 1_000:.0f}k")

Cross-validation training error: $156k
Cross-validation test error:     $159k


## Linear regression for *log*-prices: evaluation with cross-validation

In [None]:
seed = secrets.randbits(32)
train_cv_mse, test_cv_mse, trained_models = evaluation.cv_evaluation(
    models.LinearRegression,
    listings_data["keyPredictionFeatures"],
    listings_data[("target", "logPrice")],
    100,
    seed,
    target_type="log_price"
)
train_cv_rmse, test_cv_rmse = np.sqrt(np.array([train_cv_mse, test_cv_mse]))

# Report evaluations
print(f"Cross-validation training error: {train_cv_rmse:.3f}")
print(f"Cross-validation test error:     {test_cv_rmse:.3f}")

Cross-validation training error: 0.280
Cross-validation test error:     0.281


## Linear regression: cross-validation for various data subsets

In [None]:
import numpy as np
import secrets

In [None]:
seed = secrets.randbits(32)

In [None]:
# --------------------------------------------------
# PART A: USING ONLY THE KEY PREDICTION FEATURES
# --------------------------------------------------

# Case 1: with outliers and imperfect samples
listings_data = preprocessing.property_listings.load()
listings_data = preprocessing.property_listings.preprocess(listings_data, hpi)

#preprocessing.property_listings.drop_outliers(listings_data)
#preprocessing.property_listings.drop_missing_key_features(listings_data)

preprocessing.property_listings.group_columns(listings_data)

train_cv_mse, test_cv_mse, trained_models = evaluation.cv_evaluation(models.LinearRegression, listings_data["keyPredictionFeatures"], listings_data[("target", "price")], 100, seed)
train_cv_rmse, test_cv_rmse = np.sqrt(np.array([train_cv_mse, test_cv_mse]))

print("--------------------------------------------------")
print(f"With outliers and with imperfect samples.")
print("--------------------------------------------------")
print(f"Cross-validation training error: ${train_cv_rmse / 1_000:.0f}k")
print(f"Cross-validation test error:     ${test_cv_rmse / 1_000:.0f}k")

# Case 2: without outliers but with imperfect samples
listings_data = preprocessing.property_listings.load()
listings_data = preprocessing.property_listings.preprocess(listings_data, hpi)

preprocessing.property_listings.drop_outliers(listings_data)
#preprocessing.property_listings.drop_missing_key_features(listings_data)

preprocessing.property_listings.group_columns(listings_data)

train_cv_mse, test_cv_mse, trained_models = evaluation.cv_evaluation(models.LinearRegression, listings_data["keyPredictionFeatures"], listings_data[("target", "price")], 100, seed)
train_cv_rmse, test_cv_rmse = np.sqrt(np.array([train_cv_mse, test_cv_mse]))

print("--------------------------------------------------")
print(f"Without outliers but with imperfect samples.")
print("--------------------------------------------------")
print(f"Cross-validation training error: ${train_cv_rmse / 1_000:.0f}k")
print(f"Cross-validation test error:     ${test_cv_rmse / 1_000:.0f}k")

# Case 3: with outliers but without imperfect samples
listings_data = preprocessing.property_listings.load()
listings_data = preprocessing.property_listings.preprocess(listings_data, hpi)

#preprocessing.property_listings.drop_outliers(listings_data)
preprocessing.property_listings.drop_missing_key_features(listings_data)

preprocessing.property_listings.group_columns(listings_data)

train_cv_mse, test_cv_mse, trained_models = evaluation.cv_evaluation(models.LinearRegression, listings_data["keyPredictionFeatures"], listings_data[("target", "price")], 100, seed)
train_cv_rmse, test_cv_rmse = np.sqrt(np.array([train_cv_mse, test_cv_mse]))

print("--------------------------------------------------")
print(f"With outliers but without imperfect samples.")
print("--------------------------------------------------")
print(f"Cross-validation training error: ${train_cv_rmse / 1_000:.0f}k")
print(f"Cross-validation test error:     ${test_cv_rmse / 1_000:.0f}k")

# Case 1: without outliers nor imperfect samples
listings_data = preprocessing.property_listings.load()
listings_data = preprocessing.property_listings.preprocess(listings_data, hpi)

preprocessing.property_listings.drop_outliers(listings_data)
preprocessing.property_listings.drop_missing_key_features(listings_data)

preprocessing.property_listings.group_columns(listings_data)

train_cv_mse, test_cv_mse, trained_models = evaluation.cv_evaluation(models.LinearRegression, listings_data["keyPredictionFeatures"], listings_data[("target", "price")], 100, seed)
train_cv_rmse, test_cv_rmse = np.sqrt(np.array([train_cv_mse, test_cv_mse]))

print("--------------------------------------------------")
print(f"Without outliers nor imperfect samples.")
print("--------------------------------------------------")
print(f"Cross-validation training error: ${train_cv_rmse / 1_000:.0f}k")
print(f"Cross-validation test error:     ${test_cv_rmse / 1_000:.0f}k")

--------------------------------------------------
With outliers and with imperfect samples.
--------------------------------------------------
Cross-validation training error: $248k
Cross-validation test error:     $252k
--------------------------------------------------
Without outliers but with imperfect samples.
--------------------------------------------------
Cross-validation training error: $163k
Cross-validation test error:     $167k
--------------------------------------------------
With outliers but without imperfect samples.
--------------------------------------------------
Cross-validation training error: $240k
Cross-validation test error:     $244k
--------------------------------------------------
Without outliers nor imperfect samples.
--------------------------------------------------
Cross-validation training error: $156k
Cross-validation test error:     $160k


In [None]:
# ------------------------------------------------------------
# PART B: USING THE KEY AND THE AUXILIARY PREDICTION FEATURES
# ------------------------------------------------------------
features_label = ["keyPredictionFeatures", "auxiliaryPredictionFeatures"]

# Case 1: with outliers and imperfect samples
listings_data = preprocessing.property_listings.load()
listings_data = preprocessing.property_listings.preprocess(listings_data, hpi)

#preprocessing.property_listings.drop_outliers(listings_data)
#preprocessing.property_listings.drop_missing_key_features(listings_data)

preprocessing.property_listings.group_columns(listings_data)

train_cv_mse, test_cv_mse, trained_models = evaluation.cv_evaluation(models.LinearRegression, listings_data[features_label], listings_data[("target", "price")], 100, seed)
train_cv_rmse, test_cv_rmse = np.sqrt(np.array([train_cv_mse, test_cv_mse]))

print("--------------------------------------------------")
print(f"With outliers and with imperfect samples.")
print("--------------------------------------------------")
print(f"Cross-validation training error: ${train_cv_rmse / 1_000:.0f}k")
print(f"Cross-validation test error:     ${test_cv_rmse / 1_000:.0f}k")

# Case 2: without outliers but with imperfect samples
listings_data = preprocessing.property_listings.load()
listings_data = preprocessing.property_listings.preprocess(listings_data, hpi)

preprocessing.property_listings.drop_outliers(listings_data)
#preprocessing.property_listings.drop_missing_key_features(listings_data)

preprocessing.property_listings.group_columns(listings_data)

train_cv_mse, test_cv_mse, trained_models = evaluation.cv_evaluation(models.LinearRegression, listings_data[features_label], listings_data[("target", "price")], 100, seed)
train_cv_rmse, test_cv_rmse = np.sqrt(np.array([train_cv_mse, test_cv_mse]))

print("--------------------------------------------------")
print(f"Without outliers but with imperfect samples.")
print("--------------------------------------------------")
print(f"Cross-validation training error: ${train_cv_rmse / 1_000:.0f}k")
print(f"Cross-validation test error:     ${test_cv_rmse / 1_000:.0f}k")

# Case 3: with outliers but without imperfect samples
listings_data = preprocessing.property_listings.load()
listings_data = preprocessing.property_listings.preprocess(listings_data, hpi)

#preprocessing.property_listings.drop_outliers(listings_data)
preprocessing.property_listings.drop_missing_key_features(listings_data)

preprocessing.property_listings.group_columns(listings_data)

train_cv_mse, test_cv_mse, trained_models = evaluation.cv_evaluation(models.LinearRegression, listings_data[features_label], listings_data[("target", "price")], 100, seed)
train_cv_rmse, test_cv_rmse = np.sqrt(np.array([train_cv_mse, test_cv_mse]))

print("--------------------------------------------------")
print(f"With outliers but without imperfect samples.")
print("--------------------------------------------------")
print(f"Cross-validation training error: ${train_cv_rmse / 1_000:.0f}k")
print(f"Cross-validation test error:     ${test_cv_rmse / 1_000:.0f}k")

# Case 1: without outliers nor imperfect samples
listings_data = preprocessing.property_listings.load()
listings_data = preprocessing.property_listings.preprocess(listings_data, hpi)

preprocessing.property_listings.drop_outliers(listings_data)
preprocessing.property_listings.drop_missing_key_features(listings_data)

preprocessing.property_listings.group_columns(listings_data)

train_cv_mse, test_cv_mse, trained_models = evaluation.cv_evaluation(models.LinearRegression, listings_data[features_label], listings_data[("target", "price")], 100, seed)
train_cv_rmse, test_cv_rmse = np.sqrt(np.array([train_cv_mse, test_cv_mse]))

print("--------------------------------------------------")
print(f"Without outliers nor imperfect samples.")
print("--------------------------------------------------")
print(f"Cross-validation training error: ${train_cv_rmse / 1_000:.0f}k")
print(f"Cross-validation test error:     ${test_cv_rmse / 1_000:.0f}k")

--------------------------------------------------
With outliers and with imperfect samples.
--------------------------------------------------
Cross-validation training error: $230k
Cross-validation test error:     $248k
--------------------------------------------------
Without outliers but with imperfect samples.
--------------------------------------------------
Cross-validation training error: $144k
Cross-validation test error:     $161k
--------------------------------------------------
With outliers but without imperfect samples.
--------------------------------------------------
Cross-validation training error: $227k
Cross-validation test error:     $240k
--------------------------------------------------
Without outliers nor imperfect samples.
--------------------------------------------------
Cross-validation training error: $138k
Cross-validation test error:     $151k


## XGBoost: training and cross-validation

In [None]:
import numpy as np
import secrets

In [None]:
# Training a single model
features = listings_data[["keyPredictionFeatures", "auxiliaryPredictionFeatures"]]
target = listings_data[("target", "price")]

hyperparameters = {
    "max_depth": 5,
    "min_child_weight": 1,
    "gamma": 0,
    "subsample": 0.8,
    "colsample_bytree": 0.8,
    "scale_pos_weight": 1
}

model = models.BoostedTrees(**hyperparameters)
model.fit(features, target)
train_rmse = np.sqrt(model.evaluate(features, target))
print(f"Training error: ${train_rmse / 1_000:.3f}k")

Training error: $27.750k


In [None]:
# Saving a model
model.save("boosted_tree_model")
del model

  self.get_booster().save_model(fname)


In [None]:
# Loading a model
model = models.BoostedTrees()
model.load("boosted_tree_model")

train_rmse = np.sqrt(model.evaluate(features, target))
print(f"Training error: ${train_rmse / 1_000:.3f}k")

Training error: $27.750k


  self.get_booster().load_model(fname)


In [None]:
hyperparameters = {
    "max_depth": 5,
    "min_child_weight": 1,
    "gamma": 0,
    "subsample": 0.8,
    "colsample_bytree": 0.8,
    "scale_pos_weight": 1
}

seed = secrets.randbits(32)

train_cv_mse, test_cv_mse, trained_models = evaluation.cv_evaluation(models.BoostedTrees, features, target, 100, seed, hyperparameters=hyperparameters)
train_cv_rmse, test_cv_rmse = np.sqrt(np.array([train_cv_mse, test_cv_mse]))

# Report evaluations
print(f"Cross-validation training error: ${train_cv_rmse / 1_000:.0f}k")
print(f"Cross-validation test error:     ${test_cv_rmse / 1_000:.0f}k")

Cross-validation training error: $26k
Cross-validation test error:     $125k


## XGBoost: training and cross-validation for *log*-prices

In [None]:
hyperparameters = {
    "max_depth": 5,
    "min_child_weight": 1,
    "gamma": 0,
    "subsample": 0.8,
    "colsample_bytree": 0.8,
    "scale_pos_weight": 1
}

seed = secrets.randbits(32)

train_cv_mse, test_cv_mse, trained_models = evaluation.cv_evaluation(models.BoostedTrees, features, np.log(target), 100, seed, hyperparameters=hyperparameters, target_type="log_price")
train_cv_rmse, test_cv_rmse = np.sqrt(np.array([train_cv_mse, test_cv_mse]))

# Report evaluations
print(f"Cross-validation training error: {train_cv_rmse:.3f}")
print(f"Cross-validation test error:     {test_cv_rmse:.3f}")

Cross-validation training error: 0.065
Cross-validation test error:     0.255


## XGBoost: Hierarchical hyperparameter search

We follow the hierarchical hyperparameter search procedure described in
https://www.analyticsvidhya.com/blog/2016/03/complete-guide-parameter-tuning-xgboost-with-codes-python/

In [None]:
import numpy as np
import pandas as pd

from itertools import product

### Step 1: `n_estimators`

In [None]:
# STEP 1
# We use sensible default choices and aim to find a good number of estimators to use
hyperparameters = {
    "max_depth": 5,
    "min_child_weight": 1,
    "gamma": 0,
    "subsample": 0.8,
    "colsample_bytree": 0.8,
    "scale_pos_weight": 1,
    "learning_rate": 0.1
}

records = []
for n_estimators in (10, 30, 100, 300, 1000):
    hyperparameters["n_estimators"] = n_estimators
    record = evaluation.run_experiment(features, target, models.BoostedTrees, hyperparameters, n_experiments=2, n_splits=3)
    records.append(record)

# Combine the experiments into a neat DataFrame
records = pd.DataFrame(records)
# Compute the RMSE for each set of hyperparameters (here, only the value of `n_estimators` changes)
np.sqrt(records.groupby(list(hyperparameters.keys()))["test_cv_mse"].mean())

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,Unnamed: 4_level_0,Unnamed: 5_level_0,Unnamed: 6_level_0,Unnamed: 7_level_0,test_cv_mse
max_depth,min_child_weight,gamma,subsample,colsample_bytree,scale_pos_weight,learning_rate,n_estimators,Unnamed: 8_level_1
5,1,0,0.8,0.8,1,0.1,10,167866.093173
5,1,0,0.8,0.8,1,0.1,30,154970.591688
5,1,0,0.8,0.8,1,0.1,100,142922.38565
5,1,0,0.8,0.8,1,0.1,300,131397.791722
5,1,0,0.8,0.8,1,0.1,1000,141849.177067


### Step 2: `max_depth` and `min_child_weight`

In [None]:
# STEP 2
# We fix `n_estimators = 100` from the previous step and
# now seek to find good values for `max_depth` and `min_child_weight`.
hyperparameters = {
    "n_estimators": 100,
    "gamma": 0,
    "subsample": 0.8,
    "colsample_bytree": 0.8,
    "scale_pos_weight": 1,
    "learning_rate": 0.1
}

records = []
for max_depth, min_child_weight in product((3, 5, 7, 9), (1, 3, 5)):
    hyperparameters["max_depth"] = max_depth
    hyperparameters["min_child_weight"] = min_child_weight
    record = evaluation.run_experiment(features, target, models.BoostedTrees, hyperparameters, n_experiments=2, n_splits=3)
    records.append(record)

# Combine the experiments into a neat DataFrame
records = pd.DataFrame(records)
# Compute the RMSE for each set of hyperparameters (here, only the value of `n_estimators` changes)
np.sqrt(records.groupby(list(hyperparameters.keys()))["test_cv_mse"].mean())

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,Unnamed: 4_level_0,Unnamed: 5_level_0,Unnamed: 6_level_0,Unnamed: 7_level_0,test_cv_mse
n_estimators,gamma,subsample,colsample_bytree,scale_pos_weight,learning_rate,max_depth,min_child_weight,Unnamed: 8_level_1
100,0,0.8,0.8,1,0.1,3,1,122362.546601
100,0,0.8,0.8,1,0.1,3,3,130029.778251
100,0,0.8,0.8,1,0.1,3,5,131322.190905
100,0,0.8,0.8,1,0.1,5,1,140524.192897
100,0,0.8,0.8,1,0.1,5,3,131566.630501
100,0,0.8,0.8,1,0.1,5,5,132154.710467
100,0,0.8,0.8,1,0.1,7,1,148857.905292
100,0,0.8,0.8,1,0.1,7,3,132531.70392
100,0,0.8,0.8,1,0.1,7,5,132905.108605
100,0,0.8,0.8,1,0.1,9,1,123762.845216


### Step 3: `gamma`

In [None]:
# STEP 3
# We fix `max_depth = 5` and `min_child_weight = 3` from the previous step and
# now seek to find a good value for `gamma`.
hyperparameters = {
    "n_estimators": 100,
    "max_depth": 5,
    "min_child_weight": 3,
    "subsample": 0.8,
    "colsample_bytree": 0.8,
    "scale_pos_weight": 1,
    "learning_rate": 0.1
}

records = []
for gamma in (0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7):
    hyperparameters["gamma"] = gamma
    record = evaluation.run_experiment(features, target, models.BoostedTrees, hyperparameters, n_experiments=2, n_splits=3)
    records.append(record)

# Combine the experiments into a neat DataFrame
records = pd.DataFrame(records)
# Compute the RMSE for each set of hyperparameters (here, only the value of `n_estimators` changes)
np.sqrt(records.groupby(list(hyperparameters.keys()))["test_cv_mse"].mean())

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,Unnamed: 4_level_0,Unnamed: 5_level_0,Unnamed: 6_level_0,Unnamed: 7_level_0,test_cv_mse
n_estimators,max_depth,min_child_weight,subsample,colsample_bytree,scale_pos_weight,learning_rate,gamma,Unnamed: 8_level_1
100,5,3,0.8,0.8,1,0.1,0.0,129868.921322
100,5,3,0.8,0.8,1,0.1,0.1,139951.417399
100,5,3,0.8,0.8,1,0.1,0.2,136127.287884
100,5,3,0.8,0.8,1,0.1,0.3,127061.01858
100,5,3,0.8,0.8,1,0.1,0.4,128624.666416
100,5,3,0.8,0.8,1,0.1,0.5,154614.412506
100,5,3,0.8,0.8,1,0.1,0.6,130056.174284
100,5,3,0.8,0.8,1,0.1,0.7,134921.474051


### Step 4: `subsample` and `colsample_bytree`

In [None]:
# STEP 4
# We fix `gamma = 0.2` from the previous step and
# now seek to find good values for `subsample` and `colsample_bytree`.
hyperparameters = {
    "n_estimators": 100,
    "max_depth": 5,
    "min_child_weight": 3,
    "gamma": 0.2,
    "scale_pos_weight": 1,
    "learning_rate": 0.1
}

records = []
for subsample, colsample_bytree in product((0.6, 0.7, 0.8, 0.9), (0.6, 0.7, 0.8, 0.9)):
    hyperparameters["subsample"] = subsample
    hyperparameters["colsample_bytree"] = colsample_bytree
    record = evaluation.run_experiment(features, target, models.BoostedTrees, hyperparameters, n_experiments=2, n_splits=3)
    records.append(record)

# Combine the experiments into a neat DataFrame
records = pd.DataFrame(records)
# Compute the RMSE for each set of hyperparameters (here, only the value of `n_estimators` changes)
np.sqrt(records.groupby(list(hyperparameters.keys()))["test_cv_mse"].mean())

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,Unnamed: 4_level_0,Unnamed: 5_level_0,Unnamed: 6_level_0,Unnamed: 7_level_0,test_cv_mse
n_estimators,max_depth,min_child_weight,gamma,scale_pos_weight,learning_rate,subsample,colsample_bytree,Unnamed: 8_level_1
100,5,3,0.2,1,0.1,0.6,0.6,132869.031376
100,5,3,0.2,1,0.1,0.6,0.7,131054.764492
100,5,3,0.2,1,0.1,0.6,0.8,138038.242894
100,5,3,0.2,1,0.1,0.6,0.9,127430.389929
100,5,3,0.2,1,0.1,0.7,0.6,138724.340417
100,5,3,0.2,1,0.1,0.7,0.7,142196.856449
100,5,3,0.2,1,0.1,0.7,0.8,140327.985786
100,5,3,0.2,1,0.1,0.7,0.9,130399.316808
100,5,3,0.2,1,0.1,0.8,0.6,127570.024483
100,5,3,0.2,1,0.1,0.8,0.7,127403.831654


### Step 5: `reg_lambda`

In [None]:
# STEP 5
# We fix `subsample = 0.6` and `colsample_bytree = 0.9` from the previous step and
# now seek to find a good values for `reg_lambda`.
hyperparameters = {
    "n_estimators": 100,
    "max_depth": 5,
    "min_child_weight": 3,
    "gamma": 0.2,
    "subsample": 0.6,
    "colsample_bytree": 0.9,
    "scale_pos_weight": 1,
    "learning_rate": 0.1
}

records = []
n_experiments = 10
for reg_lambda in (0, 1e-5, 1e-4, 1e-3, 1e-2, 1, 10, 100):
    hyperparameters["reg_lambda"] = reg_lambda
    record = evaluation.run_experiment(features, target, models.BoostedTrees, hyperparameters, n_experiments=2, n_splits=3)
    records.append(record)

# Combine the experiments into a neat DataFrame
records = pd.DataFrame(records)
# Compute the RMSE for each set of hyperparameters (here, only the value of `n_estimators` changes)
np.sqrt(records.groupby(list(hyperparameters.keys()))["test_cv_mse"].mean())

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,Unnamed: 4_level_0,Unnamed: 5_level_0,Unnamed: 6_level_0,Unnamed: 7_level_0,Unnamed: 8_level_0,test_cv_mse
n_estimators,max_depth,min_child_weight,gamma,subsample,colsample_bytree,scale_pos_weight,learning_rate,reg_lambda,Unnamed: 9_level_1
100,5,3,0.2,0.6,0.9,1,0.1,0.0,130598.809918
100,5,3,0.2,0.6,0.9,1,0.1,1e-05,129145.321268
100,5,3,0.2,0.6,0.9,1,0.1,0.0001,139496.294011
100,5,3,0.2,0.6,0.9,1,0.1,0.001,144907.922949
100,5,3,0.2,0.6,0.9,1,0.1,0.01,132371.172048
100,5,3,0.2,0.6,0.9,1,0.1,1.0,133882.609197
100,5,3,0.2,0.6,0.9,1,0.1,10.0,140659.833262
100,5,3,0.2,0.6,0.9,1,0.1,100.0,153363.197463


### Step 6: Evaluate the final hyperparameter choice and train a final model

In [None]:
hyperparameters = {
    "max_depth": 5,
    "min_child_weight": 3,
    "gamma": 0.2,
    "subsample": 0.6,
    "colsample_bytree": 0.9,
    "reg_lambda": 0.001,
    "scale_pos_weight": 1,
    "learning_rate": 0.01,
    "n_estimators": 5_000,
}

seed = secrets.randbits(32)

start_time = time.time()
train_cv_mse, test_cv_mse, trained_models = evaluation.cv_evaluation(models.BoostedTrees, features, target, 100, seed, hyperparameters=hyperparameters)
train_cv_rmse, test_cv_rmse = np.sqrt(np.array([train_cv_mse, test_cv_mse]))
end_time = time.time()
print(f"Duration of the experiment: {(end_time - start_time)/60:.1f} minutes.")

# Report evaluations
print(f"Cross-validation training error: ${train_cv_rmse / 1_000:.0f}k")
print(f"Cross-validation test error:     ${test_cv_rmse / 1_000:.0f}k")

Duration of the experiment: 13.0 minutes.
Cross-validation training error: $14k
Cross-validation test error:     $119k


In [None]:
hyperparameters = {
    "max_depth": 5,
    "min_child_weight": 3,
    "gamma": 0.2,
    "subsample": 0.6,
    "colsample_bytree": 0.9,
    "reg_lambda": 0.001,
    "scale_pos_weight": 1,
    "learning_rate": 0.01,
    "n_estimators": 5_000,
}

# Train the model, then save it
model = models.BoostedTrees(**hyperparameters)
model.fit(features, target)

In [None]:
np.sqrt(model.evaluate(features, target))

np.float64(14189.633962861762)

In [None]:
model._model.save_model("xgb.model")

  self.get_booster().save_model(fname)


In [None]:
model._model.load_model("xgb.model")

  self.get_booster().load_model(fname)


In [None]:
np.sqrt(model.evaluate(features, target))

np.float64(14189.633962861762)

## XGBoost: Hierarchical hyperparameter search for the prediction of *log*-prices

We follow the hierarchical hyperparameter search procedure described in
https://www.analyticsvidhya.com/blog/2016/03/complete-guide-parameter-tuning-xgboost-with-codes-python/

In [None]:
import numpy as np
import pandas as pd

from itertools import product

### Step 1: `n_estimators`

In [None]:
# STEP 1
# We use sensible default choices and aim to find a good number of estimators to use
hyperparameters = {
    "max_depth": 5,
    "min_child_weight": 1,
    "gamma": 0,
    "subsample": 0.8,
    "colsample_bytree": 0.8,
    "scale_pos_weight": 1,
    "learning_rate": 0.1
}

records = []
for n_estimators in (10, 30, 100, 300, 1000):
    hyperparameters["n_estimators"] = n_estimators
    record = evaluation.run_experiment(features, np.log(target), models.BoostedTrees, hyperparameters, n_experiments=5, n_splits=10)
    records.append(record)

# Combine the experiments into a neat DataFrame
records = pd.DataFrame(records)
# Compute the RMSE for each set of hyperparameters (here, only the value of `n_estimators` changes)
np.sqrt(records.groupby(list(hyperparameters.keys()))["test_cv_mse"].mean())

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,Unnamed: 4_level_0,Unnamed: 5_level_0,Unnamed: 6_level_0,Unnamed: 7_level_0,test_cv_mse
max_depth,min_child_weight,gamma,subsample,colsample_bytree,scale_pos_weight,learning_rate,n_estimators,Unnamed: 8_level_1
5,1,0,0.8,0.8,1,0.1,10,0.283166
5,1,0,0.8,0.8,1,0.1,30,0.242005
5,1,0,0.8,0.8,1,0.1,100,0.238329
5,1,0,0.8,0.8,1,0.1,300,0.243967
5,1,0,0.8,0.8,1,0.1,1000,0.247144


### Step 2: `max_depth` and `min_child_weight`

In [None]:
# STEP 2
# We fix `n_estimators = 100` from the previous step and
# now seek to find good values for `max_depth` and `min_child_weight`.
hyperparameters = {
    "n_estimators": 100,
    "gamma": 0,
    "subsample": 0.8,
    "colsample_bytree": 0.8,
    "scale_pos_weight": 1,
    "learning_rate": 0.1
}

records = []
for max_depth, min_child_weight in product((3, 5, 7, 9), (1, 3, 5)):
    hyperparameters["max_depth"] = max_depth
    hyperparameters["min_child_weight"] = min_child_weight
    record = evaluation.run_experiment(features, np.log(target), models.BoostedTrees, hyperparameters, n_experiments=5, n_splits=10)
    records.append(record)

# Combine the experiments into a neat DataFrame
records = pd.DataFrame(records)
# Compute the RMSE for each set of hyperparameters (here, only the value of `n_estimators` changes)
np.sqrt(records.groupby(list(hyperparameters.keys()))["test_cv_mse"].mean())

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,Unnamed: 4_level_0,Unnamed: 5_level_0,Unnamed: 6_level_0,Unnamed: 7_level_0,test_cv_mse
n_estimators,gamma,subsample,colsample_bytree,scale_pos_weight,learning_rate,max_depth,min_child_weight,Unnamed: 8_level_1
100,0,0.8,0.8,1,0.1,3,1,0.23378
100,0,0.8,0.8,1,0.1,3,3,0.238169
100,0,0.8,0.8,1,0.1,3,5,0.232269
100,0,0.8,0.8,1,0.1,5,1,0.23854
100,0,0.8,0.8,1,0.1,5,3,0.235693
100,0,0.8,0.8,1,0.1,5,5,0.235061
100,0,0.8,0.8,1,0.1,7,1,0.238368
100,0,0.8,0.8,1,0.1,7,3,0.241023
100,0,0.8,0.8,1,0.1,7,5,0.240938
100,0,0.8,0.8,1,0.1,9,1,0.243694


### Step 3: `gamma`

In [None]:
# STEP 3
# We fix `max_depth = 3` and `min_child_weight = 5` from the previous step and
# now seek to find a good value for `gamma`.
hyperparameters = {
    "n_estimators": 100,
    "max_depth": 3,
    "min_child_weight": 5,
    "subsample": 0.8,
    "colsample_bytree": 0.8,
    "scale_pos_weight": 1,
    "learning_rate": 0.1
}

records = []
for gamma in (0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7):
    hyperparameters["gamma"] = gamma
    record = evaluation.run_experiment(features, np.log(target), models.BoostedTrees, hyperparameters, n_experiments=5, n_splits=10)
    records.append(record)

# Combine the experiments into a neat DataFrame
records = pd.DataFrame(records)
# Compute the RMSE for each set of hyperparameters (here, only the value of `n_estimators` changes)
np.sqrt(records.groupby(list(hyperparameters.keys()))["test_cv_mse"].mean())

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,Unnamed: 4_level_0,Unnamed: 5_level_0,Unnamed: 6_level_0,Unnamed: 7_level_0,test_cv_mse
n_estimators,max_depth,min_child_weight,subsample,colsample_bytree,scale_pos_weight,learning_rate,gamma,Unnamed: 8_level_1
100,3,5,0.8,0.8,1,0.1,0.0,0.23374
100,3,5,0.8,0.8,1,0.1,0.1,0.233047
100,3,5,0.8,0.8,1,0.1,0.2,0.236558
100,3,5,0.8,0.8,1,0.1,0.3,0.237255
100,3,5,0.8,0.8,1,0.1,0.4,0.236631
100,3,5,0.8,0.8,1,0.1,0.5,0.241904
100,3,5,0.8,0.8,1,0.1,0.6,0.241564
100,3,5,0.8,0.8,1,0.1,0.7,0.244325


### Step 4: `subsample` and `colsample_bytree`

In [None]:
# STEP 4
# We fix `gamma = 0.1` from the previous step and
# now seek to find good values for `subsample` and `colsample_bytree`.
hyperparameters = {
    "n_estimators": 100,
    "max_depth": 3,
    "min_child_weight": 5,
    "gamma": 0.1,
    "scale_pos_weight": 1,
    "learning_rate": 0.1
}

records = []
for subsample, colsample_bytree in product((0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9), (0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9)):
    hyperparameters["subsample"] = subsample
    hyperparameters["colsample_bytree"] = colsample_bytree
    record = evaluation.run_experiment(features, np.log(target), models.BoostedTrees, hyperparameters, n_experiments=5, n_splits=10)
    records.append(record)

# Combine the experiments into a neat DataFrame
records = pd.DataFrame(records)
# Compute the RMSE for each set of hyperparameters (here, only the value of `n_estimators` changes)
np.sqrt(records.groupby(list(hyperparameters.keys()))["test_cv_mse"].mean())

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,Unnamed: 4_level_0,Unnamed: 5_level_0,Unnamed: 6_level_0,Unnamed: 7_level_0,test_cv_mse
n_estimators,max_depth,min_child_weight,gamma,scale_pos_weight,learning_rate,subsample,colsample_bytree,Unnamed: 8_level_1
100,3,5,0.1,1,0.1,0.3,0.3,0.237191
100,3,5,0.1,1,0.1,0.3,0.4,0.237898
100,3,5,0.1,1,0.1,0.3,0.5,0.238063
100,3,5,0.1,1,0.1,0.3,0.6,0.235584
100,3,5,0.1,1,0.1,0.3,0.7,0.237422
100,3,5,0.1,1,0.1,0.3,0.8,0.238619
100,3,5,0.1,1,0.1,0.3,0.9,0.239914
100,3,5,0.1,1,0.1,0.4,0.3,0.235202
100,3,5,0.1,1,0.1,0.4,0.4,0.235972
100,3,5,0.1,1,0.1,0.4,0.5,0.23677


### Step 5: `reg_lambda`

In [None]:
# STEP 5
# We fix `subsample = 0.8` and `colsample_bytree = 0.4` from the previous step and
# now seek to find a good values for `reg_lambda`.
hyperparameters = {
    "n_estimators": 100,
    "max_depth": 3,
    "min_child_weight": 5,
    "gamma": 0.1,
    "subsample": 0.8,
    "colsample_bytree": 0.4,
    "scale_pos_weight": 1,
    "learning_rate": 0.1
}

records = []
n_experiments = 10
for reg_lambda in (0, 1e-5, 1e-4, 1e-3, 1e-2, 0.1, 1, 10, 100):
    hyperparameters["reg_lambda"] = reg_lambda
    record = evaluation.run_experiment(features, np.log(target), models.BoostedTrees, hyperparameters, n_experiments=5, n_splits=10)
    records.append(record)

# Combine the experiments into a neat DataFrame
records = pd.DataFrame(records)
# Compute the RMSE for each set of hyperparameters (here, only the value of `n_estimators` changes)
np.sqrt(records.groupby(list(hyperparameters.keys()))["test_cv_mse"].mean())

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,Unnamed: 4_level_0,Unnamed: 5_level_0,Unnamed: 6_level_0,Unnamed: 7_level_0,Unnamed: 8_level_0,test_cv_mse
n_estimators,max_depth,min_child_weight,gamma,subsample,colsample_bytree,scale_pos_weight,learning_rate,reg_lambda,Unnamed: 9_level_1
100,3,5,0.1,0.8,0.4,1,0.1,0.0,0.235457
100,3,5,0.1,0.8,0.4,1,0.1,1e-05,0.234757
100,3,5,0.1,0.8,0.4,1,0.1,0.0001,0.233588
100,3,5,0.1,0.8,0.4,1,0.1,0.001,0.235531
100,3,5,0.1,0.8,0.4,1,0.1,0.01,0.236655
100,3,5,0.1,0.8,0.4,1,0.1,0.1,0.233944
100,3,5,0.1,0.8,0.4,1,0.1,1.0,0.232667
100,3,5,0.1,0.8,0.4,1,0.1,10.0,0.236187
100,3,5,0.1,0.8,0.4,1,0.1,100.0,0.245917


### Step 6: Evaluate the final hyperparameter choice and train a final model

In [None]:
import time

In [None]:
hyperparameters = {
    "max_depth": 3,
    "min_child_weight": 5,
    "gamma": 0.1,
    "subsample": 0.8,
    "colsample_bytree": 0.4,
    "reg_lambda": 1,
    "scale_pos_weight": 1,
    "learning_rate": 0.01,
    "n_estimators": 5_000,
}

seed = secrets.randbits(32)

start_time = time.time()
train_cv_mse, test_cv_mse, trained_models = evaluation.cv_evaluation(models.BoostedTrees, features, np.log(target), 100, seed, hyperparameters=hyperparameters)
train_cv_rmse, test_cv_rmse = np.sqrt(np.array([train_cv_mse, test_cv_mse]))
end_time = time.time()
print(f"Duration of the experiment: {(end_time - start_time)/60:.1f} minutes.")

# Report evaluations
print(f"Cross-validation training error: {train_cv_rmse:.3f}")
print(f"Cross-validation test error:     {test_cv_rmse:.3f}")

Duration of the experiment: 5.4 minutes.
Cross-validation training error: 0.174
Cross-validation test error:     0.230


In [None]:
hyperparameters = {
    "max_depth": 3,
    "min_child_weight": 5,
    "gamma": 0.1,
    "subsample": 0.8,
    "colsample_bytree": 0.4,
    "reg_lambda": 1,
    "scale_pos_weight": 1,
    "learning_rate": 0.01,
    "n_estimators": 5_000,
}

# Train the model, then save it
model = models.BoostedTrees(**hyperparameters)
model.fit(features, np.log(target))

In [None]:
np.sqrt(model.evaluate(features, np.log(target)))

np.float64(0.1739820317797629)

In [None]:
model._model.save_model("xgb.model")

  self.get_booster().save_model(fname)


In [None]:
model._model.load_model("xgb.model")

  self.get_booster().load_model(fname)


In [None]:
np.sqrt(model.evaluate(features, np.log(target)))

np.float64(0.1739820317797629)