Skip to content

Latest commit

 

History

History
165 lines (133 loc) · 7.5 KB

index.rst

File metadata and controls

165 lines (133 loc) · 7.5 KB

image

Welcome to lrboost's documentation!

lrboost is an scikit-learn compatible simple stacking protocol for prediction. Additional utilities for machine learning are also included!

quick_start

Getting Started

LRBoostRegressor works in three steps.

  • Fit a linear model to a target y
    • This is the primary model accesible via lrboost.primary_model
  • Fit a tree-based model to the residual (y_pred - y) of the linear model
    • This is the secondary model accesible via lrboost.secondary_model
  • Combine the two predictions into a final prediction in the scale of the original target

LRBoostRegressor defaults to sklearn.linear_model.RidgeCV() and sklearn.ensemble.HistGradientBoostingRegressor() as the linear (primary) and non-linear (secondary) model respectively.

>>> from sklearn.datasets import load_diabetes >>> from lrboost import LRBoostRegressor >>> X, y = load_iris(return_X_y=True) >>> lrb = LRBoostRegressor().fit(X, y) >>> predictions = lrb.predict(X) >>> detailed_predictions = lrb.predict(X, detail=True) >>> print(lrb.primary_model.score(X, y)) #R2 >>> print(lrb.score(X, y)) #R2 >>> 0.512 >>> 0.933

The linear and non-linear models are both fit in the fit() method and used to then predict on any new data. Because lrboost is a very slightly modified scikit-learn class, you can hyperparameter tune the tree model as you would normally.

  • predict(X) returns an array-like of final predictions with an option for predict(X, detail=True)
  • predict_dist(X) provides probabilistic predictions associated with NGBoost or XGBoost-Distribution as the non-linear estimators.

Any sklearn compatible estimator can be used with lrboost, and you can unpack kwargs as needed.

>>> from sklearn.datasets import load_iris >>> from sklearn.ensemble import RandomForestRegressor >>> from lrboost import LRBoostRegressor >>> X, y = load_iris(return_X_y=True) >>> ridge_args = {"alphas": np.logspace(-4, 3, 10, endpoint=True), "cv": 5} >>> rf_args = {"n_estimators": 50, "n_jobs": -1} >>> lrb = LRBoostRegressor(primary_model=RidgeCV(ridge_args), secondary_model=RandomForestRegressor(rf_args)) >>> lrb = LRBoostRegressor.fit(X, y) >>> predictions = lrb.predict(X)

  • lrboost is not going to magically provide improved error in all circumstances.
  • Situations with extrapolation outside of the training dataset might be particularly useful.

Hyperparamter Tuning

Model Comparison - Example 1

  • This is a (simplified) example of predicting clutch minutes from non-clutch minutes in NBA basketball.
  • It has a known linear and non-linear combination and extrapolation can be difficult.

image

  • Using the above test/train split we can show an extrapolation task on the tails of the distribution.

    >>> import pandas as pd >>> import numpy as np >>> from sklearn.metrics import mean_squared_error >>> from sklearn.ensemble import HistGradientBoostingRegressor >>> clutch = pd.read_csv('../examples/clutch.csv') >>> train_mask = (clutch['nonclutch_min'] <= 4000) & (clutch['nonclutch_min'] >= 750) >>> train = clutch[train_mask] >>> test = clutch[~train_mask] >>> X_train = train[['nonclutch_min']] >>> y_train = train['clutch_min'] >>> X_test = test[['nonclutch_min']] >>> y_test = test['clutch_min'] >>> gbm = HistGradientBoostingRegressor(max_iter=500, random_state=42).fit(X_train, y_train) >>> lrb = LRBoostRegressor(secondary_model=HistGradientBoostingRegressor(max_iter=500, random_state=42)).fit(X_train, y_train) >>> print(f"Ridge RMSE: {round(mean_squared_error(lrb.primary_model.predict(X_test), y_test), 2)}") >>> print(f"HistGradientBoostingRegressor RMSE: {round(mean_squared_error(gbm.predict(X_test), y_test), 2)}") >>> print(f"lrboost RMSE: {round(mean_squared_error(lrb.predict(X_test), y_test), 2)}") >>> Ridge RMSE: 1385.81 >>> HistGradientBoostingRegressor RMSE: 3145.87 >>> lrboost RMSE: 1080.42

    If we also attempt a general train/test split, lrboost performs well.

    >>> Ridge RMSE: 570.01 >>> HistGradientBoostingRegressor RMSE: 743.66 >>> lrboost RMSE: 733.4

Model Comparison - Example 2

  • The following are some simple examples taken from Zhang et al. (2019)

    >>> import pandas as pd >>> import numpy as np >>> from sklearn.metrics import mean_squared_error >>> from sklearn.ensemble import HistGradientBoostingRegressor >>> from sklearn.model_selection import train_test_split >>> concrete = pd.read_csv("../examples/concrete_data.csv") >>> features = ['cement', 'slag', 'fly_ash', 'water', 'superplastic', 'coarse_agg', 'fine_agg', 'age', 'cw_ratio'] >>> target = 'ccs' >>> def evaluate_models(X_train, X_test, y_train, y_test): >>> lrb = LRBoostRegressor(primary_model=RidgeCV(alphas=np.logspace(-4, 3, 10, endpoint=True))) >>> lrb.fit(X_train, y_train.ravel()) >>> detailed_predictions = lrb.predict(X_test, detail=True) >>> primary_predictions = detailed_predictions['primary_prediction'] >>> rb_predictions = detailed_predictions['final_prediction'] >>> hgb = HistGradientBoostingRegressor() >>> hgb.fit(X_train, y_train.ravel()) >>> hgb_predictions = hgb.predict(X_test) >>> print(f"Ridge RMSE: {round(mean_squared_error(primary_predictions, y_test.ravel()), 2)}") >>> print(f"HistGradientBoostingRegressor RMSE: {round(mean_squared_error(hgb_predictions, y_test.ravel()), 2)}") >>> print(f"lrboost RMSE: {round(mean_squared_error(lrb_predictions, y_test.ravel()), 2)}")

    >>> # Scenario 1: 75/25 train/test (Interpolation) >>> X_train, X_test, y_train, y_test = train_test_split(concrete[features], concrete[target], train_size=0.75, random_state=100) >>> evaluate_models(X_train, X_test, y_train, y_test) >>> # Ridge RMSE: 112.4 >>> # HistGradientBoostingRegressor RMSE: 26.33 >>> # lrboost RMSE: 25.06

    >>> # Scenario 2: 50/50 train/test (Interpolation) >>> X_train, X_test, y_train, y_test = train_test_split(concrete[features], concrete[target], train_size=0.50, random_state=100) >>> evaluate_models(X_train, X_test, y_train, y_test) >>> # Ridge RMSE: 107.6 >>> # HistGradientBoostingRegressor RMSE: 26.6 >>> # lrboost RMSE: 23.55

    >>> # Scenario 3: Training: CCS > 25, Testing: CCS <= 25 (Extrapolation) >>> train = concrete.loc[concrete['ccs'] > 25] >>> test = concrete.loc[concrete['ccs'] <= 25] >>> X_train = train[features] >>> y_train = train[target] >>> X_test = train[features] >>> y_test = train[target] >>> evaluate_models(X_train, X_test, y_train, y_test) >>> # Ridge RMSE: 89.26 >>> # HistGradientBoostingRegressor RMSE: 4.21 >>> # lrboost RMSE: 3.7

  • With zero tuning of either the lrboost internal GBDT fit to the residual or the "standard" GBDT, lrboost performs well.