Calling XGBModel.fit() should clear the Booster by default #6562

hcho3 · 2020-12-31T04:01:31Z

Calling fit() twice with a scikit-learn model object should cause the model to be cleared, e.g.

import xgboost as xgb
from sklearn.datasets import load_boston

X, y = load_boston(return_X_y=True)

clf = xgb.XGBRegressor(n_estimators=5, max_depth=3, tree_method='hist', objective='reg:squarederror')
clf.fit(X, y, eval_set=[(X, y)])
clf.fit(X, y, eval_set=[(X, y)])  # should start training from scratch

Expected output:

[0]	validation_0-rmse:17.08995
[1]	validation_0-rmse:12.32976
[2]	validation_0-rmse:9.03587
[3]	validation_0-rmse:6.76642
[4]	validation_0-rmse:5.22241
[0]	validation_0-rmse:17.08995
[1]	validation_0-rmse:12.32976
[2]	validation_0-rmse:9.03587
[3]	validation_0-rmse:6.76642
[4]	validation_0-rmse:5.22241

Currently, XGBoost will automatically pick up the existing Booster object and use it as a checkpoint:

[0]	validation_0-rmse:17.08995
[1]	validation_0-rmse:12.32976
[2]	validation_0-rmse:9.03587
[3]	validation_0-rmse:6.76642
[4]	validation_0-rmse:5.22241
[0]	validation_0-rmse:4.18100
[1]	validation_0-rmse:3.46472
[2]	validation_0-rmse:3.02611
[3]	validation_0-rmse:2.74987
[4]	validation_0-rmse:2.54619

This behavior is not desired for most use cases of scikit-learn. For example, scikit-learn's cross-validation method will call fit() multiple times, with the expectation that the underlying model to be cleared for every invocation of fit().

This PR fixes the behavior of fit() as follows:

Calling fit() will clear any existing model. This behavior is now documented in the docstring of fit().
Users must explicitly specify the parameter xgb_model to resume training from a checkpoint.
The xgb_model parameter can now be XGBModel object.

hcho3 · 2020-12-31T04:03:18Z

@pseudotensor Can you review? I'd like to hear whether this fix is in line with your expectation of how fit() should behave.

pseudotensor

Thanks, LGTM.

codecov-io · 2020-12-31T06:07:32Z

Codecov Report

Merging #6562 (b4bbd9d) into master (8ad22bf) will decrease coverage by 0.03%.
The diff coverage is 60.00%.

@@            Coverage Diff             @@
##           master    #6562      +/-   ##
==========================================
- Coverage   80.15%   80.11%   -0.04%     
==========================================
  Files          13       13              
  Lines        3588     3591       +3     
==========================================
+ Hits         2876     2877       +1     
- Misses        712      714       +2

Impacted Files	Coverage Δ
python-package/xgboost/sklearn.py	`89.11% <60.00%> (-0.38%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8ad22bf...b4bbd9d. Read the comment docs.

hcho3 added 3 commits December 30, 2020 19:39

Calling XGBModel.fit() should clear the Booster by default

19ba723

Document the behavior of fit()

df52d63

Allow sklearn object to be passed in directly via xgb_model argument

8a2842c

hcho3 requested a review from trivialfis December 31, 2020 04:01

pseudotensor approved these changes Dec 31, 2020

View reviewed changes

Fix lint

b4bbd9d

trivialfis approved these changes Dec 31, 2020

View reviewed changes

hcho3 merged commit fa13992 into dmlc:master Dec 31, 2020

hcho3 deleted the fix_sklearn_fit branch December 31, 2020 19:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Calling XGBModel.fit() should clear the Booster by default #6562

Calling XGBModel.fit() should clear the Booster by default #6562

hcho3 commented Dec 31, 2020 •

edited

hcho3 commented Dec 31, 2020

pseudotensor left a comment

codecov-io commented Dec 31, 2020 •

edited

Calling XGBModel.fit() should clear the Booster by default #6562

Calling XGBModel.fit() should clear the Booster by default #6562

Conversation

hcho3 commented Dec 31, 2020 • edited

hcho3 commented Dec 31, 2020

pseudotensor left a comment

Choose a reason for hiding this comment

codecov-io commented Dec 31, 2020 • edited

Codecov Report

hcho3 commented Dec 31, 2020 •

edited

codecov-io commented Dec 31, 2020 •

edited