Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Calling XGBModel.fit() should clear the Booster by default #6562

Merged
merged 4 commits into from Dec 31, 2020

Conversation

hcho3
Copy link
Collaborator

@hcho3 hcho3 commented Dec 31, 2020

Closes #6536

Calling fit() twice with a scikit-learn model object should cause the model to be cleared, e.g.

import xgboost as xgb
from sklearn.datasets import load_boston

X, y = load_boston(return_X_y=True)

clf = xgb.XGBRegressor(n_estimators=5, max_depth=3, tree_method='hist', objective='reg:squarederror')
clf.fit(X, y, eval_set=[(X, y)])
clf.fit(X, y, eval_set=[(X, y)])  # should start training from scratch

Expected output:

[0]	validation_0-rmse:17.08995
[1]	validation_0-rmse:12.32976
[2]	validation_0-rmse:9.03587
[3]	validation_0-rmse:6.76642
[4]	validation_0-rmse:5.22241
[0]	validation_0-rmse:17.08995
[1]	validation_0-rmse:12.32976
[2]	validation_0-rmse:9.03587
[3]	validation_0-rmse:6.76642
[4]	validation_0-rmse:5.22241

Currently, XGBoost will automatically pick up the existing Booster object and use it as a checkpoint:

[0]	validation_0-rmse:17.08995
[1]	validation_0-rmse:12.32976
[2]	validation_0-rmse:9.03587
[3]	validation_0-rmse:6.76642
[4]	validation_0-rmse:5.22241
[0]	validation_0-rmse:4.18100
[1]	validation_0-rmse:3.46472
[2]	validation_0-rmse:3.02611
[3]	validation_0-rmse:2.74987
[4]	validation_0-rmse:2.54619

This behavior is not desired for most use cases of scikit-learn. For example, scikit-learn's cross-validation method will call fit() multiple times, with the expectation that the underlying model to be cleared for every invocation of fit().

This PR fixes the behavior of fit() as follows:

  • Calling fit() will clear any existing model. This behavior is now documented in the docstring of fit().
  • Users must explicitly specify the parameter xgb_model to resume training from a checkpoint.
  • The xgb_model parameter can now be XGBModel object.

@hcho3
Copy link
Collaborator Author

hcho3 commented Dec 31, 2020

@pseudotensor Can you review? I'd like to hear whether this fix is in line with your expectation of how fit() should behave.

Copy link
Contributor

@pseudotensor pseudotensor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, LGTM.

@codecov-io
Copy link

codecov-io commented Dec 31, 2020

Codecov Report

Merging #6562 (b4bbd9d) into master (8ad22bf) will decrease coverage by 0.03%.
The diff coverage is 60.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #6562      +/-   ##
==========================================
- Coverage   80.15%   80.11%   -0.04%     
==========================================
  Files          13       13              
  Lines        3588     3591       +3     
==========================================
+ Hits         2876     2877       +1     
- Misses        712      714       +2     
Impacted Files Coverage Δ
python-package/xgboost/sklearn.py 89.11% <60.00%> (-0.38%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8ad22bf...b4bbd9d. Read the comment docs.

@hcho3 hcho3 merged commit fa13992 into dmlc:master Dec 31, 2020
@hcho3 hcho3 deleted the fix_sklearn_fit branch December 31, 2020 19:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

re-use of same model automatically continues training even if not desired
4 participants