This is a stable release of 0.81 version
New feature: feature interaction constraints
- Users are now able to control which features (independent variables) are allowed to interact by specifying feature interaction constraints (#3466).
- Tutorial is available, as well as R and Python examples.
New feature: learning to rank using scikit-learn interface
- Learning to rank task is now available for the scikit-learn interface of the Python package (#3560, #3848). It is now possible to integrate the XGBoost ranking model into the scikit-learn learning pipeline.
- Examples of using
XGBRankerclass is found at demo/rank/rank_sklearn.py.
New feature: R interface for SHAP interactions
- SHAP (SHapley Additive exPlanations) is a unified approach to explain the output of any machine learning model. Previously, this feature was only available from the Python package; now it is available from the R package as well (#3636).
New feature: GPU predictor now use multiple GPUs to predict
- GPU predictor is now able to utilize multiple GPUs at once to accelerate prediction (#3738)
New feature: Scale distributed XGBoost to large-scale clusters
- Fix OS file descriptor limit assertion error on large cluster (#3835, dmlc/rabit#73) by replacing
select()based AllReduce/Broadcast withpoll()based implementation. - Mitigate tracker "thundering herd" issue on large cluster. Add exponential backoff retry when workers connect to tracker.
- With this change, we were able to scale to 1.5k executors on a 12 billion row dataset after some tweaks here and there.
New feature: Additional objective functions for GPUs
- New objective functions ported to GPU:
hinge,multi:softmax,multi:softprob,count:poisson,reg:gamma,reg:tweedie. - With supported objectives, XGBoost will select the correct devices based on your system and
n_gpusparameter.
Major bug fix: learning to rank with XGBoost4J-Spark
- Previously,
repartitionForDatawould shuffle data and lose ordering necessary for ranking task. - To fix this issue, data points within each RDD partition is explicitly group by their group (query session) IDs (#3654). Also handle empty RDD partition carefully (#3750).
Major bug fix: early stopping fixed in XGBoost4J-Spark
- Earlier implementation of early stopping had incorrect semantics and didn't let users to specify direction for optimizing (maximize / minimize)
- A parameter
maximize_evaluation_metricsis defined so as to tell whether a metric should be maximized or minimized as part of early stopping criteria (#3808). Also early stopping now has correct semantics.
API changes
- Column sampling by level (
colsample_bylevel) is now functional forhistalgorithm (#3635, #3862) - GPU tag
gpu:for regression objectives are now deprecated. XGBoost will select the correct devices automatically (#3643) - Add
disable_default_eval_metricparameter to disable default metric (#3606) - Experimental AVX support for gradient computation is removed (#3752)
- XGBoost4J-Spark
- Add
rank:ndcgandrank:mapto supported objectives (#3697)
- Add
- Python package
- Add
callbacksargument tofit()function of sciki-learn API (#3682) - Add
XGBRankerto scikit-learn interface (#3560, #3848) - Add
validate_featuresargument topredict()function of scikit-learn API (#3653) - Allow scikit-learn grid search over parameters specified as keyword arguments (#3791)
- Add
coef_andintercept_as properties of scikit-learn wrapper (#3855). Some scikit-learn functions expect these properties.
- Add
Performance improvements
- Address very high GPU memory usage for large data (#3635)
- Fix performance regression within
EvaluateSplits()ofgpu_histalgorithm. (#3680)
Bug-fixes
- Fix a problem in GPU quantile sketch with tiny instance weights. (#3628)
- Fix copy constructor for
HostDeviceVectorImplto prevent dangling pointers (#3657) - Fix a bug in partitioned file loading (#3673)
- Fixed an uninitialized pointer in
gpu_hist(#3703) - Reshared data among GPUs when number of GPUs is changed (#3721)
- Add back
max_delta_stepto split evaluation (#3668) - Do not round up integer thresholds for integer features in JSON dump (#3717)
- Use
dmlc::TemporaryDirectoryto handle temporaries in cross-platform way (#3783) - Fix accuracy problem with
gpu_histwhenmin_child_weightandlambdaare set to 0 (#3793) - Make sure that
tree_methodparameter is recognized and not silently ignored (#3849) - XGBoost4J-Spark
- Make sure
thresholdsare considered when executingpredict()method (#3577) - Avoid losing precision when computing probabilities by converting to
Doubleearly (#3576) getTreeLimit()should returnInt(#3602)- Fix checkpoint serialization on HDFS (#3614)
- Throw
ControlThrowableinstead ofInterruptedExceptionso that it is properly re-thrown (#3632) - Remove extraneous output to stdout (#3665)
- Allow specification of task type for custom objectives and evaluations (#3646)
- Fix distributed updater check (#3739)
- Fix issue when spark job execution thread cannot return before we execute
first()(#3758)
- Make sure
- Python package
- R package
Maintenance: testing, continuous integration, build system
- Add sanitizers tests to Travis CI (#3557)
- Add NumPy, Matplotlib, Graphviz as requirements for doc build (#3669)
- Comply with CRAN submission policy (#3660, #3728)
- Remove copy-paste error in JVM test suite (#3692)
- Disable flaky tests in
R-package/tests/testthat/test_update.R(#3723) - Make Python tests compatible with scikit-learn 0.20 release (#3731)
- Separate out restricted and unrestricted tasks, so that pull requests don't build downloadable artifacts (#3736)
- Add multi-GPU unit test environment (#3741)
- Allow plug-ins to be built by CMake (#3752)
- Test wheel compatibility on CPU containers for pull requests (#3762)
- Fix broken doc build due to Matplotlib 3.0 release (#3764)
- Produce
xgboost.sofor XGBoost-R on Mac OSX, so thatmake installworks (#3767) - Retry Jenkins CI tests up to 3 times to improve reliability (#3769, #3769, #3775, #3776, #3777)
- Add basic unit tests for
gpu_histalgorithm (#3785) - Fix Python environment for distributed unit tests (#3806)
- Test wheels on CUDA 10.0 container for compatibility (#3838)
- Fix JVM doc build (#3853)
Maintenance: Refactor C++ code for legibility and maintainability
- Merge generic device helper functions into
GPUSetclass (#3626) - Re-factor column sampling logic into
ColumnSamplerclass (#3635, #3637) - Replace
std::vectorwithHostDeviceVectorinMetaInfoandSparsePage(#3446) - Simplify
DMatrixclass (#3395) - De-duplicate CPU/GPU code using
Transformclass (#3643, #3751) - Remove obsoleted
QuantileHistMakerclass (#3761) - Remove obsoleted
NoConstraintclass (#3792)
Other Features
- C++20-compliant Span class for safe pointer indexing (#3548, #3588)
- Add helper functions to manipulate multiple GPU devices (#3693)
- XGBoost4J-Spark
- Allow specifying host ip from the
xgboost-tracker.properties file(#3833). This comes in handy whenhostsfiles doesn't correctly define localhost.
- Allow specifying host ip from the
Usability Improvements
- Add reference to GitHub repository in
pom.xmlof JVM packages (#3589) - Add R demo of multi-class classification (#3695)
- Document JSON dump functionality (#3600, #3603)
- Document CUDA requirement and lack of external memory for GPU algorithms (#3624)
- Document LambdaMART objectives, both pairwise and listwise (#3672)
- Document
aucprevaluation metric (#3687) - Document gblinear parameters:
feature_selectorandtop_k(#3780) - Add instructions for using MinGW-built XGBoost with Python. (#3774)
- Removed nonexistent parameter
use_bufferfrom documentation (#3610) - Update Python API doc to include all classes and members (#3619, #3682)
- Fix typos and broken links in documentation (#3618, #3640, #3676, #3713, #3759, #3784, #3843, #3852)
- Binary classification demo should produce LIBSVM with 0-based indexing (#3652)
- Process data once for Python and CLI examples of learning to rank (#3666)
- Include full text of Apache 2.0 license in the repository (#3698)
- Save predictor parameters in model file (#3856)
- JVM packages
- Python package
- Document that custom objective can't contain colon (:) (#3601)
- Show a better error message for failed library loading (#3690)
- Document that feature importance is unavailable for non-tree learners (#3765)
- Document behavior of
get_fscore()for zero-importance features (#3763) - Recommend pickling as the way to save
XGBClassifier/XGBRegressor/XGBRanker(#3829)
- R package
- Enlarge variable importance plot to make it more visible (#3820)
BREAKING CHANGES
- External memory page files have changed, breaking backwards compatibility for temporary storage used during external memory training. This only affects external memory users upgrading their xgboost version - we recommend clearing all
*.pagefiles before resuming training. Model serialization is unaffected.
Known issues
- Quantile sketcher fails to produce any quantile for some edge cases (#2943)
- The
histalgorithm leaks memory when used with learning rate decay callback (#3579) - Using custom evaluation funciton together with early stopping causes assertion failure in XGBoost4J-Spark (#3595)
- Early stopping doesn't work with
gblinearlearner (#3789) - Label and weight vectors are not reshared upon the change in number of GPUs (#3794). To get around this issue, delete the
DMatrixobject and re-load. - The
DMatrixPython objects are initialized with incorrect values when given array slices (#3841) - The
gpu_idparameter is broken and not yet properly supported (#3850)
Acknowledgement
Contributors (in no particular order): Hyunsu Cho (@hcho3), Jiaming Yuan (@trivialfis), Nan Zhu (@CodingCat), Rory Mitchell (@RAMitchell), Andy Adinets (@canonizer), Vadim Khotilovich (@khotilov), Sergei Lebedev (@superbobry)
First-time Contributors (in no particular order): Matthew Tovbin (@tovbinm), Jakob Richter (@jakob-r), Grace Lam (@grace-lam), Grant W Schneider (@grantschneider), Andrew Thia (@BlueTea88), Sergei Chipiga (@schipiga), Joseph Bradley (@jkbradley), Chen Qin (@chenqin), Jerry Lin (@linjer), Dmitriy Rybalko (@rdtft), Michael Mui (@mmui), Takahiro Kojima (@515hikaru), Bruce Zhao (@BruceZhaoR), Wei Tian (@weitian), Saumya Bhatnagar (@Sam1301), Juzer Shakir (@JuzerShakir), Zhao Hang (@cleghom), Jonathan Friedman (@jontonsoup), Bruno Tremblay (@meztez), Boris Filippov (@frenzykryger), @Shiki-H, @mrgutkun, @gorogm, @htgeis, @jakehoare, @zengxy, @KOLANICH
First-time Reviewers (in no particular order): Nikita Titov (@StrikerRUS), Xiangrui Meng (@mengxr), Nirmal Borah (@Nirmal-Neel)