Exposes thread_count for Catboost estimators as n_jobs parameters and n_jobs as a keyword argument for XGBoost#2410
Conversation
Codecov Report
@@ Coverage Diff @@
## main #2410 +/- ##
=======================================
+ Coverage 99.6% 99.7% +0.1%
=======================================
Files 283 283
Lines 25486 25539 +53
=======================================
+ Hits 25384 25437 +53
Misses 102 102
Continue to review full report at Codecov.
|
n_threads for Catboost and XGBoost n_jobs parametersthread_count for Catboost and and nthread for XGBoost as n_jobs parameters
thread_count for Catboost and and nthread for XGBoost as n_jobs parametersthread_count for Catboost estimators as n_jobs parameters
jeremyliweishih
left a comment
There was a problem hiding this comment.
I think this looks good! RE your UX question I think its fine the way it is but maybe we can add an explanation in the docstring of the catboost components for clarity.
freddyaboulton
left a comment
There was a problem hiding this comment.
@angela97lin I think this looks great! I have two points I'd like to resolve before merging:
-
Can you add coverage to the
AutoMLSearch/IterativeAlgorithmthat xgboost and catboost are initialized with the right value ofn_jobs? I don't think this line was working as intended for those estimators before this pr right? https://github.com/alteryx/evalml/blob/main/evalml/automl/automl_algorithm/iterative_algorithm.py#L258 -
Since we'll now be running xgboost with
n_jobs=-1by default and you noted that can slow down xgboost, how much slower willAutoMLSearchafter this PR? Might be good to quantify that via perf tests so we can tag #2437 as high priority.
Let me know what you think!
evalml/pipelines/components/estimators/classifiers/catboost_classifier.py
Show resolved
Hide resolved
bchen1116
left a comment
There was a problem hiding this comment.
LGTM! Nice tests and documentation!
| ) | ||
| xgb = import_or_raise("xgboost", error_msg=xgb_error_msg) | ||
| xgb_Regressor = xgb.XGBRegressor(random_state=random_seed, **parameters) | ||
| xgb_regressor = xgb.XGBRegressor(random_state=random_seed, **parameters) |
thread_count for Catboost estimators as n_jobs parametersthread_count for Catboost estimators as n_jobs parameters and n_jobs as a keyword argument for XGBoost
Closes #1475.
Some UX questions and concerns:
thread_countandn_jobs? Currently, will only use then_jobsparameter. That means if a user setsthread_countto 2, but does not setn_jobswhich defaults to -1, we will still use -1. I think this is okay given that our API specifiesn_jobsas the parameter to use for multicore processing, but this could be a bit confusing (especially given XGBoost/CatBoost use threads, scikit-learn uses processes).Image after exposing

n_jobsfor CatBoost:XGBoost: Since we use the python scikit-learn interface,
n_jobsis already implemented. This PR just exposes n_jobs as a named argument :)It's interesting to note that XGBoost takes more time using all threads than with just two. There is some documentation about how thread contention slows down the algorithm which probably plays a role in this. See image below for how the runtime slows down after setting >16 threads. I filed #2437 to further track this, as it could be useful to investigate this with more datasets and runs.
Relevant links: https://xgboost.readthedocs.io/en/latest/python/python_api.html#module-xgboost.core; dmlc/xgboost#3810