Skip to content

Exposes thread_count for Catboost estimators as n_jobs parameters and n_jobs as a keyword argument for XGBoost#2410

Merged
angela97lin merged 24 commits intomainfrom
1475_n_threads
Jul 2, 2021
Merged

Exposes thread_count for Catboost estimators as n_jobs parameters and n_jobs as a keyword argument for XGBoost#2410
angela97lin merged 24 commits intomainfrom
1475_n_threads

Conversation

@angela97lin
Copy link
Contributor

@angela97lin angela97lin commented Jun 21, 2021

Closes #1475.

Some UX questions and concerns:

  • What happens if user specifies thread_count and n_jobs? Currently, will only use the n_jobs parameter. That means if a user sets thread_count to 2, but does not set n_jobs which defaults to -1, we will still use -1. I think this is okay given that our API specifies n_jobs as the parameter to use for multicore processing, but this could be a bit confusing (especially given XGBoost/CatBoost use threads, scikit-learn uses processes).

Image after exposing n_jobs for CatBoost:
image

XGBoost: Since we use the python scikit-learn interface, n_jobs is already implemented. This PR just exposes n_jobs as a named argument :)


It's interesting to note that XGBoost takes more time using all threads than with just two. There is some documentation about how thread contention slows down the algorithm which probably plays a role in this. See image below for how the runtime slows down after setting >16 threads. I filed #2437 to further track this, as it could be useful to investigate this with more datasets and runs.

Relevant links: https://xgboost.readthedocs.io/en/latest/python/python_api.html#module-xgboost.core; dmlc/xgboost#3810

image

@angela97lin angela97lin self-assigned this Jun 21, 2021
@codecov
Copy link

codecov bot commented Jun 21, 2021

Codecov Report

Merging #2410 (30ba35d) into main (ef06526) will increase coverage by 0.1%.
The diff coverage is 100.0%.

Impacted file tree graph

@@           Coverage Diff           @@
##            main   #2410     +/-   ##
=======================================
+ Coverage   99.6%   99.7%   +0.1%     
=======================================
  Files        283     283             
  Lines      25486   25539     +53     
=======================================
+ Hits       25384   25437     +53     
  Misses       102     102             
Impacted Files Coverage Δ
...nents/estimators/classifiers/xgboost_classifier.py 100.0% <ø> (ø)
evalml/tests/component_tests/test_components.py 100.0% <ø> (ø)
...ents/estimators/classifiers/catboost_classifier.py 100.0% <100.0%> (ø)
...onents/estimators/regressors/catboost_regressor.py 100.0% <100.0%> (ø)
...ponents/estimators/regressors/xgboost_regressor.py 100.0% <100.0%> (ø)
...lml/tests/automl_tests/test_iterative_algorithm.py 100.0% <100.0%> (ø)
.../tests/component_tests/test_catboost_classifier.py 100.0% <100.0%> (ø)
...l/tests/component_tests/test_catboost_regressor.py 100.0% <100.0%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ef06526...30ba35d. Read the comment docs.

@angela97lin angela97lin changed the title Exposes n_threads for Catboost and XGBoost n_jobs parameters Exposes thread_count for Catboost and and nthread for XGBoost as n_jobs parameters Jun 23, 2021
@angela97lin angela97lin changed the title Exposes thread_count for Catboost and and nthread for XGBoost as n_jobs parameters Exposes thread_count for Catboost estimators as n_jobs parameters Jun 23, 2021
@angela97lin angela97lin marked this pull request as ready for review June 23, 2021 20:19
Copy link
Collaborator

@jeremyliweishih jeremyliweishih left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this looks good! RE your UX question I think its fine the way it is but maybe we can add an explanation in the docstring of the catboost components for clarity.

Copy link
Contributor

@freddyaboulton freddyaboulton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@angela97lin I think this looks great! I have two points I'd like to resolve before merging:

  1. Can you add coverage to the AutoMLSearch/IterativeAlgorithm that xgboost and catboost are initialized with the right value of n_jobs? I don't think this line was working as intended for those estimators before this pr right? https://github.com/alteryx/evalml/blob/main/evalml/automl/automl_algorithm/iterative_algorithm.py#L258

  2. Since we'll now be running xgboost with n_jobs=-1 by default and you noted that can slow down xgboost, how much slower will AutoMLSearch after this PR? Might be good to quantify that via perf tests so we can tag #2437 as high priority.

Let me know what you think!

Copy link
Contributor

@bchen1116 bchen1116 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Nice tests and documentation!

)
xgb = import_or_raise("xgboost", error_msg=xgb_error_msg)
xgb_Regressor = xgb.XGBRegressor(random_state=random_seed, **parameters)
xgb_regressor = xgb.XGBRegressor(random_state=random_seed, **parameters)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice catch!

@angela97lin angela97lin changed the title Exposes thread_count for Catboost estimators as n_jobs parameters Exposes thread_count for Catboost estimators as n_jobs parameters and n_jobs as a keyword argument for XGBoost Jun 25, 2021
Copy link
Contributor

@chukarsten chukarsten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add thread count parameter to Catboost/XGBoost estimators

5 participants