Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add CatBoost component and pipeline #247

Merged
merged 102 commits into from Feb 13, 2020
Merged

Add CatBoost component and pipeline #247

merged 102 commits into from Feb 13, 2020

Conversation

angela97lin
Copy link
Contributor

@angela97lin angela97lin commented Dec 3, 2019

Closes #200

Notes / to discuss:

  • Better for performance to not encode categorical features (since it is capable of handling) but this means we limit our imputation strategies to those that work with both numeric and string values (currently most_frequent). Is this too limiting? (Would be nice to apply different impute strategies for different columns...)
  • Similarly, while catboost doesn't require encoding categorical features, if we add a RF feature selector to our pipeline, we would need to encode the features... Not sure that SelectFromModel using catboost integrates smoothly because of sklearn's cloning of estimator. Perhaps we can add a feature selection method that doesn't require an estimator (separate PR?)
  • What parameters to expose to the user? CatBoost has alias for many of them but I chose to use the names that best align with our current API to avoid confusion.
  • CatBoost supports only numerical target labels, so some fiddling was required, see issue: CatBoostClassifier predict() returns type np.float64 irrespective of input type catboost/catboost#305
  • CatBoost supports snapshotting, allowing models to continue training from a saved point. I've turned this off for now, for the sake of being able to remove the catboost_info metadata produced, but good to know. Could we use this when we're considering distributed training? Or at least follow a similar approach?

@codecov
Copy link

codecov bot commented Dec 4, 2019

Codecov Report

Merging #247 into master will increase coverage by 0.15%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #247      +/-   ##
==========================================
+ Coverage   97.14%   97.29%   +0.15%     
==========================================
  Files          96      102       +6     
  Lines        2974     3180     +206     
==========================================
+ Hits         2889     3094     +205     
- Misses         85       86       +1
Impacted Files Coverage Δ
evalml/pipelines/components/estimators/__init__.py 100% <ø> (ø) ⬆️
evalml/pipelines/pipeline_base.py 98.6% <ø> (ø) ⬆️
evalml/pipelines/components/__init__.py 100% <ø> (ø) ⬆️
...eature_selection/rf_classifier_feature_selector.py 100% <100%> (ø) ⬆️
...lines/components/estimators/regressors/__init__.py 100% <100%> (ø) ⬆️
evalml/pipelines/__init__.py 100% <100%> (ø) ⬆️
evalml/utils/gen_utils.py 100% <100%> (ø) ⬆️
evalml/tests/pipeline_tests/test_rf_regression.py 100% <100%> (ø) ⬆️
evalml/pipelines/classification/__init__.py 100% <100%> (ø) ⬆️
...sts/pipeline_tests/test_catboost_classification.py 100% <100%> (ø)
... and 26 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update cb5ff6f...758a577. Read the comment docs.

@angela97lin angela97lin self-assigned this Dec 9, 2019
test-requirements.txt Outdated Show resolved Hide resolved
evalml/tests/conftest.py Outdated Show resolved Hide resolved
@dsherry
Copy link
Collaborator

dsherry commented Feb 7, 2020

This is great! I left a bunch of little comments, but the only major change request I had was for the bootstrap_type parameter. I will go over the tests some more too, but after that I'd say let's ship it!

requirements.txt Show resolved Hide resolved
Copy link
Collaborator

@dsherry dsherry left a comment

Nice work! LGTM 👍

@angela97lin angela97lin requested a review from dsherry Feb 12, 2020
Copy link
Collaborator

@dsherry dsherry left a comment

Oops, I thought I approved this yesterday, but I still saw it in the list on slack. Re-approved!

@angela97lin angela97lin merged commit d442066 into master Feb 13, 2020
2 checks passed
@dsherry dsherry mentioned this pull request Feb 13, 2020
@angela97lin angela97lin mentioned this pull request Mar 9, 2020
@angela97lin angela97lin deleted the cat branch Apr 17, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add in one other modeling library
4 participants