Add CatBoost component and pipeline #247

angela97lin · 2019-12-03T20:45:43Z

Closes #200

Notes / to discuss:

Better for performance to not encode categorical features (since it is capable of handling) but this means we limit our imputation strategies to those that work with both numeric and string values (currently most_frequent). Is this too limiting? (Would be nice to apply different impute strategies for different columns...)
Similarly, while catboost doesn't require encoding categorical features, if we add a RF feature selector to our pipeline, we would need to encode the features... Not sure that SelectFromModel using catboost integrates smoothly because of sklearn's cloning of estimator. Perhaps we can add a feature selection method that doesn't require an estimator (separate PR?)
What parameters to expose to the user? CatBoost has alias for many of them but I chose to use the names that best align with our current API to avoid confusion.
CatBoost supports only numerical target labels, so some fiddling was required, see issue: CatBoostClassifier predict() returns type np.float64 irrespective of input type catboost/catboost#305
CatBoost supports snapshotting, allowing models to continue training from a saved point. I've turned this off for now, for the sake of being able to remove the catboost_info metadata produced, but good to know. Could we use this when we're considering distributed training? Or at least follow a similar approach?

codecov · 2019-12-04T16:25:11Z

Codecov Report

Merging #247 into master will increase coverage by 0.15%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master     #247      +/-   ##
==========================================
+ Coverage   97.14%   97.29%   +0.15%     
==========================================
  Files          96      102       +6     
  Lines        2974     3180     +206     
==========================================
+ Hits         2889     3094     +205     
- Misses         85       86       +1

Impacted Files	Coverage Δ
evalml/pipelines/components/estimators/__init__.py	`100% <ø> (ø)`	⬆️
evalml/pipelines/pipeline_base.py	`98.6% <ø> (ø)`	⬆️
evalml/pipelines/components/__init__.py	`100% <ø> (ø)`	⬆️
...eature_selection/rf_classifier_feature_selector.py	`100% <100%> (ø)`	⬆️
...lines/components/estimators/regressors/__init__.py	`100% <100%> (ø)`	⬆️
evalml/pipelines/__init__.py	`100% <100%> (ø)`	⬆️
evalml/utils/gen_utils.py	`100% <100%> (ø)`	⬆️
evalml/tests/pipeline_tests/test_rf_regression.py	`100% <100%> (ø)`	⬆️
evalml/pipelines/classification/__init__.py	`100% <100%> (ø)`	⬆️
...sts/pipeline_tests/test_catboost_classification.py	`100% <100%> (ø)`
... and 26 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update cb5ff6f...758a577. Read the comment docs.

test-requirements.txt

evalml/pipelines/components/estimators/classifiers/catboost_classifier.py

evalml/pipelines/components/estimators/regressors/catboost_regressor.py

evalml/pipelines/pipeline_base.py

evalml/pipelines/regression/catboost.py

evalml/tests/conftest.py

evalml/tests/pipeline_tests/test_catboost_classification.py

dsherry · 2020-02-07T04:50:35Z

This is great! I left a bunch of little comments, but the only major change request I had was for the bootstrap_type parameter. I will go over the tests some more too, but after that I'd say let's ship it!

evalml/tests/pipeline_tests/test_xgboost.py

requirements.txt

evalml/pipelines/components/estimators/classifiers/catboost_classifier.py

dsherry

Nice work! LGTM 👍

dsherry

Oops, I thought I approved this yesterday, but I still saw it in the list on slack. Re-approved!

angela97lin added 5 commits December 3, 2019 12:06

cat init wip

25c105f

imports and all

8f0e936

update

e0e5d8c

pipelibe init

dac6333

install

03c773b

angela97lin added 22 commits December 4, 2019 12:26

test

d1e39f8

adding line to remove catboost output

c9c92e1

merging master

7f0f1f9

cat cat

9208514

lint

80cc9e2

adding regression + cat cleanup

697a20f

lint

3eb43e4

fixing feature importance test

2dbf9f2

fixing str

08186f2

adding to autobase

8f5ee81

Merge branch 'master' into cat

b2b3f4b

heavily WIP, testing feature selector

da2546a

Merge branch 'cat' of github.com:FeatureLabs/evalml into cat

9660f58

adding num_features for now to pipeline

c6bc671

Merge branch 'master' of github.com:FeatureLabs/evalml into cat

2e26550

Merge branch 'master' into cat

15ca3bd

cleaning up errors

7446fc3

Merge branch 'cat' of github.com:FeatureLabs/evalml into cat

40fa7e9

removing from autobase to test 3.5 error

403ce03

Merge branch 'master' into cat

b889244

tested locally and ok, retrying

91c24f5

Merge branch 'cat' of github.com:FeatureLabs/evalml into cat

3b42d30

angela97lin self-assigned this Dec 9, 2019

dsherry reviewed Dec 9, 2019

View reviewed changes

test-requirements.txt Outdated Show resolved Hide resolved