Add ExtraTrees Pipeline #790

eccabay · 2020-05-21T14:49:38Z

Closes #771.

codecov · 2020-05-21T14:59:40Z

Codecov Report

Merging #790 into master will increase coverage by 0.07%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master     #790      +/-   ##
==========================================
+ Coverage   99.56%   99.63%   +0.07%     
==========================================
  Files         161      170       +9     
  Lines        6404     6647     +243     
==========================================
+ Hits         6376     6623     +247     
+ Misses         28       24       -4

Impacted Files	Coverage Δ
evalml/pipelines/__init__.py	`100.00% <ø> (ø)`
evalml/pipelines/components/__init__.py	`100.00% <ø> (ø)`
evalml/pipelines/components/estimators/__init__.py	`100.00% <ø> (ø)`
evalml/pipelines/components/utils.py	`100.00% <ø> (ø)`
evalml/model_family/model_family.py	`100.00% <100.00%> (ø)`
evalml/pipelines/classification/__init__.py	`100.00% <100.00%> (ø)`
...lml/pipelines/classification/extra_trees_binary.py	`100.00% <100.00%> (ø)`
...pipelines/classification/extra_trees_multiclass.py	`100.00% <100.00%> (ø)`
...ines/components/estimators/classifiers/__init__.py	`100.00% <100.00%> (ø)`
...components/estimators/classifiers/et_classifier.py	`100.00% <100.00%> (ø)`
... and 21 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c026a89...94759e8. Read the comment docs.

jeremyliweishih

This looks good to me - great work! Should be ready to merge after Dylan gives it a look.

dsherry · 2020-05-22T16:33:29Z

@eccabay I will review this in a bit!

dsherry · 2020-05-22T18:25:05Z

evalml/model_family/model_family.py

@@ -7,12 +7,14 @@ class ModelFamily(Enum):
    XGBOOST = 'xgboost'
    LINEAR_MODEL = 'linear_model'
    CATBOOST = 'catboost'
+    EXTRA_TREES = 'extra_trees'


Yep. Makes sense to have a separate family here for now. In the new automl algo this will mean the first round runs both RF and extra trees. Down the road we may want to group tree-based models different but this is a great starting point. @jeremyliweishih

evalml/pipelines/components/estimators/classifiers/et_classifier.py

dsherry · 2020-05-22T19:00:54Z

evalml/pipelines/components/estimators/classifiers/et_classifier.py

+    name = "Extra Trees Classifier"
+    hyperparameter_ranges = {
+        "n_estimators": Integer(10, 1000),
+        "max_depth": Integer(1, 32),


We've had trouble with max_depth for catboost--for that model, fit time is exponential relative to the depth. Catboost recommended a range of 4 to 10, which is what we use now. So I wonder if this limit should be lower. Will make a general comment about that in a bit.

dsherry · 2020-05-22T19:04:58Z

evalml/pipelines/components/estimators/classifiers/et_classifier.py

+    model_family = ModelFamily.EXTRA_TREES
+    supported_problem_types = [ProblemTypes.BINARY, ProblemTypes.MULTICLASS]
+
+    def __init__(self, n_estimators=10, max_depth=None, n_jobs=-1, random_state=0):


What other parameters does this model support? max_features, min_weight_fraction_leaf, max_leaf_nodes, criterion, anything having to do with the math would be helpful to add here.

Our reasoning here is that we want our component API to be an extension of the underlying models, not a reduction, so most of the things you could do in sklearn, we want to expose.

We do have an issue planned #775 which will add a **kwargs catchall to all components, but we should still make an effort to explicitly declare any parameters we deem important.

Note you shouldn't expose all of these parameters in hyperparameter_ranges to have them get tuned in automl -- we're going to do a lot more work on that once we have our new perf tests in place (which @jeremyliweishih is working on). But, perhaps max_features should be added to the automl hyperparameters.

dsherry · 2020-05-22T19:09:01Z

evalml/pipelines/classification/extra_trees_binary.py

+class ETBinaryClassificationPipeline(BinaryClassificationPipeline):
+    """Extra Trees Pipeline for binary classification"""
+    custom_name = "Extra Trees Binary Classification Pipeline"
+    component_graph = ['One Hot Encoder', 'Simple Imputer', 'RF Classifier Select From Model', 'Extra Trees Classifier']


Yep, mirrors what we have for other pipelines at the moment, 👍

@eccabay can you please remove the "select from model" components from this pipeline, and same for multiclass? RE #819

(Also for the regression pipeline)

evalml/tests/automl_tests/test_autobase.py

eccabay · 2020-05-28T18:01:21Z

evalml/tests/pipeline_tests/classification_pipeline_tests/test_et_classification.py

+def test_et_objective_tuning(X_y):
+    X, y = X_y
+
+    # The classifier predicts accurately with perfect confidence given the original data,


@dsherry this is the test I was referring to where the model is too confident and the assert at line 105 fails without this part here

Got it, thanks for calling that out.

So you're adding coverage which tests that the output of predict(X) vs predict(X, objective) is different? Those would differ if the binary classification threshold the pipeline sets is different from the one used by the underlying estimator (i.e. internally somewhere in sklearn's predict code). You may have already seen this, but check out the impl of BinaryClassificationPipeline.predict. Specifically, if self.threshold is None, we return self.estimator.predict(X_t)

I don't think we need coverage for this per-pipeline. I agree we should have a general pipeline test which covers this, and I believe we already do. So, assuming I'm not missing something here, my recommendation is that you delete this test, because we already have the coverage elsewhere.

A test you could add here would be to mock the estimator's predict method and ensure its passed the right values during each of these cases. That would be great coverage to have, but not required.

…abs/evalml into 771_extratrees_pipeline

dsherry · 2020-05-29T15:24:56Z

evalml/pipelines/components/estimators/classifiers/et_classifier.py

+    def __init__(self,
+                 n_estimators=100,
+                 max_features="auto",
+                 max_depth=None,


@eccabay I think we should expose max_depth as hyperparameter tuneable. We currently do so for our other tree-based models like xgboost/RF/catboost.

#793 introduces a change where all default parameter values for parameters exposed as hyperparameters must be within range. So I suggest you set max_depth here to default to 6 since that's what I have the other models set to.

(Same applies to the regression estimator)

dsherry · 2020-05-29T15:26:02Z

evalml/pipelines/components/estimators/classifiers/et_classifier.py

+                 n_jobs=-1,
+                 random_state=0):
+        parameters = {"n_estimators": n_estimators,
+                      "max_features": max_features}


@ctduffy when you add max_depth, don't forget to also add it to the parameters dict. This is another bit of functionality which is necessary for #793 .

(Same applies to the regression estimator)

@dsherry I am happy to do this but this is what @eccabay was working on, do you want me to continue working on it instead?

@ctduffy @eccabay 🤦 augh, I got this mixed up with the ElasticNet PR, which you're working on. They both start with E and have two words 😂😂So sorry about that!!!

That said, reviewing these comments may be helpful to you too, because I'll make similar ones on the ElasticNet PR if applicable!

@dsherry No worries, I get the two confused sometimes too. and yep, I have been looking at the comments and changing some things with ElasticNet (like the select from model removal)

dsherry · 2020-05-29T15:30:11Z

evalml/tests/component_tests/test_et_classifier.py

+    clf.fit(X, y)
+    feature_importances = clf.feature_importances
+
+    np.testing.assert_almost_equal(sk_feature_importances, feature_importances, decimal=5)


These tests are great!

One thing which you can add: check that calling estimator.parameters on the instance returns what you'd expect. This is partially covered in the components tests, but it would be helpful to have direct coverage here.

And same for the regressor of course lol

dsherry · 2020-05-29T16:20:16Z

evalml/tests/component_tests/test_et_classifier.py

+        "max_depth": 5
+    }
+
+    assert clf.parameters == expected_parameters


dsherry · 2020-05-29T16:21:54Z

evalml/tests/pipeline_tests/classification_pipeline_tests/test_et_classification.py

+    clf.fit(X, y)
+    y_pred = clf.predict(X)
+
+    objective = PrecisionMicro()


Is this dataset multiclass? I had assumed it was binary. PrecisionMicro is a multiclass objective. Should still work but perhaps we should use Precision or a different objective like F1.

Oh, I see you're just doing that to trigger the error. Got it. In that case, I suggest we split this into a separate test.

You're right, the dataset is binary. Right below is a test that setting the objective to a multiclass one throws an error.

Got it, can do

…abs/evalml into 771_extratrees_pipeline

dsherry · 2020-05-29T16:33:55Z

evalml/tests/pipeline_tests/classification_pipeline_tests/test_et_classification.py

+
+    objective = PrecisionMicro()
+    with pytest.raises(ValueError, match="You can only use a binary classification objective to make predictions for a binary classification pipeline."):
+        clf.predict(X, objective)


dsherry · 2020-05-29T16:44:49Z

evalml/tests/pipeline_tests/classification_pipeline_tests/test_et_classification.py

+    assert len(clf.feature_importances) == len(X.columns)
+    assert not clf.feature_importances.isnull().all().all()
+    for col_name in clf.feature_importances["feature"]:
+        assert "col_" in col_name


dsherry

Left some comments on the testing, and comments about the parameter defaults and hyperparam ranges, but otherwise LGTM!

eccabay added 11 commits May 20, 2020 11:46

Add ExtraTrees as a ModelFamily

b33bcb1

Add components for ExtraTrees classifier and regressor

03b51d4

Add Pipeline classes for ExtraTrees classification and regression

b0c787d

Add ExtraTrees to init files

8e69c9d

Modify current tests to include ExtraTrees pipeline and components

5050ffd

Merge branch 'master' into 771_extratrees_pipeline

e4cc862

Update ModelFamily and components tests to include ExtraTrees

1b077bf

Fix CatBoost learning rate bug - rerorder _ALL_PIPELINES

77a71e5

Add classification pipeline tests

30f6cfe

Add ExtraTrees to the API reference

6612f0b

Fix pipeline tests to use correct estimators

26fc87d

eccabay requested review from dsherry and jeremyliweishih May 21, 2020 14:49

auto-assign bot assigned eccabay May 21, 2020

Update changelog

6543fd4

eccabay and others added 5 commits May 21, 2020 11:59

Merge branch 'master' into 771_extratrees_pipeline

af5f67c

Lint fixes

e44989e

More lint fixes

5dd78cc

Merge branch 'master' into 771_extratrees_pipeline

4207a8f

Merge branch 'master' into 771_extratrees_pipeline

23258d3

jeremyliweishih approved these changes May 22, 2020

View reviewed changes

dsherry reviewed May 22, 2020

View reviewed changes

evalml/pipelines/components/estimators/classifiers/et_classifier.py Outdated Show resolved Hide resolved

Remove explicit components/et_classifier feature_importances

07b15be

dsherry reviewed May 22, 2020

View reviewed changes

evalml/tests/automl_tests/test_autobase.py Outdated Show resolved Hide resolved

eccabay added 2 commits May 26, 2020 16:32

Preliminary ExtraTrees components tests

c7ef269

Linting and a few test tweaks

c26b2f4

jeremyliweishih mentioned this pull request May 27, 2020

Add Elastic Net Pipeline #812

Merged

lint fixes

af800f4

eccabay commented May 28, 2020

View reviewed changes

eccabay and others added 7 commits May 28, 2020 14:39

Update et_regression pipeline tests to remove redundancy

bb73610

Merge branch 'master' into 771_extratrees_pipeline

ad65a49

Update et_classification pipeline tests to remove redundancy

570d206

Merge branch '771_extratrees_pipeline' of https://github.com/FeatureL…

3b63c1e

…abs/evalml into 771_extratrees_pipeline

Fix hidden failing components tests

c69cfac

Fix hidden failing components tests

bbd28a2

Merge branch '771_extratrees_pipeline' of https://github.com/FeatureL…

84f99ca

…abs/evalml into 771_extratrees_pipeline

dsherry reviewed May 29, 2020

View reviewed changes

eccabay and others added 3 commits May 29, 2020 12:02

Expose max_depth as a hyperparameter

42f93d6

Remove feature selection from extratrees

0280854

Merge branch 'master' into 771_extratrees_pipeline

34be3a2

dsherry reviewed May 29, 2020

View reviewed changes

evalml/tests/component_tests/test_et_classifier.py

"max_depth": 5

}

assert clf.parameters == expected_parameters

Copy link

Contributor

dsherry May 29, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

dsherry reviewed May 29, 2020

View reviewed changes

eccabay added 2 commits May 29, 2020 12:28

Split up et_classifier pipeline tests

817395a

Merge branch '771_extratrees_pipeline' of https://github.com/FeatureL…

8c03017

…abs/evalml into 771_extratrees_pipeline

dsherry reviewed May 29, 2020

View reviewed changes

dsherry approved these changes May 29, 2020

View reviewed changes

eccabay added 2 commits May 29, 2020 13:05

remove unnecessary test

ea6d4f1

lint fix

94759e8

eccabay merged commit 26fbdc6 into master May 29, 2020

angela97lin mentioned this pull request May 29, 2020

Release v0.10.0 May 29, 2020 #822

Merged

dsherry deleted the 771_extratrees_pipeline branch October 29, 2020 23:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add ExtraTrees Pipeline #790

Add ExtraTrees Pipeline #790

eccabay commented May 21, 2020

codecov bot commented May 21, 2020 •

edited

jeremyliweishih left a comment

dsherry commented May 22, 2020

dsherry May 22, 2020

dsherry May 22, 2020

dsherry May 22, 2020 •

edited

dsherry May 22, 2020

dsherry May 29, 2020 •

edited

dsherry May 29, 2020

eccabay May 28, 2020

dsherry May 29, 2020

dsherry May 29, 2020 •

edited

dsherry May 29, 2020

dsherry May 29, 2020

dsherry May 29, 2020

ctduffy May 29, 2020

dsherry May 29, 2020 •

edited

ctduffy May 29, 2020 •

edited

dsherry May 29, 2020

dsherry May 29, 2020

dsherry May 29, 2020

dsherry May 29, 2020

dsherry May 29, 2020

eccabay May 29, 2020

eccabay May 29, 2020

dsherry May 29, 2020

dsherry May 29, 2020

dsherry left a comment

Add ExtraTrees Pipeline #790

Add ExtraTrees Pipeline #790

Conversation

eccabay commented May 21, 2020

codecov bot commented May 21, 2020 • edited

Codecov Report

jeremyliweishih left a comment

Choose a reason for hiding this comment

dsherry commented May 22, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dsherry May 22, 2020 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dsherry May 29, 2020 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dsherry May 29, 2020 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dsherry May 29, 2020 • edited

Choose a reason for hiding this comment

ctduffy May 29, 2020 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dsherry left a comment

Choose a reason for hiding this comment

codecov bot commented May 21, 2020 •

edited

dsherry May 22, 2020 •

edited

dsherry May 29, 2020 •

edited

dsherry May 29, 2020 •

edited

dsherry May 29, 2020 •

edited

ctduffy May 29, 2020 •

edited