Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ExtraTrees Pipeline #790

Merged
merged 38 commits into from
May 29, 2020
Merged

Add ExtraTrees Pipeline #790

merged 38 commits into from
May 29, 2020

Conversation

eccabay
Copy link
Contributor

@eccabay eccabay commented May 21, 2020

Closes #771.

@codecov
Copy link

codecov bot commented May 21, 2020

Codecov Report

Merging #790 into master will increase coverage by 0.07%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #790      +/-   ##
==========================================
+ Coverage   99.56%   99.63%   +0.07%     
==========================================
  Files         161      170       +9     
  Lines        6404     6647     +243     
==========================================
+ Hits         6376     6623     +247     
+ Misses         28       24       -4     
Impacted Files Coverage Δ
evalml/pipelines/__init__.py 100.00% <ø> (ø)
evalml/pipelines/components/__init__.py 100.00% <ø> (ø)
evalml/pipelines/components/estimators/__init__.py 100.00% <ø> (ø)
evalml/pipelines/components/utils.py 100.00% <ø> (ø)
evalml/model_family/model_family.py 100.00% <100.00%> (ø)
evalml/pipelines/classification/__init__.py 100.00% <100.00%> (ø)
...lml/pipelines/classification/extra_trees_binary.py 100.00% <100.00%> (ø)
...pipelines/classification/extra_trees_multiclass.py 100.00% <100.00%> (ø)
...ines/components/estimators/classifiers/__init__.py 100.00% <100.00%> (ø)
...components/estimators/classifiers/et_classifier.py 100.00% <100.00%> (ø)
... and 21 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c026a89...94759e8. Read the comment docs.

Copy link
Collaborator

@jeremyliweishih jeremyliweishih left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me - great work! Should be ready to merge after Dylan gives it a look.

@dsherry
Copy link
Contributor

dsherry commented May 22, 2020

@eccabay I will review this in a bit!

@@ -7,12 +7,14 @@ class ModelFamily(Enum):
XGBOOST = 'xgboost'
LINEAR_MODEL = 'linear_model'
CATBOOST = 'catboost'
EXTRA_TREES = 'extra_trees'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep. Makes sense to have a separate family here for now. In the new automl algo this will mean the first round runs both RF and extra trees. Down the road we may want to group tree-based models different but this is a great starting point. @jeremyliweishih

name = "Extra Trees Classifier"
hyperparameter_ranges = {
"n_estimators": Integer(10, 1000),
"max_depth": Integer(1, 32),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We've had trouble with max_depth for catboost--for that model, fit time is exponential relative to the depth. Catboost recommended a range of 4 to 10, which is what we use now. So I wonder if this limit should be lower. Will make a general comment about that in a bit.

model_family = ModelFamily.EXTRA_TREES
supported_problem_types = [ProblemTypes.BINARY, ProblemTypes.MULTICLASS]

def __init__(self, n_estimators=10, max_depth=None, n_jobs=-1, random_state=0):
Copy link
Contributor

@dsherry dsherry May 22, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What other parameters does this model support? max_features, min_weight_fraction_leaf, max_leaf_nodes, criterion, anything having to do with the math would be helpful to add here.

Our reasoning here is that we want our component API to be an extension of the underlying models, not a reduction, so most of the things you could do in sklearn, we want to expose.

We do have an issue planned #775 which will add a **kwargs catchall to all components, but we should still make an effort to explicitly declare any parameters we deem important.

Note you shouldn't expose all of these parameters in hyperparameter_ranges to have them get tuned in automl -- we're going to do a lot more work on that once we have our new perf tests in place (which @jeremyliweishih is working on). But, perhaps max_features should be added to the automl hyperparameters.

class ETBinaryClassificationPipeline(BinaryClassificationPipeline):
"""Extra Trees Pipeline for binary classification"""
custom_name = "Extra Trees Binary Classification Pipeline"
component_graph = ['One Hot Encoder', 'Simple Imputer', 'RF Classifier Select From Model', 'Extra Trees Classifier']
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, mirrors what we have for other pipelines at the moment, 👍

Copy link
Contributor

@dsherry dsherry May 29, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@eccabay can you please remove the "select from model" components from this pipeline, and same for multiclass? RE #819

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Also for the regression pipeline)

def test_et_objective_tuning(X_y):
X, y = X_y

# The classifier predicts accurately with perfect confidence given the original data,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dsherry this is the test I was referring to where the model is too confident and the assert at line 105 fails without this part here

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, thanks for calling that out.

So you're adding coverage which tests that the output of predict(X) vs predict(X, objective) is different? Those would differ if the binary classification threshold the pipeline sets is different from the one used by the underlying estimator (i.e. internally somewhere in sklearn's predict code). You may have already seen this, but check out the impl of BinaryClassificationPipeline.predict. Specifically, if self.threshold is None, we return self.estimator.predict(X_t)

I don't think we need coverage for this per-pipeline. I agree we should have a general pipeline test which covers this, and I believe we already do. So, assuming I'm not missing something here, my recommendation is that you delete this test, because we already have the coverage elsewhere.

A test you could add here would be to mock the estimator's predict method and ensure its passed the right values during each of these cases. That would be great coverage to have, but not required.

def __init__(self,
n_estimators=100,
max_features="auto",
max_depth=None,
Copy link
Contributor

@dsherry dsherry May 29, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@eccabay I think we should expose max_depth as hyperparameter tuneable. We currently do so for our other tree-based models like xgboost/RF/catboost.

#793 introduces a change where all default parameter values for parameters exposed as hyperparameters must be within range. So I suggest you set max_depth here to default to 6 since that's what I have the other models set to.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Same applies to the regression estimator)

n_jobs=-1,
random_state=0):
parameters = {"n_estimators": n_estimators,
"max_features": max_features}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ctduffy when you add max_depth, don't forget to also add it to the parameters dict. This is another bit of functionality which is necessary for #793 .

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Same applies to the regression estimator)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dsherry I am happy to do this but this is what @eccabay was working on, do you want me to continue working on it instead?

Copy link
Contributor

@dsherry dsherry May 29, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ctduffy @eccabay 🤦 augh, I got this mixed up with the ElasticNet PR, which you're working on. They both start with E and have two words 😂😂So sorry about that!!!

That said, reviewing these comments may be helpful to you too, because I'll make similar ones on the ElasticNet PR if applicable!

Copy link
Contributor

@ctduffy ctduffy May 29, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dsherry No worries, I get the two confused sometimes too. and yep, I have been looking at the comments and changing some things with ElasticNet (like the select from model removal)

clf.fit(X, y)
feature_importances = clf.feature_importances

np.testing.assert_almost_equal(sk_feature_importances, feature_importances, decimal=5)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These tests are great!

One thing which you can add: check that calling estimator.parameters on the instance returns what you'd expect. This is partially covered in the components tests, but it would be helpful to have direct coverage here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And same for the regressor of course lol

"max_depth": 5
}

assert clf.parameters == expected_parameters
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

clf.fit(X, y)
y_pred = clf.predict(X)

objective = PrecisionMicro()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this dataset multiclass? I had assumed it was binary. PrecisionMicro is a multiclass objective. Should still work but perhaps we should use Precision or a different objective like F1.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I see you're just doing that to trigger the error. Got it. In that case, I suggest we split this into a separate test.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, the dataset is binary. Right below is a test that setting the objective to a multiclass one throws an error.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, can do


objective = PrecisionMicro()
with pytest.raises(ValueError, match="You can only use a binary classification objective to make predictions for a binary classification pipeline."):
clf.predict(X, objective)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

assert len(clf.feature_importances) == len(X.columns)
assert not clf.feature_importances.isnull().all().all()
for col_name in clf.feature_importances["feature"]:
assert "col_" in col_name
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Copy link
Contributor

@dsherry dsherry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some comments on the testing, and comments about the parameter defaults and hyperparam ranges, but otherwise LGTM!

@eccabay eccabay merged commit 26fbdc6 into master May 29, 2020
@dsherry dsherry deleted the 771_extratrees_pipeline branch October 29, 2020 23:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add ExtraTrees pipeline
4 participants