New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add ExtraTrees Pipeline #790
Conversation
Codecov Report
@@ Coverage Diff @@
## master #790 +/- ##
==========================================
+ Coverage 99.56% 99.63% +0.07%
==========================================
Files 161 170 +9
Lines 6404 6647 +243
==========================================
+ Hits 6376 6623 +247
+ Misses 28 24 -4
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me - great work! Should be ready to merge after Dylan gives it a look.
@eccabay I will review this in a bit! |
@@ -7,12 +7,14 @@ class ModelFamily(Enum): | |||
XGBOOST = 'xgboost' | |||
LINEAR_MODEL = 'linear_model' | |||
CATBOOST = 'catboost' | |||
EXTRA_TREES = 'extra_trees' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep. Makes sense to have a separate family here for now. In the new automl algo this will mean the first round runs both RF and extra trees. Down the road we may want to group tree-based models different but this is a great starting point. @jeremyliweishih
evalml/pipelines/components/estimators/classifiers/et_classifier.py
Outdated
Show resolved
Hide resolved
name = "Extra Trees Classifier" | ||
hyperparameter_ranges = { | ||
"n_estimators": Integer(10, 1000), | ||
"max_depth": Integer(1, 32), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We've had trouble with max_depth
for catboost--for that model, fit time is exponential relative to the depth. Catboost recommended a range of 4 to 10, which is what we use now. So I wonder if this limit should be lower. Will make a general comment about that in a bit.
model_family = ModelFamily.EXTRA_TREES | ||
supported_problem_types = [ProblemTypes.BINARY, ProblemTypes.MULTICLASS] | ||
|
||
def __init__(self, n_estimators=10, max_depth=None, n_jobs=-1, random_state=0): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What other parameters does this model support? max_features
, min_weight_fraction_leaf
, max_leaf_nodes
, criterion
, anything having to do with the math would be helpful to add here.
Our reasoning here is that we want our component API to be an extension of the underlying models, not a reduction, so most of the things you could do in sklearn, we want to expose.
We do have an issue planned #775 which will add a **kwargs
catchall to all components, but we should still make an effort to explicitly declare any parameters we deem important.
Note you shouldn't expose all of these parameters in hyperparameter_ranges
to have them get tuned in automl -- we're going to do a lot more work on that once we have our new perf tests in place (which @jeremyliweishih is working on). But, perhaps max_features
should be added to the automl hyperparameters.
class ETBinaryClassificationPipeline(BinaryClassificationPipeline): | ||
"""Extra Trees Pipeline for binary classification""" | ||
custom_name = "Extra Trees Binary Classification Pipeline" | ||
component_graph = ['One Hot Encoder', 'Simple Imputer', 'RF Classifier Select From Model', 'Extra Trees Classifier'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, mirrors what we have for other pipelines at the moment, 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Also for the regression pipeline)
def test_et_objective_tuning(X_y): | ||
X, y = X_y | ||
|
||
# The classifier predicts accurately with perfect confidence given the original data, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dsherry this is the test I was referring to where the model is too confident and the assert at line 105 fails without this part here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it, thanks for calling that out.
So you're adding coverage which tests that the output of predict(X)
vs predict(X, objective)
is different? Those would differ if the binary classification threshold the pipeline sets is different from the one used by the underlying estimator (i.e. internally somewhere in sklearn's predict
code). You may have already seen this, but check out the impl of BinaryClassificationPipeline.predict
. Specifically, if self.threshold is None
, we return self.estimator.predict(X_t)
I don't think we need coverage for this per-pipeline. I agree we should have a general pipeline test which covers this, and I believe we already do. So, assuming I'm not missing something here, my recommendation is that you delete this test, because we already have the coverage elsewhere.
A test you could add here would be to mock the estimator's predict
method and ensure its passed the right values during each of these cases. That would be great coverage to have, but not required.
…abs/evalml into 771_extratrees_pipeline
…abs/evalml into 771_extratrees_pipeline
def __init__(self, | ||
n_estimators=100, | ||
max_features="auto", | ||
max_depth=None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@eccabay I think we should expose max_depth
as hyperparameter tuneable. We currently do so for our other tree-based models like xgboost/RF/catboost.
#793 introduces a change where all default parameter values for parameters exposed as hyperparameters must be within range. So I suggest you set max_depth
here to default to 6 since that's what I have the other models set to.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Same applies to the regression estimator)
n_jobs=-1, | ||
random_state=0): | ||
parameters = {"n_estimators": n_estimators, | ||
"max_features": max_features} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Same applies to the regression estimator)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dsherry No worries, I get the two confused sometimes too. and yep, I have been looking at the comments and changing some things with ElasticNet (like the select from model removal)
clf.fit(X, y) | ||
feature_importances = clf.feature_importances | ||
|
||
np.testing.assert_almost_equal(sk_feature_importances, feature_importances, decimal=5) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These tests are great!
One thing which you can add: check that calling estimator.parameters
on the instance returns what you'd expect. This is partially covered in the components tests, but it would be helpful to have direct coverage here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And same for the regressor of course lol
"max_depth": 5 | ||
} | ||
|
||
assert clf.parameters == expected_parameters |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
clf.fit(X, y) | ||
y_pred = clf.predict(X) | ||
|
||
objective = PrecisionMicro() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this dataset multiclass? I had assumed it was binary. PrecisionMicro
is a multiclass objective. Should still work but perhaps we should use Precision
or a different objective like F1
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I see you're just doing that to trigger the error. Got it. In that case, I suggest we split this into a separate test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right, the dataset is binary. Right below is a test that setting the objective to a multiclass one throws an error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it, can do
…abs/evalml into 771_extratrees_pipeline
|
||
objective = PrecisionMicro() | ||
with pytest.raises(ValueError, match="You can only use a binary classification objective to make predictions for a binary classification pipeline."): | ||
clf.predict(X, objective) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice!
assert len(clf.feature_importances) == len(X.columns) | ||
assert not clf.feature_importances.isnull().all().all() | ||
for col_name in clf.feature_importances["feature"]: | ||
assert "col_" in col_name |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left some comments on the testing, and comments about the parameter defaults and hyperparam ranges, but otherwise LGTM!
Closes #771.