-
Notifications
You must be signed in to change notification settings - Fork 86
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update automl algorithm #793
Conversation
6649556
to
f6f1f73
Compare
d26f893
to
8889796
Compare
Codecov Report
@@ Coverage Diff @@
## master #793 +/- ##
==========================================
+ Coverage 99.63% 99.65% +0.01%
==========================================
Files 170 175 +5
Lines 6647 6940 +293
==========================================
+ Hits 6623 6916 +293
Misses 24 24
Continue to review full report at Codecov.
|
@kmax12 @jeremyliweishih @angela97lin this PR is currently missing some more docstrings, a few more unit tests, and general documentation updates. I will get that stuff in shortly. But otherwise this is ready for review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here's my first pass through. probably best to discuss live anything that need clarity. need to look at tests more in next round.
MockEstimator = dummy_classifier_estimator_class | ||
|
||
class MockBinaryClassificationPipeline1(BinaryClassificationPipeline): | ||
estimator = MockEstimator |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you need this estimator property?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah. Updated in latest push, and now its clear why.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
still not clear to me, but i might not be looking in the right place. the reason why i feel it's not need is because we have this line in PipelineBase.__init__()
self.estimator = self.component_graph[-1] if isinstance(self.component_graph[-1], Estimator) else None
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The mock estimator defines the dummy hyperparameter ranges, which although not referred to directly are used in the last unit test to make sure the tuner searched for params which differ from the default.
0535acd
to
466d816
Compare
Just spoke with @kmax12 . Plan is that I'll address most of his comments, then extend the |
0b81bd3
to
0a9a863
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
proactively leaving some more review comments. looking very close in my eyes!
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"automl.possible_pipelines" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we add this line back in but be automl.allowed_pipelines
? it actually might make more sense to just replaces this with print(automl)
in the docs to give complete overview
we can come back this in a future PR if we want since it's doc changes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, since this PR crossed the 1000+ line threshold, I'd like to do docs changes separately. That may mean we cut the release without great documentation for AutoMLAlgorithm/IterativeAlgorithm. This doesn't bother me too much because this PR doesn't change much for users in terms of the inputs to automl. You're reminding me though, I'll scan the docs real quick and make sure the external params like allowed_pipelines
and allowed_model_families
are covered well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Eh, I'll push this back in for now, using allowed_pipelines
MockEstimator = dummy_classifier_estimator_class | ||
|
||
class MockBinaryClassificationPipeline1(BinaryClassificationPipeline): | ||
estimator = MockEstimator |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
still not clear to me, but i might not be looking in the right place. the reason why i feel it's not need is because we have this line in PipelineBase.__init__()
self.estimator = self.component_graph[-1] if isinstance(self.component_graph[-1], Estimator) else None
Thanks @kmax12 ! I was about to message and say that I worked through all your previous comments and this is ready for re-review. |
@@ -11,12 +11,12 @@ class RandomForestClassifier(Estimator): | |||
name = "Random Forest Classifier" | |||
hyperparameter_ranges = { | |||
"n_estimators": Integer(10, 1000), | |||
"max_depth": Integer(1, 32), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kmax12 I ended up having to update some of the param defaults and ranges, according to what I could find recommended on the sklearn/catboost/xgboost sites. We've had this recurring problem where if calls to the random number generator were changed in a PR, the runtime on a few tests could suddenly spike. I think lowering these max_depth
max values for tree models has helped with that. I think its sped up the unit tests overall, which is nice. And we're close to being able to test all of this directly via the perf tests! (@jeremyliweishih )
} | ||
model_family = ModelFamily.RANDOM_FOREST | ||
supported_problem_types = [ProblemTypes.BINARY, ProblemTypes.MULTICLASS] | ||
|
||
def __init__(self, n_estimators=10, max_depth=None, n_jobs=-1, random_state=0): | ||
def __init__(self, n_estimators=100, max_depth=6, n_jobs=-1, random_state=0): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An unfortunate limitation of the current approach is that all default values must be in the hyperparam range. This means max_depth
here has to default to an integer value. Sklearn's doc says for RF, max_depth
's default is None
so that its set automatically using some heuristic they have. We should consider how to take advantage of that in the future.
This wasn't a problem before this PR because we never tried to use the default parameters in automl! We always sampled from the hyperparameter ranges.
One option is to acknowledge that the defaults won't necessarily appear in the hyperparam ranges, and update the code to not try to add the evaluation results to the tuner, which is where the error occurs. Another option is to change the max_depth
range to be something like [None] + list(range(1, 11))
, but then I think the CASH algos won't know anymore that the numeric values are ordered. We can come back to this when its time for more rigorous perf testing (@jeremyliweishih FYI)
@@ -20,7 +20,8 @@ def __init__(self, number_features=None, n_estimators=10, max_depth=None, | |||
if number_features: | |||
max_features = max(1, int(percent_features * number_features)) | |||
parameters = {"percent_features": percent_features, | |||
"threshold": threshold} | |||
"threshold": threshold, | |||
"number_features": number_features} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#819 removed the feature selectors from all pipelines. But in order to get the unit tests from that PR passing, and preserve support for feature selectors in automl, the feature selectors must save the number_features
parameter.
automl_1 = AutoClassificationSearch(objective=fc, max_pipelines=5, random_state=30) | ||
automl_1.search(X, y) | ||
|
||
assert automl.rankings.equals(automl_1.rankings) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed this because this unit test was taking way too long, something like 90 seconds! This test just needs to run two automl searches and check the results are identical. We have separate coverage that FraudCost
is a valid objective. I think this is left over from before the objective API overhaul.
@patch('evalml.pipelines.BinaryClassificationPipeline.fit') | ||
def test_automl_allowed_pipelines_search(mock_fit, mock_score, dummy_binary_pipeline_class, X_y): | ||
X, y = X_y | ||
mock_score.return_value = {'Log Loss Binary': 1.0} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm getting pretty into this mocking pattern to save time in the automl tests. Any time the results of pipelines' fit
and score
methods don't matter, its great to mock them out to save time in the search.
@@ -315,11 +323,64 @@ def fit(self, X, y): | |||
mock_get_pipelines.return_value = allowed_pipelines | |||
start_iteration_callback = MagicMock() | |||
automl = AutoClassificationSearch(max_pipelines=2, start_iteration_callback=start_iteration_callback) | |||
assert automl.possible_pipelines == allowed_pipelines |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now covered in test_automl_allowed_pipelines_init in classification/regression tests
.format(pipeline_parameters, str(e)) | ||
logger.error(msg) | ||
raise ParameterError(msg) | ||
raise(e) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added this code to handle the case when someone passes in a value like None
but the range is Int(1, 32)
. In that case an ugly stack trace gets thrown. This makes the error more pretty. I also avoided the error in automl search by changing our default params to not use None for max_depth
d3eeead
to
2f93b89
Compare
…families and pipelines from AutoSearchBase
…d tests, updated doc
2f93b89
to
de9c9ae
Compare
:nosignatures: | ||
|
||
AutoMLAlgorithm | ||
IterativeAlgorithm |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@angela97lin how does this look for now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! 😄
current_batch_pipelines = self._automl_algorithm.next_batch() | ||
except StopIteration: | ||
logger.info('AutoML Algorithm out of recommendations, ending') | ||
break |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will update this code in a subsequent PR. Left over from when I had the algo throwing StopIteration
. This doesn't block from merging, and has test coverage in place, just that we should probably delete it to simplify and instead catch AutoMLAlgorithmError
Fixes #272 fixes #754 fixes #755 fixes #756 fixes #757 fixes #301 fixes #309 fixes #353
Adds a new iterative automl algorithm, and an API for defining automl algorithms in general.
Also updates default parameters for components and pipelines.
API docs are included, but updating the rest of the docs is not included in this PR.