Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update automl algorithm #793

Merged
merged 47 commits into from May 29, 2020
Merged

Update automl algorithm #793

merged 47 commits into from May 29, 2020

Conversation

dsherry
Copy link
Collaborator

@dsherry dsherry commented May 21, 2020

Fixes #272 fixes #754 fixes #755 fixes #756 fixes #757 fixes #301 fixes #309 fixes #353

Adds a new iterative automl algorithm, and an API for defining automl algorithms in general.

Also updates default parameters for components and pipelines.

API docs are included, but updating the rest of the docs is not included in this PR.

@codecov
Copy link

codecov bot commented May 28, 2020

Codecov Report

Merging #793 into master will increase coverage by 0.01%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #793      +/-   ##
==========================================
+ Coverage   99.63%   99.65%   +0.01%     
==========================================
  Files         170      175       +5     
  Lines        6647     6940     +293     
==========================================
+ Hits         6623     6916     +293     
  Misses         24       24              
Impacted Files Coverage Δ
evalml/__init__.py 100.00% <ø> (ø)
evalml/automl/auto_classification_search.py 100.00% <ø> (ø)
evalml/automl/auto_regression_search.py 100.00% <ø> (ø)
...eature_selection/rf_classifier_feature_selector.py 100.00% <ø> (ø)
...feature_selection/rf_regressor_feature_selector.py 100.00% <ø> (ø)
evalml/tests/tuner_tests/test_grid_search_tuner.py 100.00% <ø> (ø)
...alml/tests/tuner_tests/test_random_search_tuner.py 100.00% <ø> (ø)
evalml/automl/__init__.py 100.00% <100.00%> (ø)
evalml/automl/auto_search_base.py 97.91% <100.00%> (-0.09%) ⬇️
evalml/automl/automl_algorithm/__init__.py 100.00% <100.00%> (ø)
... and 23 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 26fbdc6...c48df2b. Read the comment docs.

@dsherry dsherry marked this pull request as ready for review May 28, 2020
@dsherry
Copy link
Collaborator Author

dsherry commented May 28, 2020

@kmax12 @jeremyliweishih @angela97lin this PR is currently missing some more docstrings, a few more unit tests, and general documentation updates. I will get that stuff in shortly. But otherwise this is ready for review.

Copy link
Contributor

@kmax12 kmax12 left a comment

here's my first pass through. probably best to discuss live anything that need clarity. need to look at tests more in next round.

MockEstimator = dummy_classifier_estimator_class

class MockBinaryClassificationPipeline1(BinaryClassificationPipeline):
estimator = MockEstimator
Copy link
Contributor

@kmax12 kmax12 May 28, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you need this estimator property?

Copy link
Collaborator Author

@dsherry dsherry May 28, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah. Updated in latest push, and now its clear why.

Copy link
Contributor

@kmax12 kmax12 May 29, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

still not clear to me, but i might not be looking in the right place. the reason why i feel it's not need is because we have this line in PipelineBase.__init__()

self.estimator = self.component_graph[-1] if isinstance(self.component_graph[-1], Estimator) else None

Copy link
Collaborator Author

@dsherry dsherry May 29, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The mock estimator defines the dummy hyperparameter ranges, which although not referred to directly are used in the last unit test to make sure the tuner searched for params which differ from the default.

evalml/automl/automl_algorithm/iterative_algorithm.py Outdated Show resolved Hide resolved
evalml/automl/automl_algorithm/iterative_algorithm.py Outdated Show resolved Hide resolved
evalml/automl/automl_algorithm/iterative_algorithm.py Outdated Show resolved Hide resolved
evalml/automl/automl_algorithm/__init__.py Outdated Show resolved Hide resolved
evalml/automl/automl_algorithm/automl_algorithm.py Outdated Show resolved Hide resolved
evalml/automl/automl_algorithm/automl_algorithm.py Outdated Show resolved Hide resolved
evalml/automl/automl_algorithm/iterative_algorithm.py Outdated Show resolved Hide resolved
@dsherry dsherry force-pushed the ds_754_automl_algorithm branch from 0535acd to 466d816 Compare May 28, 2020
@dsherry
Copy link
Collaborator Author

dsherry commented May 28, 2020

Just spoke with @kmax12 . Plan is that I'll address most of his comments, then extend the IterativeAlgorithm to be able to continue indefinitely and to respect max_pipelines, because we feel that's the most useful. There's a risk that will take too long for us to get it into the release tomorrow; if that happens we can always cut another next week.

@dsherry dsherry force-pushed the ds_754_automl_algorithm branch from 0b81bd3 to 0a9a863 Compare May 28, 2020
Copy link
Contributor

@kmax12 kmax12 left a comment

proactively leaving some more review comments. looking very close in my eyes!

"metadata": {},
"outputs": [],
"source": [
"automl.possible_pipelines"
Copy link
Contributor

@kmax12 kmax12 May 29, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we add this line back in but be automl.allowed_pipelines? it actually might make more sense to just replaces this with print(automl) in the docs to give complete overview

we can come back this in a future PR if we want since it's doc changes

Copy link
Collaborator Author

@dsherry dsherry May 29, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, since this PR crossed the 1000+ line threshold, I'd like to do docs changes separately. That may mean we cut the release without great documentation for AutoMLAlgorithm/IterativeAlgorithm. This doesn't bother me too much because this PR doesn't change much for users in terms of the inputs to automl. You're reminding me though, I'll scan the docs real quick and make sure the external params like allowed_pipelines and allowed_model_families are covered well.

Copy link
Collaborator Author

@dsherry dsherry May 29, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Eh, I'll push this back in for now, using allowed_pipelines

MockEstimator = dummy_classifier_estimator_class

class MockBinaryClassificationPipeline1(BinaryClassificationPipeline):
estimator = MockEstimator
Copy link
Contributor

@kmax12 kmax12 May 29, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

still not clear to me, but i might not be looking in the right place. the reason why i feel it's not need is because we have this line in PipelineBase.__init__()

self.estimator = self.component_graph[-1] if isinstance(self.component_graph[-1], Estimator) else None

@dsherry
Copy link
Collaborator Author

dsherry commented May 29, 2020

Thanks @kmax12 ! I was about to message and say that I worked through all your previous comments and this is ready for re-review.

@@ -11,12 +11,12 @@ class RandomForestClassifier(Estimator):
name = "Random Forest Classifier"
hyperparameter_ranges = {
"n_estimators": Integer(10, 1000),
"max_depth": Integer(1, 32),
Copy link
Collaborator Author

@dsherry dsherry May 29, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kmax12 I ended up having to update some of the param defaults and ranges, according to what I could find recommended on the sklearn/catboost/xgboost sites. We've had this recurring problem where if calls to the random number generator were changed in a PR, the runtime on a few tests could suddenly spike. I think lowering these max_depth max values for tree models has helped with that. I think its sped up the unit tests overall, which is nice. And we're close to being able to test all of this directly via the perf tests! (@jeremyliweishih )

}
model_family = ModelFamily.RANDOM_FOREST
supported_problem_types = [ProblemTypes.BINARY, ProblemTypes.MULTICLASS]

def __init__(self, n_estimators=10, max_depth=None, n_jobs=-1, random_state=0):
def __init__(self, n_estimators=100, max_depth=6, n_jobs=-1, random_state=0):
Copy link
Collaborator Author

@dsherry dsherry May 29, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An unfortunate limitation of the current approach is that all default values must be in the hyperparam range. This means max_depth here has to default to an integer value. Sklearn's doc says for RF, max_depth's default is None so that its set automatically using some heuristic they have. We should consider how to take advantage of that in the future.

This wasn't a problem before this PR because we never tried to use the default parameters in automl! We always sampled from the hyperparameter ranges.

One option is to acknowledge that the defaults won't necessarily appear in the hyperparam ranges, and update the code to not try to add the evaluation results to the tuner, which is where the error occurs. Another option is to change the max_depth range to be something like [None] + list(range(1, 11)), but then I think the CASH algos won't know anymore that the numeric values are ordered. We can come back to this when its time for more rigorous perf testing (@jeremyliweishih FYI)

@@ -20,7 +20,8 @@ def __init__(self, number_features=None, n_estimators=10, max_depth=None,
if number_features:
max_features = max(1, int(percent_features * number_features))
parameters = {"percent_features": percent_features,
"threshold": threshold}
"threshold": threshold,
"number_features": number_features}
Copy link
Collaborator Author

@dsherry dsherry May 29, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#819 removed the feature selectors from all pipelines. But in order to get the unit tests from that PR passing, and preserve support for feature selectors in automl, the feature selectors must save the number_features parameter.

automl_1 = AutoClassificationSearch(objective=fc, max_pipelines=5, random_state=30)
automl_1.search(X, y)

assert automl.rankings.equals(automl_1.rankings)
Copy link
Collaborator Author

@dsherry dsherry May 29, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed this because this unit test was taking way too long, something like 90 seconds! This test just needs to run two automl searches and check the results are identical. We have separate coverage that FraudCost is a valid objective. I think this is left over from before the objective API overhaul.

@patch('evalml.pipelines.BinaryClassificationPipeline.fit')
def test_automl_allowed_pipelines_search(mock_fit, mock_score, dummy_binary_pipeline_class, X_y):
X, y = X_y
mock_score.return_value = {'Log Loss Binary': 1.0}
Copy link
Collaborator Author

@dsherry dsherry May 29, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm getting pretty into this mocking pattern to save time in the automl tests. Any time the results of pipelines' fit and score methods don't matter, its great to mock them out to save time in the search.

@@ -315,11 +323,64 @@ def fit(self, X, y):
mock_get_pipelines.return_value = allowed_pipelines
start_iteration_callback = MagicMock()
automl = AutoClassificationSearch(max_pipelines=2, start_iteration_callback=start_iteration_callback)
assert automl.possible_pipelines == allowed_pipelines
Copy link
Collaborator Author

@dsherry dsherry May 29, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now covered in test_automl_allowed_pipelines_init in classification/regression tests

.format(pipeline_parameters, str(e))
logger.error(msg)
raise ParameterError(msg)
raise(e)
Copy link
Collaborator Author

@dsherry dsherry May 29, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added this code to handle the case when someone passes in a value like None but the range is Int(1, 32). In that case an ugly stack trace gets thrown. This makes the error more pretty. I also avoided the error in automl search by changing our default params to not use None for max_depth

@dsherry dsherry force-pushed the ds_754_automl_algorithm branch 2 times, most recently from d3eeead to 2f93b89 Compare May 29, 2020
@dsherry dsherry requested a review from kmax12 May 29, 2020
@dsherry dsherry mentioned this pull request May 29, 2020
@dsherry dsherry mentioned this pull request May 29, 2020
@dsherry dsherry force-pushed the ds_754_automl_algorithm branch from 2f93b89 to de9c9ae Compare May 29, 2020
:nosignatures:

AutoMLAlgorithm
IterativeAlgorithm
Copy link
Collaborator Author

@dsherry dsherry May 29, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@angela97lin how does this look for now?

Screen Shot 2020-05-29 at 3 00 57 PM

Copy link
Contributor

@angela97lin angela97lin May 29, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! 😄

current_batch_pipelines = self._automl_algorithm.next_batch()
except StopIteration:
logger.info('AutoML Algorithm out of recommendations, ending')
break
Copy link
Collaborator Author

@dsherry dsherry May 29, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will update this code in a subsequent PR. Left over from when I had the algo throwing StopIteration. This doesn't block from merging, and has test coverage in place, just that we should probably delete it to simplify and instead catch AutoMLAlgorithmError

@dsherry dsherry merged commit bbc802a into master May 29, 2020
2 checks passed
@dsherry dsherry deleted the ds_754_automl_algorithm branch May 29, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment