-
Notifications
You must be signed in to change notification settings - Fork 86
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implementation of max_batches parameter. #1087
Conversation
Codecov Report
@@ Coverage Diff @@
## main #1087 +/- ##
=======================================
Coverage 99.89% 99.89%
=======================================
Files 192 192
Lines 10853 10894 +41
=======================================
+ Hits 10842 10883 +41
Misses 11 11
Continue to review full report at Codecov.
|
evalml/automl/automl_search.py
Outdated
@@ -76,7 +76,8 @@ def __init__(self, | |||
n_jobs=-1, | |||
tuner_class=None, | |||
verbose=True, | |||
optimize_thresholds=False): | |||
optimize_thresholds=False, | |||
max_batches=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not adding to docstring because we want to keep this "hidden" for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, thanks. Will make it easier to make breaking changes when we next update the automl search UX.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, after thinking on it for a bit, my preference would be to name this _max_batches
and add a docstring entry, rather than simply omitting it from the docstring. Not a blocker.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with this, adding the docstring now will also help us with development and keep track of what this is for!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll rename it to max_batches
and add it to the docstring!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome! This is going to make the perf testing a lot easier :)
I left a comment on making the parameter private as opposed to unlisted (feel free to push back if you disagree), and some minor test suggestions.
evalml/automl/automl_search.py
Outdated
@@ -76,7 +76,8 @@ def __init__(self, | |||
n_jobs=-1, | |||
tuner_class=None, | |||
verbose=True, | |||
optimize_thresholds=False): | |||
optimize_thresholds=False, | |||
max_batches=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, thanks. Will make it easier to make breaking changes when we next update the automl search UX.
evalml/automl/automl_search.py
Outdated
@@ -166,7 +167,7 @@ def __init__(self, | |||
raise TypeError("max_time must be a float, int, or string. Received a {}.".format(type(max_time))) | |||
|
|||
self.max_pipelines = max_pipelines | |||
if self.max_pipelines is None and self.max_time is None: | |||
if self.max_pipelines is None and self.max_time is None and max_batches is None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
self._max_batches = max_batches | ||
# This is the default value for IterativeAlgorithm - setting this explicitly makes sure that | ||
# the behavior of max_batches does not break if IterativeAlgorithm is changed. | ||
self._pipelines_per_batch = 5 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah got it, thanks.
@@ -365,6 +373,8 @@ def search(self, X, y, data_checks="auto", feature_types=None, show_iteration_pl | |||
|
|||
if self.allowed_pipelines == []: | |||
raise ValueError("No allowed pipelines to search") | |||
if self._max_batches and self.max_pipelines is None: | |||
self.max_pipelines = 1 + len(self.allowed_pipelines) + (self._pipelines_per_batch * (self._max_batches - 1)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense. When we change our automl algorithm we'll have to remember to update this. I spent some time considering if there's a way to have IterativeAlgorithm
compute this for us, but I think this is fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can I clarify this calculation so I better understand? Is this 1 (baseline) + len(self.allowed_pipelines) for first batch + (self._pipelines_per_batch * (self._max_batches - 1)) for each batch thereafter?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes you got it @angela97lin !
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah gotcha, thanks @freddyaboulton! 😊
@@ -377,7 +387,8 @@ def search(self, X, y, data_checks="auto", feature_types=None, show_iteration_pl | |||
tuner_class=self.tuner_class, | |||
random_state=self.random_state, | |||
n_jobs=self.n_jobs, | |||
number_features=X.shape[1] | |||
number_features=X.shape[1], | |||
pipelines_per_batch=self._pipelines_per_batch |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
n_results = 1 + len(automl.allowed_pipelines) + (5 * (max_batches - 1)) | ||
|
||
assert automl._automl_algorithm._batch_number == max_batches | ||
assert len(automl._results["pipeline_results"]) == n_results |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you check automl.rankings
and automl.full_rankings
? Also please use the public accessor for results, i.e. automl.results
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good call!
# So that the test does not break when new estimator classes are added | ||
n_results = 1 + len(automl.allowed_pipelines) + (5 * (max_batches - 1)) | ||
|
||
assert automl._automl_algorithm._batch_number == max_batches |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah we don't surface this publicly right now. You're reminding me we should eventually pin down an API for exposing this sort of information.
Please do:
assert automl._automl_algorithm.batch_number == max_batches
assert automl._automl_algorithm.pipeline_number == n_results
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the tips! I was looking at IterativeAlgorithm
instead of the base class.
|
||
@patch('evalml.pipelines.BinaryClassificationPipeline.score', return_value={"Log Loss Binary": 0.8}) | ||
@patch('evalml.pipelines.BinaryClassificationPipeline.fit') | ||
def test_max_batches_plays_nice_with_other_stopping_criteria(mock_fit, mock_score, X_y_binary): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Haha nice
def test_max_batches_must_be_non_negative(max_batches): | ||
|
||
with pytest.raises(ValueError, match="Parameter max batches must be None or non-negative. Received {max_batches}."): | ||
AutoMLSearch(problem_type="binary", max_batches=max_batches) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wonderful!
evalml/automl/automl_search.py
Outdated
@@ -76,7 +76,8 @@ def __init__(self, | |||
n_jobs=-1, | |||
tuner_class=None, | |||
verbose=True, | |||
optimize_thresholds=False): | |||
optimize_thresholds=False, | |||
max_batches=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, after thinking on it for a bit, my preference would be to name this _max_batches
and add a docstring entry, rather than simply omitting it from the docstring. Not a blocker.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Was curious about implementation for my own understanding but the impl looks great, tests are fantastic :D
evalml/automl/automl_search.py
Outdated
@@ -76,7 +76,8 @@ def __init__(self, | |||
n_jobs=-1, | |||
tuner_class=None, | |||
verbose=True, | |||
optimize_thresholds=False): | |||
optimize_thresholds=False, | |||
max_batches=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with this, adding the docstring now will also help us with development and keep track of what this is for!
@@ -365,6 +373,8 @@ def search(self, X, y, data_checks="auto", feature_types=None, show_iteration_pl | |||
|
|||
if self.allowed_pipelines == []: | |||
raise ValueError("No allowed pipelines to search") | |||
if self._max_batches and self.max_pipelines is None: | |||
self.max_pipelines = 1 + len(self.allowed_pipelines) + (self._pipelines_per_batch * (self._max_batches - 1)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can I clarify this calculation so I better understand? Is this 1 (baseline) + len(self.allowed_pipelines) for first batch + (self._pipelines_per_batch * (self._max_batches - 1)) for each batch thereafter?
1ad0771
to
823714f
Compare
@dsherry I addressed your comments! Thanks for the feedback. |
@@ -76,7 +76,8 @@ def __init__(self, | |||
n_jobs=-1, | |||
tuner_class=None, | |||
verbose=True, | |||
optimize_thresholds=False): | |||
optimize_thresholds=False, | |||
_max_batches=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I think this is a good pattern for us to avoid breaking changes but still communicate clearly to users 👍
evalml/automl/automl_search.py
Outdated
@@ -129,6 +130,8 @@ def __init__(self, | |||
None and 1 are equivalent. If set to -1, all CPUs are used. For n_jobs below -1, (n_cpus + 1 + n_jobs) are used. | |||
|
|||
verbose (boolean): If True, turn verbosity on. Defaults to True | |||
|
|||
_max_batches (int): The maximum number of batches of pipelines to search. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great. Could also mention that max_pipelines
and max_time
have precedent, but not required
…atches in docstring.
@@ -129,6 +130,9 @@ def __init__(self, | |||
None and 1 are equivalent. If set to -1, all CPUs are used. For n_jobs below -1, (n_cpus + 1 + n_jobs) are used. | |||
|
|||
verbose (boolean): If True, turn verbosity on. Defaults to True | |||
|
|||
_max_batches (int): The maximum number of batches of pipelines to search. Parameters max_time, and | |||
max_pipelines have precedence over stopping the search. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🚢 💨
Pull Request Description
Fixes #1005 by adding a
max_batches
parameter toAutoMLSearch
. When bothmax_pipelines
andmax_batches
are set,max_pipelines
will take precedence.After creating the pull request: in order to pass the release_notes_updated check you will need to update the "Future Release" section of
docs/source/release_notes.rst
to include this pull request by adding :pr:123
.