Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added max_batches as a public parameter #1320

Merged
merged 9 commits into from
Oct 23, 2020
Merged

Conversation

christopherbunn
Copy link
Contributor

@christopherbunn christopherbunn commented Oct 19, 2020

Updated search progress look (for max_batches=2)
image

Resolves #1294

@codecov
Copy link

codecov bot commented Oct 19, 2020

Codecov Report

Merging #1320 into main will increase coverage by 0.01%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #1320      +/-   ##
==========================================
+ Coverage   99.95%   99.95%   +0.01%     
==========================================
  Files         213      213              
  Lines       13560    13575      +15     
==========================================
+ Hits        13553    13568      +15     
  Misses          7        7              
Impacted Files Coverage Δ
evalml/automl/automl_search.py 99.62% <100.00%> (+0.01%) ⬆️
evalml/tests/automl_tests/test_automl.py 100.00% <100.00%> (ø)
evalml/utils/logger.py 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f6a7812...0a8fb5d. Read the comment docs.

@christopherbunn
Copy link
Contributor Author

See screenshot above for changes to the progress output.

I updated the docstrings, but any suggestions for additional documentation edits are very welcome.

Copy link
Contributor

@freddyaboulton freddyaboulton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@christopherbunn Looks great!

In terms of formatting the progress - maybe we can do something like Batch 1: Iteration 5/14 to make it self explanatory? I think it might be a good idea to display the batch number regardless of whether the user specified max_batches. Whatever we decide, the formatting for the baseline pipeline iteration should be the same as the other iterations. It looks different in the screenshot you posted.

I'm curious what others think. I think what you have now is great too.

if check_output:
assert f"Searching up to {max_batches} batches for a total of {n_automl_pipelines} pipelines." in caplog.text
for i in range(1, max_batches):
assert f"({i}: " in caplog.text
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@christopherbunn
Copy link
Contributor Author

christopherbunn commented Oct 21, 2020

In terms of formatting the progress - maybe we can do something like Batch 1: Iteration 5/14 to make it self explanatory?

@freddyaboulton I think that's a good idea. I was worried that It would look a bit off formatting-wise but just prototyping it a bit and it doesn't look as bad as I thought it would.

the formatting for the baseline pipeline iteration should be the same as the other iterations

My reasoning for not having the same formatting is that I think the baseline is not officially considered as part of any batches. That being said, maybe we can mark it as a "Baseline" pipeline or something like that. Alternatively, it could be "Batch 0"? The formatting would look a bit more consistent but I think it could potentially be confusing as well

Taking both of these things into account, the new output could look like this:

image

@freddyaboulton
Copy link
Contributor

@christopherbunn I think the output looks good! I am ok with referring to the baseline as Baseline instead of Batch 1 or Batch 0 but my thought for labeling it as Batch 1 is that the baseline will get run if the user specifies max_batches=1 so it's included in that sense.

Copy link
Contributor

@dsherry dsherry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@christopherbunn looks good! I left a few comments.

@@ -86,6 +86,7 @@ def __init__(self,
tuner_class=None,
verbose=True,
optimize_thresholds=False,
max_batches=None,
_max_batches=None):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@christopherbunn because we named _max_batches as a "private" input arg, let's delete it in this PR in favor of max_batches. This is why we kept it private, to test it out before renaming to make it public!

@@ -33,9 +34,9 @@ Release Notes
.. warning::

**Breaking Changes**
* ``__max_batches`` will be deprecated in favor of ``max_batches`` as it is now public :pr:`1320`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please update to:

Make max_batches argument to AutoMLSearch.search public

@@ -11,6 +11,7 @@ Release Notes
* Added percent-better-than-baseline for all objectives to automl.results :pr:`1244`
* Added ``HighVarianceCVDataCheck`` and replaced synonymous warning in ``AutoMLSearch`` :pr:`1254`
* Added `PCA Transformer` component for dimensionality reduction :pr:`1270`
* Added ``max_batches`` parameter to `AutoMLSearch` :pr:`1320`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We added this previously, so can you change this to

Make max_batches argument to AutoMLSearch.search public

if _max_batches:
if not max_batches:
max_batches = _max_batches
logger.warning("`_max_batches` will be deprecated in the next release. Use `max_batches` instead.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RE comments above, let's delete this, and _max_batches, entirely, in favor of max_batches

self.max_batches = max_batches
# This is the default value for IterativeAlgorithm - setting this explicitly makes sure that
# the behavior of max_batches does not break if IterativeAlgorithm is changed.
self._pipelines_per_batch = 5
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we add _pipelines_per_batch as an automl search parameter as well? I do think we should keep that one private while we're getting parallel support in place, and then once we're confident we wanna keep it we can consider making it public. But it'll be nice to have external control over this for perf testing purposes. Ok to do this separately, or file a separate issue.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. I'll file this as a separate issue.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@@ -435,7 +443,9 @@ def search(self, X, y, data_checks="auto", feature_types=None, show_iteration_pl
logger.info("Optimizing for %s. " % self.objective.name)
logger.info("{} score is better.\n".format('Greater' if self.objective.greater_is_better else 'Lower'))

if self.max_iterations is not None:
if self.max_batches is not None:
logger.info(f"Searching up to {self.max_batches} batches for a total of {self.max_iterations} pipelines. ")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

if self.max_batches:
update_pipeline(logger, desc, len(self._results['pipeline_results']) + 1, self.max_iterations, self._start, self.current_batch)
else:
update_pipeline(logger, desc, len(self._results['pipeline_results']) + 1, self.max_iterations, self._start)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please get the current batch number from AutoMLAlgorithm instead:

self._automl_algorithm.batch_number

status_update_format = "({current_batch}: {current_iteration}/{max_iterations}) {pipeline_name} Elapsed:{time_elapsed}"
else:
status_update_format = "({current_iteration}/{max_iterations}) {pipeline_name} Elapsed:{time_elapsed}"
format_params = {'current_batch': current_batch, 'max_iterations': max_iterations, 'current_iteration': current_iteration}
else:
status_update_format = "{pipeline_name} Elapsed: {time_elapsed}"
format_params = {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's clean this method up:

elapsed_time = time_elapsed(start_time)
if current_batch is not None:
    logger.info(f"({current_batch}: {current_iteration}/{max_iterations}) {pipeline_name} Elapsed:{elapsed_time}")
elif max_iterations is not None:
    logger.info(f"{current_iteration}/{max_iterations}) {pipeline_name} Elapsed:{elapsed_time}")
else:
    logger.info(f"{pipeline_name} Elapsed: {elapsed_time}")

@christopherbunn
Copy link
Contributor Author

@christopherbunn I think the output looks good! I am ok with referring to the baseline as Baseline instead of Batch 1 or Batch 0 but my thought for labeling it as Batch 1 is that the baseline will get run if the user specifies max_batches=1 so it's included in that sense.

Ah I see, that makes sense. In that case, I'll count the baseline pipeline as Batch 1.

Copy link
Contributor

@bchen1116 bchen1116 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Just left one question

@@ -591,7 +596,7 @@ def _add_baseline_pipelines(self, X, y):
desc = desc.ljust(self._MAX_NAME_LEN)

update_pipeline(logger, desc, len(self._results['pipeline_results']) + 1, self.max_iterations,
self._start)
self._start, '1')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is '1' passed in as a string instead of an int?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm yeah I think its fine to pass as an int or a string, right? Its the current batch number, going into a format string in update_pipeline

Actually, @christopherbunn, can we please use self._automl_algorithm.batch_number here instead? The automl algorithm is responsible for keeping track of the batch number. In this case it'll start at 0

Copy link
Contributor

@dsherry dsherry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@christopherbunn great!

I left a couple comments about updating the batch number so it starts at 1 instead of 0 in the display output, and always using self._automl_algorithm.batch_number to get the batch number. I also left a suggestion for a test to check the log output is correct.

Otherwise, looks good!

@@ -489,7 +491,10 @@ def search(self, X, y, data_checks="auto", feature_types=None, show_iteration_pl
desc = desc[:self._MAX_NAME_LEN - 3] + "..."
desc = desc.ljust(self._MAX_NAME_LEN)

update_pipeline(logger, desc, len(self._results['pipeline_results']) + 1, self.max_iterations, self._start)
if self.max_batches:
update_pipeline(logger, desc, len(self._results['pipeline_results']) + 1, self.max_iterations, self._start, self._automl_algorithm.batch_number)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@christopherbunn I see we're passing this batch number directly to the output. Does it start at 0, or at 1? RE someone else's comment, I think the batch should start at 1 in the log output.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aha, I read further. Here's what I suggest:

  • Everywhere we call update_pipeline, pass the batch number in using self._automl_algorithm.batch_number. This will be 0 for the first batch
  • In update_pipeline, update the format string to print {current_batch+1} instead of {current_batch}

Sound good?

We should also add unit test coverage which checks that the right batch number prints out. My suggestion:

  • Set max_iterations = len(get_estimators(self.problem_type))+1 + _pipelines_per_batch*2 to run 3 batches
  • Mock pipeline fit/score to avoid long runtime
  • Run search, and using the caplog fixture, assert that Batch 1 appears len(get_estimators(self.problem_type)) + 1 times, and Batch 2 and Batch 3 each appear _pipelines_per_batch times.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test case sounds good, I'll add that in.

In update_pipeline, update the format string to print {current_batch+1} instead of {current_batch}

@dsherry because the pipeline is counted as its own separate batch, the number of batches would actually be off by one since batch_number is incremented by 1 when next_batch() is called. Since next_batch() is called for the first batch before the output is seen, Batch 1 actually shows up as Batch 2 with {current_batch+1}.

I could:

  1. Keep {current_batch+1} and move where we increment batch number to when the results are reported back
  2. Have the baseline pipeline show as batch 0 and use self._automl_algorithm.batch_number
  3. Override the output for the baseline (and not use self._automl_algorithm.batch_number) and show it as either batch 1, "baseline", or "base"

Imo 1 has the potential to get messy since we're modifying IterativeAlgorithm so I want to avoid that. I'm leaning towards 2 although 3 seems fine with me as well. If we wanted "base" or "baseline" for 3, then I'll add another case for update_pipeline()

Copy link
Contributor

@dsherry dsherry Oct 23, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah got it. Thanks.

I'll throw one more option on the pile:
2b. Label the baseline pipeline as batch 1 instead of batch 0. Use self._automl_algorithm.batch_number everywhere else, meaning the 1st batch shows up as batch 1, etc.

I think I agree with what you had before for the baseline in _add_baseline_pipelines, and we should keep the baseline pipeline displayed as "batch 1" along with the others.

But really, either 2 or 2b seem fine to me.

Sound good?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2b works for me, I'll manually set the baseline as batch 1 then. Thanks!

@@ -591,7 +596,7 @@ def _add_baseline_pipelines(self, X, y):
desc = desc.ljust(self._MAX_NAME_LEN)

update_pipeline(logger, desc, len(self._results['pipeline_results']) + 1, self.max_iterations,
self._start)
self._start, '1')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm yeah I think its fine to pass as an int or a string, right? Its the current batch number, going into a format string in update_pipeline

Actually, @christopherbunn, can we please use self._automl_algorithm.batch_number here instead? The automl algorithm is responsible for keeping track of the batch number. In this case it'll start at 0

@christopherbunn christopherbunn merged commit 4c9e5a9 into main Oct 23, 2020
@christopherbunn christopherbunn deleted the 1294_automl_batching branch October 23, 2020 19:23
@dsherry dsherry mentioned this pull request Oct 29, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Show AutoML Batching to Users
4 participants