Added max_batches as a public parameter #1320

christopherbunn · 2020-10-19T18:56:17Z

Updated search progress look (for max_batches=2)

Resolves #1294

codecov · 2020-10-19T18:58:02Z

Codecov Report

Merging #1320 into main will increase coverage by 0.01%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##             main    #1320      +/-   ##
==========================================
+ Coverage   99.95%   99.95%   +0.01%     
==========================================
  Files         213      213              
  Lines       13560    13575      +15     
==========================================
+ Hits        13553    13568      +15     
  Misses          7        7

Impacted Files	Coverage Δ
evalml/automl/automl_search.py	`99.62% <100.00%> (+0.01%)`	⬆️
evalml/tests/automl_tests/test_automl.py	`100.00% <100.00%> (ø)`
evalml/utils/logger.py	`100.00% <100.00%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f6a7812...0a8fb5d. Read the comment docs.

christopherbunn · 2020-10-20T15:34:09Z

See screenshot above for changes to the progress output.

I updated the docstrings, but any suggestions for additional documentation edits are very welcome.

freddyaboulton

@christopherbunn Looks great!

In terms of formatting the progress - maybe we can do something like Batch 1: Iteration 5/14 to make it self explanatory? I think it might be a good idea to display the batch number regardless of whether the user specified max_batches. Whatever we decide, the formatting for the baseline pipeline iteration should be the same as the other iterations. It looks different in the screenshot you posted.

I'm curious what others think. I think what you have now is great too.

freddyaboulton · 2020-10-20T18:54:28Z

evalml/tests/automl_tests/test_automl.py

+    if check_output:
+        assert f"Searching up to {max_batches} batches for a total of {n_automl_pipelines} pipelines." in caplog.text
+        for i in range(1, max_batches):
+            assert f"({i}: " in caplog.text


christopherbunn · 2020-10-21T15:20:39Z

In terms of formatting the progress - maybe we can do something like Batch 1: Iteration 5/14 to make it self explanatory?

@freddyaboulton I think that's a good idea. I was worried that It would look a bit off formatting-wise but just prototyping it a bit and it doesn't look as bad as I thought it would.

the formatting for the baseline pipeline iteration should be the same as the other iterations

My reasoning for not having the same formatting is that I think the baseline is not officially considered as part of any batches. That being said, maybe we can mark it as a "Baseline" pipeline or something like that. Alternatively, it could be "Batch 0"? The formatting would look a bit more consistent but I think it could potentially be confusing as well

Taking both of these things into account, the new output could look like this:

freddyaboulton · 2020-10-21T17:39:26Z

@christopherbunn I think the output looks good! I am ok with referring to the baseline as Baseline instead of Batch 1 or Batch 0 but my thought for labeling it as Batch 1 is that the baseline will get run if the user specifies max_batches=1 so it's included in that sense.

dsherry

@christopherbunn looks good! I left a few comments.

dsherry · 2020-10-22T18:13:46Z

evalml/automl/automl_search.py

@@ -86,6 +86,7 @@ def __init__(self,
                 tuner_class=None,
                 verbose=True,
                 optimize_thresholds=False,
+                 max_batches=None,
                 _max_batches=None):


@christopherbunn because we named _max_batches as a "private" input arg, let's delete it in this PR in favor of max_batches. This is why we kept it private, to test it out before renaming to make it public!

dsherry · 2020-10-22T18:14:35Z

docs/source/release_notes.rst

@@ -33,9 +34,9 @@ Release Notes
 .. warning::

    **Breaking Changes**
+        * ``__max_batches`` will be deprecated in favor of ``max_batches`` as it is now public :pr:`1320`


Please update to:

Make max_batches argument to AutoMLSearch.search public

dsherry · 2020-10-22T18:15:15Z

docs/source/release_notes.rst

@@ -11,6 +11,7 @@ Release Notes
        * Added percent-better-than-baseline for all objectives to automl.results :pr:`1244`
        * Added ``HighVarianceCVDataCheck`` and replaced synonymous warning in ``AutoMLSearch`` :pr:`1254`
        * Added `PCA Transformer` component for dimensionality reduction :pr:`1270`
+        * Added ``max_batches`` parameter to `AutoMLSearch` :pr:`1320`


We added this previously, so can you change this to

Make max_batches argument to AutoMLSearch.search public

dsherry · 2020-10-22T18:16:25Z

evalml/automl/automl_search.py

+        if _max_batches:
+            if not max_batches:
+                max_batches = _max_batches
+            logger.warning("`_max_batches` will be deprecated in the next release. Use `max_batches` instead.")


RE comments above, let's delete this, and _max_batches, entirely, in favor of max_batches

dsherry · 2020-10-22T18:18:29Z

evalml/automl/automl_search.py

+        self.max_batches = max_batches
+        # This is the default value for IterativeAlgorithm - setting this explicitly makes sure that
+        # the behavior of max_batches does not break if IterativeAlgorithm is changed.
+        self._pipelines_per_batch = 5


Could we add _pipelines_per_batch as an automl search parameter as well? I do think we should keep that one private while we're getting parallel support in place, and then once we're confident we wanna keep it we can consider making it public. But it'll be nice to have external control over this for perf testing purposes. Ok to do this separately, or file a separate issue.

Sure. I'll file this as a separate issue.

dsherry · 2020-10-22T18:19:33Z

evalml/automl/automl_search.py

@@ -435,7 +443,9 @@ def search(self, X, y, data_checks="auto", feature_types=None, show_iteration_pl
        logger.info("Optimizing for %s. " % self.objective.name)
        logger.info("{} score is better.\n".format('Greater' if self.objective.greater_is_better else 'Lower'))

-        if self.max_iterations is not None:
+        if self.max_batches is not None:
+            logger.info(f"Searching up to {self.max_batches} batches for a total of {self.max_iterations} pipelines. ")


dsherry · 2020-10-22T18:21:37Z

evalml/automl/automl_search.py

+                if self.max_batches:
+                    update_pipeline(logger, desc, len(self._results['pipeline_results']) + 1, self.max_iterations, self._start, self.current_batch)
+                else:
+                    update_pipeline(logger, desc, len(self._results['pipeline_results']) + 1, self.max_iterations, self._start)


Please get the current batch number from AutoMLAlgorithm instead:

self._automl_algorithm.batch_number

dsherry · 2020-10-22T18:27:00Z

evalml/utils/logger.py

+            status_update_format = "({current_batch}: {current_iteration}/{max_iterations}) {pipeline_name} Elapsed:{time_elapsed}"
+        else:
+            status_update_format = "({current_iteration}/{max_iterations}) {pipeline_name} Elapsed:{time_elapsed}"
+        format_params = {'current_batch': current_batch, 'max_iterations': max_iterations, 'current_iteration': current_iteration}
    else:
        status_update_format = "{pipeline_name} Elapsed: {time_elapsed}"
        format_params = {}


Let's clean this method up:

elapsed_time = time_elapsed(start_time) if current_batch is not None: logger.info(f"({current_batch}: {current_iteration}/{max_iterations}) {pipeline_name} Elapsed:{elapsed_time}") elif max_iterations is not None: logger.info(f"{current_iteration}/{max_iterations}) {pipeline_name} Elapsed:{elapsed_time}") else: logger.info(f"{pipeline_name} Elapsed: {elapsed_time}")

christopherbunn · 2020-10-22T19:41:51Z

@christopherbunn I think the output looks good! I am ok with referring to the baseline as Baseline instead of Batch 1 or Batch 0 but my thought for labeling it as Batch 1 is that the baseline will get run if the user specifies max_batches=1 so it's included in that sense.

Ah I see, that makes sense. In that case, I'll count the baseline pipeline as Batch 1.

bchen1116

LGTM! Just left one question

bchen1116 · 2020-10-23T14:17:56Z

evalml/automl/automl_search.py

@@ -591,7 +596,7 @@ def _add_baseline_pipelines(self, X, y):
                desc = desc.ljust(self._MAX_NAME_LEN)

                update_pipeline(logger, desc, len(self._results['pipeline_results']) + 1, self.max_iterations,
-                                self._start)
+                                self._start, '1')


Why is '1' passed in as a string instead of an int?

Hmm yeah I think its fine to pass as an int or a string, right? Its the current batch number, going into a format string in update_pipeline

Actually, @christopherbunn, can we please use self._automl_algorithm.batch_number here instead? The automl algorithm is responsible for keeping track of the batch number. In this case it'll start at 0

dsherry

@christopherbunn great!

I left a couple comments about updating the batch number so it starts at 1 instead of 0 in the display output, and always using self._automl_algorithm.batch_number to get the batch number. I also left a suggestion for a test to check the log output is correct.

Otherwise, looks good!

dsherry · 2020-10-23T14:24:57Z

evalml/automl/automl_search.py

@@ -489,7 +491,10 @@ def search(self, X, y, data_checks="auto", feature_types=None, show_iteration_pl
                    desc = desc[:self._MAX_NAME_LEN - 3] + "..."
                desc = desc.ljust(self._MAX_NAME_LEN)

-                update_pipeline(logger, desc, len(self._results['pipeline_results']) + 1, self.max_iterations, self._start)
+                if self.max_batches:
+                    update_pipeline(logger, desc, len(self._results['pipeline_results']) + 1, self.max_iterations, self._start, self._automl_algorithm.batch_number)


@christopherbunn I see we're passing this batch number directly to the output. Does it start at 0, or at 1? RE someone else's comment, I think the batch should start at 1 in the log output.

Aha, I read further. Here's what I suggest:

Everywhere we call update_pipeline, pass the batch number in using self._automl_algorithm.batch_number. This will be 0 for the first batch

In update_pipeline, update the format string to print {current_batch+1} instead of {current_batch}

Sound good?

We should also add unit test coverage which checks that the right batch number prints out. My suggestion:

Set max_iterations = len(get_estimators(self.problem_type))+1 + _pipelines_per_batch*2 to run 3 batches

Mock pipeline fit/score to avoid long runtime

Run search, and using the caplog fixture, assert that Batch 1 appears len(get_estimators(self.problem_type)) + 1 times, and Batch 2 and Batch 3 each appear _pipelines_per_batch times.

The test case sounds good, I'll add that in.

In update_pipeline, update the format string to print {current_batch+1} instead of {current_batch}

@dsherry because the pipeline is counted as its own separate batch, the number of batches would actually be off by one since batch_number is incremented by 1 when next_batch() is called. Since next_batch() is called for the first batch before the output is seen, Batch 1 actually shows up as Batch 2 with {current_batch+1}.

I could:

Keep {current_batch+1} and move where we increment batch number to when the results are reported back

Have the baseline pipeline show as batch 0 and use self._automl_algorithm.batch_number

Override the output for the baseline (and not use self._automl_algorithm.batch_number) and show it as either batch 1, "baseline", or "base"

Imo 1 has the potential to get messy since we're modifying IterativeAlgorithm so I want to avoid that. I'm leaning towards 2 although 3 seems fine with me as well. If we wanted "base" or "baseline" for 3, then I'll add another case for update_pipeline()

Ah got it. Thanks.

I'll throw one more option on the pile:
2b. Label the baseline pipeline as batch 1 instead of batch 0. Use self._automl_algorithm.batch_number everywhere else, meaning the 1st batch shows up as batch 1, etc.

I think I agree with what you had before for the baseline in _add_baseline_pipelines, and we should keep the baseline pipeline displayed as "batch 1" along with the others.

But really, either 2 or 2b seem fine to me.

Sound good?

2b works for me, I'll manually set the baseline as batch 1 then. Thanks!

dsherry · 2020-10-23T14:30:07Z

evalml/automl/automl_search.py

@@ -591,7 +596,7 @@ def _add_baseline_pipelines(self, X, y):
                desc = desc.ljust(self._MAX_NAME_LEN)

                update_pipeline(logger, desc, len(self._results['pipeline_results']) + 1, self.max_iterations,
-                                self._start)
+                                self._start, '1')


Hmm yeah I think its fine to pass as an int or a string, right? Its the current batch number, going into a format string in update_pipeline

Actually, @christopherbunn, can we please use self._automl_algorithm.batch_number here instead? The automl algorithm is responsible for keeping track of the batch number. In this case it'll start at 0

christopherbunn marked this pull request as ready for review October 20, 2020 15:39

christopherbunn requested review from dsherry, freddyaboulton, angela97lin, bchen1116, eccabay and jeremyliweishih October 20, 2020 15:47

freddyaboulton approved these changes Oct 20, 2020

View reviewed changes

dsherry suggested changes Oct 22, 2020

View reviewed changes

christopherbunn force-pushed the 1294_automl_batching branch from fdcad09 to 6b85481 Compare October 22, 2020 19:36

christopherbunn mentioned this pull request Oct 22, 2020

Add _pipelines_per_batch as a private parameter for AutoML Search #1339

Closed

christopherbunn force-pushed the 1294_automl_batching branch from 6b85481 to af33f1a Compare October 22, 2020 19:39

christopherbunn requested a review from dsherry October 22, 2020 20:00

bchen1116 approved these changes Oct 23, 2020

View reviewed changes

dsherry approved these changes Oct 23, 2020

View reviewed changes

christopherbunn added 7 commits October 23, 2020 14:48

Added max_batches and deprecation for _max_batches

e287308

Updated release notes

388a3ff

Updated search output for batches

a781788

Updated logger docstring

0677c68

Updated logger formatting to include baseline

bd731f6

Method cleanup and removed deprecation steps

661df2f

Changed baseline pipeline to show as batch 1

d4f703d

christopherbunn force-pushed the 1294_automl_batching branch from 422c794 to ac28961 Compare October 23, 2020 18:48

Added batch output checking test and reverted baseline to be batch 0

62a05aa

christopherbunn force-pushed the 1294_automl_batching branch from ac28961 to 62a05aa Compare October 23, 2020 18:52

Cleaned up test cases

0a8fb5d

christopherbunn merged commit 4c9e5a9 into main Oct 23, 2020

christopherbunn deleted the 1294_automl_batching branch October 23, 2020 19:23

dsherry mentioned this pull request Oct 29, 2020

Release v0.15.0 #1370

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added max_batches as a public parameter #1320

Added max_batches as a public parameter #1320

christopherbunn commented Oct 19, 2020 •

edited

Loading

codecov bot commented Oct 19, 2020 •

edited

Loading

christopherbunn commented Oct 20, 2020

freddyaboulton left a comment •

edited

Loading

freddyaboulton Oct 20, 2020

christopherbunn commented Oct 21, 2020 •

edited

Loading

freddyaboulton commented Oct 21, 2020

dsherry left a comment

dsherry Oct 22, 2020

dsherry Oct 22, 2020

dsherry Oct 22, 2020

dsherry Oct 22, 2020

dsherry Oct 22, 2020

christopherbunn Oct 22, 2020

dsherry Oct 23, 2020

dsherry Oct 22, 2020

dsherry Oct 22, 2020

dsherry Oct 22, 2020

christopherbunn commented Oct 22, 2020

bchen1116 left a comment

bchen1116 Oct 23, 2020

dsherry Oct 23, 2020

dsherry left a comment

dsherry Oct 23, 2020

dsherry Oct 23, 2020

christopherbunn Oct 23, 2020

dsherry Oct 23, 2020 •

edited

Loading

christopherbunn Oct 23, 2020

dsherry Oct 23, 2020

Added max_batches as a public parameter #1320

Added max_batches as a public parameter #1320

Conversation

christopherbunn commented Oct 19, 2020 • edited Loading

codecov bot commented Oct 19, 2020 • edited Loading

Codecov Report

christopherbunn commented Oct 20, 2020

freddyaboulton left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

christopherbunn commented Oct 21, 2020 • edited Loading

freddyaboulton commented Oct 21, 2020

dsherry left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

christopherbunn commented Oct 22, 2020

bchen1116 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dsherry left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dsherry Oct 23, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

christopherbunn commented Oct 19, 2020 •

edited

Loading

codecov bot commented Oct 19, 2020 •

edited

Loading

freddyaboulton left a comment •

edited

Loading

christopherbunn commented Oct 21, 2020 •

edited

Loading

dsherry Oct 23, 2020 •

edited

Loading