Skip to content

2470 report batch times automlsearch#3577

Merged
chukarsten merged 34 commits into
mainfrom
2470-Report-batch-times-automlsearch
Jul 8, 2022
Merged

2470 report batch times automlsearch#3577
chukarsten merged 34 commits into
mainfrom
2470-Report-batch-times-automlsearch

Conversation

@MichaelFu512

@MichaelFu512 MichaelFu512 commented Jun 21, 2022

Copy link
Copy Markdown
Contributor

Pull Request Description

Added an option to record batch and pipeline search times from automl.search() by using an optional parameter called "timing". Using timing = True will output the batch timings to stdout. automl.search() now also returns a dictionary that holds individual batches and pipelines times.

There's also a value in the inner dictionary called "Total time of batch" which records how long the batch took in total.

Closes #2470

@codecov

codecov Bot commented Jun 21, 2022

Copy link
Copy Markdown

Codecov Report

Merging #3577 (93aec54) into main (a303470) will increase coverage by 0.1%.
The diff coverage is 100.0%.

@@           Coverage Diff           @@
##            main   #3577     +/-   ##
=======================================
+ Coverage   99.7%   99.7%   +0.1%     
=======================================
  Files        335     335             
  Lines      33456   33512     +56     
=======================================
+ Hits       33327   33383     +56     
  Misses       129     129             
Impacted Files Coverage Δ
evalml/automl/automl_search.py 99.6% <100.0%> (+0.1%) ⬆️
evalml/tests/automl_tests/test_automl.py 99.5% <100.0%> (+0.1%) ⬆️
evalml/tests/utils_tests/test_logger.py 100.0% <100.0%> (ø)
evalml/utils/logger.py 100.0% <100.0%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a303470...93aec54. Read the comment docs.

@MichaelFu512 MichaelFu512 marked this pull request as ready for review June 21, 2022 17:49
@MichaelFu512 MichaelFu512 enabled auto-merge (squash) June 21, 2022 19:03

@jeremyliweishih jeremyliweishih left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just some initial comments - lmk if they make sense! I'll give it another final review after your edits. Thanks!

Comment thread evalml/automl/automl_search.py Outdated
Comment thread evalml/automl/automl_search.py Outdated
Comment thread evalml/automl/automl_search.py Outdated

def test_search_batch_times(caplog):
caplog.clear()
X, y = load_data(

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can use a smaller dataset here: check out the X_y_binary fixture in contest.py and just search for tests that use it as an example.

additional_objectives=["auc", "f1", "precision"],
max_time=1,
max_iterations=1,
max_iterations=3,

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did we have to change this test?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that one was an oops, I think I changed that by accident.

"provider": "categorical",
},
)
X_train, _, y_train, _ = evalml.preprocessing.split_data(

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can skip out on all this logic about splitting and datachecks as well. Look at something like test_pipeline_score_raises for an example.

@chukarsten chukarsten left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for tackling this and for doing it so quickly!

Comment thread evalml/automl/automl_search.py Outdated
Comment thread evalml/automl/automl_search.py Outdated
AutoMLSearchException: If all pipelines in the current AutoML batch produced a score of np.nan on the primary objective.

Returns:
Optinal Dict[int, Dict[str, str]]: Returns dict if timing is set to "return" or "both".

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: "Optional"

I think we can also just make it so that search always returns this timing dictionary. I think conditional returns is a little troublesome, sometimes.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think I should get rid of "return" and "both" as options for timing if we want it to always return this dictionary (meaning we only keep "log" as an option for timing)?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sorry this took me so long, but yea, I think yes, it's fine to always return the dictionary with all the things in it.

Comment thread evalml/automl/automl_search.py Outdated
Comment thread evalml/automl/automl_search.py Outdated
Comment thread evalml/automl/automl_search.py Outdated

@eccabay eccabay left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Huge fan of this, but I think we can make things a little clearer/more accessible for our users, most importantly where and what the argument that controls the behavior lives.

Comment thread evalml/automl/automl_search.py Outdated
Comment on lines +868 to +869
Default: None
log=prints out batch/pipeline timing to console.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the only options are "log" or None, I vote that we switch to a boolean flag for this, something like log_timing which defaults to false. It'd make both our lives as the implementers easier (checking a boolean instead of string equality, not having to validate the input) and make it clearer for users to boot.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed as well!

Comment thread evalml/automl/automl_search.py Outdated
Comment on lines +875 to +878
Returns:
Dict[int, Dict[str, Timestamp]]: Returns dict.
Key=batch #, value=Dict[key=pipeline name, value=timestamp of pipeline].
Inner dict has key called "Total time of batch" with value=total time of batch.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really hard to understand without reading closely. I'd refactor it, something more like:

Dict[int, Dict[str, Timestamp]]: Dictionary keyed by batch number that maps to the timings for pipelines run in that batch, 
as well as the total time for each batch. Pipelines within a batch are labeled by pipeline name. 

As a side note, = in docstrings really throws me off. It'd be better to stick to using colons, which maintain consistency with the rest of our docs!

Comment thread evalml/automl/automl_search.py Outdated
Comment on lines +1031 to +1032
log_title(self.logger, "Batch Time Stats")
log_batch_times(self.logger, batch_times)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would move the call to log_title into log_batch_times itself, since we don't need to call log_batch_times without setting the title as well.

Comment thread evalml/automl/automl_search.py Outdated
leading_char = ""

def search(self, show_iteration_plot=True):
def search(self, show_iteration_plot=True, timing=None):

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should move this to be an argument in AutoMLSearch.__init__ instead of AutoMLSearch.search. Reason being, we have two ways for users to run search. This is one of them, but we're trying to move more over to running the top level search method instead of manually instantiating AutoMLSearch first. With the argument living here, users have no access to the argument.

If we move the arg to AutoMLSearch.__init__ and add it to the top level search methods as well, that will ensure users have full access to controlling this.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed - thanks for covering this @eccabay!

@jeremyliweishih jeremyliweishih left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed with @eccabay's comments. @MichaelFu512 can you request a re-review once those changes are in? Thanks!

Comment thread evalml/automl/automl_search.py Outdated
leading_char = ""

def search(self, show_iteration_plot=True):
def search(self, show_iteration_plot=True, timing=None):

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed - thanks for covering this @eccabay!

Comment thread evalml/automl/automl_search.py Outdated
Comment on lines +868 to +869
Default: None
log=prints out batch/pipeline timing to console.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed as well!

@MichaelFu512 MichaelFu512 disabled auto-merge July 5, 2022 20:53

@eccabay eccabay left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome Michael, thanks for making all these changes! I just have one small comment, but other than that this is looking great.

Comment thread evalml/automl/automl_search.py Outdated
in time series problems, values should be passed in for the time_index, gap, forecast_horizon, and max_delay variables.
n_splits (int): Number of splits to use with the default data splitter.
verbose (boolean): Whether or not to display semi-real-time updates to stdout while search is running. Defaults to False.
timing (boolean): Whether or not to display pipeline search times to stdout. Defaults to False.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpicky clarification: logging info is not guaranteed to display that information in stdout, that will only happen if the logging level is set low enough to expose it. It'd be more accurate to say:

Whether or not to write pipeline search times to the logger. Defaults to False.

By default, if timing is set to True, the user would still not see the timings being logged in stdout since the default logging behavior is at the warning level (will not show info/debug level logs). To see this, they would either need to configure the logger themselves, or set verbose=True. Alternatively, they can choose to dump the log into a file or otherwise redirect that info, in which case it would be logged somewhere but not appearing in stdout.

Sorry for the info dump, I'm just intimately familiar with our logging behavior 😅

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's always good for me to learn more so "info dump(s)" are always great.

Comment thread evalml/automl/automl_search.py Outdated

verbose (boolean): Whether or not to display semi-real-time updates to stdout while search is running. Defaults to False.

timing (boolean): Whether or not to display pipeline search times to stdout. Defaults to False.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment about stdout vs logging holds here.

Also, I think you're missing this argument in search_iterative?

@chukarsten chukarsten merged commit 7610736 into main Jul 8, 2022
@chukarsten chukarsten deleted the 2470-Report-batch-times-automlsearch branch July 8, 2022 16:28
@chukarsten chukarsten mentioned this pull request Jul 24, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Report Batch/Pipeline Times in AutoMLSearch

5 participants