2470 report batch times automlsearch#3577
Conversation
* Unpin nlp_primitives but disallow v2.60 * Update release_notes.rst * more constraint / version updates
* Throws error on describe if uninstantiated * Added test for Exception * Fixed linting and added to release_notes * Made changes that Becca and Jeremy suggested
Codecov Report
@@ Coverage Diff @@
## main #3577 +/- ##
=======================================
+ Coverage 99.7% 99.7% +0.1%
=======================================
Files 335 335
Lines 33456 33512 +56
=======================================
+ Hits 33327 33383 +56
Misses 129 129
Continue to review full report at Codecov.
|
jeremyliweishih
left a comment
There was a problem hiding this comment.
Just some initial comments - lmk if they make sense! I'll give it another final review after your edits. Thanks!
|
|
||
| def test_search_batch_times(caplog): | ||
| caplog.clear() | ||
| X, y = load_data( |
There was a problem hiding this comment.
We can use a smaller dataset here: check out the X_y_binary fixture in contest.py and just search for tests that use it as an example.
| additional_objectives=["auc", "f1", "precision"], | ||
| max_time=1, | ||
| max_iterations=1, | ||
| max_iterations=3, |
There was a problem hiding this comment.
Why did we have to change this test?
There was a problem hiding this comment.
that one was an oops, I think I changed that by accident.
| "provider": "categorical", | ||
| }, | ||
| ) | ||
| X_train, _, y_train, _ = evalml.preprocessing.split_data( |
There was a problem hiding this comment.
We can skip out on all this logic about splitting and datachecks as well. Look at something like test_pipeline_score_raises for an example.
chukarsten
left a comment
There was a problem hiding this comment.
Thanks for tackling this and for doing it so quickly!
| AutoMLSearchException: If all pipelines in the current AutoML batch produced a score of np.nan on the primary objective. | ||
|
|
||
| Returns: | ||
| Optinal Dict[int, Dict[str, str]]: Returns dict if timing is set to "return" or "both". |
There was a problem hiding this comment.
typo: "Optional"
I think we can also just make it so that search always returns this timing dictionary. I think conditional returns is a little troublesome, sometimes.
There was a problem hiding this comment.
Do you think I should get rid of "return" and "both" as options for timing if we want it to always return this dictionary (meaning we only keep "log" as an option for timing)?
There was a problem hiding this comment.
I'm sorry this took me so long, but yea, I think yes, it's fine to always return the dictionary with all the things in it.
….com/alteryx/evalml into 2470-Report-batch-times-automlsearch
eccabay
left a comment
There was a problem hiding this comment.
Huge fan of this, but I think we can make things a little clearer/more accessible for our users, most importantly where and what the argument that controls the behavior lives.
| Default: None | ||
| log=prints out batch/pipeline timing to console. |
There was a problem hiding this comment.
Since the only options are "log" or None, I vote that we switch to a boolean flag for this, something like log_timing which defaults to false. It'd make both our lives as the implementers easier (checking a boolean instead of string equality, not having to validate the input) and make it clearer for users to boot.
| Returns: | ||
| Dict[int, Dict[str, Timestamp]]: Returns dict. | ||
| Key=batch #, value=Dict[key=pipeline name, value=timestamp of pipeline]. | ||
| Inner dict has key called "Total time of batch" with value=total time of batch. |
There was a problem hiding this comment.
This is really hard to understand without reading closely. I'd refactor it, something more like:
Dict[int, Dict[str, Timestamp]]: Dictionary keyed by batch number that maps to the timings for pipelines run in that batch,
as well as the total time for each batch. Pipelines within a batch are labeled by pipeline name.
As a side note, = in docstrings really throws me off. It'd be better to stick to using colons, which maintain consistency with the rest of our docs!
| log_title(self.logger, "Batch Time Stats") | ||
| log_batch_times(self.logger, batch_times) |
There was a problem hiding this comment.
I would move the call to log_title into log_batch_times itself, since we don't need to call log_batch_times without setting the title as well.
| leading_char = "" | ||
|
|
||
| def search(self, show_iteration_plot=True): | ||
| def search(self, show_iteration_plot=True, timing=None): |
There was a problem hiding this comment.
I think we should move this to be an argument in AutoMLSearch.__init__ instead of AutoMLSearch.search. Reason being, we have two ways for users to run search. This is one of them, but we're trying to move more over to running the top level search method instead of manually instantiating AutoMLSearch first. With the argument living here, users have no access to the argument.
If we move the arg to AutoMLSearch.__init__ and add it to the top level search methods as well, that will ensure users have full access to controlling this.
jeremyliweishih
left a comment
There was a problem hiding this comment.
Agreed with @eccabay's comments. @MichaelFu512 can you request a re-review once those changes are in? Thanks!
| leading_char = "" | ||
|
|
||
| def search(self, show_iteration_plot=True): | ||
| def search(self, show_iteration_plot=True, timing=None): |
| Default: None | ||
| log=prints out batch/pipeline timing to console. |
eccabay
left a comment
There was a problem hiding this comment.
Awesome Michael, thanks for making all these changes! I just have one small comment, but other than that this is looking great.
| in time series problems, values should be passed in for the time_index, gap, forecast_horizon, and max_delay variables. | ||
| n_splits (int): Number of splits to use with the default data splitter. | ||
| verbose (boolean): Whether or not to display semi-real-time updates to stdout while search is running. Defaults to False. | ||
| timing (boolean): Whether or not to display pipeline search times to stdout. Defaults to False. |
There was a problem hiding this comment.
Nitpicky clarification: logging info is not guaranteed to display that information in stdout, that will only happen if the logging level is set low enough to expose it. It'd be more accurate to say:
Whether or not to write pipeline search times to the logger. Defaults to False.
By default, if timing is set to True, the user would still not see the timings being logged in stdout since the default logging behavior is at the warning level (will not show info/debug level logs). To see this, they would either need to configure the logger themselves, or set verbose=True. Alternatively, they can choose to dump the log into a file or otherwise redirect that info, in which case it would be logged somewhere but not appearing in stdout.
Sorry for the info dump, I'm just intimately familiar with our logging behavior 😅
There was a problem hiding this comment.
It's always good for me to learn more so "info dump(s)" are always great.
|
|
||
| verbose (boolean): Whether or not to display semi-real-time updates to stdout while search is running. Defaults to False. | ||
|
|
||
| timing (boolean): Whether or not to display pipeline search times to stdout. Defaults to False. |
There was a problem hiding this comment.
Same comment about stdout vs logging holds here.
Also, I think you're missing this argument in search_iterative?
….com/alteryx/evalml into 2470-Report-batch-times-automlsearch
Pull Request Description
Added an option to record batch and pipeline search times from
automl.search()by using an optional parameter called "timing". Using timing = True will output the batch timings to stdout.automl.search()now also returns a dictionary that holds individual batches and pipelines times.There's also a value in the inner dictionary called "Total time of batch" which records how long the batch took in total.
Closes #2470