-
Notifications
You must be signed in to change notification settings - Fork 86
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable ensembling as a parameter for DefaultAlgorithm
#3435
Conversation
Codecov Report
@@ Coverage Diff @@
## main #3435 +/- ##
=======================================
+ Coverage 99.7% 99.7% +0.1%
=======================================
Files 334 334
Lines 32959 32982 +23
=======================================
+ Hits 32829 32852 +23
Misses 130 130
Continue to review full report at Codecov.
|
…/evalml into js_enable_ensembling_for_default
…bling_for_default
…bling_for_default
* Fixes | ||
* Fix ``DefaultAlgorithm`` not handling Email and URL features :pr:`3419` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Didn't move this out in #3419.
@@ -208,7 +208,7 @@ def search( | |||
if data_check_result["level"] == DataCheckMessageType.ERROR.value: | |||
return None, data_check_results | |||
|
|||
automl = AutoMLSearch(automl_algorithm="default", **automl_config) | |||
automl = AutoMLSearch(automl_algorithm="default", ensembling=True, **automl_config) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Keeping the same behavior for the top level search method.
@@ -300,11 +302,11 @@ def test_pipeline_limits( | |||
automl.search() | |||
out = caplog.text | |||
if verbose: | |||
assert "Using default limit of max_batches=4." in out | |||
assert "Searching up to 4 batches for a total of" in out | |||
assert "Using default limit of max_batches=3." in out |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Had to change because ensembling is turned off by default in AutoMLSearch
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's remember to update our perf test code! I think we still want the ability to test the ensembler pipeline in batch 4 when we kick off a job?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thanks!
One thing I forgot to check - do we need to update the documentation since turning ensembling on/off for iterative algorithm is called out? |
@@ -605,7 +605,7 @@ | |||
"### Stacking\n", | |||
"[Stacking](https://en.wikipedia.org/wiki/Ensemble_learning#Stacking) is an ensemble machine learning algorithm that involves training a model to best combine the predictions of several base learning algorithms. First, each base learning algorithms is trained using the given data. Then, the combining algorithm or meta-learner is trained on the predictions made by those base learning algorithms to make a final prediction.\n", | |||
"\n", | |||
"AutoML enables stacking using the `ensembling` flag during initalization; this is set to `False` by default. The stacking ensemble pipeline runs in its own batch after a whole cycle of training has occurred (each allowed pipeline trains for one batch). Note that this means __a large number of iterations may need to run before the stacking ensemble runs__. It is also important to note that __only the first CV fold is calculated for stacking ensembles__ because the model internally uses CV folds." | |||
"AutoML enables stacking using the `ensembling` flag during initalization; this is set to `False` by default. How ensembling runs is defined by the AutoML algorithm you choose. In the `IterativeAlgorithm`, the stacking ensemble pipeline runs in its own batch after a whole cycle of training has occurred (each allowed pipeline trains for one batch). Note that this means __a large number of iterations may need to run before the stacking ensemble runs__. It is also important to note that __only the first CV fold is calculated for stacking ensembles__ because the model internally uses CV folds. See below in the AutoML Algorithms section to see how ensembling is run for `DefaultAlgorithm`." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Made some doc changes here @chukarsten.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
gucci
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good @jeremyliweishih ! Saw you already merged but was half way through the review so figured I'd finish it.
Nothing blocking and if we decide address these comments we can do so in a follow up.
@@ -735,7 +735,7 @@ | |||
" a. For each of the previous top 3 estimators, sample 10 parameters from the tuner. Run all 30 in one batch\n", | |||
" b. Run ensembling\n", | |||
" \n", | |||
"To this end, it is recommended to use the top level `search()` method to run `DefaultAlgorithm`. This allows users to specify running search with just the `mode` parameter, where `fast` is recommended for users who want a fast scan at how EvalML pipelines will perform on their problem and where `long` is reserved for a deeper dive into high performing pipelines. One can also specify `automl_algorithm='default'` using `AutoMLSearch` and it will default to using `fast` mode. Users are welcome to select `max_batches` according to the algorithm above (or other stopping criteria) but should be aware that results may not be optimal if the algorithm does not run for the full length of `fast` mode." | |||
"To this end, it is recommended to use the top level `search()` method to run `DefaultAlgorithm`. This allows users to specify running search with just the `mode` parameter, where `fast` is recommended for users who want a fast scan at how EvalML pipelines will perform on their problem and where `long` is reserved for a deeper dive into high performing pipelines. If one needs finer control over AutoML parameters, one can also specify `automl_algorithm='default'` using `AutoMLSearch` and it will default to using `fast` mode. However, in this case ensembling will be defined by the `ensembling` flag (if `ensembling=False` the abovementioned ensembling batches will be skipped). Users are welcome to select `max_batches` according to the algorithm above (or other stopping criteria) but should be aware that results may not be optimal if the algorithm does not run for the full length of `fast` mode." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's say ensembling for time series is not enabled
@@ -90,6 +90,7 @@ def __init__( | |||
n_jobs=-1, | |||
text_in_ensembling=False, | |||
top_n=3, | |||
ensembling=True, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we set this false to match the behavior in AutoMLSearch?
What I don't like about this is that we change the user parameter value silently for time series if it's not False
. Perhaps it would be better to raise an exception if ensembling is set to true for time series?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will do!
@@ -300,11 +302,11 @@ def test_pipeline_limits( | |||
automl.search() | |||
out = caplog.text | |||
if verbose: | |||
assert "Using default limit of max_batches=4." in out | |||
assert "Searching up to 4 batches for a total of" in out | |||
assert "Using default limit of max_batches=3." in out |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's remember to update our perf test code! I think we still want the ability to test the ensembler pipeline in batch 4 when we kick off a job?
No description provided.