Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable ensembling as a parameter for DefaultAlgorithm #3435

Merged
merged 16 commits into from
Apr 4, 2022

Conversation

jeremyliweishih
Copy link
Collaborator

No description provided.

@codecov
Copy link

codecov bot commented Mar 31, 2022

Codecov Report

Merging #3435 (cf4a1d4) into main (df13ed9) will increase coverage by 0.1%.
The diff coverage is 100.0%.

@@           Coverage Diff           @@
##            main   #3435     +/-   ##
=======================================
+ Coverage   99.7%   99.7%   +0.1%     
=======================================
  Files        334     334             
  Lines      32959   32982     +23     
=======================================
+ Hits       32829   32852     +23     
  Misses       130     130             
Impacted Files Coverage Δ
...valml/automl/automl_algorithm/default_algorithm.py 100.0% <100.0%> (ø)
evalml/automl/automl_search.py 99.7% <100.0%> (ø)
evalml/tests/automl_tests/test_automl.py 99.5% <100.0%> (+0.1%) ⬆️
...valml/tests/automl_tests/test_default_algorithm.py 100.0% <100.0%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update df13ed9...cf4a1d4. Read the comment docs.

* Fixes
* Fix ``DefaultAlgorithm`` not handling Email and URL features :pr:`3419`
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't move this out in #3419.

@@ -208,7 +208,7 @@ def search(
if data_check_result["level"] == DataCheckMessageType.ERROR.value:
return None, data_check_results

automl = AutoMLSearch(automl_algorithm="default", **automl_config)
automl = AutoMLSearch(automl_algorithm="default", ensembling=True, **automl_config)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keeping the same behavior for the top level search method.

@@ -300,11 +302,11 @@ def test_pipeline_limits(
automl.search()
out = caplog.text
if verbose:
assert "Using default limit of max_batches=4." in out
assert "Searching up to 4 batches for a total of" in out
assert "Using default limit of max_batches=3." in out
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Had to change because ensembling is turned off by default in AutoMLSearch.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's remember to update our perf test code! I think we still want the ability to test the ensembler pipeline in batch 4 when we kick off a job?

Copy link
Contributor

@chukarsten chukarsten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks!

@chukarsten
Copy link
Contributor

One thing I forgot to check - do we need to update the documentation since turning ensembling on/off for iterative algorithm is called out?

@@ -605,7 +605,7 @@
"### Stacking\n",
"[Stacking](https://en.wikipedia.org/wiki/Ensemble_learning#Stacking) is an ensemble machine learning algorithm that involves training a model to best combine the predictions of several base learning algorithms. First, each base learning algorithms is trained using the given data. Then, the combining algorithm or meta-learner is trained on the predictions made by those base learning algorithms to make a final prediction.\n",
"\n",
"AutoML enables stacking using the `ensembling` flag during initalization; this is set to `False` by default. The stacking ensemble pipeline runs in its own batch after a whole cycle of training has occurred (each allowed pipeline trains for one batch). Note that this means __a large number of iterations may need to run before the stacking ensemble runs__. It is also important to note that __only the first CV fold is calculated for stacking ensembles__ because the model internally uses CV folds."
"AutoML enables stacking using the `ensembling` flag during initalization; this is set to `False` by default. How ensembling runs is defined by the AutoML algorithm you choose. In the `IterativeAlgorithm`, the stacking ensemble pipeline runs in its own batch after a whole cycle of training has occurred (each allowed pipeline trains for one batch). Note that this means __a large number of iterations may need to run before the stacking ensemble runs__. It is also important to note that __only the first CV fold is calculated for stacking ensembles__ because the model internally uses CV folds. See below in the AutoML Algorithms section to see how ensembling is run for `DefaultAlgorithm`."
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made some doc changes here @chukarsten.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gucci

@jeremyliweishih jeremyliweishih merged commit f05332d into main Apr 4, 2022
Copy link
Contributor

@freddyaboulton freddyaboulton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good @jeremyliweishih ! Saw you already merged but was half way through the review so figured I'd finish it.

Nothing blocking and if we decide address these comments we can do so in a follow up.

@@ -735,7 +735,7 @@
" a. For each of the previous top 3 estimators, sample 10 parameters from the tuner. Run all 30 in one batch\n",
" b. Run ensembling\n",
" \n",
"To this end, it is recommended to use the top level `search()` method to run `DefaultAlgorithm`. This allows users to specify running search with just the `mode` parameter, where `fast` is recommended for users who want a fast scan at how EvalML pipelines will perform on their problem and where `long` is reserved for a deeper dive into high performing pipelines. One can also specify `automl_algorithm='default'` using `AutoMLSearch` and it will default to using `fast` mode. Users are welcome to select `max_batches` according to the algorithm above (or other stopping criteria) but should be aware that results may not be optimal if the algorithm does not run for the full length of `fast` mode."
"To this end, it is recommended to use the top level `search()` method to run `DefaultAlgorithm`. This allows users to specify running search with just the `mode` parameter, where `fast` is recommended for users who want a fast scan at how EvalML pipelines will perform on their problem and where `long` is reserved for a deeper dive into high performing pipelines. If one needs finer control over AutoML parameters, one can also specify `automl_algorithm='default'` using `AutoMLSearch` and it will default to using `fast` mode. However, in this case ensembling will be defined by the `ensembling` flag (if `ensembling=False` the abovementioned ensembling batches will be skipped). Users are welcome to select `max_batches` according to the algorithm above (or other stopping criteria) but should be aware that results may not be optimal if the algorithm does not run for the full length of `fast` mode."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's say ensembling for time series is not enabled

@@ -90,6 +90,7 @@ def __init__(
n_jobs=-1,
text_in_ensembling=False,
top_n=3,
ensembling=True,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we set this false to match the behavior in AutoMLSearch?

What I don't like about this is that we change the user parameter value silently for time series if it's not False. Perhaps it would be better to raise an exception if ensembling is set to true for time series?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will do!

@@ -300,11 +302,11 @@ def test_pipeline_limits(
automl.search()
out = caplog.text
if verbose:
assert "Using default limit of max_batches=4." in out
assert "Searching up to 4 batches for a total of" in out
assert "Using default limit of max_batches=3." in out
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's remember to update our perf test code! I think we still want the ability to test the ensembler pipeline in batch 4 when we kick off a job?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants