-
Notifications
You must be signed in to change notification settings - Fork 86
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added AutoML function to access ensemble pipeline's input pipelines IDs #3011
Conversation
Codecov Report
@@ Coverage Diff @@
## main #3011 +/- ##
=======================================
+ Coverage 99.7% 99.7% +0.1%
=======================================
Files 312 312
Lines 29856 29893 +37
=======================================
+ Hits 29765 29802 +37
Misses 91 91
Continue to review full report at Codecov.
|
289c34f
to
7575808
Compare
automl = AutoMLSearch( | ||
X_train=X, | ||
y_train=y, | ||
problem_type="binary", | ||
max_batches=two_stacking_batches, | ||
objective="Log Loss Binary", | ||
ensembling=True, | ||
) | ||
|
||
test_env = AutoMLTestEnv("binary") | ||
with test_env.test_context(mock_score_side_effect=score_side_effect): | ||
automl.search() | ||
pipeline_names = automl.rankings["pipeline_name"] | ||
assert pipeline_names.str.contains("Ensemble").any() | ||
|
||
ensemble_ids = [ | ||
_get_first_stacked_classifier_no() - 1, | ||
len(automl.results["pipeline_results"]) - 1, | ||
] | ||
|
||
final_best_pipeline_ids = [ | ||
pipeline["id"] | ||
for pipeline in list(automl._automl_algorithm._best_pipeline_info.values()) | ||
] | ||
final_best_pipeline_ids.sort() | ||
|
||
input_pipeline_0_ids = automl.get_ensembler_input_pipelines(ensemble_ids[0]) | ||
input_pipeline_0_ids.sort() | ||
|
||
input_pipeline_1_ids = automl.get_ensembler_input_pipelines(ensemble_ids[1]) | ||
input_pipeline_1_ids.sort() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here, we're rerunning AutoMLSearch so that we can get two ensemble pipelines (and to make sure that the IDs for best pipeline are updated). I did this since there isn't a super easy way to verify that the first ensemble pipeline generated is correct. Def open to ideas/suggestions!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thoughts on rewriting test_describe_pipeline_with_ensembling to also call this new method as opposed to writing another test? There what we did was check that the input pipeline ids for the second ensemble are all greater than the input ids of the first ensemble since the scores is always decreasing which sounds reasonable to me.
I'm ok with whatever you decide. This looks great to me as is!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am open to either, but I'm leaning slightly towards keeping it as-is. If we need to update get_ensembler_input_pipelines()
in the future there's a single point to update + it keeps this particular edge case together. The runtime difference for another AutoMLSearch is relatively minimal for this mocked example.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @christopherbunn !!
evalml/automl/automl_search.py
Outdated
pipeline_results = self._results["pipeline_results"] | ||
if ( | ||
ensemble_pipeline_id not in pipeline_results.keys() | ||
or "input_pipeline_ids" not in pipeline_results[ensemble_pipeline_id].keys() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: We can get kid of .keys()
here.
automl = AutoMLSearch( | ||
X_train=X, | ||
y_train=y, | ||
problem_type="binary", | ||
max_batches=two_stacking_batches, | ||
objective="Log Loss Binary", | ||
ensembling=True, | ||
) | ||
|
||
test_env = AutoMLTestEnv("binary") | ||
with test_env.test_context(mock_score_side_effect=score_side_effect): | ||
automl.search() | ||
pipeline_names = automl.rankings["pipeline_name"] | ||
assert pipeline_names.str.contains("Ensemble").any() | ||
|
||
ensemble_ids = [ | ||
_get_first_stacked_classifier_no() - 1, | ||
len(automl.results["pipeline_results"]) - 1, | ||
] | ||
|
||
final_best_pipeline_ids = [ | ||
pipeline["id"] | ||
for pipeline in list(automl._automl_algorithm._best_pipeline_info.values()) | ||
] | ||
final_best_pipeline_ids.sort() | ||
|
||
input_pipeline_0_ids = automl.get_ensembler_input_pipelines(ensemble_ids[0]) | ||
input_pipeline_0_ids.sort() | ||
|
||
input_pipeline_1_ids = automl.get_ensembler_input_pipelines(ensemble_ids[1]) | ||
input_pipeline_1_ids.sort() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thoughts on rewriting test_describe_pipeline_with_ensembling to also call this new method as opposed to writing another test? There what we did was check that the input pipeline ids for the second ensemble are all greater than the input ids of the first ensemble since the scores is always decreasing which sounds reasonable to me.
I'm ok with whatever you decide. This looks great to me as is!
evalml/automl/automl_search.py
Outdated
@@ -1590,3 +1590,28 @@ def plot(self): | |||
@property | |||
def _sleep_time(self): | |||
return self._SLEEP_TIME | |||
|
|||
def get_ensembler_input_pipelines(self, ensemble_pipeline_id): | |||
"""Score a list of pipelines on the given holdout data. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Docstring first sentence doesn't match what the method does.
with test_env.test_context(mock_score_side_effect=score_side_effect): | ||
automl.search() | ||
pipeline_names = automl.rankings["pipeline_name"] | ||
assert pipeline_names.str.contains("Ensemble").any() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this assert carried over from a different ensemble test? I noticed that we don't test it above, and not exactly what we're testing here so we could remove it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, removed this line.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! I wonder if it's worth sorting the output of get_ensembler_input_pipelines
, but otherwise 🚢
pipeline["id"] | ||
for pipeline in list(automl._automl_algorithm._best_pipeline_info.values()) | ||
] | ||
best_pipeline_ids.sort() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Q: What do you think about just sorting the output for get_ensembler_input_pipelines
? Is there ever a case where we wouldn't want to?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@angela97lin I think the pipeline IDs come out in model family order. In the case of the IterativeAlgorithm, it comes out in the order of the first batch. It's a minor detail but this is the same order that is fed into _create_ensemble
, so all of the input pipelines in the ensemble pipeline's graph are in this order from top-down.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@christopherbunn Isn't _best_pipeline_info
a dictionary? So the order isn't guaranteed, unless we're assuming python order from dict --> list is the order in which items are added to the dict?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@angela97lin From Python 3.7 onwards dictionaries are guaranteed to be insertion order: https://stackoverflow.com/questions/60848/how-do-you-retrieve-items-from-a-dictionary-in-the-order-that-theyre-inserted
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah right! 😂
Okay back to the original question then: is it better to sort or not? Sounds like there's no benefit to doing so LGTM! My original question was because I noticed we were sorting the outputs in the tests to compare and wondered if there was any value to changing the method to sort. It's not too difficult to add later if we change our minds :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah gotcha! Yeah to be clear I think it would be best to retain insertion order for the output rather than sort it, but it's not necessary for now. I have it sorted for this test case so it's easier to compare values.
6c377ef
to
9282a01
Compare
Resolves #3008