Added AutoML function to access ensemble pipeline's input pipelines IDs#3011
Added AutoML function to access ensemble pipeline's input pipelines IDs#3011christopherbunn merged 5 commits intomainfrom
Conversation
Codecov Report
@@ Coverage Diff @@
## main #3011 +/- ##
=======================================
+ Coverage 99.7% 99.7% +0.1%
=======================================
Files 312 312
Lines 29856 29893 +37
=======================================
+ Hits 29765 29802 +37
Misses 91 91
Continue to review full report at Codecov.
|
289c34f to
7575808
Compare
| automl = AutoMLSearch( | ||
| X_train=X, | ||
| y_train=y, | ||
| problem_type="binary", | ||
| max_batches=two_stacking_batches, | ||
| objective="Log Loss Binary", | ||
| ensembling=True, | ||
| ) | ||
|
|
||
| test_env = AutoMLTestEnv("binary") | ||
| with test_env.test_context(mock_score_side_effect=score_side_effect): | ||
| automl.search() | ||
| pipeline_names = automl.rankings["pipeline_name"] | ||
| assert pipeline_names.str.contains("Ensemble").any() | ||
|
|
||
| ensemble_ids = [ | ||
| _get_first_stacked_classifier_no() - 1, | ||
| len(automl.results["pipeline_results"]) - 1, | ||
| ] | ||
|
|
||
| final_best_pipeline_ids = [ | ||
| pipeline["id"] | ||
| for pipeline in list(automl._automl_algorithm._best_pipeline_info.values()) | ||
| ] | ||
| final_best_pipeline_ids.sort() | ||
|
|
||
| input_pipeline_0_ids = automl.get_ensembler_input_pipelines(ensemble_ids[0]) | ||
| input_pipeline_0_ids.sort() | ||
|
|
||
| input_pipeline_1_ids = automl.get_ensembler_input_pipelines(ensemble_ids[1]) | ||
| input_pipeline_1_ids.sort() |
There was a problem hiding this comment.
Here, we're rerunning AutoMLSearch so that we can get two ensemble pipelines (and to make sure that the IDs for best pipeline are updated). I did this since there isn't a super easy way to verify that the first ensemble pipeline generated is correct. Def open to ideas/suggestions!
There was a problem hiding this comment.
Thoughts on rewriting test_describe_pipeline_with_ensembling to also call this new method as opposed to writing another test? There what we did was check that the input pipeline ids for the second ensemble are all greater than the input ids of the first ensemble since the scores is always decreasing which sounds reasonable to me.
I'm ok with whatever you decide. This looks great to me as is!
There was a problem hiding this comment.
I am open to either, but I'm leaning slightly towards keeping it as-is. If we need to update get_ensembler_input_pipelines() in the future there's a single point to update + it keeps this particular edge case together. The runtime difference for another AutoMLSearch is relatively minimal for this mocked example.
freddyaboulton
left a comment
There was a problem hiding this comment.
Thank you @christopherbunn !!
evalml/automl/automl_search.py
Outdated
| pipeline_results = self._results["pipeline_results"] | ||
| if ( | ||
| ensemble_pipeline_id not in pipeline_results.keys() | ||
| or "input_pipeline_ids" not in pipeline_results[ensemble_pipeline_id].keys() |
There was a problem hiding this comment.
nit: We can get kid of .keys() here.
| automl = AutoMLSearch( | ||
| X_train=X, | ||
| y_train=y, | ||
| problem_type="binary", | ||
| max_batches=two_stacking_batches, | ||
| objective="Log Loss Binary", | ||
| ensembling=True, | ||
| ) | ||
|
|
||
| test_env = AutoMLTestEnv("binary") | ||
| with test_env.test_context(mock_score_side_effect=score_side_effect): | ||
| automl.search() | ||
| pipeline_names = automl.rankings["pipeline_name"] | ||
| assert pipeline_names.str.contains("Ensemble").any() | ||
|
|
||
| ensemble_ids = [ | ||
| _get_first_stacked_classifier_no() - 1, | ||
| len(automl.results["pipeline_results"]) - 1, | ||
| ] | ||
|
|
||
| final_best_pipeline_ids = [ | ||
| pipeline["id"] | ||
| for pipeline in list(automl._automl_algorithm._best_pipeline_info.values()) | ||
| ] | ||
| final_best_pipeline_ids.sort() | ||
|
|
||
| input_pipeline_0_ids = automl.get_ensembler_input_pipelines(ensemble_ids[0]) | ||
| input_pipeline_0_ids.sort() | ||
|
|
||
| input_pipeline_1_ids = automl.get_ensembler_input_pipelines(ensemble_ids[1]) | ||
| input_pipeline_1_ids.sort() |
There was a problem hiding this comment.
Thoughts on rewriting test_describe_pipeline_with_ensembling to also call this new method as opposed to writing another test? There what we did was check that the input pipeline ids for the second ensemble are all greater than the input ids of the first ensemble since the scores is always decreasing which sounds reasonable to me.
I'm ok with whatever you decide. This looks great to me as is!
evalml/automl/automl_search.py
Outdated
| return self._SLEEP_TIME | ||
|
|
||
| def get_ensembler_input_pipelines(self, ensemble_pipeline_id): | ||
| """Score a list of pipelines on the given holdout data. |
There was a problem hiding this comment.
Docstring first sentence doesn't match what the method does.
| with test_env.test_context(mock_score_side_effect=score_side_effect): | ||
| automl.search() | ||
| pipeline_names = automl.rankings["pipeline_name"] | ||
| assert pipeline_names.str.contains("Ensemble").any() |
There was a problem hiding this comment.
Is this assert carried over from a different ensemble test? I noticed that we don't test it above, and not exactly what we're testing here so we could remove it?
There was a problem hiding this comment.
Yep, removed this line.
angela97lin
left a comment
There was a problem hiding this comment.
Looks good! I wonder if it's worth sorting the output of get_ensembler_input_pipelines, but otherwise 🚢
| pipeline["id"] | ||
| for pipeline in list(automl._automl_algorithm._best_pipeline_info.values()) | ||
| ] | ||
| best_pipeline_ids.sort() |
There was a problem hiding this comment.
Q: What do you think about just sorting the output for get_ensembler_input_pipelines? Is there ever a case where we wouldn't want to?
There was a problem hiding this comment.
@angela97lin I think the pipeline IDs come out in model family order. In the case of the IterativeAlgorithm, it comes out in the order of the first batch. It's a minor detail but this is the same order that is fed into _create_ensemble, so all of the input pipelines in the ensemble pipeline's graph are in this order from top-down.
There was a problem hiding this comment.
@christopherbunn Isn't _best_pipeline_info a dictionary? So the order isn't guaranteed, unless we're assuming python order from dict --> list is the order in which items are added to the dict?
There was a problem hiding this comment.
@angela97lin From Python 3.7 onwards dictionaries are guaranteed to be insertion order: https://stackoverflow.com/questions/60848/how-do-you-retrieve-items-from-a-dictionary-in-the-order-that-theyre-inserted
There was a problem hiding this comment.
Ah right! 😂
Okay back to the original question then: is it better to sort or not? Sounds like there's no benefit to doing so LGTM! My original question was because I noticed we were sorting the outputs in the tests to compare and wondered if there was any value to changing the method to sort. It's not too difficult to add later if we change our minds :)
There was a problem hiding this comment.
Ah gotcha! Yeah to be clear I think it would be best to retain insertion order for the output rather than sort it, but it's not necessary for now. I have it sorted for this test case so it's easier to compare values.
6c377ef to
9282a01
Compare
Resolves #3008