Added AutoML function to access ensemble pipeline's input pipelines IDs #3011

christopherbunn · 2021-11-04T18:20:13Z

Resolves #3008

codecov · 2021-11-04T18:23:35Z

Codecov Report

Merging #3011 (6c377ef) into main (9c21720) will increase coverage by 0.1%.
The diff coverage is 100.0%.

❗ Current head 6c377ef differs from pull request most recent head 9282a01. Consider uploading reports for the commit 9282a01 to get more accurate results

@@           Coverage Diff           @@
##            main   #3011     +/-   ##
=======================================
+ Coverage   99.7%   99.7%   +0.1%     
=======================================
  Files        312     312             
  Lines      29856   29893     +37     
=======================================
+ Hits       29765   29802     +37     
  Misses        91      91

Impacted Files	Coverage Δ
evalml/automl/automl_search.py	`99.9% <100.0%> (+0.1%)`	⬆️
evalml/tests/automl_tests/test_automl.py	`99.5% <100.0%> (+0.1%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9c21720...9282a01. Read the comment docs.

christopherbunn · 2021-11-04T19:14:23Z

evalml/tests/automl_tests/test_automl.py

+    automl = AutoMLSearch(
+        X_train=X,
+        y_train=y,
+        problem_type="binary",
+        max_batches=two_stacking_batches,
+        objective="Log Loss Binary",
+        ensembling=True,
+    )
+
+    test_env = AutoMLTestEnv("binary")
+    with test_env.test_context(mock_score_side_effect=score_side_effect):
+        automl.search()
+    pipeline_names = automl.rankings["pipeline_name"]
+    assert pipeline_names.str.contains("Ensemble").any()
+
+    ensemble_ids = [
+        _get_first_stacked_classifier_no() - 1,
+        len(automl.results["pipeline_results"]) - 1,
+    ]
+
+    final_best_pipeline_ids = [
+        pipeline["id"]
+        for pipeline in list(automl._automl_algorithm._best_pipeline_info.values())
+    ]
+    final_best_pipeline_ids.sort()
+
+    input_pipeline_0_ids = automl.get_ensembler_input_pipelines(ensemble_ids[0])
+    input_pipeline_0_ids.sort()
+
+    input_pipeline_1_ids = automl.get_ensembler_input_pipelines(ensemble_ids[1])
+    input_pipeline_1_ids.sort()


Here, we're rerunning AutoMLSearch so that we can get two ensemble pipelines (and to make sure that the IDs for best pipeline are updated). I did this since there isn't a super easy way to verify that the first ensemble pipeline generated is correct. Def open to ideas/suggestions!

Thoughts on rewriting test_describe_pipeline_with_ensembling to also call this new method as opposed to writing another test? There what we did was check that the input pipeline ids for the second ensemble are all greater than the input ids of the first ensemble since the scores is always decreasing which sounds reasonable to me.

I'm ok with whatever you decide. This looks great to me as is!

I am open to either, but I'm leaning slightly towards keeping it as-is. If we need to update get_ensembler_input_pipelines() in the future there's a single point to update + it keeps this particular edge case together. The runtime difference for another AutoMLSearch is relatively minimal for this mocked example.

freddyaboulton

Thank you @christopherbunn !!

freddyaboulton · 2021-11-05T18:07:28Z

evalml/automl/automl_search.py

+        pipeline_results = self._results["pipeline_results"]
+        if (
+            ensemble_pipeline_id not in pipeline_results.keys()
+            or "input_pipeline_ids" not in pipeline_results[ensemble_pipeline_id].keys()


nit: We can get kid of .keys() here.

freddyaboulton · 2021-11-05T18:17:07Z

evalml/tests/automl_tests/test_automl.py

+    automl = AutoMLSearch(
+        X_train=X,
+        y_train=y,
+        problem_type="binary",
+        max_batches=two_stacking_batches,
+        objective="Log Loss Binary",
+        ensembling=True,
+    )
+
+    test_env = AutoMLTestEnv("binary")
+    with test_env.test_context(mock_score_side_effect=score_side_effect):
+        automl.search()
+    pipeline_names = automl.rankings["pipeline_name"]
+    assert pipeline_names.str.contains("Ensemble").any()
+
+    ensemble_ids = [
+        _get_first_stacked_classifier_no() - 1,
+        len(automl.results["pipeline_results"]) - 1,
+    ]
+
+    final_best_pipeline_ids = [
+        pipeline["id"]
+        for pipeline in list(automl._automl_algorithm._best_pipeline_info.values())
+    ]
+    final_best_pipeline_ids.sort()
+
+    input_pipeline_0_ids = automl.get_ensembler_input_pipelines(ensemble_ids[0])
+    input_pipeline_0_ids.sort()
+
+    input_pipeline_1_ids = automl.get_ensembler_input_pipelines(ensemble_ids[1])
+    input_pipeline_1_ids.sort()


Thoughts on rewriting test_describe_pipeline_with_ensembling to also call this new method as opposed to writing another test? There what we did was check that the input pipeline ids for the second ensemble are all greater than the input ids of the first ensemble since the scores is always decreasing which sounds reasonable to me.

I'm ok with whatever you decide. This looks great to me as is!

freddyaboulton · 2021-11-05T18:17:48Z

evalml/automl/automl_search.py

@@ -1590,3 +1590,28 @@ def plot(self):
    @property
    def _sleep_time(self):
        return self._SLEEP_TIME
+
+    def get_ensembler_input_pipelines(self, ensemble_pipeline_id):
+        """Score a list of pipelines on the given holdout data.


Docstring first sentence doesn't match what the method does.

angela97lin · 2021-11-07T18:28:21Z

evalml/tests/automl_tests/test_automl.py

+    with test_env.test_context(mock_score_side_effect=score_side_effect):
+        automl.search()
+    pipeline_names = automl.rankings["pipeline_name"]
+    assert pipeline_names.str.contains("Ensemble").any()


Is this assert carried over from a different ensemble test? I noticed that we don't test it above, and not exactly what we're testing here so we could remove it?

Yep, removed this line.

angela97lin

Looks good! I wonder if it's worth sorting the output of get_ensembler_input_pipelines, but otherwise 🚢

angela97lin · 2021-11-07T18:30:07Z

evalml/tests/automl_tests/test_automl.py

+        pipeline["id"]
+        for pipeline in list(automl._automl_algorithm._best_pipeline_info.values())
+    ]
+    best_pipeline_ids.sort()


Q: What do you think about just sorting the output for get_ensembler_input_pipelines? Is there ever a case where we wouldn't want to?

@angela97lin I think the pipeline IDs come out in model family order. In the case of the IterativeAlgorithm, it comes out in the order of the first batch. It's a minor detail but this is the same order that is fed into _create_ensemble, so all of the input pipelines in the ensemble pipeline's graph are in this order from top-down.

@christopherbunn Isn't _best_pipeline_info a dictionary? So the order isn't guaranteed, unless we're assuming python order from dict --> list is the order in which items are added to the dict?

@angela97lin From Python 3.7 onwards dictionaries are guaranteed to be insertion order: https://stackoverflow.com/questions/60848/how-do-you-retrieve-items-from-a-dictionary-in-the-order-that-theyre-inserted

Ah right! 😂
Okay back to the original question then: is it better to sort or not? Sounds like there's no benefit to doing so LGTM! My original question was because I noticed we were sorting the outputs in the tests to compare and wondered if there was any value to changing the method to sort. It's not too difficult to add later if we change our minds :)

Ah gotcha! Yeah to be clear I think it would be best to retain insertion order for the output rather than sort it, but it's not necessary for now. I have it sorted for this test case so it's easier to compare values.

christopherbunn force-pushed the 3008_ensemble_ids branch from 289c34f to 7575808 Compare November 4, 2021 18:32

christopherbunn marked this pull request as ready for review November 4, 2021 19:12

auto-assign bot assigned christopherbunn Nov 4, 2021

christopherbunn requested a review from a team November 4, 2021 19:12

christopherbunn commented Nov 4, 2021

View reviewed changes

freddyaboulton approved these changes Nov 5, 2021

View reviewed changes

angela97lin reviewed Nov 7, 2021

View reviewed changes

angela97lin approved these changes Nov 7, 2021

View reviewed changes

christopherbunn added 5 commits November 8, 2021 10:41

Initial push

c5c6f85

Added additional test case for two pipelines

19e5c8f

Updated release notes

b4c96b8

Lint fixes

425e051

Fixed docstring

9282a01

christopherbunn force-pushed the 3008_ensemble_ids branch from 6c377ef to 9282a01 Compare November 8, 2021 15:41

christopherbunn merged commit a65e026 into main Nov 8, 2021

chukarsten mentioned this pull request Nov 9, 2021

Release v0.37.0 #3029

Merged

freddyaboulton deleted the 3008_ensemble_ids branch May 13, 2022 15:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added AutoML function to access ensemble pipeline's input pipelines IDs #3011

Added AutoML function to access ensemble pipeline's input pipelines IDs #3011

christopherbunn commented Nov 4, 2021

codecov bot commented Nov 4, 2021 •

edited

Loading

christopherbunn Nov 4, 2021

freddyaboulton Nov 5, 2021 •

edited

Loading

christopherbunn Nov 8, 2021

freddyaboulton left a comment

freddyaboulton Nov 5, 2021

freddyaboulton Nov 5, 2021 •

edited

Loading

freddyaboulton Nov 5, 2021

angela97lin Nov 7, 2021

christopherbunn Nov 8, 2021

angela97lin left a comment

angela97lin Nov 7, 2021

christopherbunn Nov 8, 2021

angela97lin Nov 8, 2021

christopherbunn Nov 8, 2021 •

edited

Loading

angela97lin Nov 8, 2021

christopherbunn Nov 8, 2021

Added AutoML function to access ensemble pipeline's input pipelines IDs #3011

Added AutoML function to access ensemble pipeline's input pipelines IDs #3011

Conversation

christopherbunn commented Nov 4, 2021

codecov bot commented Nov 4, 2021 • edited Loading

Codecov Report

Choose a reason for hiding this comment

freddyaboulton Nov 5, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

freddyaboulton left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

freddyaboulton Nov 5, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

angela97lin left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

christopherbunn Nov 8, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov bot commented Nov 4, 2021 •

edited

Loading

freddyaboulton Nov 5, 2021 •

edited

Loading

freddyaboulton Nov 5, 2021 •

edited

Loading

christopherbunn Nov 8, 2021 •

edited

Loading