Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added AutoML function to access ensemble pipeline's input pipelines IDs #3011

Merged
merged 5 commits into from
Nov 8, 2021

Conversation

christopherbunn
Copy link
Contributor

Resolves #3008

@codecov
Copy link

codecov bot commented Nov 4, 2021

Codecov Report

Merging #3011 (6c377ef) into main (9c21720) will increase coverage by 0.1%.
The diff coverage is 100.0%.

❗ Current head 6c377ef differs from pull request most recent head 9282a01. Consider uploading reports for the commit 9282a01 to get more accurate results
Impacted file tree graph

@@           Coverage Diff           @@
##            main   #3011     +/-   ##
=======================================
+ Coverage   99.7%   99.7%   +0.1%     
=======================================
  Files        312     312             
  Lines      29856   29893     +37     
=======================================
+ Hits       29765   29802     +37     
  Misses        91      91             
Impacted Files Coverage Δ
evalml/automl/automl_search.py 99.9% <100.0%> (+0.1%) ⬆️
evalml/tests/automl_tests/test_automl.py 99.5% <100.0%> (+0.1%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9c21720...9282a01. Read the comment docs.

Comment on lines 5404 to 5432
automl = AutoMLSearch(
X_train=X,
y_train=y,
problem_type="binary",
max_batches=two_stacking_batches,
objective="Log Loss Binary",
ensembling=True,
)

test_env = AutoMLTestEnv("binary")
with test_env.test_context(mock_score_side_effect=score_side_effect):
automl.search()
pipeline_names = automl.rankings["pipeline_name"]
assert pipeline_names.str.contains("Ensemble").any()

ensemble_ids = [
_get_first_stacked_classifier_no() - 1,
len(automl.results["pipeline_results"]) - 1,
]

final_best_pipeline_ids = [
pipeline["id"]
for pipeline in list(automl._automl_algorithm._best_pipeline_info.values())
]
final_best_pipeline_ids.sort()

input_pipeline_0_ids = automl.get_ensembler_input_pipelines(ensemble_ids[0])
input_pipeline_0_ids.sort()

input_pipeline_1_ids = automl.get_ensembler_input_pipelines(ensemble_ids[1])
input_pipeline_1_ids.sort()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here, we're rerunning AutoMLSearch so that we can get two ensemble pipelines (and to make sure that the IDs for best pipeline are updated). I did this since there isn't a super easy way to verify that the first ensemble pipeline generated is correct. Def open to ideas/suggestions!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thoughts on rewriting test_describe_pipeline_with_ensembling to also call this new method as opposed to writing another test? There what we did was check that the input pipeline ids for the second ensemble are all greater than the input ids of the first ensemble since the scores is always decreasing which sounds reasonable to me.

I'm ok with whatever you decide. This looks great to me as is!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am open to either, but I'm leaning slightly towards keeping it as-is. If we need to update get_ensembler_input_pipelines() in the future there's a single point to update + it keeps this particular edge case together. The runtime difference for another AutoMLSearch is relatively minimal for this mocked example.

Copy link
Contributor

@freddyaboulton freddyaboulton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @christopherbunn !!

pipeline_results = self._results["pipeline_results"]
if (
ensemble_pipeline_id not in pipeline_results.keys()
or "input_pipeline_ids" not in pipeline_results[ensemble_pipeline_id].keys()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: We can get kid of .keys() here.

Comment on lines 5404 to 5432
automl = AutoMLSearch(
X_train=X,
y_train=y,
problem_type="binary",
max_batches=two_stacking_batches,
objective="Log Loss Binary",
ensembling=True,
)

test_env = AutoMLTestEnv("binary")
with test_env.test_context(mock_score_side_effect=score_side_effect):
automl.search()
pipeline_names = automl.rankings["pipeline_name"]
assert pipeline_names.str.contains("Ensemble").any()

ensemble_ids = [
_get_first_stacked_classifier_no() - 1,
len(automl.results["pipeline_results"]) - 1,
]

final_best_pipeline_ids = [
pipeline["id"]
for pipeline in list(automl._automl_algorithm._best_pipeline_info.values())
]
final_best_pipeline_ids.sort()

input_pipeline_0_ids = automl.get_ensembler_input_pipelines(ensemble_ids[0])
input_pipeline_0_ids.sort()

input_pipeline_1_ids = automl.get_ensembler_input_pipelines(ensemble_ids[1])
input_pipeline_1_ids.sort()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thoughts on rewriting test_describe_pipeline_with_ensembling to also call this new method as opposed to writing another test? There what we did was check that the input pipeline ids for the second ensemble are all greater than the input ids of the first ensemble since the scores is always decreasing which sounds reasonable to me.

I'm ok with whatever you decide. This looks great to me as is!

@@ -1590,3 +1590,28 @@ def plot(self):
@property
def _sleep_time(self):
return self._SLEEP_TIME

def get_ensembler_input_pipelines(self, ensemble_pipeline_id):
"""Score a list of pipelines on the given holdout data.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Docstring first sentence doesn't match what the method does.

with test_env.test_context(mock_score_side_effect=score_side_effect):
automl.search()
pipeline_names = automl.rankings["pipeline_name"]
assert pipeline_names.str.contains("Ensemble").any()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this assert carried over from a different ensemble test? I noticed that we don't test it above, and not exactly what we're testing here so we could remove it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, removed this line.

Copy link
Contributor

@angela97lin angela97lin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! I wonder if it's worth sorting the output of get_ensembler_input_pipelines, but otherwise 🚢

pipeline["id"]
for pipeline in list(automl._automl_algorithm._best_pipeline_info.values())
]
best_pipeline_ids.sort()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Q: What do you think about just sorting the output for get_ensembler_input_pipelines? Is there ever a case where we wouldn't want to?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@angela97lin I think the pipeline IDs come out in model family order. In the case of the IterativeAlgorithm, it comes out in the order of the first batch. It's a minor detail but this is the same order that is fed into _create_ensemble, so all of the input pipelines in the ensemble pipeline's graph are in this order from top-down.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@christopherbunn Isn't _best_pipeline_info a dictionary? So the order isn't guaranteed, unless we're assuming python order from dict --> list is the order in which items are added to the dict?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah right! 😂
Okay back to the original question then: is it better to sort or not? Sounds like there's no benefit to doing so LGTM! My original question was because I noticed we were sorting the outputs in the tests to compare and wondered if there was any value to changing the method to sort. It's not too difficult to add later if we change our minds :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah gotcha! Yeah to be clear I think it would be best to retain insertion order for the output rather than sort it, but it's not necessary for now. I have it sorted for this test case so it's easier to compare values.

@christopherbunn christopherbunn merged commit a65e026 into main Nov 8, 2021
@chukarsten chukarsten mentioned this pull request Nov 9, 2021
@freddyaboulton freddyaboulton deleted the 3008_ensemble_ids branch May 13, 2022 15:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

AutoMLSearch: build API to access IDs of ensemble pipeline's input pipelines
3 participants