Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dont allow duplicate pipeline names in AutoMLSearch #1932

Merged

Conversation

freddyaboulton
Copy link
Contributor

Pull Request Description

Fixes #1858


After creating the pull request: in order to pass the release_notes_updated check you will need to update the "Future Release" section of docs/source/release_notes.rst to include this pull request by adding :pr:123.

@freddyaboulton freddyaboulton changed the title Check if there are duplicate pipeline names in AutoMLSearch init. Dont allow duplicate pipeline names in AutoMLSearch Mar 4, 2021
@codecov
Copy link

codecov bot commented Mar 4, 2021

Codecov Report

Merging #1932 (2d666f0) into main (5051e67) will increase coverage by 0.1%.
The diff coverage is 100.0%.

Impacted file tree graph

@@            Coverage Diff            @@
##             main    #1932     +/-   ##
=========================================
+ Coverage   100.0%   100.0%   +0.1%     
=========================================
  Files         267      267             
  Lines       21700    21730     +30     
=========================================
+ Hits        21694    21724     +30     
  Misses          6        6             
Impacted Files Coverage Δ
evalml/automl/automl_search.py 100.0% <100.0%> (ø)
evalml/automl/utils.py 100.0% <100.0%> (ø)
evalml/tests/automl_tests/test_automl.py 100.0% <100.0%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5051e67...2d666f0. Read the comment docs.

@@ -85,3 +85,29 @@ def tune_binary_threshold(pipeline, objective, problem_type, X_threshold_tuning,
y_predict_proba = pipeline.predict_proba(X_threshold_tuning)
y_predict_proba = y_predict_proba.iloc[:, 1]
pipeline.threshold = objective.optimize_threshold(y_predict_proba, y_threshold_tuning, X=X_threshold_tuning)


def check_all_pipeline_names_unique(pipelines):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is also in my #1913 PR

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

classic stacked branches. disappointed this was only one on another and not 5

Copy link
Contributor

@angela97lin angela97lin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! 🥳

If we want to be super sure maybe it'd be good to add test cases for when multiple (>3) pipelines have the same name to make sure its printed out once (Custom, Custom, Custom), or multiple pipelines with the same names (Custom, Custom, Custom1, Custom1) but probably not super necessary since at that point we're just testing set functionality 😂


with pytest.raises(ValueError,
match="All pipeline names must be unique. The names 'Custom Pipeline' were repeated."):
AutoMLSearch(X, y, problem_type="binary", allowed_pipelines=[MyPipeline1, MyPipeline2, MyPipeline3])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Super duper nit-pick suggestion but I wonder if there's a way to phrase this so that it gramatically makes sense for the case with 1 or multiple duplicates :P

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I should've known @angela97lin was driving the tense calculator

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had to heavily debate whether it was worth it or not to comment about this LOL 😅

Copy link
Contributor

@jeremyliweishih jeremyliweishih left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good but I like Angela's suggestion on fixing the grammar in the one duplicate case!

Copy link
Contributor

@chukarsten chukarsten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Proposed alternate impl for the check to reduce lines, but no need to accept. Great work.

@@ -85,3 +85,29 @@ def tune_binary_threshold(pipeline, objective, problem_type, X_threshold_tuning,
y_predict_proba = pipeline.predict_proba(X_threshold_tuning)
y_predict_proba = y_predict_proba.iloc[:, 1]
pipeline.threshold = objective.optimize_threshold(y_predict_proba, y_threshold_tuning, X=X_threshold_tuning)


def check_all_pipeline_names_unique(pipelines):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

classic stacked branches. disappointed this was only one on another and not 5

None

Raises:
ValueError if any pipeline names are duplicated.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I mentioned in your other PR that normally I think we do: ValueError: if....

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call!

seen_names.add(pipeline.name)

if duplicate_names:
plural, tense = ("s", "were") if len(duplicate_names) > 1 else ("", "was")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I respect this dedication to verb tense calculation.

Comment on lines 102 to 109
seen_names = set()
duplicate_names = set()

for pipeline in pipelines:
if pipeline.name in seen_names:
duplicate_names.add(pipeline.name)
else:
seen_names.add(pipeline.name)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have twice proposed a pandas impl and twice deleted it thinking I was being overbearing. Here it is...not blocking, feel free to reject.

name_count = pd.Series([p.name for p in pipelines]).value_counts()
duplicate_names = name_count[name_count > 1] # I don't think this line is quite right.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ended up taking it! 👏


with pytest.raises(ValueError,
match="All pipeline names must be unique. The names 'Custom Pipeline' were repeated."):
AutoMLSearch(X, y, problem_type="binary", allowed_pipelines=[MyPipeline1, MyPipeline2, MyPipeline3])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I should've known @angela97lin was driving the tense calculator

@freddyaboulton freddyaboulton merged commit cfc601d into main Mar 5, 2021
@freddyaboulton freddyaboulton deleted the 1858-dont-allow-duplicate-pipeline-names-in-automl-search branch March 5, 2021 21:17
@dsherry dsherry mentioned this pull request Mar 11, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Stacktrace in AutoMLSearch when pipelines have duplicate names
4 participants