Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Limit Classification problems to appropriate target data #3185

Merged
merged 23 commits into from
Jan 12, 2022

Conversation

ParthivNaresh
Copy link
Contributor

@ParthivNaresh ParthivNaresh commented Jan 5, 2022

Fixes #2975

The scope of this issue was expanded to ensure that the appropriate number of classes exist for binary and multiclass pipelines.

@codecov
Copy link

codecov bot commented Jan 5, 2022

Codecov Report

Merging #3185 (b18f58c) into main (70d8dc9) will increase coverage by 0.1%.
The diff coverage is 100.0%.

Impacted file tree graph

@@           Coverage Diff           @@
##            main   #3185     +/-   ##
=======================================
+ Coverage   99.7%   99.8%   +0.1%     
=======================================
  Files        326     326             
  Lines      31268   31372    +104     
=======================================
+ Hits       31172   31283    +111     
+ Misses        96      89      -7     
Impacted Files Coverage Δ
.../pipelines/time_series_classification_pipelines.py 100.0% <ø> (ø)
evalml/pipelines/time_series_pipeline_base.py 100.0% <ø> (ø)
evalml/pipelines/classification_pipeline.py 100.0% <100.0%> (ø)
...valml/pipelines/time_series_regression_pipeline.py 100.0% <100.0%> (ø)
evalml/tests/automl_tests/test_automl.py 99.6% <100.0%> (+0.1%) ⬆️
evalml/tests/conftest.py 97.0% <100.0%> (+1.2%) ⬆️
...assification_pipeline_tests/test_classification.py 100.0% <100.0%> (ø)
evalml/tests/pipeline_tests/test_pipelines.py 99.8% <100.0%> (ø)
.../tests/pipeline_tests/test_time_series_pipeline.py 99.9% <100.0%> (+0.1%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 70d8dc9...b18f58c. Read the comment docs.

@ParthivNaresh ParthivNaresh marked this pull request as ready for review January 6, 2022 15:49
Copy link
Contributor

@eccabay eccabay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good stuff! Just left a few cleanup notes

@@ -67,8 +68,18 @@ def fit(self, X, y):

Returns:
self

Raises:
ValueError: If the target is not numeric.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be more specific - the target doesn't need to be numeric in all cases, just regression cases, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes specifically just time series regression in this function

evalml/tests/automl_tests/test_automl.py Outdated Show resolved Hide resolved
evalml/tests/pipeline_tests/test_pipelines.py Outdated Show resolved Hide resolved
Comment on lines 61 to 66
if self.problem_type == ProblemTypes.BINARY and y.nunique() != 2:
raise ValueError(f"Binary pipelines require y to have 2 unique classes!")
elif self.problem_type == ProblemTypes.MULTICLASS and y.nunique() in [1, 2]:
raise ValueError(
"Multiclass pipelines require y to have 3 or more unique classes!"
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I asked this on your previous PR, but is there anyway we can tie this to the logic in invalid_target_data_check? That data check, albeit somewhat complex, does check for this. But at the same time, this is starting to feel weird, adding bits and pieces of certain data checks back into the actual pipelines. @angela97lin , any thoughts? Are we moving in a direction contrary to the philosophy we had in mind for data checks?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that's a bad idea at all @chukarsten, especially since we already have a check for whether or not a target has enough instances of a class in a classification problem--it seems to fit the philosophy for data checks quite well.

Pretty interesting problem, maybe this is worth a discussion with the larger group? I wonder if we need to better draw the line between when we want to throw an exception to strongly disallow something vs use data checks.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Calling attention to this issue a filed forever ago that I think tackles the same problem #968

Basically, let pipelines accept data checks on init so that we can run data checks prior to fit.

I think that's just one particular solution to the problem at hand but I think it's worth considering.

Copy link
Contributor

@chukarsten chukarsten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for tackling this one, Parthiv! I am gonna block based on the duplication of the data checks/actions logic. I am not sure what we want to do with this. The situation is kind of the same for the now reverted #3160 . What do we want to do when we're testing for something that a data check already tests for? I suspect we're going to want to refactor the datachecks in such a way that we can use the same logic. Let's talk about it in OH Monday?

"""
X, y = self._convert_to_woodwork(X, y)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this exactly the same as ClassificationPipeline.fit()? I feel like we can actually just get rid of the time series fit, unless there's currently a compelling reason to override it with the same thing? @freddyaboulton , did you have something in mind?

evalml/tests/pipeline_tests/test_pipelines.py Outdated Show resolved Hide resolved
# Conflicts:
#	docs/source/release_notes.rst
#	evalml/tests/pipeline_tests/test_time_series_pipeline.py
# Conflicts:
#	evalml/tests/pipeline_tests/test_pipelines.py
# Conflicts:
#	docs/source/release_notes.rst
#	docs/source/user_guide/pipelines.ipynb
Copy link
Contributor

@bchen1116 bchen1116 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Left a nit. Agreed with Karsten, I think it would be useful to see if there was a way to solve this without duplicating code that already exists in our data check

"""
X = infer_feature_types(X)
y = infer_feature_types(y)

if is_binary(self.problem_type) and y.nunique() != 2:
raise ValueError(f"Binary pipelines require y to have 2 unique classes!")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: We don't need the formatting f prefacing this string

Copy link
Contributor

@chukarsten chukarsten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM (Looks Gucci to me)

@ParthivNaresh ParthivNaresh merged commit cb846dd into main Jan 12, 2022
@chukarsten chukarsten mentioned this pull request Jan 18, 2022
@freddyaboulton freddyaboulton deleted the Limit_BinaryPipeline branch May 13, 2022 15:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Limit BinaryClassificationPipeline to binary problems only.
6 participants