-
Notifications
You must be signed in to change notification settings - Fork 87
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Limit Classification problems to appropriate target data #3185
Conversation
Codecov Report
@@ Coverage Diff @@
## main #3185 +/- ##
=======================================
+ Coverage 99.7% 99.8% +0.1%
=======================================
Files 326 326
Lines 31268 31372 +104
=======================================
+ Hits 31172 31283 +111
+ Misses 96 89 -7
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good stuff! Just left a few cleanup notes
@@ -67,8 +68,18 @@ def fit(self, X, y): | |||
|
|||
Returns: | |||
self | |||
|
|||
Raises: | |||
ValueError: If the target is not numeric. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be more specific - the target doesn't need to be numeric in all cases, just regression cases, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes specifically just time series regression in this function
if self.problem_type == ProblemTypes.BINARY and y.nunique() != 2: | ||
raise ValueError(f"Binary pipelines require y to have 2 unique classes!") | ||
elif self.problem_type == ProblemTypes.MULTICLASS and y.nunique() in [1, 2]: | ||
raise ValueError( | ||
"Multiclass pipelines require y to have 3 or more unique classes!" | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I asked this on your previous PR, but is there anyway we can tie this to the logic in invalid_target_data_check
? That data check, albeit somewhat complex, does check for this. But at the same time, this is starting to feel weird, adding bits and pieces of certain data checks back into the actual pipelines. @angela97lin , any thoughts? Are we moving in a direction contrary to the philosophy we had in mind for data checks?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think that's a bad idea at all @chukarsten, especially since we already have a check for whether or not a target has enough instances of a class in a classification problem--it seems to fit the philosophy for data checks quite well.
Pretty interesting problem, maybe this is worth a discussion with the larger group? I wonder if we need to better draw the line between when we want to throw an exception to strongly disallow something vs use data checks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Calling attention to this issue a filed forever ago that I think tackles the same problem #968
Basically, let pipelines accept data checks on init so that we can run data checks prior to fit.
I think that's just one particular solution to the problem at hand but I think it's worth considering.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for tackling this one, Parthiv! I am gonna block based on the duplication of the data checks/actions logic. I am not sure what we want to do with this. The situation is kind of the same for the now reverted #3160 . What do we want to do when we're testing for something that a data check already tests for? I suspect we're going to want to refactor the datachecks in such a way that we can use the same logic. Let's talk about it in OH Monday?
""" | ||
X, y = self._convert_to_woodwork(X, y) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't this exactly the same as ClassificationPipeline.fit()
? I feel like we can actually just get rid of the time series fit, unless there's currently a compelling reason to override it with the same thing? @freddyaboulton , did you have something in mind?
# Conflicts: # docs/source/release_notes.rst # evalml/tests/pipeline_tests/test_time_series_pipeline.py
# Conflicts: # evalml/tests/pipeline_tests/test_pipelines.py
# Conflicts: # docs/source/release_notes.rst # docs/source/user_guide/pipelines.ipynb
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Left a nit. Agreed with Karsten, I think it would be useful to see if there was a way to solve this without duplicating code that already exists in our data check
""" | ||
X = infer_feature_types(X) | ||
y = infer_feature_types(y) | ||
|
||
if is_binary(self.problem_type) and y.nunique() != 2: | ||
raise ValueError(f"Binary pipelines require y to have 2 unique classes!") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: We don't need the formatting f
prefacing this string
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM (Looks Gucci to me)
Fixes #2975
The scope of this issue was expanded to ensure that the appropriate number of classes exist for binary and multiclass pipelines.