Limit Classification problems to appropriate target data#3185
Conversation
Codecov Report
@@ Coverage Diff @@
## main #3185 +/- ##
=======================================
+ Coverage 99.7% 99.8% +0.1%
=======================================
Files 326 326
Lines 31268 31372 +104
=======================================
+ Hits 31172 31283 +111
+ Misses 96 89 -7
Continue to review full report at Codecov.
|
eccabay
left a comment
There was a problem hiding this comment.
Good stuff! Just left a few cleanup notes
| self | ||
|
|
||
| Raises: | ||
| ValueError: If the target is not numeric. |
There was a problem hiding this comment.
This should be more specific - the target doesn't need to be numeric in all cases, just regression cases, right?
There was a problem hiding this comment.
Yes specifically just time series regression in this function
| if self.problem_type == ProblemTypes.BINARY and y.nunique() != 2: | ||
| raise ValueError(f"Binary pipelines require y to have 2 unique classes!") | ||
| elif self.problem_type == ProblemTypes.MULTICLASS and y.nunique() in [1, 2]: | ||
| raise ValueError( | ||
| "Multiclass pipelines require y to have 3 or more unique classes!" | ||
| ) |
There was a problem hiding this comment.
I think I asked this on your previous PR, but is there anyway we can tie this to the logic in invalid_target_data_check? That data check, albeit somewhat complex, does check for this. But at the same time, this is starting to feel weird, adding bits and pieces of certain data checks back into the actual pipelines. @angela97lin , any thoughts? Are we moving in a direction contrary to the philosophy we had in mind for data checks?
There was a problem hiding this comment.
I don't think that's a bad idea at all @chukarsten, especially since we already have a check for whether or not a target has enough instances of a class in a classification problem--it seems to fit the philosophy for data checks quite well.
Pretty interesting problem, maybe this is worth a discussion with the larger group? I wonder if we need to better draw the line between when we want to throw an exception to strongly disallow something vs use data checks.
There was a problem hiding this comment.
Calling attention to this issue a filed forever ago that I think tackles the same problem #968
Basically, let pipelines accept data checks on init so that we can run data checks prior to fit.
I think that's just one particular solution to the problem at hand but I think it's worth considering.
chukarsten
left a comment
There was a problem hiding this comment.
Thanks for tackling this one, Parthiv! I am gonna block based on the duplication of the data checks/actions logic. I am not sure what we want to do with this. The situation is kind of the same for the now reverted #3160 . What do we want to do when we're testing for something that a data check already tests for? I suspect we're going to want to refactor the datachecks in such a way that we can use the same logic. Let's talk about it in OH Monday?
| ValueError: If the number of unique classes in y are not appropriate for the type of pipeline. | ||
| """ | ||
| X, y = self._convert_to_woodwork(X, y) | ||
|
|
There was a problem hiding this comment.
Isn't this exactly the same as ClassificationPipeline.fit()? I feel like we can actually just get rid of the time series fit, unless there's currently a compelling reason to override it with the same thing? @freddyaboulton , did you have something in mind?
# Conflicts: # docs/source/release_notes.rst # evalml/tests/pipeline_tests/test_time_series_pipeline.py
# Conflicts: # evalml/tests/pipeline_tests/test_pipelines.py
# Conflicts: # docs/source/release_notes.rst # docs/source/user_guide/pipelines.ipynb
bchen1116
left a comment
There was a problem hiding this comment.
LGTM! Left a nit. Agreed with Karsten, I think it would be useful to see if there was a way to solve this without duplicating code that already exists in our data check
| y = infer_feature_types(y) | ||
|
|
||
| if is_binary(self.problem_type) and y.nunique() != 2: | ||
| raise ValueError(f"Binary pipelines require y to have 2 unique classes!") |
There was a problem hiding this comment.
nit: We don't need the formatting f prefacing this string
chukarsten
left a comment
There was a problem hiding this comment.
LGTM (Looks Gucci to me)
Fixes #2975
The scope of this issue was expanded to ensure that the appropriate number of classes exist for binary and multiclass pipelines.