Conversation
8d064b0 to
a0a5f39
Compare
Codecov Report
@@ Coverage Diff @@
## main #1665 +/- ##
=========================================
+ Coverage 100.0% 100.0% +0.1%
=========================================
Files 240 240
Lines 18577 18625 +48
=========================================
+ Hits 18569 18617 +48
Misses 8 8
Continue to review full report at Codecov.
|
645b9fa to
f0989dd
Compare
|
|
||
| # with 1 class not having min 2 instances | ||
| assert invalid_targets_check.validate(X, y=pd.Series([0, 1, 1, 2, 2])) == { | ||
| assert invalid_targets_check.validate(X, y=pd.Series([0] + [1] * 19 + [2] * 80)) == { |
There was a problem hiding this comment.
This is an edit to make the targets pass the test that looks for targets with large numbers of unique values relative to the total number of target values. I do this in a few places.
| details=details).to_dict()) | ||
|
|
||
| num_class_to_num_value_ratio = len(unique_values) / len(y) | ||
| if num_class_to_num_value_ratio >= self.multiclass_continuous_threshold: |
There was a problem hiding this comment.
Not sure I'm understanding this. If there's 3 unique values in a list of length 50, this would warn that the problem could be regression? 3/50 == 0.06 > 0.05. If so, this default seems really high, but definitely open to discuss.
There was a problem hiding this comment.
Good point, I think that's why this is a warning rather than an error, but maybe there could be an additional check taking into account the size of the target data when calculating this ratio?
There was a problem hiding this comment.
Yea I agree that I wouldn't expect a warning on a problem with 3 unique classes across 50 observations. Maybe we should bump it higher. As a follow up, it'd be nice if the detect_problem_type util used this same logic for recommending multiclass vs regression. Right now that has a hard cap on 10 classes for multiclass. That can be tackled in a separate issue though.
There was a problem hiding this comment.
Yup, I think this is a good place for a discussion. I just implemented what was in the issue as I wasn't present for the discussion - a 5% threshold. I'm certainly down for a smarter method of doing it.
ParthivNaresh
left a comment
There was a problem hiding this comment.
Looks excellent, just a few wording changes!
| "objective": None}}) | ||
|
|
||
|
|
||
| def test_invalid_target_data_check_regression_problem_nonnumeric_data(): |
There was a problem hiding this comment.
You could parametrize with a non-regression problem type to check the differences if you want to
There was a problem hiding this comment.
So I went ahead and parametrized it...I guess I'm not very familiar with what we ultimately want. I know in some of my tests I use non-numerics for a categorical, but I don't think we're setup to do that, right? ["Happy", "Birthday", "Birthday"] isn't considered a multiclass target, but [0, 1, 2, 2, 2, 1] is, right?
evalml/tests/data_checks_tests/test_invalid_targets_data_check.py
Outdated
Show resolved
Hide resolved
evalml/tests/data_checks_tests/test_invalid_targets_data_check.py
Outdated
Show resolved
Hide resolved
freddyaboulton
left a comment
There was a problem hiding this comment.
@chukarsten Thanks for the improvements!
| details=details).to_dict()) | ||
|
|
||
| num_class_to_num_value_ratio = len(unique_values) / len(y) | ||
| if num_class_to_num_value_ratio >= self.multiclass_continuous_threshold: |
There was a problem hiding this comment.
Yea I agree that I wouldn't expect a warning on a problem with 3 unique classes across 50 observations. Maybe we should bump it higher. As a follow up, it'd be nice if the detect_problem_type util used this same logic for recommending multiclass vs regression. Right now that has a hard cap on 10 classes for multiclass. That can be tackled in a separate issue though.
… Need to refactor to handle the details checks.
…ulticlass problem with binary target data and tests. Need to update the docstrings and also look at how the previous X and y were updated to not trigger the large amount of classes warning. Many of these X/y combos can probably be merged.
Shortened the unique class message.
Refactored to accomodate : TARGET_MULTICLASS_HIGH_UNIQUE_CLASS
Updated tests to accomodate : TARGET_MULTICLASS_HIGH_UNIQUE_CLASS change.
1f410af to
9da1f50
Compare
For regression, error if the problem type was not "numeric"
For binary, error if the problem type was "categorical" (but first we should confirm that if a categorical column has only two unique values (other than nan) woodwork will label it as binary, not as "categorical" (this was a little unclear - I think this already is the case and is tested.)
For multiclass, error if the problem type was binary
For multiclass, warn if the problem type had a high number of unique values--perhaps set a max cap at over 5%.
addresses Issue #1548