Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

InvalidDataTargetCheck additions #1665

Merged
merged 17 commits into from
Jan 12, 2021

Conversation

chukarsten
Copy link
Contributor

@chukarsten chukarsten commented Jan 7, 2021

  • For regression, error if the problem type was not "numeric"

  • For binary, error if the problem type was "categorical" (but first we should confirm that if a categorical column has only two unique values (other than nan) woodwork will label it as binary, not as "categorical" (this was a little unclear - I think this already is the case and is tested.)

  • For multiclass, error if the problem type was binary

  • For multiclass, warn if the problem type had a high number of unique values--perhaps set a max cap at over 5%.

addresses Issue #1548

@chukarsten chukarsten force-pushed the 1548-make_invalidatargetdatacheck_smarter branch from 8d064b0 to a0a5f39 Compare January 8, 2021 21:43
@codecov
Copy link

codecov bot commented Jan 11, 2021

Codecov Report

Merging #1665 (9da1f50) into main (80daa66) will increase coverage by 0.1%.
The diff coverage is 100.0%.

Impacted file tree graph

@@            Coverage Diff            @@
##             main    #1665     +/-   ##
=========================================
+ Coverage   100.0%   100.0%   +0.1%     
=========================================
  Files         240      240             
  Lines       18577    18625     +48     
=========================================
+ Hits        18569    18617     +48     
  Misses          8        8             
Impacted Files Coverage Δ
evalml/data_checks/no_variance_data_check.py 100.0% <ø> (ø)
...lml/tests/model_understanding_tests/test_graphs.py 100.0% <ø> (ø)
evalml/data_checks/data_check_message_code.py 100.0% <100.0%> (ø)
evalml/data_checks/invalid_targets_data_check.py 100.0% <100.0%> (ø)
evalml/tests/data_checks_tests/test_data_checks.py 100.0% <100.0%> (ø)
...ta_checks_tests/test_invalid_targets_data_check.py 100.0% <100.0%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 80daa66...9da1f50. Read the comment docs.

@chukarsten chukarsten force-pushed the 1548-make_invalidatargetdatacheck_smarter branch from 645b9fa to f0989dd Compare January 11, 2021 17:14
@@ -81,20 +83,20 @@ def test_invalid_target_data_check_multiclass_two_examples_per_class():
expected_message = "Target does not have at least two instances per class which is required for multiclass classification"

# with 1 class not having min 2 instances
assert invalid_targets_check.validate(X, y=pd.Series([0, 1, 1, 2, 2])) == {
assert invalid_targets_check.validate(X, y=pd.Series([0] + [1] * 19 + [2] * 80)) == {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an edit to make the targets pass the test that looks for targets with large numbers of unique values relative to the total number of target values. I do this in a few places.

Copy link
Contributor

@bchen1116 bchen1116 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just left a question

details=details).to_dict())

num_class_to_num_value_ratio = len(unique_values) / len(y)
if num_class_to_num_value_ratio >= self.multiclass_continuous_threshold:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I'm understanding this. If there's 3 unique values in a list of length 50, this would warn that the problem could be regression? 3/50 == 0.06 > 0.05. If so, this default seems really high, but definitely open to discuss.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, I think that's why this is a warning rather than an error, but maybe there could be an additional check taking into account the size of the target data when calculating this ratio?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea I agree that I wouldn't expect a warning on a problem with 3 unique classes across 50 observations. Maybe we should bump it higher. As a follow up, it'd be nice if the detect_problem_type util used this same logic for recommending multiclass vs regression. Right now that has a hard cap on 10 classes for multiclass. That can be tackled in a separate issue though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, I think this is a good place for a discussion. I just implemented what was in the issue as I wasn't present for the discussion - a 5% threshold. I'm certainly down for a smarter method of doing it.

evalml/data_checks/no_variance_data_check.py Show resolved Hide resolved
Copy link
Contributor

@ParthivNaresh ParthivNaresh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks excellent, just a few wording changes!

evalml/data_checks/data_check_message_code.py Outdated Show resolved Hide resolved
evalml/data_checks/invalid_targets_data_check.py Outdated Show resolved Hide resolved
evalml/data_checks/invalid_targets_data_check.py Outdated Show resolved Hide resolved
@@ -314,3 +320,67 @@ def test_invalid_target_data_check_initialize_with_none_objective():
with pytest.raises(DataCheckInitError, match="Encountered the following error"):
DataChecks([InvalidTargetDataCheck], {"InvalidTargetDataCheck": {"problem_type": "multiclass",
"objective": None}})


def test_invalid_target_data_check_regression_problem_nonnumeric_data():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could parametrize with a non-regression problem type to check the differences if you want to

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I went ahead and parametrized it...I guess I'm not very familiar with what we ultimately want. I know in some of my tests I use non-numerics for a categorical, but I don't think we're setup to do that, right? ["Happy", "Birthday", "Birthday"] isn't considered a multiclass target, but [0, 1, 2, 2, 2, 1] is, right?

Copy link
Contributor

@freddyaboulton freddyaboulton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chukarsten Thanks for the improvements!

evalml/data_checks/invalid_targets_data_check.py Outdated Show resolved Hide resolved
evalml/data_checks/invalid_targets_data_check.py Outdated Show resolved Hide resolved
details=details).to_dict())

num_class_to_num_value_ratio = len(unique_values) / len(y)
if num_class_to_num_value_ratio >= self.multiclass_continuous_threshold:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea I agree that I wouldn't expect a warning on a problem with 3 unique classes across 50 observations. Maybe we should bump it higher. As a follow up, it'd be nice if the detect_problem_type util used this same logic for recommending multiclass vs regression. Right now that has a hard cap on 10 classes for multiclass. That can be tackled in a separate issue though.

chukarsten and others added 15 commits January 12, 2021 16:19
… Need to refactor to handle the details checks.
…ulticlass problem with binary target data and tests. Need to update the docstrings and also look at how the previous X and y were updated to not trigger the large amount of classes warning. Many of these X/y combos can probably be merged.
Shortened the unique class message.
Refactored to accomodate : TARGET_MULTICLASS_HIGH_UNIQUE_CLASS
Updated tests to accomodate : TARGET_MULTICLASS_HIGH_UNIQUE_CLASS change.
@chukarsten chukarsten force-pushed the 1548-make_invalidatargetdatacheck_smarter branch from 1f410af to 9da1f50 Compare January 12, 2021 21:20
@chukarsten chukarsten merged commit ba7590f into main Jan 12, 2021
@chukarsten chukarsten deleted the 1548-make_invalidatargetdatacheck_smarter branch January 12, 2021 21:55
@bchen1116 bchen1116 mentioned this pull request Jan 26, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants