Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add problematic target data check #814

Merged
merged 19 commits into from Jun 4, 2020
Merged

Add problematic target data check #814

merged 19 commits into from Jun 4, 2020

Conversation

angela97lin
Copy link
Contributor

@angela97lin angela97lin commented May 27, 2020

Closes #710 by adding InvalidTargetsDataCheck, appending it to DefaultDataChecks run by AutoML

InvalidTargetsDataCheck currently only checks if there are any NaN/None values in the target labels.

@angela97lin angela97lin self-assigned this May 27, 2020
@codecov
Copy link

codecov bot commented May 27, 2020

Codecov Report

Merging #814 into master will increase coverage by 0.00%.
The diff coverage is 100.00%.

Impacted file tree graph

@@           Coverage Diff           @@
##           master     #814   +/-   ##
=======================================
  Coverage   99.67%   99.67%           
=======================================
  Files         186      188    +2     
  Lines        7295     7338   +43     
=======================================
+ Hits         7271     7314   +43     
  Misses         24       24           
Impacted Files Coverage Δ
evalml/data_checks/__init__.py 100.00% <100.00%> (ø)
evalml/data_checks/default_data_checks.py 100.00% <100.00%> (ø)
evalml/data_checks/invalid_targets_data_check.py 100.00% <100.00%> (ø)
evalml/tests/data_checks_tests/test_data_checks.py 100.00% <100.00%> (ø)
...ta_checks_tests/test_invalid_targets_data_check.py 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c55e109...9fbcb54. Read the comment docs.

@angela97lin angela97lin marked this pull request as ready for review June 1, 2020 15:13
@angela97lin angela97lin requested a review from dsherry June 1, 2020 15:13
y = pd.Series(y)
null_rows = y.isnull()
error_msg = "Row '{}' contains a null value"
return [DataCheckError(error_msg.format(row_index), self.name) for row_index, row_value in null_rows.items() if row_value]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need an error for each row. Let's just return one error saying the target contains a missing value, yeah?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what if we put the count and % of the null values in the error message? 1 row vs 50% of rows is a very different error

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dsherry @kmax12 Something along the lines of "1 row(s) (50%) of rows are null"?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's what i was thinking

Copy link
Contributor

@dsherry dsherry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good stuff. I think we should return a single error though.

@angela97lin angela97lin requested a review from dsherry June 2, 2020 22:15
Copy link
Contributor

@dsherry dsherry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Almost there, left another comment about what info is included in the error

Copy link
Contributor

@kmax12 kmax12 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@dsherry dsherry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! 🚢

@angela97lin angela97lin merged commit f21a2aa into master Jun 4, 2020
@angela97lin angela97lin deleted the 710_target_check branch June 4, 2020 15:27
@angela97lin angela97lin mentioned this pull request Jun 30, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Data Checks API: Add new “invalid or problematic target data” data check
3 participants