Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Standardize data check messages by adding default "rows" and "columns" metadata #2869

Merged
merged 32 commits into from
Oct 21, 2021

Conversation

angela97lin
Copy link
Contributor

@angela97lin angela97lin commented Oct 1, 2021

Closes #2792.

  • Adds a default "rows" and "columns" metadata field to data check messages.
  • Also updates data checks which previously would output one message per column to consolidate all similar messages into one message.

I still need to update docstring tests, will wait until #2933 is merged to avoid unnecessary changes and conflicts.

@angela97lin angela97lin self-assigned this Oct 1, 2021
@codecov
Copy link

codecov bot commented Oct 1, 2021

Codecov Report

Merging #2869 (e9e863b) into main (c941353) will increase coverage by 0.1%.
The diff coverage is 100.0%.

Impacted file tree graph

@@           Coverage Diff           @@
##            main   #2869     +/-   ##
=======================================
+ Coverage   99.7%   99.7%   +0.1%     
=======================================
  Files        302     302             
  Lines      28593   28626     +33     
=======================================
+ Hits       28500   28533     +33     
  Misses        93      93             
Impacted Files Coverage Δ
evalml/data_checks/class_imbalance_data_check.py 100.0% <ø> (ø)
evalml/data_checks/datetime_format_data_check.py 100.0% <ø> (ø)
evalml/data_checks/datetime_nan_data_check.py 100.0% <ø> (ø)
evalml/data_checks/invalid_targets_data_check.py 100.0% <ø> (ø)
evalml/data_checks/multicollinearity_data_check.py 100.0% <ø> (ø)
...lml/data_checks/natural_language_nan_data_check.py 100.0% <ø> (ø)
...alml/data_checks/target_distribution_data_check.py 100.0% <ø> (ø)
.../data_checks_tests/test_datetime_nan_data_check.py 100.0% <ø> (ø)
...s/data_checks_tests/test_highly_null_data_check.py 100.0% <ø> (ø)
...ts/data_checks_tests/test_id_columns_data_check.py 100.0% <ø> (ø)
... and 22 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c941353...e9e863b. Read the comment docs.

@angela97lin angela97lin changed the title [Spike] Standardize data check messages (columns vs column) Standardize data check messages by adding default "rows" and "columns" metadata Oct 15, 2021
@angela97lin angela97lin marked this pull request as ready for review October 15, 2021 03:40
Copy link
Contributor

@bchen1116 bchen1116 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Left some nits, but I like that we have now simplified the returns and combined the column/rows when needed!

evalml/data_checks/no_variance_data_check.py Show resolved Hide resolved
@@ -86,14 +86,14 @@ def validate(self, X, y):
>>> y = pd.Series([10, 42, 31, 51, 40])
>>> target_leakage_check = TargetLeakageDataCheck(pct_corr_threshold=0.95)
>>> assert target_leakage_check.validate(X, y) == {
... "warnings": [{"message": "Column 'leak' is 95.0% or more correlated with the target",
... "warnings": [{"message": "Columns 'leak' are 95.0% or more correlated with the target",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It feels weird that this is plural when there's only one column. Not a nit, but would be nice to fix this

Copy link
Contributor

@ParthivNaresh ParthivNaresh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, it's definitely odd that the docstrings aren't failing in HighlyNullDataCheck and elsewhere. Getting this compatible with tempo health shouldn't be a heavy lift from here for me, thanks for the changes!

evalml/data_checks/highly_null_data_check.py Show resolved Hide resolved
).to_dict(),
DataCheckWarning(
message="Column 'd' is 80.0% or more correlated with the target",
message="Columns 'a', 'b', 'c', 'd' are 80.0% or more correlated with the target",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The beauty of consolidation

results["actions"].append(
DataCheckAction(
DataCheckActionCode.DROP_ROWS,
metadata={"indices": all_rows_with_indices},
metadata={"rows": all_rows_with_indices},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess a follow up comment, do we want to repeat the all_rows_with_indices over here? Maybe it should exist only in the DataCheckAction and not the Warning?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oof I missed this comment before merging, but I'm pretty indifferent about this either way!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Standardize data check return output for warnings / errors for columns
3 participants