Skip to content

Standardize data check messages by adding default "rows" and "columns" metadata#2869

Merged
angela97lin merged 32 commits intomainfrom
2792_standardize_dc_cleanup
Oct 21, 2021
Merged

Standardize data check messages by adding default "rows" and "columns" metadata#2869
angela97lin merged 32 commits intomainfrom
2792_standardize_dc_cleanup

Conversation

@angela97lin
Copy link
Contributor

@angela97lin angela97lin commented Oct 1, 2021

Closes #2792.

  • Adds a default "rows" and "columns" metadata field to data check messages.
  • Also updates data checks which previously would output one message per column to consolidate all similar messages into one message.

I still need to update docstring tests, will wait until #2933 is merged to avoid unnecessary changes and conflicts.

@angela97lin angela97lin self-assigned this Oct 1, 2021
@codecov
Copy link

codecov bot commented Oct 1, 2021

Codecov Report

Merging #2869 (e9e863b) into main (c941353) will increase coverage by 0.1%.
The diff coverage is 100.0%.

Impacted file tree graph

@@           Coverage Diff           @@
##            main   #2869     +/-   ##
=======================================
+ Coverage   99.7%   99.7%   +0.1%     
=======================================
  Files        302     302             
  Lines      28593   28626     +33     
=======================================
+ Hits       28500   28533     +33     
  Misses        93      93             
Impacted Files Coverage Δ
evalml/data_checks/class_imbalance_data_check.py 100.0% <ø> (ø)
evalml/data_checks/datetime_format_data_check.py 100.0% <ø> (ø)
evalml/data_checks/datetime_nan_data_check.py 100.0% <ø> (ø)
evalml/data_checks/invalid_targets_data_check.py 100.0% <ø> (ø)
evalml/data_checks/multicollinearity_data_check.py 100.0% <ø> (ø)
...lml/data_checks/natural_language_nan_data_check.py 100.0% <ø> (ø)
...alml/data_checks/target_distribution_data_check.py 100.0% <ø> (ø)
.../data_checks_tests/test_datetime_nan_data_check.py 100.0% <ø> (ø)
...s/data_checks_tests/test_highly_null_data_check.py 100.0% <ø> (ø)
...ts/data_checks_tests/test_id_columns_data_check.py 100.0% <ø> (ø)
... and 22 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c941353...e9e863b. Read the comment docs.

@angela97lin angela97lin changed the title [Spike] Standardize data check messages (columns vs column) Standardize data check messages by adding default "rows" and "columns" metadata Oct 15, 2021
@angela97lin angela97lin marked this pull request as ready for review October 15, 2021 03:40
Copy link
Contributor

@bchen1116 bchen1116 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Left some nits, but I like that we have now simplified the returns and combined the column/rows when needed!

>>> target_leakage_check = TargetLeakageDataCheck(pct_corr_threshold=0.95)
>>> assert target_leakage_check.validate(X, y) == {
... "warnings": [{"message": "Column 'leak' is 95.0% or more correlated with the target",
... "warnings": [{"message": "Columns 'leak' are 95.0% or more correlated with the target",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It feels weird that this is plural when there's only one column. Not a nit, but would be nice to fix this

Copy link
Contributor

@ParthivNaresh ParthivNaresh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, it's definitely odd that the docstrings aren't failing in HighlyNullDataCheck and elsewhere. Getting this compatible with tempo health shouldn't be a heavy lift from here for me, thanks for the changes!

).to_dict(),
DataCheckWarning(
message="Column 'd' is 80.0% or more correlated with the target",
message="Columns 'a', 'b', 'c', 'd' are 80.0% or more correlated with the target",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The beauty of consolidation

DataCheckAction(
DataCheckActionCode.DROP_ROWS,
metadata={"indices": all_rows_with_indices},
metadata={"rows": all_rows_with_indices},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess a follow up comment, do we want to repeat the all_rows_with_indices over here? Maybe it should exist only in the DataCheckAction and not the Warning?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oof I missed this comment before merging, but I'm pretty indifferent about this either way!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Standardize data check return output for warnings / errors for columns

3 participants