-
Notifications
You must be signed in to change notification settings - Fork 91
Updating Data Checks to support Data Health #2907
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
# Conflicts: # docs/source/release_notes.rst
Codecov Report
@@ Coverage Diff @@
## main #2907 +/- ##
=======================================
+ Coverage 99.7% 99.7% +0.1%
=======================================
Files 302 302
Lines 28850 28872 +22
=======================================
+ Hits 28759 28781 +22
Misses 91 91
Continue to review full report at Codecov.
|
| log_scale = ( | ||
| -19.8196822259052 | ||
| + 8.5359212447622 * log_n | ||
| + 18.5359212447622 * log_n |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if we missed this before on purpose but this was part of the original log scale
| return results | ||
|
|
||
| @staticmethod | ||
| def _get_boxplot_data(data_): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The purpose of this is mainly to provide consumers with easy access to the information it needs without having to instantiate OutliersDataCheck, while still allowing validate to use the numbers it needs so this code doesn't clutter up the function
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we should make it public then?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed!
| """Data check that checks if there are any outliers in input data by using IQR to determine score anomalies.""" | ||
| import numpy as np | ||
| from scipy.stats import gamma | ||
| from statsmodels.stats.stattools import medcouple |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have to rely on the statsmodels implementation of medcouple because robustats doesn't seem to have a conda installation option. This also limits having to add extra dependencies
| details={"pct_null_cols": highly_null_rows}, | ||
| details={ | ||
| "pct_null_cols": highly_null_rows, | ||
| "pct_of_rows_above_thresh": round( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Necessary for pct_highly_null_rows in table_health
| DataCheckActionCode.DROP_COL, | ||
| metadata={ | ||
| "column": col_name, | ||
| "row_indices": X[col_name][X[col_name].isnull()].index.tolist(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Necessary for missing_score in table_health
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct me if I'm wrong, but looking at the table health code, the missing_score doesn't use the indices to compute the scores. I think this is fine to keep as is while we do the integration but maybe we should file an issue to delete this? I guess I'm worried it can be a very big list.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The score doesn't use this but the indices are part of the payload:
payload = {"score": score, "indices": indices}
and are added to the column_payload in composite_score:
if metric == "missing":
missing = missing_score(data)
column_scores.append(missing["score"])
column_payload["missing"] = missing
which is then appended to the final payload:
payload["column_scores"].append(column_payload)
| DataCheckActionCode.IMPUTE_COL, | ||
| metadata={"column": None, "is_target": True, "impute_strategy": "mean"}, | ||
| ).to_dict(), | ||
| DataCheckAction( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NoVarianceDataCheck throws this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we should get @angela97lin 's opinion on this before merging?
I think the intention is to not have duplicate actions for the same column and now there are two DROP_COL for both all_null and also_all_null.
Do we need to include row_indices in the action code? Maybe it can go in the warning?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#2869 ends up combining all of the actions into one action! Agreed though, if the action is just drop the column, the row indices seem unimportant 😛
# Conflicts: # docs/source/release_notes.rst
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ParthivNaresh This is great! I think this is pretty much ready to go but I think we should discuss whether we need to include row_indices in the action code for the highly null check. It's introducing a subtle change where there are can now be two DROP_COLs per column.
| DataCheckActionCode.DROP_COL, | ||
| metadata={ | ||
| "column": col_name, | ||
| "row_indices": X[col_name][X[col_name].isnull()].index.tolist(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct me if I'm wrong, but looking at the table health code, the missing_score doesn't use the indices to compute the scores. I think this is fine to keep as is while we do the integration but maybe we should file an issue to delete this? I guess I'm worried it can be a very big list.
| return results | ||
|
|
||
| @staticmethod | ||
| def _get_boxplot_data(data_): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we should make it public then?
| DataCheckActionCode.IMPUTE_COL, | ||
| metadata={"column": None, "is_target": True, "impute_strategy": "mean"}, | ||
| ).to_dict(), | ||
| DataCheckAction( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we should get @angela97lin 's opinion on this before merging?
I think the intention is to not have duplicate actions for the same column and now there are two DROP_COL for both all_null and also_all_null.
Do we need to include row_indices in the action code? Maybe it can go in the warning?
…yx/evalml into Match-DataHealth-Functions
# Conflicts: # evalml/data_checks/highly_null_data_check.py # evalml/tests/data_checks_tests/test_data_checks.py # evalml/tests/data_checks_tests/test_highly_null_data_check.py
freddyaboulton
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great to me @ParthivNaresh ! Thanks for making the changes
After conversation with Tyler, Raymond, Raj, and Dylan, we've decided to proceed with
Woodwork's box plot for outlier detection, and future work done by Raymond will determine if this is the best approach. If not, the necessary changes can be made in Woodwork's implementation.