Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updating Data Checks to support Data Health #2907

Merged
merged 23 commits into from Oct 22, 2021

Conversation

ParthivNaresh
Copy link
Contributor

@ParthivNaresh ParthivNaresh commented Oct 14, 2021

After conversation with Tyler, Raymond, Raj, and Dylan, we've decided to proceed with Woodwork's box plot for outlier detection, and future work done by Raymond will determine if this is the best approach. If not, the necessary changes can be made in Woodwork's implementation.

@codecov
Copy link

codecov bot commented Oct 14, 2021

Codecov Report

Merging #2907 (edd73a4) into main (47c1fec) will increase coverage by 0.1%.
The diff coverage is 100.0%.

Impacted file tree graph

@@           Coverage Diff           @@
##            main   #2907     +/-   ##
=======================================
+ Coverage   99.7%   99.7%   +0.1%     
=======================================
  Files        302     302             
  Lines      28850   28872     +22     
=======================================
+ Hits       28759   28781     +22     
  Misses        91      91             
Impacted Files Coverage Δ
evalml/tests/data_checks_tests/test_data_checks.py 100.0% <ø> (ø)
...s/data_checks_tests/test_highly_null_data_check.py 100.0% <ø> (ø)
evalml/data_checks/highly_null_data_check.py 100.0% <100.0%> (ø)
evalml/data_checks/outliers_data_check.py 100.0% <100.0%> (ø)
...ests/data_checks_tests/test_outliers_data_check.py 100.0% <100.0%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 47c1fec...edd73a4. Read the comment docs.

@@ -143,7 +207,7 @@ def _no_outlier_prob(num_records: int, pct_outliers: float) -> float:
shape_param = np.exp(log_shape)
log_scale = (
-19.8196822259052
+ 8.5359212447622 * log_n
+ 18.5359212447622 * log_n
Copy link
Contributor Author

@ParthivNaresh ParthivNaresh Oct 14, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if we missed this before on purpose but this was part of the original log scale

@@ -98,6 +101,67 @@ def validate(self, X, y=None):
)
return results

@staticmethod
def _get_boxplot_data(data_):
Copy link
Contributor Author

@ParthivNaresh ParthivNaresh Oct 14, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The purpose of this is mainly to provide consumers with easy access to the information it needs without having to instantiate OutliersDataCheck, while still allowing validate to use the numbers it needs so this code doesn't clutter up the function

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should make it public then?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed!

@@ -1,6 +1,7 @@
"""Data check that checks if there are any outliers in input data by using IQR to determine score anomalies."""
import numpy as np
from scipy.stats import gamma
from statsmodels.stats.stattools import medcouple
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have to rely on the statsmodels implementation of medcouple because robustats doesn't seem to have a conda installation option. This also limits having to add extra dependencies

details={"pct_null_cols": highly_null_rows},
details={
"pct_null_cols": highly_null_rows,
"pct_of_rows_above_thresh": round(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Necessary for pct_highly_null_rows in table_health

DataCheckActionCode.DROP_COL,
metadata={
"column": col_name,
"row_indices": X[col_name][X[col_name].isnull()].index.tolist(),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Necessary for missing_score in table_health

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct me if I'm wrong, but looking at the table health code, the missing_score doesn't use the indices to compute the scores. I think this is fine to keep as is while we do the integration but maybe we should file an issue to delete this? I guess I'm worried it can be a very big list.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The score doesn't use this but the indices are part of the payload:

payload = {"score": score, "indices": indices}

and are added to the column_payload in composite_score:

if metric == "missing":
    missing = missing_score(data)
    column_scores.append(missing["score"])
    column_payload["missing"] = missing

which is then appended to the final payload:

payload["column_scores"].append(column_payload)

).to_dict(),
DataCheckAction(
DataCheckActionCode.IMPUTE_COL,
metadata={"column": None, "is_target": True, "impute_strategy": "mean"},
).to_dict(),
DataCheckAction(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NoVarianceDataCheck throws this

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should get @angela97lin 's opinion on this before merging?

I think the intention is to not have duplicate actions for the same column and now there are two DROP_COL for both all_null and also_all_null.

Do we need to include row_indices in the action code? Maybe it can go in the warning?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#2869 ends up combining all of the actions into one action! Agreed though, if the action is just drop the column, the row indices seem unimportant 😛

@ParthivNaresh ParthivNaresh marked this pull request as ready for review October 14, 2021 19:48
Copy link
Contributor

@freddyaboulton freddyaboulton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ParthivNaresh This is great! I think this is pretty much ready to go but I think we should discuss whether we need to include row_indices in the action code for the highly null check. It's introducing a subtle change where there are can now be two DROP_COLs per column.

DataCheckActionCode.DROP_COL,
metadata={
"column": col_name,
"row_indices": X[col_name][X[col_name].isnull()].index.tolist(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct me if I'm wrong, but looking at the table health code, the missing_score doesn't use the indices to compute the scores. I think this is fine to keep as is while we do the integration but maybe we should file an issue to delete this? I guess I'm worried it can be a very big list.

@@ -98,6 +101,67 @@ def validate(self, X, y=None):
)
return results

@staticmethod
def _get_boxplot_data(data_):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should make it public then?

).to_dict(),
DataCheckAction(
DataCheckActionCode.IMPUTE_COL,
metadata={"column": None, "is_target": True, "impute_strategy": "mean"},
).to_dict(),
DataCheckAction(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should get @angela97lin 's opinion on this before merging?

I think the intention is to not have duplicate actions for the same column and now there are two DROP_COL for both all_null and also_all_null.

Do we need to include row_indices in the action code? Maybe it can go in the warning?

@chukarsten chukarsten changed the title Updating Data Checks to support Tempo Data Health Updating Data Checks to support Data Health Oct 19, 2021
@ParthivNaresh ParthivNaresh marked this pull request as draft October 20, 2021 18:05
@ParthivNaresh ParthivNaresh marked this pull request as ready for review October 21, 2021 20:05
Copy link
Contributor

@freddyaboulton freddyaboulton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great to me @ParthivNaresh ! Thanks for making the changes

@ParthivNaresh ParthivNaresh merged commit 8357f13 into main Oct 22, 2021
@chukarsten chukarsten mentioned this pull request Oct 27, 2021
@freddyaboulton freddyaboulton deleted the Match-DataHealth-Functions branch May 13, 2022 15:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants