-
Notifications
You must be signed in to change notification settings - Fork 87
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Port over highly-null Data Check and define BasicDataChecks and DisableDataChecks classes #745
Conversation
Codecov Report
@@ Coverage Diff @@
## master #745 +/- ##
=======================================
Coverage 99.35% 99.36%
=======================================
Files 148 151 +3
Lines 5299 5378 +79
=======================================
+ Hits 5265 5344 +79
Misses 34 34
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good stuff! I left some comments. In particular, I think the highly null data check should return one warning for each column, not just one which lists all columns.
"""Checks if there are any highly-null columns in the input. | ||
|
||
Arguments: | ||
percent_threshold(float): If the percentage of values in an input feature exceeds this amount, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's update this description. It doesn't mention what the percentage applies to. In fact, I wonder if we should rename this parameter. Perhaps pct_null_threshold
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like pct_null_threshold
! I'm not sure what you mean by "doesn't mention what the percentage applies to"; is the input feature not clear? :o
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dsherry I just changed it to:
pct_null_threshold(float): If the percentage of NaN values in an input feature exceeds this amount, that feature will be considered highly-null. Defaults to 0.95.
Is that better?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 yeah thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool, I think thats the last comment to address on this PR so I'll merge it when tests are green 👍
percent_null = (X.isnull().mean()).to_dict() | ||
highly_null_cols = {key: value for key, value in percent_null.items() if value >= self.percent_threshold} | ||
warning_msg = "Column '{}' is {}% or more null" | ||
return [DataCheckWarning(warning_msg.format(col_name, self.percent_threshold * 100), self.name) for col_name in highly_null_cols] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does this output for inputs which aren't pd.DataFrame
, or for dataframes which don't have column names set? If a dataframe's column names aren't set, pd.DataFrame.to_dict
will use the dataframe index. I see we have coverage for this in test_highly_null_data_check_input_formats
. I guess there's not much we can do about that, haha.
Relatedly, in the future we'll probably want each data check to have its own message type. For instance, if we had HighlyNullColumnWarning
, we could have that add a column_name
parameter as metadata. I don't think we should add that now, but I bet we'll need that later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, right now it'd use the dataframe index, which I don't this is too bad of an idea? I think it's pretty nice for 2d data, but maybe a little weird for lists / series.
And yeah, I think the idea of each data check having their own message type was something we'd had talked about during the design phase, but seemed excessive / unnecessary for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left more comments, but still LGTM 🚀
Closes #708
Some questions:
DetectHighlyNullDataCheck
: name too long?DetectHighlyNull
?