New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add null rows check to HighlyNullDataCheck
#2222
Conversation
HighlyNullRowDataCheck
HighlyNullRowsDataCheck
class HighlyNullRowsDataCheck(DataCheck): | ||
"""Checks if there are any highly-null rows in the input.""" | ||
|
||
def __init__(self, pct_null_threshold=0.5): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I set default to 0.5
(as opposed to 0.95
) because I think this can be more aggressive on removing individual samples/rows than removing a whole column in HighlyNullDataCheck
.
Codecov Report
@@ Coverage Diff @@
## main #2222 +/- ##
=========================================
+ Coverage 100.0% 100.0% +0.1%
=========================================
Files 280 280
Lines 24322 24357 +35
=========================================
+ Hits 24300 24335 +35
Misses 22 22
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work here, Jeremy. I think we should merge the logic for the 0% and the rest of the percentage empty in the data check .validate(). Unless there's a really compelling reason for this additional complexity, I think it'll be good to drop the if branching and merge them. Additionally, we might want to talk about adding test coverage for other null types? This looks great though. Really like the tests - very readable.
evalml/tests/data_checks_tests/test_highly_null_rows_data_check.py
Outdated
Show resolved
Hide resolved
evalml/tests/data_checks_tests/test_highly_null_rows_data_check.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jeremyliweishih Looks great! I think this is good to merge pending @chukarsten 's comments.
from evalml.utils import _convert_woodwork_types_wrapper, infer_feature_types | ||
|
||
|
||
class HighlyNullRowsDataCheck(DataCheck): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we can share some code with HighlyNullDataCheck
. The only computational difference I see is
(X.isnull().mean()).to_dict()
vs (X.isnull().mean(axis=1)).to_dict()
.
I'd be happy to punt on this though, doesn't seem super pressing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought about this as well but I felt that it fits better into DefaultDataChecks
and DataChecks
better as separate data checks. I don't feel too strongly on this though - happy to see what others think!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The two options presented so far:
- Update the existing
HighlyNullDataCheck
to also perform a null row check. Don't define a new data check. - Define a separate
HighlyNullRowsDataCheck
as this PR currently does.
The argument for option 1: Simpler. No duplicated impl code. No need to add plumbing for new check up the stack (outside EvalML). Potential for better performance on larger datasets because we'd only need to call X.isnull()
once.
The argument for option 2: Easier for users to selectively only run either the col or row null checks in isolation instead of always running both.
Anything else?
I agree with @freddyaboulton here that option 1 is better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This implementation looks good! I did have a question that I think we should address, whether in this PR or in a future issue.
Currently, we find the null values with the original dataset, but I can see issues if we have a dataset like:
X = pd.DataFrame({"col_1": [None, None, None],
"col_2": [None, 1, 2],
"col_3": [1, 2, None]})
Then the data check for null columns would suggest dropping col 1, and the data check for null rows would drop cols 2 and 3. I'm curious if we want to drop the null columns BEFORE finding the rows that satisfy this data check.
It's mainly for larger datasets, where there might be 1000+ columns, and maybe half have empty columns. These two highlyNull
datachecks would then drop all the rows and columns, which doesn't seem like the behavior we want.
Just something that could raise problems in the future. Requested changes since I wanted to have this discussion before merging!
@bchen1116 good points! I'm not sure if there has been discussions on the order of operations with data check actions but I do like your suggestion. A solution now we could do would be to bump |
@jeremyliweishih yeah, I think that's a good short-term solution! Might need to discuss further for a longer-term solution |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jeremyliweishih I'd like you to change the following:
- Add to existing
HighlyNullDataCheck
instead of creating a separate data check - Return a single
DataCheckWarning
when one or more rows are highly null, instead of one warning for each row - In the warning
details
, save info about which rows are highly null and what the percentage was using a single pandasSeries
instead of a dict.
@@ -7,5 +7,8 @@ class DataCheckActionCode(Enum): | |||
DROP_COL = "drop_col" | |||
"""Action code for dropping a column.""" | |||
|
|||
DROP_ROW = "drop_row" | |||
"""Action code for dropping a row.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's call this DROP_ROWS
since we'll rarely be dropping a single row. Relates to my comments about the impl.
from evalml.utils import _convert_woodwork_types_wrapper, infer_feature_types | ||
|
||
|
||
class HighlyNullRowsDataCheck(DataCheck): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The two options presented so far:
- Update the existing
HighlyNullDataCheck
to also perform a null row check. Don't define a new data check. - Define a separate
HighlyNullRowsDataCheck
as this PR currently does.
The argument for option 1: Simpler. No duplicated impl code. No need to add plumbing for new check up the stack (outside EvalML). Potential for better performance on larger datasets because we'd only need to call X.isnull()
once.
The argument for option 2: Easier for users to selectively only run either the col or row null checks in isolation instead of always running both.
Anything else?
I agree with @freddyaboulton here that option 1 is better.
X = _convert_woodwork_types_wrapper(X.to_dataframe()) | ||
|
||
percent_null = (X.isnull().mean(axis=1)).to_dict() | ||
highly_null_rows = {key: value for key, value in percent_null.items() if value >= self.pct_null_threshold and value != 0} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of using a dictionary here, what if we represented this like so:
percent_null = X.isnull().mean(axis=1)
highly_null = percent_null[percent_null >= self.pct_null_threshold]
results["warnings"].extend([DataCheckWarning(..., details={"row": row_name, "pct_null_cols": highly_null})])
For large datasets I think this is a more efficient way to share this info. Since we don't reset the index above, pandas will track keep track of it for us, so the indices of the rows which are highly null can be obtained by grabbing "pct_null_cols" from the warning details and running pct_null_cols.index
data_check_name=self.name, | ||
message_code=DataCheckMessageCode.HIGHLY_NULL_ROWS, | ||
details={"row": row_name, "pct_null_cols": highly_null_rows[row_name]}).to_dict() | ||
for row_name in highly_null_rows]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This approach will not do well on large and/or sparse datasets. Consider the case when the data is 100k rows x 100 cols and each feature is 1% missing values, but also the row position of those missing values has 0% overlap. In that case we'd end up with 100k DataCheckWarning
message instances in the list here.
We should certainly still allow the caller to see which rows were null, but I don't think we should require them to process each row as a separate warning.
Instead let's return one DataCheckWarning
for the entire dataset.
if len(highly_null) > 0:
warning_msg = f"{len(highly_null)} out of {len(X)} rows are more than {self.pct_null_threshold*100}% null"
results["warnings"].append(
DataCheckWarning(message=warning_msg,
data_check_name=self.name,
message_code=DataCheckMessageCode.HIGHLY_NULL_ROWS,
details={"pct_null_cols": highly_null})
where I'm using highly_null
from my other comment above
HighlyNullRowsDataCheck
HighlyNullDataCheck
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jeremyliweishih thanks, nice going! 🚢 😁
results["warnings"].append(DataCheckWarning(message=warning_msg, | ||
data_check_name=self.name, | ||
message_code=DataCheckMessageCode.HIGHLY_NULL_ROWS, | ||
details={"pct_null_cols": highly_null_rows}).to_dict()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Looks great
|
||
Arguments: | ||
pct_null_threshold(float): If the percentage of NaN values in an input feature exceeds this amount, | ||
that feature will be considered highly-null. Defaults to 0.95. | ||
that column/row will be considered highly-null. Defaults to 0.95. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I do think it would be helpful to have a separate threshold for rows vs columns.
I filed #2270 to track that separately.
Fixes #2220.