Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add null rows check to HighlyNullDataCheck #2222

Merged
merged 23 commits into from May 13, 2021
Merged

Conversation

jeremyliweishih
Copy link
Contributor

Fixes #2220.

@jeremyliweishih jeremyliweishih changed the title Add HighlyNullRowDataCheck Add HighlyNullRowsDataCheck May 3, 2021
class HighlyNullRowsDataCheck(DataCheck):
"""Checks if there are any highly-null rows in the input."""

def __init__(self, pct_null_threshold=0.5):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I set default to 0.5 (as opposed to 0.95) because I think this can be more aggressive on removing individual samples/rows than removing a whole column in HighlyNullDataCheck.

@codecov
Copy link

codecov bot commented May 4, 2021

Codecov Report

Merging #2222 (8c96390) into main (40a8765) will increase coverage by 0.1%.
The diff coverage is 100.0%.

Impacted file tree graph

@@            Coverage Diff            @@
##             main    #2222     +/-   ##
=========================================
+ Coverage   100.0%   100.0%   +0.1%     
=========================================
  Files         280      280             
  Lines       24322    24357     +35     
=========================================
+ Hits        24300    24335     +35     
  Misses         22       22             
Impacted Files Coverage Δ
evalml/data_checks/default_data_checks.py 100.0% <ø> (ø)
evalml/data_checks/data_check_action_code.py 100.0% <100.0%> (ø)
evalml/data_checks/data_check_message_code.py 100.0% <100.0%> (ø)
evalml/data_checks/highly_null_data_check.py 100.0% <100.0%> (ø)
...tests/data_checks_tests/test_data_check_message.py 100.0% <100.0%> (ø)
evalml/tests/data_checks_tests/test_data_checks.py 100.0% <100.0%> (ø)
...s/data_checks_tests/test_highly_null_data_check.py 100.0% <100.0%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 40a8765...8c96390. Read the comment docs.

@jeremyliweishih jeremyliweishih marked this pull request as ready for review May 5, 2021 13:02
Copy link
Contributor

@chukarsten chukarsten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work here, Jeremy. I think we should merge the logic for the 0% and the rest of the percentage empty in the data check .validate(). Unless there's a really compelling reason for this additional complexity, I think it'll be good to drop the if branching and merge them. Additionally, we might want to talk about adding test coverage for other null types? This looks great though. Really like the tests - very readable.

Copy link
Contributor

@freddyaboulton freddyaboulton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jeremyliweishih Looks great! I think this is good to merge pending @chukarsten 's comments.

evalml/data_checks/default_data_checks.py Show resolved Hide resolved
from evalml.utils import _convert_woodwork_types_wrapper, infer_feature_types


class HighlyNullRowsDataCheck(DataCheck):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we can share some code with HighlyNullDataCheck. The only computational difference I see is

(X.isnull().mean()).to_dict() vs (X.isnull().mean(axis=1)).to_dict().

I'd be happy to punt on this though, doesn't seem super pressing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about this as well but I felt that it fits better into DefaultDataChecks and DataChecks better as separate data checks. I don't feel too strongly on this though - happy to see what others think!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The two options presented so far:

  1. Update the existing HighlyNullDataCheck to also perform a null row check. Don't define a new data check.
  2. Define a separate HighlyNullRowsDataCheck as this PR currently does.

The argument for option 1: Simpler. No duplicated impl code. No need to add plumbing for new check up the stack (outside EvalML). Potential for better performance on larger datasets because we'd only need to call X.isnull() once.
The argument for option 2: Easier for users to selectively only run either the col or row null checks in isolation instead of always running both.

Anything else?

I agree with @freddyaboulton here that option 1 is better.

Copy link
Contributor

@bchen1116 bchen1116 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This implementation looks good! I did have a question that I think we should address, whether in this PR or in a future issue.

Currently, we find the null values with the original dataset, but I can see issues if we have a dataset like:

X = pd.DataFrame({"col_1": [None, None, None],
                  "col_2": [None, 1, 2],
                  "col_3": [1, 2, None]})

Then the data check for null columns would suggest dropping col 1, and the data check for null rows would drop cols 2 and 3. I'm curious if we want to drop the null columns BEFORE finding the rows that satisfy this data check.

It's mainly for larger datasets, where there might be 1000+ columns, and maybe half have empty columns. These two highlyNull datachecks would then drop all the rows and columns, which doesn't seem like the behavior we want.

Just something that could raise problems in the future. Requested changes since I wanted to have this discussion before merging!

docs/source/release_notes.rst Outdated Show resolved Hide resolved
@jeremyliweishih
Copy link
Contributor Author

@bchen1116 good points! I'm not sure if there has been discussions on the order of operations with data check actions but I do like your suggestion. A solution now we could do would be to bump pct_null_threshold up so that the case you brought up won't happen. I can bump it up to something like 0.75 which may be more reasonable. What do you think?

@bchen1116
Copy link
Contributor

@jeremyliweishih yeah, I think that's a good short-term solution! Might need to discuss further for a longer-term solution

Copy link
Contributor

@bchen1116 bchen1116 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

Copy link
Contributor

@dsherry dsherry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jeremyliweishih I'd like you to change the following:

  • Add to existing HighlyNullDataCheck instead of creating a separate data check
  • Return a single DataCheckWarning when one or more rows are highly null, instead of one warning for each row
  • In the warning details, save info about which rows are highly null and what the percentage was using a single pandas Series instead of a dict.

@@ -7,5 +7,8 @@ class DataCheckActionCode(Enum):
DROP_COL = "drop_col"
"""Action code for dropping a column."""

DROP_ROW = "drop_row"
"""Action code for dropping a row."""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's call this DROP_ROWS since we'll rarely be dropping a single row. Relates to my comments about the impl.

from evalml.utils import _convert_woodwork_types_wrapper, infer_feature_types


class HighlyNullRowsDataCheck(DataCheck):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The two options presented so far:

  1. Update the existing HighlyNullDataCheck to also perform a null row check. Don't define a new data check.
  2. Define a separate HighlyNullRowsDataCheck as this PR currently does.

The argument for option 1: Simpler. No duplicated impl code. No need to add plumbing for new check up the stack (outside EvalML). Potential for better performance on larger datasets because we'd only need to call X.isnull() once.
The argument for option 2: Easier for users to selectively only run either the col or row null checks in isolation instead of always running both.

Anything else?

I agree with @freddyaboulton here that option 1 is better.

X = _convert_woodwork_types_wrapper(X.to_dataframe())

percent_null = (X.isnull().mean(axis=1)).to_dict()
highly_null_rows = {key: value for key, value in percent_null.items() if value >= self.pct_null_threshold and value != 0}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of using a dictionary here, what if we represented this like so:

percent_null = X.isnull().mean(axis=1)
highly_null = percent_null[percent_null >= self.pct_null_threshold]
results["warnings"].extend([DataCheckWarning(..., details={"row": row_name, "pct_null_cols": highly_null})])

For large datasets I think this is a more efficient way to share this info. Since we don't reset the index above, pandas will track keep track of it for us, so the indices of the rows which are highly null can be obtained by grabbing "pct_null_cols" from the warning details and running pct_null_cols.index

data_check_name=self.name,
message_code=DataCheckMessageCode.HIGHLY_NULL_ROWS,
details={"row": row_name, "pct_null_cols": highly_null_rows[row_name]}).to_dict()
for row_name in highly_null_rows])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This approach will not do well on large and/or sparse datasets. Consider the case when the data is 100k rows x 100 cols and each feature is 1% missing values, but also the row position of those missing values has 0% overlap. In that case we'd end up with 100k DataCheckWarning message instances in the list here.

We should certainly still allow the caller to see which rows were null, but I don't think we should require them to process each row as a separate warning.

Instead let's return one DataCheckWarning for the entire dataset.

if len(highly_null) > 0:
    warning_msg = f"{len(highly_null)} out of {len(X)} rows are more than {self.pct_null_threshold*100}% null"
    results["warnings"].append(
        DataCheckWarning(message=warning_msg,
                         data_check_name=self.name,
                         message_code=DataCheckMessageCode.HIGHLY_NULL_ROWS,
                         details={"pct_null_cols": highly_null})

where I'm using highly_null from my other comment above

@jeremyliweishih jeremyliweishih changed the title Add HighlyNullRowsDataCheck Add null rows check to HighlyNullDataCheck May 12, 2021
Copy link
Contributor

@dsherry dsherry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jeremyliweishih thanks, nice going! 🚢 😁

results["warnings"].append(DataCheckWarning(message=warning_msg,
data_check_name=self.name,
message_code=DataCheckMessageCode.HIGHLY_NULL_ROWS,
details={"pct_null_cols": highly_null_rows}).to_dict())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Looks great


Arguments:
pct_null_threshold(float): If the percentage of NaN values in an input feature exceeds this amount,
that feature will be considered highly-null. Defaults to 0.95.
that column/row will be considered highly-null. Defaults to 0.95.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I do think it would be helpful to have a separate threshold for rows vs columns.

I filed #2270 to track that separately.

@jeremyliweishih jeremyliweishih merged commit 3a7ee46 into main May 13, 2021
@chukarsten chukarsten mentioned this pull request May 17, 2021
@freddyaboulton freddyaboulton deleted the js_2220_row_datacheck branch May 13, 2022 15:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add HighlyNullRowsDataCheck
5 participants