Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Port over highly-null Data Check and define BasicDataChecks and DisableDataChecks classes #745

Merged
merged 18 commits into from May 8, 2020

Conversation

angela97lin
Copy link
Contributor

@angela97lin angela97lin commented May 5, 2020

Closes #708

Some questions:

  • DetectHighlyNullDataCheck: name too long? DetectHighlyNull?
  • should we return a warning or error?

@codecov
Copy link

codecov bot commented May 5, 2020

Codecov Report

Merging #745 into master will increase coverage by 0.00%.
The diff coverage is 100.00%.

Impacted file tree graph

@@           Coverage Diff           @@
##           master     #745   +/-   ##
=======================================
  Coverage   99.35%   99.36%           
=======================================
  Files         148      151    +3     
  Lines        5299     5378   +79     
=======================================
+ Hits         5265     5344   +79     
  Misses         34       34           
Impacted Files Coverage Δ
evalml/data_checks/__init__.py 100.00% <100.00%> (ø)
evalml/data_checks/default_data_checks.py 100.00% <100.00%> (ø)
...valml/data_checks/detect_highly_null_data_check.py 100.00% <100.00%> (ø)
evalml/data_checks/utils.py 100.00% <100.00%> (ø)
evalml/tests/data_checks_tests/test_data_check.py 100.00% <100.00%> (ø)
evalml/tests/data_checks_tests/test_data_checks.py 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 37412f2...9b3e9d5. Read the comment docs.

@angela97lin angela97lin marked this pull request as ready for review May 6, 2020
@angela97lin angela97lin requested a review from dsherry May 6, 2020
evalml/data_checks/__init__.py Outdated Show resolved Hide resolved
dsherry
dsherry approved these changes May 6, 2020
Copy link
Collaborator

@dsherry dsherry left a comment

Good stuff! I left some comments. In particular, I think the highly null data check should return one warning for each column, not just one which lists all columns.

"""Checks if there are any highly-null columns in the input.

Arguments:
percent_threshold(float): If the percentage of values in an input feature exceeds this amount,
Copy link
Collaborator

@dsherry dsherry May 7, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's update this description. It doesn't mention what the percentage applies to. In fact, I wonder if we should rename this parameter. Perhaps pct_null_threshold

Copy link
Contributor Author

@angela97lin angela97lin May 7, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like pct_null_threshold! I'm not sure what you mean by "doesn't mention what the percentage applies to"; is the input feature not clear? :o

Copy link
Contributor Author

@angela97lin angela97lin May 8, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dsherry I just changed it to:

pct_null_threshold(float): If the percentage of NaN values in an input feature exceeds this amount, that feature will be considered highly-null. Defaults to 0.95.

Is that better?

Copy link
Collaborator

@dsherry dsherry May 8, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 yeah thanks!

Copy link
Contributor Author

@angela97lin angela97lin May 8, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, I think thats the last comment to address on this PR so I'll merge it when tests are green 👍

percent_null = (X.isnull().mean()).to_dict()
highly_null_cols = {key: value for key, value in percent_null.items() if value >= self.percent_threshold}
warning_msg = "Column '{}' is {}% or more null"
return [DataCheckWarning(warning_msg.format(col_name, self.percent_threshold * 100), self.name) for col_name in highly_null_cols]
Copy link
Collaborator

@dsherry dsherry May 7, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this output for inputs which aren't pd.DataFrame, or for dataframes which don't have column names set? If a dataframe's column names aren't set, pd.DataFrame.to_dict will use the dataframe index. I see we have coverage for this in test_highly_null_data_check_input_formats. I guess there's not much we can do about that, haha.

Relatedly, in the future we'll probably want each data check to have its own message type. For instance, if we had HighlyNullColumnWarning, we could have that add a column_name parameter as metadata. I don't think we should add that now, but I bet we'll need that later.

Copy link
Contributor Author

@angela97lin angela97lin May 7, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, right now it'd use the dataframe index, which I don't this is too bad of an idea? I think it's pretty nice for 2d data, but maybe a little weird for lists / series.

And yeah, I think the idea of each data check having their own message type was something we'd had talked about during the design phase, but seemed excessive / unnecessary for now.

@angela97lin angela97lin requested a review from dsherry May 7, 2020
dsherry
dsherry approved these changes May 7, 2020
Copy link
Collaborator

@dsherry dsherry left a comment

Left more comments, but still LGTM 🚀

@angela97lin angela97lin merged commit c5c8846 into master May 8, 2020
2 checks passed
@dsherry dsherry deleted the 708_basic_data_checks branch Oct 29, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Data Checks API: Port over highly-null Data Check and define BasicDataChecks and DisableDataChecks classes
2 participants