Skip to content

Added threshold for NullDataCheck#3507

Merged
ParthivNaresh merged 11 commits intomainfrom
add_thresholding_nulldc
May 10, 2022
Merged

Added threshold for NullDataCheck#3507
ParthivNaresh merged 11 commits intomainfrom
add_thresholding_nulldc

Conversation

@ParthivNaresh
Copy link
Contributor

Fixes #3286

@ParthivNaresh ParthivNaresh self-assigned this May 9, 2022
@codecov
Copy link

codecov bot commented May 9, 2022

Codecov Report

Merging #3507 (cff66a9) into main (2caa985) will increase coverage by 0.1%.
The diff coverage is 100.0%.

@@           Coverage Diff           @@
##            main   #3507     +/-   ##
=======================================
+ Coverage   99.7%   99.7%   +0.1%     
=======================================
  Files        336     336             
  Lines      33420   33436     +16     
=======================================
+ Hits       33294   33310     +16     
  Misses       126     126             
Impacted Files Coverage Δ
...nsformers/preprocessing/time_series_regularizer.py 100.0% <ø> (ø)
evalml/tests/data_checks_tests/test_data_checks.py 100.0% <ø> (ø)
evalml/data_checks/null_data_check.py 100.0% <100.0%> (ø)
...ml/tests/data_checks_tests/test_null_data_check.py 100.0% <100.0%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2caa985...cff66a9. Read the comment docs.

else:
ww_payload = infer_frequency(
X[self.time_index],
X_ww[self.time_index],
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sneaking this in here

@ParthivNaresh ParthivNaresh marked this pull request as ready for review May 9, 2022 19:34
Copy link
Contributor

@chukarsten chukarsten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks solid to me. Left a few nits and some things that aren't blocking that you can feel free to reject.

def __init__(self, pct_null_col_threshold=0.95, pct_null_row_threshold=0.95):
def __init__(
self,
pct_null_col_threshold=0.95,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to change the name of this to reflect that this is "more null" than "moderately null"? Like "heavily null" or something. I dunno, I'm not a thesaurus.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only reason I chose not to was to prevent this from becoming a breaking change!

).to_dict(),
DataCheckWarning(
message="Column(s) 'lots_of_null', 'nullable_integer', 'nullable_bool' have null values",
message="Column(s) 'lots_of_null', 'nullable_integer', 'nullable_bool' have between 20.0% and 95.0% null values",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a nit, but is there any value between making these values like a module level variable for the data check and import it here to build this string? Or do we just want to accept that we might change the datacheck's default values and then have to update this test?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm I think we could try something like what woodwork has. Since we have enough instances throughout our data checks that if we go down the path of creating a config dc defaults file with all default parameters, we can call upon that in all our tests if we aren't using the default parameters. Should I file an issue for this?

@ParthivNaresh ParthivNaresh merged commit 8f7b4f5 into main May 10, 2022
@chukarsten chukarsten mentioned this pull request May 12, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add thresholding for warning and suggesting IMPUTE_COL action in NullDataCheck

2 participants