Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Separate thresholds for pct null rows and columns in HighlyNullDataCheck #2562

Merged
merged 8 commits into from
Jul 30, 2021

Conversation

eccabay
Copy link
Contributor

@eccabay eccabay commented Jul 27, 2021

Closes #2270

@codecov
Copy link

codecov bot commented Jul 27, 2021

Codecov Report

Merging #2562 (fe17a9b) into main (0a12418) will increase coverage by 0.1%.
The diff coverage is 100.0%.

Impacted file tree graph

@@           Coverage Diff           @@
##            main   #2562     +/-   ##
=======================================
+ Coverage   99.9%   99.9%   +0.1%     
=======================================
  Files        287     287             
  Lines      26377   26398     +21     
=======================================
+ Hits       26338   26359     +21     
  Misses        39      39             
Impacted Files Coverage Δ
evalml/data_checks/highly_null_data_check.py 100.0% <100.0%> (ø)
...s/data_checks_tests/test_highly_null_data_check.py 100.0% <100.0%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0a12418...fe17a9b. Read the comment docs.

@eccabay eccabay marked this pull request as ready for review July 28, 2021 13:34
@eccabay eccabay requested a review from a team July 28, 2021 13:34
Copy link
Contributor

@bchen1116 bchen1116 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! I left a nit and a comment about potentially separating the check for valid thresholds on rows vs columns, but nothing blocking!

docs/source/release_notes.rst Outdated Show resolved Hide resolved
raise ValueError(
"pct_null_threshold must be a float between 0 and 1, inclusive."
"pct null thresholds must be a float between 0 and 1, inclusive."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it'd be nice to separate the checks for cols vs rows to raise error specifically on the value needed

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can also do a compound inequality

Copy link
Contributor

@angela97lin angela97lin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a nitpicky comment but LGTM, thanks for this! 😁

evalml/data_checks/highly_null_data_check.py Outdated Show resolved Hide resolved
Copy link
Contributor

@chukarsten chukarsten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome, this looks like a solid extension of the highlynulldatacheck. I think we need to clean up the added test a little bit. I think it carries over some copy pasta. We might want to consider a test_null_rows that just goes through and tests the rows similarly to the columns, with the all null, some null, all full and shifting the threshold around to play with that.

raise ValueError(
"pct_null_threshold must be a float between 0 and 1, inclusive."
"pct null thresholds must be a float between 0 and 1, inclusive."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can also do a compound inequality

@eccabay eccabay merged commit 66dcf8f into main Jul 30, 2021
@eccabay eccabay deleted the 2270_null_vols_rows_thresh branch July 30, 2021 18:20
@chukarsten chukarsten mentioned this pull request Aug 3, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Highly null data check: separate thresholds for pct null cols vs rows
4 participants