Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Include recommended actions for each data check's set of actions #1968

Merged
merged 15 commits into from Mar 16, 2021

Conversation

angela97lin
Copy link
Contributor

@angela97lin angela97lin commented Mar 11, 2021

First half of #1881.

Addresses:

  • DataChecks
  • UniquenessDataCheck
  • SparsityDataCheck
  • TargetLeakageDataCheck
  • HighlyNullDataCheck
  • IDColumnsDataCheck
  • NoVarianceDataCheck

These data checks are relatively simple as their recommended action is just to drop the problem column.

Currently, different data checks could result in the same data check action (drop a specific col). Should the DataChecks class address this and remove duplicates?

In this PR, I address this question with "yes", because I imagine this will be useful when creating components--we don't want the second DropColumns component to error out when it can't find the column, but open to thoughts / concerns! Note that this doesn't affect different transformations on the same col (impute then drop).

@angela97lin angela97lin self-assigned this Mar 11, 2021
@angela97lin angela97lin changed the title Include recommended actions for each data check's set of actions#1881 Include recommended actions for each data check's set of actions Mar 11, 2021
@codecov
Copy link

codecov bot commented Mar 15, 2021

Codecov Report

Merging #1968 (73fa23c) into main (9576d5d) will increase coverage by 0.1%.
The diff coverage is 100.0%.

Impacted file tree graph

@@            Coverage Diff            @@
##             main    #1968     +/-   ##
=========================================
+ Coverage   100.0%   100.0%   +0.1%     
=========================================
  Files         273      273             
  Lines       22356    22381     +25     
=========================================
+ Hits        22350    22375     +25     
  Misses          6        6             
Impacted Files Coverage Δ
...ts/data_checks_tests/test_id_columns_data_check.py 100.0% <ø> (ø)
...ests/data_checks_tests/test_sparsity_data_check.py 100.0% <ø> (ø)
...ts/data_checks_tests/test_uniqueness_data_check.py 100.0% <ø> (ø)
evalml/data_checks/data_checks.py 100.0% <100.0%> (ø)
evalml/data_checks/highly_null_data_check.py 100.0% <100.0%> (ø)
evalml/data_checks/id_columns_data_check.py 100.0% <100.0%> (ø)
evalml/data_checks/no_variance_data_check.py 100.0% <100.0%> (ø)
evalml/data_checks/sparsity_data_check.py 100.0% <100.0%> (ø)
evalml/data_checks/target_leakage_data_check.py 100.0% <100.0%> (ø)
evalml/data_checks/uniqueness_data_check.py 100.0% <100.0%> (ø)
... and 4 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9576d5d...73fa23c. Read the comment docs.


# test 2D list
assert highly_null_check.validate([[None, None, None, None, 0], [None, None, None, "hi", 5]]) == {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just cleaning up. We were using the same expected value for each of these cases.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice!

Copy link
Contributor

@bchen1116 bchen1116 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me! I like that the data checks class handles removing duplicate columns names that we want to drop.


# test 2D list
assert highly_null_check.validate([[None, None, None, None, 0], [None, None, None, "hi", 5]]) == {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice!


new_actions = messages_new["actions"]
for new_action in new_actions:
if new_action not in messages["actions"]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch!

Copy link
Collaborator

@chukarsten chukarsten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks solid as is. If you want to add the explicit test to show the removal of anticipated redundant columns, that would be cool. But you probably have better things to do lol

@angela97lin
Copy link
Contributor Author

@chukarsten There's no better way to spend time than to write more unit tests 😂 Thanks for the suggestion, added!

@angela97lin angela97lin merged commit 23ced7f into main Mar 16, 2021
@angela97lin angela97lin deleted the 1881_fill_in_actions branch March 16, 2021 19:06
@dsherry dsherry mentioned this pull request Mar 24, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants