Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update data check actions API to return options instead of actions and add functionality to suggest and take action on columns with null values #3182

Merged
merged 29 commits into from
Jan 25, 2022

Conversation

angela97lin
Copy link
Contributor

@angela97lin angela97lin commented Jan 5, 2022

Closes #3116, closes #3013, closes #3144, closes #3243, closes #3240

Everything in this branch should have been previously reviewed; I just didn't want to continuously break and fix any code that was using the intermediate API changes as I broke down issues into smaller (and hopefully easier to review) PRs.

* init

* init

* start updating tests

* add validation code for option

* remove data check updates for now

* start to clean up tests and add validate_parameter tests

* revert highly null dc

* add in more valueerror test checking

* fix test and logic for column parameters

* init

* update some tests to new API

* fix more tests

* fix doc and sparsity test

* fix integration tests

* fix doctests

* fix doctests

* fix ts splitting test and impl

* fix more ts splitting tests

* fix doctest for ts data check

* release notes

* update target leakage data check docstring for consistency

* doctest linting and cleanup

* linting
@codecov
Copy link

codecov bot commented Jan 5, 2022

Codecov Report

Merging #3182 (81288de) into main (2cb356d) will increase coverage by 0.1%.
The diff coverage is 99.9%.

Impacted file tree graph

@@           Coverage Diff           @@
##            main   #3182     +/-   ##
=======================================
+ Coverage   99.8%   99.8%   +0.1%     
=======================================
  Files        326     326             
  Lines      31563   31698    +135     
=======================================
+ Hits       31472   31607    +135     
  Misses        91      91             
Impacted Files Coverage Δ
evalml/data_checks/data_check.py 100.0% <ø> (ø)
...tests/data_checks_tests/test_data_check_message.py 100.0% <ø> (ø)
..._tests/test_data_checks_and_actions_integration.py 99.0% <98.0%> (-1.0%) ⬇️
evalml/automl/automl_search.py 99.9% <100.0%> (+0.1%) ⬆️
evalml/data_checks/__init__.py 100.0% <100.0%> (ø)
evalml/data_checks/class_imbalance_data_check.py 100.0% <100.0%> (ø)
evalml/data_checks/data_check_action_option.py 100.0% <100.0%> (ø)
evalml/data_checks/data_check_message.py 100.0% <100.0%> (ø)
evalml/data_checks/data_check_message_code.py 100.0% <100.0%> (ø)
evalml/data_checks/data_checks.py 100.0% <100.0%> (ø)
... and 39 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2cb356d...81288de. Read the comment docs.

…taCheckAction` (#3152)

* init

* init

* start updating tests

* add validation code for option

* remove data check updates for now

* start to clean up tests and add validate_parameter tests

* revert highly null dc

* add in more valueerror test checking

* fix test and logic for column parameters

* init

* update some tests to new API

* fix more tests

* fix doc and sparsity test

* fix integration tests

* fix doctests

* init, update no variance

* update highly null dc

* fix id column dc

* update target leakage dc

* update sparsity dc

* outliers and uniqueness dc

* update target dis dc

* update invalid target data check

* fix doctests

* fix ts splitting test and impl

* fix dc validate and tests

* fix more ts splitting tests

* fix data check actions notebook

* update to remove columns to drop

* freeze

* remove rows to drop

* update dc tests

* update doctests

* fix integration tests

* revert requirements

* update parameters to set to empty dict

* fix tests

* fix doctests

* retrigger

* revert parameter for data check option

* fix data check option test

* fix more tests from updating default parameters

* release notes

* use empty instead of none

* release notes

* cleanup unnecessary code

* move logic to data check option class and rename

* update integration tests

* add tests and some cleanup

* fix tests

* fix pipeline util test

* remove unnecessary conditional

* fix naming, need to fix tests

* fix tests

* fix doctest

* add new enums

* add logic for enums

* update files to use enum

* add in testing for invalid enum

* fix doctest by updating to_dict impl

* linting

* try with different base

* oops revert yaml

* fix tests

* remove outdated code

* add more tests
@angela97lin angela97lin changed the title Update data check actions API Update data check actions API to return options instead of actions and add functionality to suggest and take action on columns with null values Jan 12, 2022
… to return impute action for non-highly null columns. (#3197)

* init

* init

* start updating tests

* add validation code for option

* remove data check updates for now

* start to clean up tests and add validate_parameter tests

* revert highly null dc

* add in more valueerror test checking

* fix test and logic for column parameters

* init

* update some tests to new API

* fix more tests

* fix doc and sparsity test

* fix integration tests

* fix doctests

* init, update no variance

* update highly null dc

* fix id column dc

* update target leakage dc

* update sparsity dc

* outliers and uniqueness dc

* update target dis dc

* update invalid target data check

* fix doctests

* fix ts splitting test and impl

* fix dc validate and tests

* fix more ts splitting tests

* fix data check actions notebook

* update to remove columns to drop

* freeze

* remove rows to drop

* update dc tests

* update doctests

* fix integration tests

* revert requirements

* update parameters to set to empty dict

* fix tests

* fix doctests

* retrigger

* revert parameter for data check option

* fix data check option test

* fix more tests from updating default parameters

* release notes

* use empty instead of none

* release notes

* cleanup unnecessary code

* move logic to data check option class and rename

* init rename

* retrigger

* revert release nots

* add new logic for detecting null cols

* update integration tests

* add tests and some cleanup

* fix tests

* fix pipeline util test

* remove unnecessary conditional

* fix test and iteration for null data check

* fix naming, need to fix tests

* fix tests

* fix doctest

* add new enums

* add logic for enums

* update files to use enum

* add in testing for invalid enum

* fix doctest by updating to_dict impl

* linting

* try with different base

* oops revert yaml

* fix tests

* remove outdated code

* oops fix merge

* add more tests

* fix null data check tests

* update to use per column strategy and fix tests

* fix tests for data checks

* fix tests and doctests

* release notes

* fix release notes

* oops update action code

* oops fix test

* update wording of messages

* move logic out of dict

* update mode to most_frequent for impute strategies

* oops fix linting and doctests
* init

* more cleanup

* begin to clean up tests

* fix more tests

* updating naming to action_options and fixing more tests

* fix no variance

* fix another test

* fixing automl tests

* oops actually fix automl tests

* fix the other automl tests

* fixing notebook

* fix data check tests

* fix doctests and docs

* integration tests and cleanup

* cleanup based on comments

* linting

* clean up notebook and tests
…3237)

* init

* init

* start updating tests

* add validation code for option

* remove data check updates for now

* start to clean up tests and add validate_parameter tests

* revert highly null dc

* add in more valueerror test checking

* fix test and logic for column parameters

* init

* update some tests to new API

* fix more tests

* fix doc and sparsity test

* fix integration tests

* fix doctests

* init, update no variance

* update highly null dc

* fix id column dc

* update target leakage dc

* update sparsity dc

* outliers and uniqueness dc

* update target dis dc

* update invalid target data check

* fix doctests

* fix ts splitting test and impl

* fix dc validate and tests

* fix more ts splitting tests

* fix data check actions notebook

* update to remove columns to drop

* freeze

* remove rows to drop

* update dc tests

* update doctests

* fix integration tests

* revert requirements

* update parameters to set to empty dict

* fix tests

* fix doctests

* retrigger

* revert parameter for data check option

* fix data check option test

* fix more tests from updating default parameters

* release notes

* use empty instead of none

* release notes

* cleanup unnecessary code

* move logic to data check option class and rename

* init rename

* retrigger

* revert release nots

* add new logic for detecting null cols

* update integration tests

* add tests and some cleanup

* fix tests

* fix pipeline util test

* remove unnecessary conditional

* fix test and iteration for null data check

* fix naming, need to fix tests

* fix tests

* fix doctest

* add new enums

* add logic for enums

* update files to use enum

* add in testing for invalid enum

* fix doctest by updating to_dict impl

* linting

* try with different base

* oops revert yaml

* fix tests

* remove outdated code

* oops fix merge

* add more tests

* fix null data check tests

* update to use per column strategy and fix tests

* fix tests for data checks

* fix tests and doctests

* release notes

* fix release notes

* init

* init and fix testing

* fix integration test

* updates test

* clean up docs

* lint notebook

* fix tests with types

* linting

* release notes

* update wording

* update impl for natural language and datetimes, remove old tests

* fix tests

* fix doctest
Copy link
Contributor

@chukarsten chukarsten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good, Angela! I'm sorry I wasn't able to post some of the reviews I did of your constituent PRs, I had a few pending submission when they got merged. But they were all approvals anyways. This looks good to me and is great work. Way to really stick it through...this one must have been exhausting.

Comment on lines +407 to +430
parameters={
"impute_strategies": {
"parameter_type": "column",
"columns": {
"some_column": {
"impute_strategy": {
"categories": ["mean", "most_frequent"],
"type": "category",
"default_value": "most_frequent",
},
"fill_value": {"type": "float", "default_value": 0.0},
},
"some_other_column": {
"impute_strategy": {
"categories": ["mean", "most_frequent"],
"type": "category",
"default_value": "mean",
},
"fill_value": {"type": "float", "default_value": 1.0},
},
},
}
},
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yikes!

Copy link
Contributor

@freddyaboulton freddyaboulton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@angela97lin This looks good to me! Thanks for all the work in this refactor.

I skimmed this PR as I reviewed all the sub PRs. Only one thing stood out to me.

X, pct_null_col_threshold=self.pct_null_col_threshold
)

X_to_check_for_any_null = X.ww.select(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can improve this implementation. Right now we do two scans of the data to determine the highly null columns and the columns with any nulls. Doing so also involves creating a new copy of the data without NaturalLanguage/DateTime.

I think if we refactor get_null_column_information to return {col_name: % nulls} we only have to do one scan without creating another copy to determine which are highly null and which are null.

Let's file an issue to do that in a separate PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, thank you for pointing out that nuance! I filed #3273 to track this :)

Copy link
Contributor

@ParthivNaresh ParthivNaresh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fantastic work on this, I can't wait for it to be merged. And great coordination on splitting it up into manageable PRs first!

Copy link
Contributor

@eccabay eccabay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fantastic! Just left a few nits

docs/source/release_notes.rst Outdated Show resolved Hide resolved
@@ -38,7 +48,6 @@
* Changed the default objective to ``MedianAE`` from ``R2`` for time series regression :pr:`3205`
* Removed all-nan Unknown to Double logical conversion in ``infer_feature_types`` :pr:`3196`
* Checking the validity of holdout data for time series problems can be performed by calling ``pipelines.utils.validate_holdout_datasets`` prior to calling ``predict`` :pr:`3208`
* Uncapped numba version and removed it from requirements :pr:`3263`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this have moved?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@eccabay Haha yes! I put it in the wrong place before and am sneaking this in here :)

docs/source/release_notes.rst Outdated Show resolved Hide resolved
Comment on lines +99 to 100


Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another extra space?

impute_strategies = parameters["impute_strategies"]
components.append(
PerColumnImputer(
impute_strategies=impute_strategies, impute_all=False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the merge of #3267, we don't need the impute_all parameter any more!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's true! Thank you @eccabay 🙏

@angela97lin angela97lin merged commit 9e36684 into main Jan 25, 2022
@angela97lin angela97lin deleted the update_data_check_action_API branch January 25, 2022 17:50
@angela97lin angela97lin mentioned this pull request Jan 25, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants