Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Detect highly null columns #121

Merged
merged 22 commits into from Oct 18, 2019
Merged

Detect highly null columns #121

merged 22 commits into from Oct 18, 2019

Conversation

angela97lin
Copy link
Contributor

@angela97lin angela97lin commented Oct 10, 2019

Adding preprocessing function to remove highly null columns and incorporating into Auto(*) classes

Fixes #116

@angela97lin angela97lin changed the title Guardrail: remove highly null columns Remove highly null columns Oct 10, 2019
@codecov
Copy link

codecov bot commented Oct 10, 2019

Codecov Report

Merging #121 into master will increase coverage by 0.05%.
The diff coverage is 95.23%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #121      +/-   ##
==========================================
+ Coverage   94.08%   94.13%   +0.05%     
==========================================
  Files          55       58       +3     
  Lines        1436     1467      +31     
==========================================
+ Hits         1351     1381      +30     
- Misses         85       86       +1
Impacted Files Coverage Δ
evalml/utils/__init__.py 100% <ø> (ø) ⬆️
evalml/models/auto_regressor.py 90.9% <ø> (ø) ⬆️
evalml/pipelines/__init__.py 100% <ø> (ø) ⬆️
evalml/preprocessing/utils.py 88.88% <ø> (+0.25%) ⬆️
evalml/models/auto_classifier.py 100% <ø> (ø) ⬆️
...l/tests/guardrail_tests/test_detect_highly_null.py 100% <100%> (ø)
...tests/guardrail_tests/test_detect_label_leakage.py 100% <100%> (ø)
evalml/guardrails/__init__.py 100% <100%> (ø)
evalml/__init__.py 100% <100%> (ø) ⬆️
evalml/models/auto_base.py 92.16% <85.71%> (-0.29%) ⬇️
... and 5 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2e9c7e3...03f7aec. Read the comment docs.

@angela97lin angela97lin requested a review from kmax12 Oct 10, 2019
@kmax12
Copy link
Contributor

kmax12 commented Oct 10, 2019

quick comment. if we drop columns on training data, we need to drop the same columns on the test data even if they aren't highly null. would it be better to create this a component in the pipeline with fit/predict?

@angela97lin
Copy link
Contributor Author

angela97lin commented Oct 10, 2019

Hmmm, since the columns are dropped before it is passed to _do_iteration(...) where the split into training and testing data happens, we don't have to manually drop the columns in the test data too? Still, do you think it'd make more sense with the restructuring of pipelines?

@kmax12
Copy link
Contributor

kmax12 commented Oct 10, 2019

i agree that works during testing, but what about when we call the fitted pipeline on new data in the future. we'd need to make sure to drop the same columns

@kmax12
Copy link
Contributor

kmax12 commented Oct 10, 2019

and actually, as I think about it, we don't want to drop columns based on the test data. that would be creating label leakage. we should only drop columns based on what is in the training data split e.g "learn" what columns to drop on the train data, but reuse that learned selection on test data and future usage.

@angela97lin
Copy link
Contributor Author

angela97lin commented Oct 10, 2019

Ah, I think I see what you mean. In that case, should I connect (rebase) this with the pipeline components PR we’re working on for pipeline_v2 to create an actual Component and then merge it when we merge in the new pipeline code?

@kmax12
Copy link
Contributor

kmax12 commented Oct 10, 2019

ya, I think that would make sense

@kmax12 kmax12 self-assigned this Oct 14, 2019
evalml/models/auto_base.py Outdated Show resolved Hide resolved
evalml/models/auto_base.py Outdated Show resolved Hide resolved
evalml/models/auto_classifier.py Outdated Show resolved Hide resolved
evalml/preprocessing/utils.py Outdated Show resolved Hide resolved
evalml/tests/preprocessing_tests/test_drop_null.py Outdated Show resolved Hide resolved
@kmax12
Copy link
Contributor

kmax12 commented Oct 15, 2019

this PR should probably change to "Detect highly null columns"

evalml/preprocessing/utils.py Outdated Show resolved Hide resolved
@angela97lin angela97lin changed the title Remove highly null columns Detect highly null columns Oct 15, 2019
@angela97lin angela97lin requested a review from kmax12 Oct 17, 2019
evalml/guardrails/utils.py Outdated Show resolved Hide resolved
evalml/guardrails/utils.py Outdated Show resolved Hide resolved
evalml/models/auto_regressor.py Show resolved Hide resolved
kmax12
kmax12 previously approved these changes Oct 17, 2019
Copy link
Contributor

@kmax12 kmax12 left a comment

LGTM

kmax12
kmax12 previously approved these changes Oct 17, 2019
@angela97lin angela97lin requested a review from kmax12 Oct 18, 2019
kmax12
kmax12 approved these changes Oct 18, 2019
Copy link
Contributor

@kmax12 kmax12 left a comment

LGTM

@angela97lin angela97lin merged commit 9c0470b into master Oct 18, 2019
@angela97lin angela97lin deleted the gr_highly_null branch Oct 18, 2019
@angela97lin angela97lin mentioned this pull request Oct 29, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Guardrail: detect highly null columns
2 participants