Detect highly null columns #121

angela97lin · 2019-10-10T19:00:47Z

Adding preprocessing function to remove highly null columns and incorporating into Auto(*) classes

Fixes #116

codecov · 2019-10-10T19:15:35Z

Codecov Report

Merging #121 into master will increase coverage by 0.05%.
The diff coverage is 95.23%.

@@            Coverage Diff             @@
##           master     #121      +/-   ##
==========================================
+ Coverage   94.08%   94.13%   +0.05%     
==========================================
  Files          55       58       +3     
  Lines        1436     1467      +31     
==========================================
+ Hits         1351     1381      +30     
- Misses         85       86       +1

Impacted Files	Coverage Δ
evalml/utils/__init__.py	`100% <ø> (ø)`	⬆️
evalml/models/auto_regressor.py	`90.9% <ø> (ø)`	⬆️
evalml/pipelines/__init__.py	`100% <ø> (ø)`	⬆️
evalml/preprocessing/utils.py	`88.88% <ø> (+0.25%)`	⬆️
evalml/models/auto_classifier.py	`100% <ø> (ø)`	⬆️
...l/tests/guardrail_tests/test_detect_highly_null.py	`100% <100%> (ø)`
...tests/guardrail_tests/test_detect_label_leakage.py	`100% <100%> (ø)`
evalml/guardrails/__init__.py	`100% <100%> (ø)`
evalml/__init__.py	`100% <100%> (ø)`	⬆️
evalml/models/auto_base.py	`92.16% <85.71%> (-0.29%)`	⬇️
... and 5 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2e9c7e3...03f7aec. Read the comment docs.

kmax12 · 2019-10-10T19:21:52Z

quick comment. if we drop columns on training data, we need to drop the same columns on the test data even if they aren't highly null. would it be better to create this a component in the pipeline with fit/predict?

angela97lin · 2019-10-10T20:06:49Z

Hmmm, since the columns are dropped before it is passed to _do_iteration(...) where the split into training and testing data happens, we don't have to manually drop the columns in the test data too? Still, do you think it'd make more sense with the restructuring of pipelines?

kmax12 · 2019-10-10T20:43:48Z

i agree that works during testing, but what about when we call the fitted pipeline on new data in the future. we'd need to make sure to drop the same columns

kmax12 · 2019-10-10T20:45:01Z

and actually, as I think about it, we don't want to drop columns based on the test data. that would be creating label leakage. we should only drop columns based on what is in the training data split e.g "learn" what columns to drop on the train data, but reuse that learned selection on test data and future usage.

angela97lin · 2019-10-10T21:14:37Z

Ah, I think I see what you mean. In that case, should I connect (rebase) this with the pipeline components PR we’re working on for pipeline_v2 to create an actual Component and then merge it when we merge in the new pipeline code?

kmax12 · 2019-10-10T21:21:35Z

ya, I think that would make sense

evalml/models/auto_base.py

evalml/models/auto_classifier.py

evalml/preprocessing/utils.py

evalml/tests/preprocessing_tests/test_drop_null.py

kmax12 · 2019-10-15T16:43:58Z

this PR should probably change to "Detect highly null columns"

evalml/preprocessing/utils.py

evalml/guardrails/utils.py

evalml/models/auto_regressor.py

kmax12

LGTM

…r_highly_null

kmax12

LGTM

angela97lin added 2 commits October 10, 2019 14:57

adding drop_null functionality to auto classes

0440c8a

adding drop_null fxn

6944a1b

angela97lin changed the title ~~Guardrail: remove highly null columns~~ Remove highly null columns Oct 10, 2019

Merge branch 'master' into gr_highly_null

a4b6e7a

angela97lin requested a review from kmax12 October 10, 2019 19:16

angela97lin added 5 commits October 11, 2019 13:17

updating drop --> detect

db03e4b

Merge branch 'master' into gr_highly_null

01225f0

fixing merge

6723897

Merge branch 'master' into gr_highly_null

9cd6134

Merge branch 'master' into gr_highly_null

7291af0

kmax12 self-assigned this Oct 14, 2019

kmax12 suggested changes Oct 15, 2019

View reviewed changes

kmax12 reviewed Oct 15, 2019

View reviewed changes

evalml/preprocessing/utils.py Outdated Show resolved Hide resolved

angela97lin changed the title ~~Remove highly null columns~~ Detect highly null columns Oct 15, 2019

angela97lin added 5 commits October 15, 2019 16:12

addressing some PR comments

abf91f6

moved guardrails from preprocessing, updating api

1849c05

cleanup and moving tests

105e395

Merge branch 'master' into gr_highly_null

5e26a39

adding check before logging

239b7ad

angela97lin requested a review from kmax12 October 17, 2019 15:04

kmax12 suggested changes Oct 17, 2019

View reviewed changes

evalml/guardrails/utils.py Outdated Show resolved Hide resolved

evalml/guardrails/utils.py Outdated Show resolved Hide resolved

evalml/models/auto_regressor.py Show resolved Hide resolved

addressing comments

49e32c4

kmax12 previously approved these changes Oct 17, 2019

View reviewed changes

linting

b897c82

angela97lin dismissed kmax12’s stale review via b897c82 October 17, 2019 18:36

angela97lin and others added 4 commits October 17, 2019 15:00

linting

9fd66cf

linting

dfd51e8

Merge branch 'gr_highly_null' of github.com:FeatureLabs/evalml into g…

0a6d393

…r_highly_null

Merge branch 'master' into gr_highly_null

c6815b6

kmax12 previously approved these changes Oct 17, 2019

View reviewed changes

angela97lin added 2 commits October 18, 2019 10:05

change to retrigger circleci

9d0f582

Merge branch 'gr_highly_null' of github.com:FeatureLabs/evalml into g…

ddf822e

…r_highly_null

angela97lin dismissed kmax12’s stale review via ddf822e October 18, 2019 14:05

adding test for non pandas + rename

03f7aec

angela97lin requested a review from kmax12 October 18, 2019 18:18

kmax12 approved these changes Oct 18, 2019

View reviewed changes

angela97lin merged commit 9c0470b into master Oct 18, 2019

angela97lin deleted the gr_highly_null branch October 18, 2019 19:51

angela97lin mentioned this pull request Oct 29, 2019

v0.5.0 #163

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Detect highly null columns #121

Detect highly null columns #121

angela97lin commented Oct 10, 2019 •

edited

Loading

codecov bot commented Oct 10, 2019 •

edited

Loading

kmax12 commented Oct 10, 2019

angela97lin commented Oct 10, 2019

kmax12 commented Oct 10, 2019

kmax12 commented Oct 10, 2019 •

edited

Loading

angela97lin commented Oct 10, 2019

kmax12 commented Oct 10, 2019

kmax12 commented Oct 15, 2019

kmax12 left a comment

kmax12 left a comment

Detect highly null columns #121

Detect highly null columns #121

Conversation

angela97lin commented Oct 10, 2019 • edited Loading

codecov bot commented Oct 10, 2019 • edited Loading

Codecov Report

kmax12 commented Oct 10, 2019

angela97lin commented Oct 10, 2019

kmax12 commented Oct 10, 2019

kmax12 commented Oct 10, 2019 • edited Loading

angela97lin commented Oct 10, 2019

kmax12 commented Oct 10, 2019

kmax12 commented Oct 15, 2019

kmax12 left a comment

Choose a reason for hiding this comment

kmax12 left a comment

Choose a reason for hiding this comment

angela97lin commented Oct 10, 2019 •

edited

Loading

codecov bot commented Oct 10, 2019 •

edited

Loading

kmax12 commented Oct 10, 2019 •

edited

Loading