-
Notifications
You must be signed in to change notification settings - Fork 87
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Detect highly null columns #121
Conversation
Codecov Report
@@ Coverage Diff @@
## master #121 +/- ##
==========================================
+ Coverage 94.08% 94.13% +0.05%
==========================================
Files 55 58 +3
Lines 1436 1467 +31
==========================================
+ Hits 1351 1381 +30
- Misses 85 86 +1
Continue to review full report at Codecov.
|
quick comment. if we drop columns on training data, we need to drop the same columns on the test data even if they aren't highly null. would it be better to create this a component in the pipeline with fit/predict? |
Hmmm, since the columns are dropped before it is passed to _do_iteration(...) where the split into training and testing data happens, we don't have to manually drop the columns in the test data too? Still, do you think it'd make more sense with the restructuring of pipelines? |
i agree that works during testing, but what about when we call the fitted pipeline on new data in the future. we'd need to make sure to drop the same columns |
and actually, as I think about it, we don't want to drop columns based on the test data. that would be creating label leakage. we should only drop columns based on what is in the training data split e.g "learn" what columns to drop on the train data, but reuse that learned selection on test data and future usage. |
Ah, I think I see what you mean. In that case, should I connect (rebase) this with the pipeline components PR we’re working on for pipeline_v2 to create an actual Component and then merge it when we merge in the new pipeline code? |
ya, I think that would make sense |
this PR should probably change to "Detect highly null columns" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Adding preprocessing function to remove highly null columns and incorporating into Auto(*) classes
Fixes #116