-
Notifications
You must be signed in to change notification settings - Fork 86
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add create_if_missing
to SelectColumns
#2912
Conversation
Codecov Report
@@ Coverage Diff @@
## main #2912 +/- ##
=======================================
+ Coverage 99.7% 99.7% +0.1%
=======================================
Files 302 302
Lines 28593 28634 +41
=======================================
+ Hits 28500 28541 +41
Misses 93 93
Continue to review full report at Codecov.
|
append_all_known_values
to OHEcreate_if_missing
to SelectColumns
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jeremyliweishih Thank you for this solution! Before moving forward, I want to make sure I understand the problem and discuss some other possible solutions.
If I understand correctly, the feature selector will select features after all feature engineering happens. Sometimes this will include some/all of the features created by the one hot encoder. Moreover, the DefaultAlgorithm only stores one _selected_cols
list corresponding to the columns selected during the last fold of cv.
The problem is that the features created by the one hot encoder during the last fold are sometimes different from those created during the second and first fold.
I worry about determining which features to select from the OHE based only on the last fold of cv. Isn't this problem caused by us "overfitting" to the features created by the OHE on the last fold?
I have three points I want to discuss with you before moving forward:
- Instead of selecting a subset of the features created by the OHE, maybe the better (as in less prone to overfitting) thing to do is to select all features created by the OHE if any one of them is selected by the feature selector. I think we can accomplish this by changing the pipeline structure a bit.
- Can we store the
_selected_cols
for each fold of cv as opposed to only the last fold? This might require us changing the api of DefaultAlgortihm.add_result - What
create_if_missing
does is recategorize the categorical features based on the features selected during the last fold. So if we have a categorical column with all 50 states, but only FL, NY, and MA get selected in the last fold, then what we're saying is that there are now four categories: FL, NY, MA, and "all other". I wonder if we can do that re-categorization in a way that's more specific to the one hot encoder rather than doing it via SelectColumns. That way it doesn't impact pipelines without a OHE, e.g. Catboost. If I understand correctly, since all pipelines have a SelectColumns after the second batch, this PR will add columns of all zeros corresponding to the OHE-created features for Catboost pipelines even though catboost does not need a OHE. That seems wasteful and probably has a negative impact on performance? Maybe we can use thecategories
parameter of the OHE or write a custom component to do the recategorization?
Closing in favor of #2944. |
Fixes #2904, #2903.
The main issue was that because of cross validation, different columns would be created by the OHE but our feature selector (turned into a column selector) expects certain columns that doesn't exist. This PR fixes this by adding empty columns if the selected columns do not exist.