Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add create_if_missing to SelectColumns #2912

Closed
wants to merge 22 commits into from
Closed

Conversation

jeremyliweishih
Copy link
Collaborator

@jeremyliweishih jeremyliweishih commented Oct 14, 2021

Fixes #2904, #2903.

The main issue was that because of cross validation, different columns would be created by the OHE but our feature selector (turned into a column selector) expects certain columns that doesn't exist. This PR fixes this by adding empty columns if the selected columns do not exist.

@codecov
Copy link

codecov bot commented Oct 14, 2021

Codecov Report

Merging #2912 (625111f) into main (08601ea) will increase coverage by 0.1%.
The diff coverage is 100.0%.

Impacted file tree graph

@@           Coverage Diff           @@
##            main   #2912     +/-   ##
=======================================
+ Coverage   99.7%   99.7%   +0.1%     
=======================================
  Files        302     302             
  Lines      28593   28634     +41     
=======================================
+ Hits       28500   28541     +41     
  Misses        93      93             
Impacted Files Coverage Δ
evalml/automl/automl_algorithm/automl_algorithm.py 100.0% <ø> (ø)
...valml/automl/automl_algorithm/default_algorithm.py 100.0% <ø> (ø)
...elines/components/transformers/column_selectors.py 100.0% <100.0%> (ø)
...mponent_tests/test_column_selector_transformers.py 100.0% <100.0%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 08601ea...625111f. Read the comment docs.

@jeremyliweishih jeremyliweishih changed the title Add append_all_known_values to OHE Add create_if_missing to SelectColumns Oct 18, 2021
@jeremyliweishih jeremyliweishih marked this pull request as ready for review October 19, 2021 13:41
Copy link
Contributor

@freddyaboulton freddyaboulton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jeremyliweishih Thank you for this solution! Before moving forward, I want to make sure I understand the problem and discuss some other possible solutions.

If I understand correctly, the feature selector will select features after all feature engineering happens. Sometimes this will include some/all of the features created by the one hot encoder. Moreover, the DefaultAlgorithm only stores one _selected_cols list corresponding to the columns selected during the last fold of cv.

The problem is that the features created by the one hot encoder during the last fold are sometimes different from those created during the second and first fold.

I worry about determining which features to select from the OHE based only on the last fold of cv. Isn't this problem caused by us "overfitting" to the features created by the OHE on the last fold?

I have three points I want to discuss with you before moving forward:

  1. Instead of selecting a subset of the features created by the OHE, maybe the better (as in less prone to overfitting) thing to do is to select all features created by the OHE if any one of them is selected by the feature selector. I think we can accomplish this by changing the pipeline structure a bit.
  2. Can we store the _selected_cols for each fold of cv as opposed to only the last fold? This might require us changing the api of DefaultAlgortihm.add_result
  3. What create_if_missing does is recategorize the categorical features based on the features selected during the last fold. So if we have a categorical column with all 50 states, but only FL, NY, and MA get selected in the last fold, then what we're saying is that there are now four categories: FL, NY, MA, and "all other". I wonder if we can do that re-categorization in a way that's more specific to the one hot encoder rather than doing it via SelectColumns. That way it doesn't impact pipelines without a OHE, e.g. Catboost. If I understand correctly, since all pipelines have a SelectColumns after the second batch, this PR will add columns of all zeros corresponding to the OHE-created features for Catboost pipelines even though catboost does not need a OHE. That seems wasteful and probably has a negative impact on performance? Maybe we can use the categories parameter of the OHE or write a custom component to do the recategorization?

@jeremyliweishih
Copy link
Collaborator Author

Closing in favor of #2944.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

DefaultAlgorithm errors out during cross validation with encoder and feature selection interaction
2 participants