Add `create_if_missing` to `SelectColumns` #2912

jeremyliweishih · 2021-10-14T18:41:16Z

The main issue was that because of cross validation, different columns would be created by the OHE but our feature selector (turned into a column selector) expects certain columns that doesn't exist. This PR fixes this by adding empty columns if the selected columns do not exist.

codecov · 2021-10-14T18:52:29Z

Codecov Report

Merging #2912 (625111f) into main (08601ea) will increase coverage by 0.1%.
The diff coverage is 100.0%.

@@           Coverage Diff           @@
##            main   #2912     +/-   ##
=======================================
+ Coverage   99.7%   99.7%   +0.1%     
=======================================
  Files        302     302             
  Lines      28593   28634     +41     
=======================================
+ Hits       28500   28541     +41     
  Misses        93      93

Impacted Files	Coverage Δ
evalml/automl/automl_algorithm/automl_algorithm.py	`100.0% <ø> (ø)`
...valml/automl/automl_algorithm/default_algorithm.py	`100.0% <ø> (ø)`
...elines/components/transformers/column_selectors.py	`100.0% <100.0%> (ø)`
...mponent_tests/test_column_selector_transformers.py	`100.0% <100.0%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 08601ea...625111f. Read the comment docs.

freddyaboulton

@jeremyliweishih Thank you for this solution! Before moving forward, I want to make sure I understand the problem and discuss some other possible solutions.

If I understand correctly, the feature selector will select features after all feature engineering happens. Sometimes this will include some/all of the features created by the one hot encoder. Moreover, the DefaultAlgorithm only stores one _selected_cols list corresponding to the columns selected during the last fold of cv.

The problem is that the features created by the one hot encoder during the last fold are sometimes different from those created during the second and first fold.

I worry about determining which features to select from the OHE based only on the last fold of cv. Isn't this problem caused by us "overfitting" to the features created by the OHE on the last fold?

I have three points I want to discuss with you before moving forward:

Instead of selecting a subset of the features created by the OHE, maybe the better (as in less prone to overfitting) thing to do is to select all features created by the OHE if any one of them is selected by the feature selector. I think we can accomplish this by changing the pipeline structure a bit.
Can we store the _selected_cols for each fold of cv as opposed to only the last fold? This might require us changing the api of DefaultAlgortihm.add_result
What create_if_missing does is recategorize the categorical features based on the features selected during the last fold. So if we have a categorical column with all 50 states, but only FL, NY, and MA get selected in the last fold, then what we're saying is that there are now four categories: FL, NY, MA, and "all other". I wonder if we can do that re-categorization in a way that's more specific to the one hot encoder rather than doing it via SelectColumns. That way it doesn't impact pipelines without a OHE, e.g. Catboost. If I understand correctly, since all pipelines have a SelectColumns after the second batch, this PR will add columns of all zeros corresponding to the OHE-created features for Catboost pipelines even though catboost does not need a OHE. That seems wasteful and probably has a negative impact on performance? Maybe we can use the categories parameter of the OHE or write a custom component to do the recategorization?

jeremyliweishih · 2021-10-21T17:22:18Z

Closing in favor of #2944.

jeremyliweishih added 3 commits October 14, 2021 14:33

Add append all values flag

b8fe4df

Add flag to OHE in default algo

cf0a78d

Merge branch 'main' of github.com:alteryx/evalml into js_2904_ohe

435f31f

jeremyliweishih added 5 commits October 18, 2021 11:57

Move fix to column selector

8f9989b

Remove checks for testing

a903c8d

testing

f095732

Fix init

43d4eb6

Remove OHE changes

21357e3

jeremyliweishih changed the title ~~Add append_all_known_values to OHE~~ Add create_if_missing to SelectColumns Oct 18, 2021

jeremyliweishih added 7 commits October 18, 2021 17:41

add back datachecks'

f29dfd7

RL

a45e19b

Lint

01e52b2

Merge branch 'main' into js_2904_ohe

82c845e

Add back results

7f78b38

Merge branch 'js_2904_ohe' of github.com:alteryx/evalml into js_2904_ohe

8be60da

add None back

22e24e4

jeremyliweishih marked this pull request as ready for review October 19, 2021 13:41

auto-assign bot assigned jeremyliweishih Oct 19, 2021

jeremyliweishih added 7 commits October 19, 2021 11:32

Fix tests

10d6aec

Merge branch 'main' into js_2904_ohe

14c8f1f

Fix parameter passing

9e9c57b

Merge branch 'js_2904_ohe' of github.com:alteryx/evalml into js_2904_ohe

52f3dbc

Lint

509f48c

Merge branch 'main' of github.com:alteryx/evalml into js_2904_ohe

fb05f19

Add docstring

625111f

jeremyliweishih requested review from bchen1116, freddyaboulton, angela97lin and christopherbunn October 19, 2021 18:58

jeremyliweishih requested review from chukarsten, eccabay and ParthivNaresh October 19, 2021 18:58

freddyaboulton reviewed Oct 20, 2021

View reviewed changes

jeremyliweishih removed request for angela97lin, christopherbunn, bchen1116, ParthivNaresh, eccabay and chukarsten October 21, 2021 13:30

jeremyliweishih closed this Oct 21, 2021

jeremyliweishih mentioned this pull request Dec 9, 2021

Default Algorithm: fully split numeric and categorical columns in preprocessing split #3020

Closed

freddyaboulton deleted the js_2904_ohe branch May 13, 2022 15:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add `create_if_missing` to `SelectColumns` #2912

Add `create_if_missing` to `SelectColumns` #2912

jeremyliweishih commented Oct 14, 2021 •

edited

Loading

codecov bot commented Oct 14, 2021 •

edited

Loading

freddyaboulton left a comment •

edited

Loading

jeremyliweishih commented Oct 21, 2021

Add create_if_missing to SelectColumns #2912

Add create_if_missing to SelectColumns #2912

Conversation

jeremyliweishih commented Oct 14, 2021 • edited Loading

codecov bot commented Oct 14, 2021 • edited Loading

Codecov Report

freddyaboulton left a comment • edited Loading

Choose a reason for hiding this comment

jeremyliweishih commented Oct 21, 2021

Add `create_if_missing` to `SelectColumns` #2912

Add `create_if_missing` to `SelectColumns` #2912

jeremyliweishih commented Oct 14, 2021 •

edited

Loading

codecov bot commented Oct 14, 2021 •

edited

Loading

freddyaboulton left a comment •

edited

Loading