Change default parameters for feature selectors #3110

jeremyliweishih · 2021-12-01T21:03:00Z

This PR proposes changing the default parameters for RFRegressorSelectFromModel and RFClassifierSelectFromModel. The current, incorrect, behavior of these components is as follows:

number_features=None, percent_features=0.5, and threshold=-np.inf
max features is then calculated as:

max_features = (
            max(1, int(percent_features * number_features)) if number_features else None
        )

therefore, max_features = None
with max_features == None and threshold=-np.inf, the component will select every feature with importance above -np.inf` which is every feature available.

Performance tests using default algorithm:
fs_parameters_tests.zip

…rameters

jeremyliweishih · 2021-12-01T21:04:20Z

evalml/pipelines/components/transformers/feature_selection/rf_classifier_feature_selector.py

@@ -29,11 +29,11 @@ class RFClassifierSelectFromModel(FeatureSelector):
    name = "RF Classifier Select From Model"
    hyperparameter_ranges = {
        "percent_features": Real(0.01, 1),
-        "threshold": ["mean", -np.inf],
+        "threshold": ["mean"],


This PR proposes using the mean as the threshold but experiments show that median performs similarly as well. median will select exactly half of the features and mean depending on the distribution. Happy to discuss which one to choose but I chose mean in the... meantime..

On second thought: using median might be the move as the mean of feature importances will bound to be dragged down by low signal features (which are inevitable). However, performance results show similar model quality between median and mean but median having slightly longer fit times due to median selecting more features.

Why not allow both 'mean' and 'median' to be in the hyperparameter ranges? We can default to one, but allowing both in the ranges seems to be ideal for our automlsearch algo

I like that suggestion. Right now the DefaultAlgorithm only runs one feature selector, so I'm down to set median as the default. Yesterday we discussed letting the algorithm tune the parameters of the selector in which case broadening the threshold search space (like parametrizing it as a quantile of the observed feature importance distribution) will be in play.

@bchen1116 @freddyaboulton

sounds good, I'll add them both as hyperparameter ranges! My main concern is what Freddy brought up about the default parameter and having only 1 FS batch but I'm still on the fence on chosing mean or median. Either way it'll be a quick fix so I'm not too worried!

codecov · 2021-12-01T21:13:40Z

Codecov Report

Merging #3110 (11ceecc) into main (840fc3b) will increase coverage by 0.1%.
The diff coverage is 100.0%.

@@           Coverage Diff           @@
##            main   #3110     +/-   ##
=======================================
+ Coverage   99.8%   99.8%   +0.1%     
=======================================
  Files        313     313             
  Lines      30579   30585      +6     
=======================================
+ Hits       30489   30495      +6     
  Misses        90      90

Impacted Files	Coverage Δ
...eature_selection/rf_classifier_feature_selector.py	`100.0% <ø> (ø)`
...feature_selection/rf_regressor_feature_selector.py	`100.0% <ø> (ø)`
evalml/tests/component_tests/test_components.py	`98.9% <100.0%> (+0.1%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 840fc3b...11ceecc. Read the comment docs.

bchen1116

Looks great! So this change will then allow the RF...FeatureSelector to now choose the top features that fall above the mean importance weight?

Looking at the perf tests, the fit times and performance changes seem reasonable. In all three tests, though, LightGBM seems to drop very significantly in performance:

Do you know why this is? This seems to be something that we should figure out and resolve before moving ahead with the change, especially if it's a potential bug somewhere in the code.

bchen1116 · 2021-12-01T22:13:57Z

evalml/pipelines/components/transformers/feature_selection/rf_classifier_feature_selector.py

@@ -29,11 +29,11 @@ class RFClassifierSelectFromModel(FeatureSelector):
    name = "RF Classifier Select From Model"
    hyperparameter_ranges = {
        "percent_features": Real(0.01, 1),
-        "threshold": ["mean", -np.inf],
+        "threshold": ["mean"],


Why not allow both 'mean' and 'median' to be in the hyperparameter ranges? We can default to one, but allowing both in the ranges seems to be ideal for our automlsearch algo

bchen1116 · 2021-12-01T22:14:46Z

evalml/pipelines/components/transformers/feature_selection/rf_regressor_feature_selector.py

@@ -29,11 +28,11 @@ class RFRegressorSelectFromModel(FeatureSelector):
    name = "RF Regressor Select From Model"
    hyperparameter_ranges = {
        "percent_features": Real(0.01, 1),
-        "threshold": ["mean", -np.inf],
+        "threshold": ["mean"],


Also here, seems to be ideal to allow median as a possible value in the hyperparams.

bchen1116 · 2021-12-01T22:15:16Z

evalml/tests/component_tests/test_components.py

+            elif isinstance(component, RFRegressorSelectFromModel):
+                assert transform_output.shape == (X.shape[0], 2)
+            elif isinstance(component, RFClassifierSelectFromModel):
+                assert transform_output.shape == (X.shape[0], 5)


What's the 2 or 5 values coming from here?

it's the number of columns selected by the FS component using the default parameters on X_y_binary and X_y_regression.

freddyaboulton

Looks good to me @jeremyliweishih ! Agree with @bchen1116 that setting median and mean as hyperparameter values makes sense.

freddyaboulton · 2021-12-01T22:29:07Z

evalml/pipelines/components/transformers/feature_selection/rf_classifier_feature_selector.py

@@ -29,11 +29,11 @@ class RFClassifierSelectFromModel(FeatureSelector):
    name = "RF Classifier Select From Model"
    hyperparameter_ranges = {
        "percent_features": Real(0.01, 1),
-        "threshold": ["mean", -np.inf],
+        "threshold": ["mean"],


I like that suggestion. Right now the DefaultAlgorithm only runs one feature selector, so I'm down to set median as the default. Yesterday we discussed letting the algorithm tune the parameters of the selector in which case broadening the threshold search space (like parametrizing it as a quantile of the observed feature importance distribution) will be in play.

jeremyliweishih · 2021-12-01T23:13:49Z

@bchen1116

LightGBM drops significantly in terms of percentage change (since the log loss is < 0.1) but it's not that big in absolute terms and likewise with the best pipeline validation score for that dataset so I'm not too worried about it. I guess the columns selected by the FS doesn't play nicely with LightGBM but I don't have enough knowledge about how LightGBM works to make any concrete statements about the change.

bchen1116

Changes look good to me!

jeremyliweishih added 2 commits November 29, 2021 13:42

use mean threshold only

57de034

Merge branch 'main' of github.com:alteryx/evalml into js_change_fs_pa…

68bc04d

…rameters

jeremyliweishih commented Dec 1, 2021

View reviewed changes

docs

67ce90b

jeremyliweishih added 2 commits December 1, 2021 16:21

lint

bebb5b7

Fix tests

cf99f4b

jeremyliweishih marked this pull request as ready for review December 1, 2021 21:31

auto-assign bot assigned jeremyliweishih Dec 1, 2021

jeremyliweishih requested review from freddyaboulton, bchen1116, angela97lin, chukarsten, eccabay, ParthivNaresh and christopherbunn December 1, 2021 21:56

bchen1116 reviewed Dec 1, 2021

View reviewed changes

freddyaboulton approved these changes Dec 1, 2021

View reviewed changes

bchen1116 approved these changes Dec 2, 2021

View reviewed changes

Make default median and add to hyperparameter ranges

11ceecc

jeremyliweishih merged commit 36a50d2 into main Dec 2, 2021

chukarsten mentioned this pull request Dec 9, 2021

Release v0.39.0 #3136

Merged

freddyaboulton deleted the js_change_fs_parameters branch May 13, 2022 15:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Change default parameters for feature selectors #3110

Change default parameters for feature selectors #3110

jeremyliweishih commented Dec 1, 2021 •

edited

Loading

jeremyliweishih Dec 1, 2021

jeremyliweishih Dec 1, 2021

bchen1116 Dec 1, 2021

freddyaboulton Dec 1, 2021

jeremyliweishih Dec 1, 2021

codecov bot commented Dec 1, 2021 •

edited

Loading

bchen1116 left a comment

bchen1116 Dec 1, 2021

bchen1116 Dec 1, 2021

bchen1116 Dec 1, 2021

jeremyliweishih Dec 1, 2021

freddyaboulton left a comment

freddyaboulton Dec 1, 2021

jeremyliweishih commented Dec 1, 2021

bchen1116 left a comment

Change default parameters for feature selectors #3110

Change default parameters for feature selectors #3110

Conversation

jeremyliweishih commented Dec 1, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov bot commented Dec 1, 2021 • edited Loading

Codecov Report

bchen1116 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

freddyaboulton left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jeremyliweishih commented Dec 1, 2021

bchen1116 left a comment

Choose a reason for hiding this comment

jeremyliweishih commented Dec 1, 2021 •

edited

Loading

codecov bot commented Dec 1, 2021 •

edited

Loading