Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change default parameters for feature selectors #3110

Merged
merged 6 commits into from
Dec 2, 2021
Merged
Show file tree
Hide file tree
Changes from 5 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions docs/source/release_notes.rst
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@ Release Notes
* Enhancements
* Renamed ``DelayedFeatureTransformer`` to ``TimeSeriesFeaturizer`` and enhanced it to compute rolling features :pr:`3028`
* Fixes
* Default parameters for ``RFRegressorSelectFromModel`` and ``RFClassifierSelectFromModel`` has been fixed to avoid selecting all features :pr:`3110`
* Changes
* Documentation Changes
* Testing Changes
Expand Down
Original file line number Diff line number Diff line change
@@ -1,5 +1,4 @@
"""Component that selects top features based on importance weights using a Random Forest classifier."""
import numpy as np
from sklearn.ensemble import RandomForestClassifier as SKRandomForestClassifier
from sklearn.feature_selection import SelectFromModel as SkSelect
from skopt.space import Real
Expand Down Expand Up @@ -29,11 +28,11 @@ class RFClassifierSelectFromModel(FeatureSelector):
name = "RF Classifier Select From Model"
hyperparameter_ranges = {
"percent_features": Real(0.01, 1),
"threshold": ["mean", -np.inf],
"threshold": ["mean"],
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR proposes using the mean as the threshold but experiments show that median performs similarly as well. median will select exactly half of the features and mean depending on the distribution. Happy to discuss which one to choose but I chose mean in the... meantime..

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On second thought: using median might be the move as the mean of feature importances will bound to be dragged down by low signal features (which are inevitable). However, performance results show similar model quality between median and mean but median having slightly longer fit times due to median selecting more features.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not allow both 'mean' and 'median' to be in the hyperparameter ranges? We can default to one, but allowing both in the ranges seems to be ideal for our automlsearch algo

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like that suggestion. Right now the DefaultAlgorithm only runs one feature selector, so I'm down to set median as the default. Yesterday we discussed letting the algorithm tune the parameters of the selector in which case broadening the threshold search space (like parametrizing it as a quantile of the observed feature importance distribution) will be in play.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bchen1116 @freddyaboulton

sounds good, I'll add them both as hyperparameter ranges! My main concern is what Freddy brought up about the default parameter and having only 1 FS batch but I'm still on the fence on chosing mean or median. Either way it'll be a quick fix so I'm not too worried!

}
"""{
"percent_features": Real(0.01, 1),
"threshold": ["mean", -np.inf],
"threshold": ["mean"],
}"""

def __init__(
Expand All @@ -42,7 +41,7 @@ def __init__(
n_estimators=10,
max_depth=None,
percent_features=0.5,
threshold=-np.inf,
threshold="mean",
n_jobs=-1,
random_seed=0,
**kwargs,
Expand Down
Original file line number Diff line number Diff line change
@@ -1,5 +1,4 @@
"""Component that selects top features based on importance weights using a Random Forest regresor."""
import numpy as np
from sklearn.ensemble import RandomForestRegressor as SKRandomForestRegressor
from sklearn.feature_selection import SelectFromModel as SkSelect
from skopt.space import Real
Expand Down Expand Up @@ -29,11 +28,11 @@ class RFRegressorSelectFromModel(FeatureSelector):
name = "RF Regressor Select From Model"
hyperparameter_ranges = {
"percent_features": Real(0.01, 1),
"threshold": ["mean", -np.inf],
"threshold": ["mean"],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also here, seems to be ideal to allow median as a possible value in the hyperparams.

}
"""{
"percent_features": Real(0.01, 1),
"threshold": ["mean", -np.inf],
"threshold": ["mean"],
}"""

def __init__(
Expand All @@ -42,7 +41,7 @@ def __init__(
n_estimators=10,
max_depth=None,
percent_features=0.5,
threshold=-np.inf,
threshold="mean",
n_jobs=-1,
random_seed=0,
**kwargs,
Expand Down
8 changes: 8 additions & 0 deletions evalml/tests/component_tests/test_components.py
Original file line number Diff line number Diff line change
Expand Up @@ -888,6 +888,10 @@ def test_transformer_transform_output_type(X_y_binary):
component, SelectByType
):
assert transform_output.shape == (X.shape[0], 0)
elif isinstance(component, RFRegressorSelectFromModel):
assert transform_output.shape == (X.shape[0], 2)
elif isinstance(component, RFClassifierSelectFromModel):
assert transform_output.shape == (X.shape[0], 5)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the 2 or 5 values coming from here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's the number of columns selected by the FS component using the default parameters on X_y_binary and X_y_regression.

elif isinstance(component, PCA) or isinstance(
component, LinearDiscriminantAnalysis
):
Expand Down Expand Up @@ -915,6 +919,10 @@ def test_transformer_transform_output_type(X_y_binary):
component, SelectByType
):
assert transform_output.shape == (X.shape[0], 0)
elif isinstance(component, RFRegressorSelectFromModel):
assert transform_output.shape == (X.shape[0], 2)
elif isinstance(component, RFClassifierSelectFromModel):
assert transform_output.shape == (X.shape[0], 5)
elif isinstance(component, PCA) or isinstance(
component, LinearDiscriminantAnalysis
):
Expand Down