-
Notifications
You must be signed in to change notification settings - Fork 83
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Change default parameters for feature selectors #3110
Changes from 5 commits
57de034
68bc04d
67ce90b
bebb5b7
cf99f4b
11ceecc
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,5 +1,4 @@ | ||
"""Component that selects top features based on importance weights using a Random Forest regresor.""" | ||
import numpy as np | ||
from sklearn.ensemble import RandomForestRegressor as SKRandomForestRegressor | ||
from sklearn.feature_selection import SelectFromModel as SkSelect | ||
from skopt.space import Real | ||
|
@@ -29,11 +28,11 @@ class RFRegressorSelectFromModel(FeatureSelector): | |
name = "RF Regressor Select From Model" | ||
hyperparameter_ranges = { | ||
"percent_features": Real(0.01, 1), | ||
"threshold": ["mean", -np.inf], | ||
"threshold": ["mean"], | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Also here, seems to be ideal to allow |
||
} | ||
"""{ | ||
"percent_features": Real(0.01, 1), | ||
"threshold": ["mean", -np.inf], | ||
"threshold": ["mean"], | ||
}""" | ||
|
||
def __init__( | ||
|
@@ -42,7 +41,7 @@ def __init__( | |
n_estimators=10, | ||
max_depth=None, | ||
percent_features=0.5, | ||
threshold=-np.inf, | ||
threshold="mean", | ||
n_jobs=-1, | ||
random_seed=0, | ||
**kwargs, | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -888,6 +888,10 @@ def test_transformer_transform_output_type(X_y_binary): | |
component, SelectByType | ||
): | ||
assert transform_output.shape == (X.shape[0], 0) | ||
elif isinstance(component, RFRegressorSelectFromModel): | ||
assert transform_output.shape == (X.shape[0], 2) | ||
elif isinstance(component, RFClassifierSelectFromModel): | ||
assert transform_output.shape == (X.shape[0], 5) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What's the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. it's the number of columns selected by the FS component using the default parameters on |
||
elif isinstance(component, PCA) or isinstance( | ||
component, LinearDiscriminantAnalysis | ||
): | ||
|
@@ -915,6 +919,10 @@ def test_transformer_transform_output_type(X_y_binary): | |
component, SelectByType | ||
): | ||
assert transform_output.shape == (X.shape[0], 0) | ||
elif isinstance(component, RFRegressorSelectFromModel): | ||
assert transform_output.shape == (X.shape[0], 2) | ||
elif isinstance(component, RFClassifierSelectFromModel): | ||
assert transform_output.shape == (X.shape[0], 5) | ||
elif isinstance(component, PCA) or isinstance( | ||
component, LinearDiscriminantAnalysis | ||
): | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR proposes using the
mean
as the threshold but experiments show thatmedian
performs similarly as well.median
will select exactly half of the features andmean
depending on the distribution. Happy to discuss which one to choose but I chosemean
in the... meantime..There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On second thought: using
median
might be the move as themean
of feature importances will bound to be dragged down by low signal features (which are inevitable). However, performance results show similar model quality betweenmedian
andmean
butmedian
having slightly longer fit times due tomedian
selecting more features.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not allow both 'mean' and 'median' to be in the hyperparameter ranges? We can default to one, but allowing both in the ranges seems to be ideal for our automlsearch algo
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like that suggestion. Right now the DefaultAlgorithm only runs one feature selector, so I'm down to set median as the default. Yesterday we discussed letting the algorithm tune the parameters of the selector in which case broadening the threshold search space (like parametrizing it as a quantile of the observed feature importance distribution) will be in play.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bchen1116 @freddyaboulton
sounds good, I'll add them both as hyperparameter ranges! My main concern is what Freddy brought up about the default parameter and having only 1 FS batch but I'm still on the fence on chosing
mean
ormedian
. Either way it'll be a quick fix so I'm not too worried!