-
Notifications
You must be signed in to change notification settings - Fork 87
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Updated partial dependence methods to support non-numeric columns #1150
Conversation
Codecov Report
@@ Coverage Diff @@
## main #1150 +/- ##
=======================================
Coverage 99.92% 99.92%
=======================================
Files 196 196
Lines 11998 12029 +31
=======================================
+ Hits 11989 12020 +31
Misses 9 9
Continue to review full report at Codecov.
|
@angela97lin I tried to run this with the following, and while some of the errors from before are gone, I am unable to generate a partial dependency with my CatBoost pipeline. import evalml
from evalml import AutoMLSearch
from evalml.model_understanding.graphs import partial_dependence
from featuretools.demo import load_retail
from evalml.model_family import ModelFamily
X_y = load_retail(nrows=1000, return_single_table=True)
X = X_y.drop(columns=['total'])
y = X_y['total']
automl = AutoMLSearch(problem_type="regression", objective="auto", max_pipelines=5)
X_train, X_holdout, y_train, y_holdout = evalml.preprocessing.split_data(X, y,
test_size=0.2,
regression=True)
automl.search(X_train, y_train)
for x in automl.rankings['id']:
pipeline = automl.get_pipeline(x)
if pipeline.model_family != ModelFamily.BASELINE:
pipeline.fit(X_train, y_train)
# quantity is a numeric field
try:
data = partial_dependence(pipeline, X_holdout, feature='quantity')
except Exception as e:
print(pipeline.name)
print(e) ErrorCatBoost Regressor w/ Imputer + DateTime Featurization Component
Invalid cat_features[6] = 9 value: index must be < 9. |
@gsheni Ah thanks for the feedback; I just updated this PR, could you try again and let me know if you run into any issues? |
@angela97lin Yes, that does seem to work better, and am no longer getting the same error. The only remaining thing is support for categoricals (which you mentioned might be in a follow up MR), natural-language, boolean columns. (Maybe datetime columns, thought not sure if that makes sense to do, mathematically) Would it be possible to raise a generic error if the data type is unsupported? Maybe in a follow - up MR? I noticed that with the following code, it errors on categoricals, natural-language, boolean, and datetime columns. But that is expected behavior. import evalml
from evalml import AutoMLSearch
from evalml.model_understanding.graphs import partial_dependence
from featuretools.demo import load_retail
from evalml.model_family import ModelFamily
X_y = load_retail(nrows=1000, return_single_table=True)
X = X_y.drop(columns=['total'])
y = X_y['total']
automl = AutoMLSearch(problem_type="regression", objective="auto", max_pipelines=5)
X_train, X_holdout, y_train, y_holdout = evalml.preprocessing.split_data(X, y,
test_size=0.2,
regression=True)
automl.search(X_train, y_train)
for x in automl.rankings['id']:
pipeline = automl.get_pipeline(x)
if pipeline.model_family != ModelFamily.BASELINE:
pipeline.fit(X_train, y_train)
for col in X_train.columns.tolist():
try:
data = partial_dependence(pipeline, X_holdout, feature=col)
except Exception as e:
print(col)
print(X[col].head(3))
print(pipeline.name)
print(e) Errors
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@angela97lin I think this looks good! I wonder if we should update the docs to mention that you can't compute the partial dependence of non-numeric columns?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@angela97lin looks good! I left some questions and requests to file a couple follow-on items.
evalml/model_understanding/graphs.py
Outdated
if not pipeline._is_fitted: | ||
raise ValueError("Pipeline to calculate partial dependence for must be fitted") | ||
if pipeline.model_family == ModelFamily.CATBOOST: | ||
pipeline.estimator._component_obj._fitted_ = True | ||
elif X[feature].dtype not in numeric_dtypes: | ||
raise ValueError(f"Partial dependence is is currently only supported for numeric dtypes for non-CatBoost pipelines.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo: "is is"
Also can we throw a separate error for catboost? I think its confusing having one error for both cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dsherry I'm not sure what you mean, catboost doesn't throw an error. Let me know if i can make this more clear?
evalml/model_understanding/graphs.py
Outdated
if pipeline.model_family == ModelFamily.BASELINE: | ||
raise ValueError("Partial dependence plots are not supported for Baseline pipelines") | ||
if not isinstance(X, pd.DataFrame): | ||
X = pd.DataFrame(X) | ||
if not pipeline._is_fitted: | ||
raise ValueError("Pipeline to calculate partial dependence for must be fitted") | ||
if pipeline.model_family == ModelFamily.CATBOOST: | ||
pipeline.estimator._component_obj._fitted_ = True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I forgot about this, but if we're making modifications to the inputted pipeline, we should probably warn users about that. Ideally, we should remove these added attributes once the method is complete. They were only needed to get the sklearn code to run. Do you know what I mean, and if so could you file something for that?
evalml/model_understanding/graphs.py
Outdated
@@ -441,9 +447,9 @@ def partial_dependence(pipeline, X, feature, grid_resolution=100): | |||
pipeline.estimator._estimator_type = "regressor" | |||
# set arbitrary attribute that ends in underscore to pass scikit-learn check for is_fitted | |||
pipeline.estimator.feature_importances_ = pipeline.feature_importance | |||
avg_pred, values = sk_partial_dependence(pipeline.estimator, X=X, features=[feature], grid_resolution=grid_resolution) | |||
avg_pred, values = sk_partial_dependence(pipeline.estimator, X=pipeline._transform(X), features=[feature], grid_resolution=grid_resolution) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@angela97lin hmm, apologies if I'm forgetting context here, but was there a reason we couldn't pass in the pipeline rather than the estimator?
I think this is fine, and that you are right that we should provide the features for the estimator here. The side-effect is that the feature names provided as input need to be names of the generated features provided to the estimator, not the user-defined features. I suggest we file something to track passing pipelines in here instead of estimators.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aha, I think I went with estimators because the attributes we had to add belonged to the scikit-learn estimators. But since they're temporary, it's okay to add them to our pipelines and remove them afterwards. Updating to use pipelines actually lets us calculate and support non-numeric columns too! :D
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@angela97lin I took a quick look, sorry I missed your comments yesterday, looks great, I'm glad pipelines worked here! I left one idea for an improvement.
|
||
# Delete scikit-learn attributes that were temporarily set | ||
del pipeline._estimator_type | ||
del pipeline.feature_importances_ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very cool!
One note: if sk_partial_dependence
throws, we'll never reach this code. You could use finally
though:
try:
avg_pred, values = sk_partial_dependence(pipeline, X=X, features=[feature], grid_resolution=grid_resolution)
finally:
del pipeline._estimator_type
del pipeline.feature_importances_
This will allow the exception to still propagate up the stack because we didn't add an except
clause, but it will guarantee that the code in finally
runs before the exception continues on!
Not critical, but it would be cool to add this sometime.
Closes #1125
This PR enables support of non-numeric columns in our partial dependence methods by transforming the input before passing it to the scikit-learn method.