Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updated partial dependence methods to support non-numeric columns #1150

Merged
merged 26 commits into from
Sep 25, 2020

Conversation

angela97lin
Copy link
Contributor

@angela97lin angela97lin commented Sep 8, 2020

Closes #1125

This PR enables support of non-numeric columns in our partial dependence methods by transforming the input before passing it to the scikit-learn method.

@angela97lin angela97lin self-assigned this Sep 8, 2020
@codecov
Copy link

codecov bot commented Sep 8, 2020

Codecov Report

Merging #1150 into main will increase coverage by 0.00%.
The diff coverage is 100.00%.

Impacted file tree graph

@@           Coverage Diff           @@
##             main    #1150   +/-   ##
=======================================
  Coverage   99.92%   99.92%           
=======================================
  Files         196      196           
  Lines       11998    12029   +31     
=======================================
+ Hits        11989    12020   +31     
  Misses          9        9           
Impacted Files Coverage Δ
...elines/components/transformers/imputers/imputer.py 100.00% <ø> (ø)
evalml/model_understanding/graphs.py 100.00% <100.00%> (ø)
...lml/tests/model_understanding_tests/test_graphs.py 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 6b0f75d...9b22e89. Read the comment docs.

@angela97lin angela97lin changed the title Support non-numeric columns in one-way Partial Dependency Support non-numeric columns in one-way Partial Dependencw Sep 9, 2020
@angela97lin angela97lin changed the title Support non-numeric columns in one-way Partial Dependencw Support non-numeric columns in one-way Partial Dependence Sep 9, 2020
@angela97lin angela97lin added this to the September 2020 milestone Sep 9, 2020
@angela97lin angela97lin marked this pull request as ready for review September 9, 2020 18:35
@gsheni
Copy link
Contributor

gsheni commented Sep 9, 2020

@angela97lin I tried to run this with the following, and while some of the errors from before are gone, I am unable to generate a partial dependency with my CatBoost pipeline.

import evalml
from evalml import AutoMLSearch
from evalml.model_understanding.graphs import partial_dependence
from featuretools.demo import load_retail
from evalml.model_family import ModelFamily

X_y = load_retail(nrows=1000, return_single_table=True)
X = X_y.drop(columns=['total'])
y = X_y['total']

automl = AutoMLSearch(problem_type="regression", objective="auto", max_pipelines=5)
X_train, X_holdout, y_train, y_holdout = evalml.preprocessing.split_data(X, y, 
                                                                         test_size=0.2,
                                                                         regression=True)
automl.search(X_train, y_train)
for x in automl.rankings['id']:
    pipeline = automl.get_pipeline(x)
    if pipeline.model_family != ModelFamily.BASELINE:
        pipeline.fit(X_train, y_train)
        # quantity is a numeric field
        try:
            data = partial_dependence(pipeline, X_holdout, feature='quantity')
        except Exception as e:
            print(pipeline.name)
            print(e)

Error

CatBoost Regressor w/ Imputer + DateTime Featurization Component
Invalid cat_features[6] = 9 value: index must be < 9.

@angela97lin
Copy link
Contributor Author

@gsheni Ah thanks for the feedback; I just updated this PR, could you try again and let me know if you run into any issues?

@gsheni
Copy link
Contributor

gsheni commented Sep 10, 2020

@angela97lin Yes, that does seem to work better, and am no longer getting the same error. The only remaining thing is support for categoricals (which you mentioned might be in a follow up MR), natural-language, boolean columns. (Maybe datetime columns, thought not sure if that makes sense to do, mathematically)

Would it be possible to raise a generic error if the data type is unsupported? Maybe in a follow - up MR?

I noticed that with the following code, it errors on categoricals, natural-language, boolean, and datetime columns. But that is expected behavior.

import evalml
from evalml import AutoMLSearch
from evalml.model_understanding.graphs import partial_dependence
from featuretools.demo import load_retail
from evalml.model_family import ModelFamily

X_y = load_retail(nrows=1000, return_single_table=True)
X = X_y.drop(columns=['total'])
y = X_y['total']

automl = AutoMLSearch(problem_type="regression", objective="auto", max_pipelines=5)
X_train, X_holdout, y_train, y_holdout = evalml.preprocessing.split_data(X, y, 
                                                                         test_size=0.2,
                                                                         regression=True)
automl.search(X_train, y_train)
for x in automl.rankings['id']:
    pipeline = automl.get_pipeline(x)
    if pipeline.model_family != ModelFamily.BASELINE:
        pipeline.fit(X_train, y_train)
        for col in X_train.columns.tolist():
            try:
                data = partial_dependence(pipeline, X_holdout, feature=col)
            except Exception as e:
                print(col)
                print(X[col].head(3))
                print(pipeline.name)
                print(e)

Errors

can't multiply sequence by non-int of type 'float'
A given column is not a column of the dataframe

Copy link
Contributor

@freddyaboulton freddyaboulton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@angela97lin I think this looks good! I wonder if we should update the docs to mention that you can't compute the partial dependence of non-numeric columns?

evalml/model_understanding/graphs.py Outdated Show resolved Hide resolved
evalml/tests/model_understanding_tests/test_graphs.py Outdated Show resolved Hide resolved
Copy link
Contributor

@dsherry dsherry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@angela97lin looks good! I left some questions and requests to file a couple follow-on items.

evalml/model_understanding/graphs.py Outdated Show resolved Hide resolved
evalml/model_understanding/graphs.py Outdated Show resolved Hide resolved
if not pipeline._is_fitted:
raise ValueError("Pipeline to calculate partial dependence for must be fitted")
if pipeline.model_family == ModelFamily.CATBOOST:
pipeline.estimator._component_obj._fitted_ = True
elif X[feature].dtype not in numeric_dtypes:
raise ValueError(f"Partial dependence is is currently only supported for numeric dtypes for non-CatBoost pipelines.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo: "is is"

Also can we throw a separate error for catboost? I think its confusing having one error for both cases.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dsherry I'm not sure what you mean, catboost doesn't throw an error. Let me know if i can make this more clear?

if pipeline.model_family == ModelFamily.BASELINE:
raise ValueError("Partial dependence plots are not supported for Baseline pipelines")
if not isinstance(X, pd.DataFrame):
X = pd.DataFrame(X)
if not pipeline._is_fitted:
raise ValueError("Pipeline to calculate partial dependence for must be fitted")
if pipeline.model_family == ModelFamily.CATBOOST:
pipeline.estimator._component_obj._fitted_ = True
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I forgot about this, but if we're making modifications to the inputted pipeline, we should probably warn users about that. Ideally, we should remove these added attributes once the method is complete. They were only needed to get the sklearn code to run. Do you know what I mean, and if so could you file something for that?

@@ -441,9 +447,9 @@ def partial_dependence(pipeline, X, feature, grid_resolution=100):
pipeline.estimator._estimator_type = "regressor"
# set arbitrary attribute that ends in underscore to pass scikit-learn check for is_fitted
pipeline.estimator.feature_importances_ = pipeline.feature_importance
avg_pred, values = sk_partial_dependence(pipeline.estimator, X=X, features=[feature], grid_resolution=grid_resolution)
avg_pred, values = sk_partial_dependence(pipeline.estimator, X=pipeline._transform(X), features=[feature], grid_resolution=grid_resolution)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@angela97lin hmm, apologies if I'm forgetting context here, but was there a reason we couldn't pass in the pipeline rather than the estimator?

I think this is fine, and that you are right that we should provide the features for the estimator here. The side-effect is that the feature names provided as input need to be names of the generated features provided to the estimator, not the user-defined features. I suggest we file something to track passing pipelines in here instead of estimators.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aha, I think I went with estimators because the attributes we had to add belonged to the scikit-learn estimators. But since they're temporary, it's okay to add them to our pipelines and remove them afterwards. Updating to use pipelines actually lets us calculate and support non-numeric columns too! :D

@angela97lin angela97lin changed the title Support non-numeric columns in one-way Partial Dependence Updated partial dependence methods to support calculating numeric columns in a dataset with non-numeric columns Sep 24, 2020
@angela97lin angela97lin changed the title Updated partial dependence methods to support calculating numeric columns in a dataset with non-numeric columns Updated partial dependence methods to support non-numeric columns Sep 25, 2020
@angela97lin
Copy link
Contributor Author

@dsherry @gsheni By updating our implementation to pass in evalml pipelines directly (thanks @dsherry!), I believe we're able to support categorical columns and calculating partial dependence for those columns. Just a FYI before I merge in case you want to take another look! 🥳

@angela97lin angela97lin merged commit 9e6706b into main Sep 25, 2020
@angela97lin angela97lin deleted the 1125_non_numeric branch September 25, 2020 17:21
Copy link
Contributor

@dsherry dsherry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@angela97lin I took a quick look, sorry I missed your comments yesterday, looks great, I'm glad pipelines worked here! I left one idea for an improvement.


# Delete scikit-learn attributes that were temporarily set
del pipeline._estimator_type
del pipeline.feature_importances_
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very cool!

One note: if sk_partial_dependence throws, we'll never reach this code. You could use finally though:

try:
    avg_pred, values = sk_partial_dependence(pipeline, X=X, features=[feature], grid_resolution=grid_resolution)
finally:
    del pipeline._estimator_type
    del pipeline.feature_importances_

This will allow the exception to still propagate up the stack because we didn't add an except clause, but it will guarantee that the code in finally runs before the exception continues on!

Not critical, but it would be cool to add this sometime.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support non-numeric columns in one-way Partial Dependency
4 participants