Updated partial dependence methods to support non-numeric columns #1150

angela97lin · 2020-09-08T23:12:01Z

This PR enables support of non-numeric columns in our partial dependence methods by transforming the input before passing it to the scikit-learn method.

codecov · 2020-09-08T23:25:59Z

Codecov Report

Merging #1150 into main will increase coverage by 0.00%.
The diff coverage is 100.00%.

@@           Coverage Diff           @@
##             main    #1150   +/-   ##
=======================================
  Coverage   99.92%   99.92%           
=======================================
  Files         196      196           
  Lines       11998    12029   +31     
=======================================
+ Hits        11989    12020   +31     
  Misses          9        9

Impacted Files	Coverage Δ
...elines/components/transformers/imputers/imputer.py	`100.00% <ø> (ø)`
evalml/model_understanding/graphs.py	`100.00% <100.00%> (ø)`
...lml/tests/model_understanding_tests/test_graphs.py	`100.00% <100.00%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 6b0f75d...9b22e89. Read the comment docs.

… 1125_non_numeric

gsheni · 2020-09-09T21:03:35Z

@angela97lin I tried to run this with the following, and while some of the errors from before are gone, I am unable to generate a partial dependency with my CatBoost pipeline.

import evalml
from evalml import AutoMLSearch
from evalml.model_understanding.graphs import partial_dependence
from featuretools.demo import load_retail
from evalml.model_family import ModelFamily

X_y = load_retail(nrows=1000, return_single_table=True)
X = X_y.drop(columns=['total'])
y = X_y['total']

automl = AutoMLSearch(problem_type="regression", objective="auto", max_pipelines=5)
X_train, X_holdout, y_train, y_holdout = evalml.preprocessing.split_data(X, y, 
                                                                         test_size=0.2,
                                                                         regression=True)
automl.search(X_train, y_train)
for x in automl.rankings['id']:
    pipeline = automl.get_pipeline(x)
    if pipeline.model_family != ModelFamily.BASELINE:
        pipeline.fit(X_train, y_train)
        # quantity is a numeric field
        try:
            data = partial_dependence(pipeline, X_holdout, feature='quantity')
        except Exception as e:
            print(pipeline.name)
            print(e)

Error

CatBoost Regressor w/ Imputer + DateTime Featurization Component
Invalid cat_features[6] = 9 value: index must be < 9.

angela97lin · 2020-09-10T07:41:27Z

@gsheni Ah thanks for the feedback; I just updated this PR, could you try again and let me know if you run into any issues?

gsheni · 2020-09-10T16:59:39Z

@angela97lin Yes, that does seem to work better, and am no longer getting the same error. The only remaining thing is support for categoricals (which you mentioned might be in a follow up MR), natural-language, boolean columns. (Maybe datetime columns, thought not sure if that makes sense to do, mathematically)

Would it be possible to raise a generic error if the data type is unsupported? Maybe in a follow - up MR?

I noticed that with the following code, it errors on categoricals, natural-language, boolean, and datetime columns. But that is expected behavior.

import evalml
from evalml import AutoMLSearch
from evalml.model_understanding.graphs import partial_dependence
from featuretools.demo import load_retail
from evalml.model_family import ModelFamily

X_y = load_retail(nrows=1000, return_single_table=True)
X = X_y.drop(columns=['total'])
y = X_y['total']

automl = AutoMLSearch(problem_type="regression", objective="auto", max_pipelines=5)
X_train, X_holdout, y_train, y_holdout = evalml.preprocessing.split_data(X, y, 
                                                                         test_size=0.2,
                                                                         regression=True)
automl.search(X_train, y_train)
for x in automl.rankings['id']:
    pipeline = automl.get_pipeline(x)
    if pipeline.model_family != ModelFamily.BASELINE:
        pipeline.fit(X_train, y_train)
        for col in X_train.columns.tolist():
            try:
                data = partial_dependence(pipeline, X_holdout, feature=col)
            except Exception as e:
                print(col)
                print(X[col].head(3))
                print(pipeline.name)
                print(e)

Errors

can't multiply sequence by non-int of type 'float'
A given column is not a column of the dataframe

freddyaboulton

@angela97lin I think this looks good! I wonder if we should update the docs to mention that you can't compute the partial dependence of non-numeric columns?

evalml/model_understanding/graphs.py

evalml/tests/model_understanding_tests/test_graphs.py

evalml/model_understanding/graphs.py

dsherry

@angela97lin looks good! I left some questions and requests to file a couple follow-on items.

evalml/model_understanding/graphs.py

dsherry · 2020-09-24T16:54:35Z

evalml/model_understanding/graphs.py

    if not pipeline._is_fitted:
        raise ValueError("Pipeline to calculate partial dependence for must be fitted")
    if pipeline.model_family == ModelFamily.CATBOOST:
        pipeline.estimator._component_obj._fitted_ = True
+    elif X[feature].dtype not in numeric_dtypes:
+        raise ValueError(f"Partial dependence is is currently only supported for numeric dtypes for non-CatBoost pipelines.")


Typo: "is is"

Also can we throw a separate error for catboost? I think its confusing having one error for both cases.

@dsherry I'm not sure what you mean, catboost doesn't throw an error. Let me know if i can make this more clear?

dsherry · 2020-09-24T17:01:16Z

evalml/model_understanding/graphs.py

-    if pipeline.model_family == ModelFamily.BASELINE:
-        raise ValueError("Partial dependence plots are not supported for Baseline pipelines")
+    if not isinstance(X, pd.DataFrame):
+        X = pd.DataFrame(X)
    if not pipeline._is_fitted:
        raise ValueError("Pipeline to calculate partial dependence for must be fitted")
    if pipeline.model_family == ModelFamily.CATBOOST:
        pipeline.estimator._component_obj._fitted_ = True


I forgot about this, but if we're making modifications to the inputted pipeline, we should probably warn users about that. Ideally, we should remove these added attributes once the method is complete. They were only needed to get the sklearn code to run. Do you know what I mean, and if so could you file something for that?

dsherry · 2020-09-24T17:04:39Z

evalml/model_understanding/graphs.py

@@ -441,9 +447,9 @@ def partial_dependence(pipeline, X, feature, grid_resolution=100):
            pipeline.estimator._estimator_type = "regressor"
            # set arbitrary attribute that ends in underscore to pass scikit-learn check for is_fitted
            pipeline.estimator.feature_importances_ = pipeline.feature_importance
-        avg_pred, values = sk_partial_dependence(pipeline.estimator, X=X, features=[feature], grid_resolution=grid_resolution)
+        avg_pred, values = sk_partial_dependence(pipeline.estimator, X=pipeline._transform(X), features=[feature], grid_resolution=grid_resolution)


@angela97lin hmm, apologies if I'm forgetting context here, but was there a reason we couldn't pass in the pipeline rather than the estimator?

I think this is fine, and that you are right that we should provide the features for the estimator here. The side-effect is that the feature names provided as input need to be names of the generated features provided to the estimator, not the user-defined features. I suggest we file something to track passing pipelines in here instead of estimators.

Aha, I think I went with estimators because the attributes we had to add belonged to the scikit-learn estimators. But since they're temporary, it's okay to add them to our pipelines and remove them afterwards. Updating to use pipelines actually lets us calculate and support non-numeric columns too! :D

angela97lin · 2020-09-25T05:48:46Z

@dsherry @gsheni By updating our implementation to pass in evalml pipelines directly (thanks @dsherry!), I believe we're able to support categorical columns and calculating partial dependence for those columns. Just a FYI before I merge in case you want to take another look! 🥳

dsherry

@angela97lin I took a quick look, sorry I missed your comments yesterday, looks great, I'm glad pipelines worked here! I left one idea for an improvement.

dsherry · 2020-09-26T02:31:30Z

evalml/model_understanding/graphs.py

+
+    # Delete scikit-learn attributes that were temporarily set
+    del pipeline._estimator_type
+    del pipeline.feature_importances_


Very cool!

One note: if sk_partial_dependence throws, we'll never reach this code. You could use finally though:

try: avg_pred, values = sk_partial_dependence(pipeline, X=X, features=[feature], grid_resolution=grid_resolution) finally: del pipeline._estimator_type del pipeline.feature_importances_

This will allow the exception to still propagate up the stack because we didn't add an except clause, but it will guarantee that the code in finally runs before the exception continues on!

Not critical, but it would be cool to add this sometime.

init

d5055ac

angela97lin self-assigned this Sep 8, 2020

angela97lin added 2 commits September 8, 2020 19:17

release notes

0421d79

Merge branch 'main' into 1125_non_numeric

1f409d4

angela97lin changed the title ~~Support non-numeric columns in one-way Partial Dependency~~ Support non-numeric columns in one-way Partial Dependencw Sep 9, 2020

angela97lin changed the title ~~Support non-numeric columns in one-way Partial Dependencw~~ Support non-numeric columns in one-way Partial Dependence Sep 9, 2020

angela97lin added 3 commits September 9, 2020 14:15

add tests and update

2e512d0

Merge branch '1125_non_numeric' of github.com:FeatureLabs/evalml into…

ecd9aa8

… 1125_non_numeric

fix test

ac6b2d5

angela97lin requested review from dsherry, freddyaboulton, bchen1116, eccabay, gsheni and jeremyliweishih September 9, 2020 18:35

angela97lin added this to the September 2020 milestone Sep 9, 2020

angela97lin marked this pull request as ready for review September 9, 2020 18:35

fix

9c5f148

Merge branch 'main' into 1125_non_numeric

6626fad

Merge branch 'main' into 1125_non_numeric

476f1f3

freddyaboulton approved these changes Sep 10, 2020

View reviewed changes

evalml/model_understanding/graphs.py Outdated Show resolved Hide resolved

evalml/tests/model_understanding_tests/test_graphs.py Outdated Show resolved Hide resolved

angela97lin added 5 commits September 10, 2020 17:51

Merge branch 'main' into 1125_non_numeric

aa94e3d

Merge branch 'main' into 1125_non_numeric

667fc89

cleanup, add docstr about support for numeric only

a48f61b

Merge branch 'main' into 1125_non_numeric

cfed916

add test and raise error

2272e2a

fix catboost and categorical

e13d5df

gsheni reviewed Sep 14, 2020

View reviewed changes

evalml/model_understanding/graphs.py Outdated Show resolved Hide resolved

angela97lin added 5 commits September 17, 2020 13:06

Merge branch 'main' into 1125_non_numeric

00ffd9a

Merge branch 'main' into 1125_non_numeric

6036fe7

move release notes

d1355da

empty for circleci

da2014f

Merge branch 'main' into 1125_non_numeric

cdc35af

dsherry approved these changes Sep 24, 2020

View reviewed changes

Merge branch 'main' into 1125_non_numeric

875495b

angela97lin changed the title ~~Support non-numeric columns in one-way Partial Dependence~~ Updated partial dependence methods to support calculating numeric columns in a dataset with non-numeric columns Sep 24, 2020

angela97lin added 2 commits September 24, 2020 15:43

fix docs

91aaa73

some cleanup, still need to delete temporarily set attributes

0da2c13

angela97lin changed the title ~~Updated partial dependence methods to support calculating numeric columns in a dataset with non-numeric columns~~ Updated partial dependence methods to support non-numeric columns Sep 25, 2020

angela97lin added 3 commits September 25, 2020 01:49

Merge branch 'main' into 1125_non_numeric

2c41287

linting

d04cdc9

remove line about only supporting numerical features

9b22e89

angela97lin merged commit 9e6706b into main Sep 25, 2020

angela97lin deleted the 1125_non_numeric branch September 25, 2020 17:21

dsherry reviewed Sep 26, 2020

View reviewed changes

This was referenced Sep 28, 2020

Wrap call to scikit-learn's partial dependence method in a try/finally block #1232

Merged

Release v0.14.1 #1241

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Updated partial dependence methods to support non-numeric columns #1150

Updated partial dependence methods to support non-numeric columns #1150

angela97lin commented Sep 8, 2020 •

edited

Loading

codecov bot commented Sep 8, 2020 •

edited

Loading

gsheni commented Sep 9, 2020 •

edited

Loading

angela97lin commented Sep 10, 2020

gsheni commented Sep 10, 2020 •

edited

Loading

freddyaboulton left a comment

dsherry left a comment

dsherry Sep 24, 2020

angela97lin Sep 24, 2020

dsherry Sep 24, 2020

dsherry Sep 24, 2020

angela97lin Sep 25, 2020

angela97lin commented Sep 25, 2020

dsherry left a comment

dsherry Sep 26, 2020

Updated partial dependence methods to support non-numeric columns #1150

Updated partial dependence methods to support non-numeric columns #1150

Conversation

angela97lin commented Sep 8, 2020 • edited Loading

codecov bot commented Sep 8, 2020 • edited Loading

Codecov Report

gsheni commented Sep 9, 2020 • edited Loading

Error

angela97lin commented Sep 10, 2020

gsheni commented Sep 10, 2020 • edited Loading

Errors

freddyaboulton left a comment

Choose a reason for hiding this comment

dsherry left a comment

Choose a reason for hiding this comment

dsherry Sep 24, 2020

Choose a reason for hiding this comment

angela97lin Sep 24, 2020

Choose a reason for hiding this comment

dsherry Sep 24, 2020

Choose a reason for hiding this comment

dsherry Sep 24, 2020

Choose a reason for hiding this comment

angela97lin Sep 25, 2020

Choose a reason for hiding this comment

angela97lin commented Sep 25, 2020

dsherry left a comment

Choose a reason for hiding this comment

dsherry Sep 26, 2020

Choose a reason for hiding this comment

angela97lin commented Sep 8, 2020 •

edited

Loading

codecov bot commented Sep 8, 2020 •

edited

Loading

gsheni commented Sep 9, 2020 •

edited

Loading

gsheni commented Sep 10, 2020 •

edited

Loading