Skip to content

Partial Dependence Fast Mode#3753

Merged
tamargrey merged 37 commits into
mainfrom
partial-dependence-performance
Nov 9, 2022
Merged

Partial Dependence Fast Mode#3753
tamargrey merged 37 commits into
mainfrom
partial-dependence-performance

Conversation

@tamargrey

@tamargrey tamargrey commented Oct 13, 2022

Copy link
Copy Markdown
Contributor

Optimize partial dependence by avoiding the repeated transformations that come from running pipeline.predict(X) on the entire dataset for every grid value. This implementation transforms only the features whose partial dependence we are calculating by running pipeline.transform_all_but_final(X[[single_col]]), updating the transformed full X, and running estimator.predict on the updated transformed X.

Has the following limitations:

  • Fast mode cannot be used if the pipeline contains components that rely on more than one column to fit and transform data correctly.
    • If there is a way of handling such components for fast mode, it will exist in a _handle_partial_dependence_fast_mode method on the component itself.
    • Known components with this issue that have not been handled are Oversampler, StackedEnsembleRegressor, and StackedEnsembleClassifier. They are blocked with a _can_be_used_for_fast_partial_dependence = False property.
    • Known components that have handed this in some way are DFSTransformer, RFClassifierSelectFromModel, and RFRegressorSelectFromModel.
  • Requires that training X and y are passed in in order to properly refit cloned pipelines with the correct data. Spoke to @eccabay about how X_train is known to be needed in partial dependence parameters for future time series work. y_train is known to be needed for the oversampler, which is currently blocked, so we could in theory remove it for now, but I think keeping it is a good failsafe for any components that might use for fit and transform that we missed here.

Note that listed above are only the known components with problems in fast mode. This implementation provides some great performance improvements, but there are risks, as it is not hard to accidentally add or change a component in such a way as to make it not compatible with fast mode. Therefore, it is an opt-in API.

@tamargrey tamargrey force-pushed the partial-dependence-performance branch from ff43db8 to 460f0c8 Compare October 13, 2022 19:22
@codecov

codecov Bot commented Oct 13, 2022

Copy link
Copy Markdown

Codecov Report

Merging #3753 (48d5465) into main (132e712) will increase coverage by 0.1%.
The diff coverage is 99.7%.

@@           Coverage Diff           @@
##            main   #3753     +/-   ##
=======================================
+ Coverage   99.7%   99.7%   +0.1%     
=======================================
  Files        343     344      +1     
  Lines      35807   36080    +273     
=======================================
+ Hits       35670   35942    +272     
- Misses       137     138      +1     
Impacted Files Coverage Δ
evalml/tests/component_tests/test_components.py 99.0% <88.9%> (-0.1%) ⬇️
...derstanding/_partial_dependence_fast_mode_utils.py 100.0% <100.0%> (ø)
...l/model_understanding/_partial_dependence_utils.py 99.4% <100.0%> (+0.1%) ⬆️
...odel_understanding/partial_dependence_functions.py 100.0% <100.0%> (ø)
evalml/pipelines/components/component_base.py 100.0% <100.0%> (ø)
...lines/components/ensemble/stacked_ensemble_base.py 100.0% <100.0%> (ø)
...transformers/feature_selection/feature_selector.py 100.0% <100.0%> (ø)
...ponents/transformers/preprocessing/featuretools.py 100.0% <100.0%> (ø)
...es/components/transformers/samplers/oversampler.py 100.0% <100.0%> (ø)
...del_understanding_tests/test_partial_dependence.py 100.0% <100.0%> (ø)

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

@tamargrey tamargrey force-pushed the partial-dependence-performance branch from 99d2367 to ab5b813 Compare October 26, 2022 16:10
X_t_single_col = X_t_single_col[cols_to_replace]

# --> not keeping in woodwork - problematic?
X_t[cols_to_replace] = X_t_single_col

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wasn't sure if I should be adding the columns into X_t though woodwork or not. This X_t gets passed to estimator.predict, so I would think that it'd be problematic if the woodwork schema was invalid. What I think is happening here to allow this to work is that the cols_to_replace are the correct types to not break the woodwork schema.

But maybe I should just go through woodwork to be safe. Open to hearing other peoples' thoughts!

# --> this won't make custom components invalid for fast mode, but lets us opt into it.
# Given that fast mode isn't default behavior and we'll have proper warnings about not
# using it for custom components, (maybe we raise a warning if we find custom?), this makes the most sense
_can_be_used_for_fast_partial_dependence = True

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just wanted to highlight this behavior: Any component defaults to being valid for fast mode, so if a user created a custom component or a developer added a new component, it's possible they might get incorrect PD results for pipelines containing it.

I added a test in test_components.py that will raise an error when a new component is added to evalml, forcing developers to update the test and hopefully spend a little time thinking about if their component should be used for fast mode. It should be very quick to update, and devs can just default to disallowing it if they want. If people think that's a pain in the butt, I can remove it.

There's nothing currently in place to stop custom components, but I was considering adding a warning using the all_components util to let users know the results may be wrong if a custom component is detected. Does anyone feel strongly about whether this warning should exist or not?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think its ok for now since fast mode is off by default but we should file an issue to track it!

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Created an issue: #3817

While thinking about this more, it occurred to me that we could add a check in fast mode itself as to whether the pipeline relies on multiple columns. I think that check will be a little tricky to implement correctly, and we should test the impact on fast mode's runtime, so we shouldn't do that here, but I outlined the idea in the issue!

A new instance of this pipeline with identical components.
"""
# --> why is there no setting of the threshold for binary like there is in clone?
return self.__class__(

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really relevant to the rest of the PR, but I noticed that the following check that's in clone isn't in this method and wanted to see if anyone knew why:

        if is_binary(self.problem_type):
            clone.threshold = self.threshold

prediction_method = prediction_object.predict_proba
if fast_mode and no_features_passed_to_estimator:
original_predictions = prediction_method(X_eval)
original_predictions_mean = np.mean(original_predictions, axis=0)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In theory we could do this for slow mode too-- it would require us to calculate pipeline.transform_all_but_final(X_eval), though, and I didn't do a full set of analysis on whether that ends up saving time in many cases, so I decided to not add it in for now. I created an issue to look into adding it at a separate time.

@tamargrey tamargrey changed the title [Draft] Partial Dependence Optimization POC Partial Dependence Fast Mode Nov 1, 2022
@tamargrey tamargrey marked this pull request as ready for review November 1, 2022 17:09
@tamargrey tamargrey force-pushed the partial-dependence-performance branch from 7974af5 to 64f1f4d Compare November 2, 2022 16:01

@jeremyliweishih jeremyliweishih left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - just some nits and questions but overall looks great! Is there an issue add to our docs? If not we should create one so we can tell users to use fast mode.

Comment thread evalml/model_understanding/_partial_dependence_utils.py Outdated
Comment thread evalml/model_understanding/_partial_dependence_utils.py Outdated
Comment thread evalml/model_understanding/_partial_dependence_fast_mode_utils.py Outdated
Comment thread evalml/pipelines/components/component_base.py
cloned_feature_pipelines = {}
for variable in features:
# Don't fit pipelines if the feature has no impact on predictions
if not variable_has_features_passed_to_estimator[variable]:

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you give me an example of when this happens in practice? Can't think of one off the top of my head!

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the RFRegressorSelectFromModel decides that the same feature you're trying to calculate PD for is unimportant and should be dropped prior to passing the data to the estimator.

It's what lets us just use the original data's average predictions as PD values. We know we won't have to change that feature's values and re-transform, so there's no need to waste time fitting a cloned pipeline for that feature!

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gotcha - I think I understand how everything is linked now. Thanks!

# --> this won't make custom components invalid for fast mode, but lets us opt into it.
# Given that fast mode isn't default behavior and we'll have proper warnings about not
# using it for custom components, (maybe we raise a warning if we find custom?), this makes the most sense
_can_be_used_for_fast_partial_dependence = True

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think its ok for now since fast mode is off by default but we should file an issue to track it!

"""
return self.fit(X, y).transform(X, y)

def _handle_partial_dependence_fast_mode(self, X, pipeline_parameters):

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if the feature selector decided to drop a feature - should we be calculating partial dependence for it? I'm not sure what the current behavior is but I feel like we shouldn't be if the feature selector deemed the feature unnecessary. Don't think we need to solve in this PR but might be worth putting a issue up to discuss!

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I saw your test case! So does this mean features dropped by the feature selector show no impact on predictions? I'm a little confused how the code chunk below affects it.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(forgive me if this is now unnecessary after my comment above!)

We can use the information from the original pipeline to determine if the feature gets dropped or not, and if it doesn't get dropped, this code lets us keep it. It basically turns off the Feature Selector for the single column-fitted pipelines. We say "take 100% of features" and "take features above the importance threshold of 0", so that we don't drop the single feature we're fitting that cloned pipeline with!

It's definitely a bit confusing to look at completely out of context, though. Let me know if you think the docstring isn't descriptive enough.

As to your other point: I don't know that it should be up to the partial dependence function to say that it won't calculate partial dependence for you if it deems the feature to not have an impact on predictions. For all we know, the partial dependence plot could be the way a user discovers this, and I could see it being useful to emit a warning or add something to the plot that indicates why this feature's PD is static.

If a user wants to not calculate PD at all in this case, I say it's up to them to decide that, but maybe we should be making that information (whether or not a feature gets passed to the estimator at all) more readily available. I can make an issue to think more about the concept of knowing that a feature won't have an impact on predictions and how that information can be accessible to users.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no this is an amazing explanation! it might be good to provide a little more context in the docstring 😄

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a Return section to the docstring that hopefully explains this a little--let me know if it makes sense to you!

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Created an issue for adding a feature to show whether a feature will be passed to the estimator: #3818

@eccabay eccabay left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks solid to me! Most of my comments are nitpicks related to docstrings. Awesome work!

Comment thread evalml/model_understanding/_partial_dependence_fast_mode_utils.py Outdated

# Create a fit pipeline for each feature
cloned_feature_pipelines = {}
for variable in features:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this will work if features is a string, as the docstring says it can be

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very good call. I also think it should just be (list(str)), because it's getting passed in from _partial_dependence_calculation, which already says it'll just be a list of strings at that point.

Comment thread evalml/model_understanding/_partial_dependence_fast_mode_utils.py Outdated
Comment thread evalml/model_understanding/_partial_dependence_fast_mode_utils.py Outdated
Comment thread evalml/model_understanding/_partial_dependence_utils.py Outdated
Comment thread evalml/model_understanding/partial_dependence_functions.py Outdated
def _handle_partial_dependence_fast_mode(self, X, pipeline_parameters):
"""Updates pipeline parameters to not drop any features based off of feature importance.

Note:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A style mega nit is that we don't need this in its own section. We can mirror other parts of the docstring here by just keeping it as its own unmarked separate paragraph.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@eccabay Do you mean like this?
image

Comment on lines +89 to +91
to not drop any features. This is needed because the cloned pipeline that uses these features
will be fit on a single column, and we do not want to drop that column. We can use the original
pipeline to determine if that feature gets dropped or not.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It feels odd and/or slightly redundant to mention the "this is needed" within the return statement. It might fit better and be easier to reference in the note section instead?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call -- removed that section here and added another sentence above, so the note section now says:

This is needed, because fast mode refits cloned pipelines on single columns,
so categorical columns that have one-hot encoding applied may lose some
of their encoded columns with the default parameters. Therefore, we update the
parameters here to not drop any columns, and use the original
pipeline to determine if that feature gets dropped or not.

Comment on lines +2719 to +2721
error = re.escape(
f"cannot run partial dependence fast mode",
)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might be something w.r.t regex I'm unfamiliar with, but why do we need re.escape here? Does keeping this as a regular string fail?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope, I think this was just an accidental holdover from when I tried to include the full error message with the component name and the extra quotation marks and whatnot weren't playing nice. I'll remove!

Comment on lines +32 to +33
if X_train is None or y_train is None:
raise ValueError("Training data is required for partial dependence fast mode.")

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is purely my opinion, but it may be neater to raise this error higher up in the call stack - potentially even in the highest level partial_dependence function. Just as fewer things to wade through when debugging an error.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that makes sense--moved up to partial_dependence with all the other error logic

@tamargrey tamargrey force-pushed the partial-dependence-performance branch 2 times, most recently from 29053e3 to 2eb025f Compare November 9, 2022 18:48
@tamargrey tamargrey force-pushed the partial-dependence-performance branch from 2eb025f to 48d5465 Compare November 9, 2022 20:16
@tamargrey tamargrey merged commit 9057491 into main Nov 9, 2022
@tamargrey tamargrey deleted the partial-dependence-performance branch November 9, 2022 21:33
@chukarsten chukarsten mentioned this pull request Nov 23, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants