Partial Dependence Fast Mode#3753
Conversation
ff43db8 to
460f0c8
Compare
Codecov Report
@@ Coverage Diff @@
## main #3753 +/- ##
=======================================
+ Coverage 99.7% 99.7% +0.1%
=======================================
Files 343 344 +1
Lines 35807 36080 +273
=======================================
+ Hits 35670 35942 +272
- Misses 137 138 +1
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. |
99d2367 to
ab5b813
Compare
| X_t_single_col = X_t_single_col[cols_to_replace] | ||
|
|
||
| # --> not keeping in woodwork - problematic? | ||
| X_t[cols_to_replace] = X_t_single_col |
There was a problem hiding this comment.
I wasn't sure if I should be adding the columns into X_t though woodwork or not. This X_t gets passed to estimator.predict, so I would think that it'd be problematic if the woodwork schema was invalid. What I think is happening here to allow this to work is that the cols_to_replace are the correct types to not break the woodwork schema.
But maybe I should just go through woodwork to be safe. Open to hearing other peoples' thoughts!
| # --> this won't make custom components invalid for fast mode, but lets us opt into it. | ||
| # Given that fast mode isn't default behavior and we'll have proper warnings about not | ||
| # using it for custom components, (maybe we raise a warning if we find custom?), this makes the most sense | ||
| _can_be_used_for_fast_partial_dependence = True |
There was a problem hiding this comment.
I just wanted to highlight this behavior: Any component defaults to being valid for fast mode, so if a user created a custom component or a developer added a new component, it's possible they might get incorrect PD results for pipelines containing it.
I added a test in test_components.py that will raise an error when a new component is added to evalml, forcing developers to update the test and hopefully spend a little time thinking about if their component should be used for fast mode. It should be very quick to update, and devs can just default to disallowing it if they want. If people think that's a pain in the butt, I can remove it.
There's nothing currently in place to stop custom components, but I was considering adding a warning using the all_components util to let users know the results may be wrong if a custom component is detected. Does anyone feel strongly about whether this warning should exist or not?
There was a problem hiding this comment.
I think its ok for now since fast mode is off by default but we should file an issue to track it!
There was a problem hiding this comment.
Created an issue: #3817
While thinking about this more, it occurred to me that we could add a check in fast mode itself as to whether the pipeline relies on multiple columns. I think that check will be a little tricky to implement correctly, and we should test the impact on fast mode's runtime, so we shouldn't do that here, but I outlined the idea in the issue!
| A new instance of this pipeline with identical components. | ||
| """ | ||
| # --> why is there no setting of the threshold for binary like there is in clone? | ||
| return self.__class__( |
There was a problem hiding this comment.
Not really relevant to the rest of the PR, but I noticed that the following check that's in clone isn't in this method and wanted to see if anyone knew why:
if is_binary(self.problem_type):
clone.threshold = self.threshold865149b to
4e4269b
Compare
| prediction_method = prediction_object.predict_proba | ||
| if fast_mode and no_features_passed_to_estimator: | ||
| original_predictions = prediction_method(X_eval) | ||
| original_predictions_mean = np.mean(original_predictions, axis=0) |
There was a problem hiding this comment.
In theory we could do this for slow mode too-- it would require us to calculate pipeline.transform_all_but_final(X_eval), though, and I didn't do a full set of analysis on whether that ends up saving time in many cases, so I decided to not add it in for now. I created an issue to look into adding it at a separate time.
7974af5 to
64f1f4d
Compare
jeremyliweishih
left a comment
There was a problem hiding this comment.
LGTM - just some nits and questions but overall looks great! Is there an issue add to our docs? If not we should create one so we can tell users to use fast mode.
| cloned_feature_pipelines = {} | ||
| for variable in features: | ||
| # Don't fit pipelines if the feature has no impact on predictions | ||
| if not variable_has_features_passed_to_estimator[variable]: |
There was a problem hiding this comment.
can you give me an example of when this happens in practice? Can't think of one off the top of my head!
There was a problem hiding this comment.
If the RFRegressorSelectFromModel decides that the same feature you're trying to calculate PD for is unimportant and should be dropped prior to passing the data to the estimator.
It's what lets us just use the original data's average predictions as PD values. We know we won't have to change that feature's values and re-transform, so there's no need to waste time fitting a cloned pipeline for that feature!
There was a problem hiding this comment.
gotcha - I think I understand how everything is linked now. Thanks!
| # --> this won't make custom components invalid for fast mode, but lets us opt into it. | ||
| # Given that fast mode isn't default behavior and we'll have proper warnings about not | ||
| # using it for custom components, (maybe we raise a warning if we find custom?), this makes the most sense | ||
| _can_be_used_for_fast_partial_dependence = True |
There was a problem hiding this comment.
I think its ok for now since fast mode is off by default but we should file an issue to track it!
| """ | ||
| return self.fit(X, y).transform(X, y) | ||
|
|
||
| def _handle_partial_dependence_fast_mode(self, X, pipeline_parameters): |
There was a problem hiding this comment.
if the feature selector decided to drop a feature - should we be calculating partial dependence for it? I'm not sure what the current behavior is but I feel like we shouldn't be if the feature selector deemed the feature unnecessary. Don't think we need to solve in this PR but might be worth putting a issue up to discuss!
There was a problem hiding this comment.
I saw your test case! So does this mean features dropped by the feature selector show no impact on predictions? I'm a little confused how the code chunk below affects it.
There was a problem hiding this comment.
(forgive me if this is now unnecessary after my comment above!)
We can use the information from the original pipeline to determine if the feature gets dropped or not, and if it doesn't get dropped, this code lets us keep it. It basically turns off the Feature Selector for the single column-fitted pipelines. We say "take 100% of features" and "take features above the importance threshold of 0", so that we don't drop the single feature we're fitting that cloned pipeline with!
It's definitely a bit confusing to look at completely out of context, though. Let me know if you think the docstring isn't descriptive enough.
As to your other point: I don't know that it should be up to the partial dependence function to say that it won't calculate partial dependence for you if it deems the feature to not have an impact on predictions. For all we know, the partial dependence plot could be the way a user discovers this, and I could see it being useful to emit a warning or add something to the plot that indicates why this feature's PD is static.
If a user wants to not calculate PD at all in this case, I say it's up to them to decide that, but maybe we should be making that information (whether or not a feature gets passed to the estimator at all) more readily available. I can make an issue to think more about the concept of knowing that a feature won't have an impact on predictions and how that information can be accessible to users.
There was a problem hiding this comment.
no this is an amazing explanation! it might be good to provide a little more context in the docstring 😄
There was a problem hiding this comment.
Added a Return section to the docstring that hopefully explains this a little--let me know if it makes sense to you!
There was a problem hiding this comment.
Created an issue for adding a feature to show whether a feature will be passed to the estimator: #3818
eccabay
left a comment
There was a problem hiding this comment.
This looks solid to me! Most of my comments are nitpicks related to docstrings. Awesome work!
|
|
||
| # Create a fit pipeline for each feature | ||
| cloned_feature_pipelines = {} | ||
| for variable in features: |
There was a problem hiding this comment.
I don't think this will work if features is a string, as the docstring says it can be
There was a problem hiding this comment.
Very good call. I also think it should just be (list(str)), because it's getting passed in from _partial_dependence_calculation, which already says it'll just be a list of strings at that point.
| def _handle_partial_dependence_fast_mode(self, X, pipeline_parameters): | ||
| """Updates pipeline parameters to not drop any features based off of feature importance. | ||
|
|
||
| Note: |
There was a problem hiding this comment.
A style mega nit is that we don't need this in its own section. We can mirror other parts of the docstring here by just keeping it as its own unmarked separate paragraph.
| to not drop any features. This is needed because the cloned pipeline that uses these features | ||
| will be fit on a single column, and we do not want to drop that column. We can use the original | ||
| pipeline to determine if that feature gets dropped or not. |
There was a problem hiding this comment.
It feels odd and/or slightly redundant to mention the "this is needed" within the return statement. It might fit better and be easier to reference in the note section instead?
There was a problem hiding this comment.
Good call -- removed that section here and added another sentence above, so the note section now says:
This is needed, because fast mode refits cloned pipelines on single columns,
so categorical columns that have one-hot encoding applied may lose some
of their encoded columns with the default parameters. Therefore, we update the
parameters here to not drop any columns, and use the original
pipeline to determine if that feature gets dropped or not.
| error = re.escape( | ||
| f"cannot run partial dependence fast mode", | ||
| ) |
There was a problem hiding this comment.
This might be something w.r.t regex I'm unfamiliar with, but why do we need re.escape here? Does keeping this as a regular string fail?
There was a problem hiding this comment.
Nope, I think this was just an accidental holdover from when I tried to include the full error message with the component name and the extra quotation marks and whatnot weren't playing nice. I'll remove!
| if X_train is None or y_train is None: | ||
| raise ValueError("Training data is required for partial dependence fast mode.") |
There was a problem hiding this comment.
This is purely my opinion, but it may be neater to raise this error higher up in the call stack - potentially even in the highest level partial_dependence function. Just as fewer things to wade through when debugging an error.
There was a problem hiding this comment.
I think that makes sense--moved up to partial_dependence with all the other error logic
29053e3 to
2eb025f
Compare
…o decide how to handle pd fast mode
2eb025f to
48d5465
Compare

Optimize partial dependence by avoiding the repeated transformations that come from running
pipeline.predict(X)on the entire dataset for every grid value. This implementation transforms only the features whose partial dependence we are calculating by runningpipeline.transform_all_but_final(X[[single_col]]), updating the transformed full X, and runningestimator.predicton the updated transformed X.Has the following limitations:
_handle_partial_dependence_fast_modemethod on the component itself.Oversampler,StackedEnsembleRegressor, andStackedEnsembleClassifier. They are blocked with a_can_be_used_for_fast_partial_dependence = Falseproperty.DFSTransformer,RFClassifierSelectFromModel, andRFRegressorSelectFromModel.X_trainis known to be needed in partial dependence parameters for future time series work.y_trainis known to be needed for the oversampler, which is currently blocked, so we could in theory remove it for now, but I think keeping it is a good failsafe for any components that might use for fit and transform that we missed here.Note that listed above are only the known components with problems in fast mode. This implementation provides some great performance improvements, but there are risks, as it is not hard to accidentally add or change a component in such a way as to make it not compatible with fast mode. Therefore, it is an opt-in API.