New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Speed up permutation importance #1762
Conversation
6bfd79e
to
1db411b
Compare
Codecov Report
@@ Coverage Diff @@
## main #1762 +/- ##
=========================================
+ Coverage 100.0% 100.0% +0.1%
=========================================
Files 247 248 +1
Lines 19679 19871 +192
=========================================
+ Hits 19670 19863 +193
+ Misses 9 8 -1
Continue to review full report at Codecov.
|
2ed9772
to
8bd4890
Compare
@@ -68,3 +68,6 @@ def fit_transform(self, X, y=None): | |||
except MethodPropertyNotFoundError as e: | |||
raise e | |||
return _convert_to_woodwork_structure(X_t) | |||
|
|||
def _get_feature_provenance(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Keeping this private for now since it's only used for permutation importance but this same mechanism can help with #1347
@@ -523,70 +523,6 @@ def test_graph_confusion_matrix_title_addition(X_y_binary): | |||
assert fig_dict['layout']['title']['text'] == 'Confusion matrix with added title text, normalized using method "true"' | |||
|
|||
|
|||
def test_get_permutation_importance_invalid_objective(X_y_regression, linear_regression_pipeline_class): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There has been talk about splitting up test_graphs.py
. I figured I would get a head start by creating test_permutation_importance.py
and moving all of the permutation tests there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good call
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, thank you for doing this!
|
||
@pytest.mark.parametrize('pipeline_class, parameters', test_cases) | ||
@patch('evalml.pipelines.PipelineBase._supports_fast_permutation_importance', new_callable=PropertyMock) | ||
def test_fast_permutation_importance_matches_sklearn_output(mock_supports_fast_importance, pipeline_class, parameters, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's important to test that our optimization calculates the values correctly. Mocking doesn't really make sense here but since I'm only using 100 points and n_jobs=1
I think it's ok.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed!
55aef58
to
0073b7f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! The time speedup is pretty awesome!
evalml/model_understanding/graphs.py
Outdated
|
||
|
||
def _fast_permutation_importance(pipeline, X, y, objective, n_repeats=5, n_jobs=None, random_seed=None): | ||
"""Calculate permutation importance faster by onlu computing the estimator features once. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo: onlu
evalml/pipelines/component_graph.py
Outdated
@@ -92,7 +93,12 @@ def fit(self, X, y): | |||
X (ww.DataTable, pd.DataFrame): The input training data of shape [n_samples, n_features] | |||
y (ww.DataColumn, pd.Series): The target training data of length [n_samples] | |||
""" | |||
if isinstance(X, ww.DataTable): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not use _convert_to_woodwork
and convert_woodwork_types_wrapper
here instead of doing these two checks to convert to pandas?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point! I'll delete this in favor of that.
scores = pipeline.score(X, y, objectives=[objective]) | ||
return scores[objective.name] if objective.greater_is_better else -scores[objective.name] | ||
perm_importance = sk_permutation_importance(pipeline, X, y, n_repeats=n_repeats, scoring=scorer, n_jobs=n_jobs, random_state=random_state) | ||
if pipeline._supports_fast_permutation_importance: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice!
slow_scores = calculate_permutation_importance(pipeline, X, y, objective='Log Loss Binary', | ||
random_state=0) | ||
|
||
pd.testing.assert_frame_equal(fast_scores, slow_scores) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Test looks good! Since we're ideally getting a speed boost by doing faster performance, would we be able to do assertions that the time for fast_scores
is <= time for slow_scores
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point! I tried a few times locally and the time for fast scores is smaller than slow scores. I'm pushing this up and keeping an eye if the results vary from run to run.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It flaked on windows so I'm reverting it lol.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
classic windows lol
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Like it. Lots of good work here. A few nits/suggestions/questions but nothing to stop a merge.
evalml/model_understanding/graphs.py
Outdated
|
||
|
||
def _fast_permutation_importance(pipeline, X, y, objective, n_repeats=5, n_jobs=None, random_seed=None): | ||
"""Calculate permutation importance faster by onlu computing the estimator features once. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo: "only"
evalml/model_understanding/graphs.py
Outdated
pipeline (PipelineBase or subclass): Fitted pipeline | ||
X (ww.DataTable, pd.DataFrame): The input data used to score and compute permutation importance | ||
y (ww.DataColumn, pd.Series): The target data | ||
objective (str, ObjectiveBase): Objective to score on |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nittiest of nits: periods. It is private, though, so do we want docstrings anyway?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. I think the docstring for calculate_permutation_importance
is enough. I'll add a type hint for the return type there!
evalml/model_understanding/graphs.py
Outdated
random_state (int): The random seed. Defaults to 0. | ||
|
||
Returns: | ||
Mean feature importance scores over n_repeats number of shuffles. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Type hinting for the return?
def _get_feature_provenance(self, component_list, input_feature_names): | ||
"""Get the feature provenance for each feature in the input_feature_names. | ||
|
||
The provenance is a mapping from the original feature names in the dataset to a list of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Solid, thanks for putting that in here.
evalml/pipelines/component_graph.py
Outdated
if not component_list: | ||
return {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the intent here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pipelines with empty component graphs are valid so I wanted to exit before we get the final_estimator_features
.
@@ -523,70 +523,6 @@ def test_graph_confusion_matrix_title_addition(X_y_binary): | |||
assert fig_dict['layout']['title']['text'] == 'Confusion matrix with added title text, normalized using method "true"' | |||
|
|||
|
|||
def test_get_permutation_importance_invalid_objective(X_y_regression, linear_regression_pipeline_class): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good call
|
||
@pytest.mark.parametrize('pipeline_class, parameters', test_cases) | ||
@patch('evalml.pipelines.PipelineBase._supports_fast_permutation_importance', new_callable=PropertyMock) | ||
def test_fast_permutation_importance_matches_sklearn_output(mock_supports_fast_importance, pipeline_class, parameters, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed!
4d25044
to
4e01fa6
Compare
Will take a look shortly. I love the performance numbers! 10x speedup for text is 🔥
That seems like a reasonable limitation for the immediate future. We'll need to revisit this at some point, yes? Transformers which create columns today: OHE, target encoder, datetime, text, timeseries delayed features, and soon seasonality. We currently have LDA/PCA transformers defined but aren't using them in automl--adding those would definitely violate this limitation. |
@dsherry Agreed on enhancing this implementation to support for more transformers! When we get there, we also need to consider pipelines with multiple estimators. The estimators act like transformers in that case - they ingest many columns and output many columns. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is some really great work!! The speedups are chefs kiss. Implementation looks great, just left a few minor questions and comments (some just so I understand this better hehe).
Also, _get_feature_provenance
is really interesting to see from a conceptual level, to better understand how our components transform data. (Also useful since I've been working on that describes transformations :P)
@@ -3,9 +3,12 @@ Release Notes | |||
|
|||
**Future Releases** | |||
* Enhancements | |||
* Sped up permutation importance for some pipelines :pr:`1762` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🥳 !!
evalml/pipelines/component_graph.py
Outdated
features that were created from that original feature. | ||
|
||
For example, after fitting a OHE on a feature called 'cats', with categories 'a' and 'b', the | ||
provenance, would have the following entry: {'cats': ['a', 'b']}. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit pick: the provenance, --> the provenance (no comma needed :P)
evalml/pipelines/component_graph.py
Outdated
input_feature_names (list(str)): Names of the features in the input dataframe. | ||
|
||
Returns: | ||
dictionary: mapping of feature name of set feature names that were created from that feature. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mapping of feature name of --> mapping of feature name to ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you! 😅
if component_input in out_feature: | ||
provenance[in_feature] = out_feature.union(set(component_output)) | ||
|
||
# Get rid of features that are not in the dataset the final estimator uses |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In what cases does this not happen? Is this just when columns are dropped?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes! The other case is when we create features from a feature but delete the intermediate feature, e.g we create a month column from a datetime feature and then we ohe the month column. Since the ohe drops the month column we want to remove it from the provenance of the datetime feature.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oooo got it, thanks for explaining! sgtm 😀
for col_name in self._date_time_col_names: | ||
provenance[col_name] = [] | ||
for feature in self.parameters['features_to_extract']: | ||
provenance[col_name].append(f'{col_name}_{feature}') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know this is from our current implementation of how we create new features, makes me wonder if we can abstract that logic away so we don't have to change two places but just a thought :p looks good!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or maybe somewhat related, but it seems like in other components, we calculate self._provenance
when we fit/transform, but here we reconstruct it almost from scratch. Could we also follow that similar pattern? Would that be helpful?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good observation! I like the "from scratch" pattern because it keeps the definition of the provenance separate from the transform logic but you're right it can lead to some duplication. I think in most cases I tried to follow the "from scratch" pattern but I couldn't use it for all because the OHE, for example, has special logic for making names unique so the provenance basically has to happen in get_feature_names
.
evalml/pipelines/component_graph.py
Outdated
@@ -231,6 +235,56 @@ def _compute_features(self, component_list, X, y=None, fit=False): | |||
output_cache[component_name] = output | |||
return output_cache | |||
|
|||
def _get_feature_provenance(self, component_list, input_feature_names): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Curious, why component_list
as an argument? I noticed the only time we use this is in calling
self._feature_provenance = self._get_feature_provenance(self.compute_order, X.columns)
; would it make sense to just remove the arguments and internally use the component graph's compute_order? Or is this for flexibility? Are there use cases for this? (Maybe computing not the whole graph but some path in it? 🤔 )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great point! This is an oversight on my part. I think I was refactoring some code and thought I could make it a static method but that's not really reasonable.
I'm deleting the component_list parameter now!
|
||
test_cases = [(LinearPipelineWithDropCols, {"Drop Columns Transformer": {'columns': ['country']}}), | ||
(LinearPipelineWithImputer, {}), | ||
(LinearPipelineSameFeatureUsedByTwoComponents, {'DateTime Featurization Component': {'encode_as_categories': True}}), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Amazing. I was wondering what happens when the datetime featurizer creates categorical columns that the OHE then uses, since it'd rely on the output cols of the datetime featurizer as provenance. 🤩
|
||
# Do this to make sure we use the same int as sklearn under the hood | ||
random_state = np.random.RandomState(0) | ||
random_seed = random_state.randint(np.iinfo(np.int32).max + 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this different behavior from what we use in get_random_seed
/ can we just use that? 👀
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately it is! Great suggestion though!
slow_scores = calculate_permutation_importance(pipeline, X, y, objective='Log Loss Binary', | ||
random_state=0) | ||
|
||
pd.testing.assert_frame_equal(fast_scores, slow_scores) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
classic windows lol
} | ||
|
||
|
||
test_cases = [(LinearPipelineWithDropCols, {"Drop Columns Transformer": {'columns': ['country']}}), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow, this is parametrization taken to the next level LMAO
ad41815
to
3f0cbbd
Compare
3f0cbbd
to
3736c42
Compare
Pull Request Description
Fixes #1416
Currently this only supports pipelines where each feature is created from at most one other feature.
Handling pipelines where that is not the case (PCA, multiple estimators) will be more complicated. We'd have to cache the features before each of these transformers/estimators and do the permutation there.
10x speedup if pipeline has text features
On 500 rows of the fraud dataset:
About 2x speedup on a "standard" pipeline
After creating the pull request: in order to pass the release_notes_updated check you will need to update the "Future Release" section of
docs/source/release_notes.rst
to include this pull request by adding :pr:123
.