Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed up permutation importance #1762

Merged
merged 17 commits into from
Feb 5, 2021
Merged

Conversation

freddyaboulton
Copy link
Contributor

@freddyaboulton freddyaboulton commented Jan 29, 2021

Pull Request Description

Fixes #1416

Currently this only supports pipelines where each feature is created from at most one other feature.

Handling pipelines where that is not the case (PCA, multiple estimators) will be more complicated. We'd have to cache the features before each of these transformers/estimators and do the permutation there.

10x speedup if pipeline has text features

On 500 rows of the fraud dataset:

image

About 2x speedup on a "standard" pipeline

image


After creating the pull request: in order to pass the release_notes_updated check you will need to update the "Future Release" section of docs/source/release_notes.rst to include this pull request by adding :pr:123.

@codecov
Copy link

codecov bot commented Jan 29, 2021

Codecov Report

Merging #1762 (3736c42) into main (9b39336) will increase coverage by 0.1%.
The diff coverage is 100.0%.

Impacted file tree graph

@@            Coverage Diff            @@
##             main    #1762     +/-   ##
=========================================
+ Coverage   100.0%   100.0%   +0.1%     
=========================================
  Files         247      248      +1     
  Lines       19679    19871    +192     
=========================================
+ Hits        19670    19863    +193     
+ Misses          9        8      -1     
Impacted Files Coverage Δ
...lml/tests/model_understanding_tests/test_graphs.py 100.0% <ø> (ø)
evalml/model_understanding/graphs.py 100.0% <100.0%> (+0.3%) ⬆️
evalml/pipelines/component_graph.py 100.0% <100.0%> (ø)
...components/transformers/encoders/onehot_encoder.py 100.0% <100.0%> (ø)
...components/transformers/encoders/target_encoder.py 100.0% <100.0%> (ø)
.../transformers/preprocessing/datetime_featurizer.py 100.0% <100.0%> (ø)
...lines/components/transformers/preprocessing/lsa.py 100.0% <100.0%> (ø)
...ents/transformers/preprocessing/text_featurizer.py 100.0% <100.0%> (ø)
...l/pipelines/components/transformers/transformer.py 100.0% <100.0%> (ø)
evalml/pipelines/pipeline_base.py 100.0% <100.0%> (ø)
... and 2 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9b39336...3736c42. Read the comment docs.

@freddyaboulton freddyaboulton added the enhancement An improvement to an existing feature. label Jan 29, 2021
@freddyaboulton freddyaboulton marked this pull request as ready for review January 29, 2021 21:38
@@ -68,3 +68,6 @@ def fit_transform(self, X, y=None):
except MethodPropertyNotFoundError as e:
raise e
return _convert_to_woodwork_structure(X_t)

def _get_feature_provenance(self):
Copy link
Contributor Author

@freddyaboulton freddyaboulton Jan 29, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keeping this private for now since it's only used for permutation importance but this same mechanism can help with #1347

@@ -523,70 +523,6 @@ def test_graph_confusion_matrix_title_addition(X_y_binary):
assert fig_dict['layout']['title']['text'] == 'Confusion matrix with added title text, normalized using method "true"'


def test_get_permutation_importance_invalid_objective(X_y_regression, linear_regression_pipeline_class):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There has been talk about splitting up test_graphs.py. I figured I would get a head start by creating test_permutation_importance.py and moving all of the permutation tests there.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, thank you for doing this!


@pytest.mark.parametrize('pipeline_class, parameters', test_cases)
@patch('evalml.pipelines.PipelineBase._supports_fast_permutation_importance', new_callable=PropertyMock)
def test_fast_permutation_importance_matches_sklearn_output(mock_supports_fast_importance, pipeline_class, parameters,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's important to test that our optimization calculates the values correctly. Mocking doesn't really make sense here but since I'm only using 100 points and n_jobs=1 I think it's ok.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed!

@freddyaboulton freddyaboulton force-pushed the 1416_perm_importance_speedup branch 2 times, most recently from 55aef58 to 0073b7f Compare February 2, 2021 20:19
Copy link
Contributor

@bchen1116 bchen1116 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! The time speedup is pretty awesome!



def _fast_permutation_importance(pipeline, X, y, objective, n_repeats=5, n_jobs=None, random_seed=None):
"""Calculate permutation importance faster by onlu computing the estimator features once.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: onlu

@@ -92,7 +93,12 @@ def fit(self, X, y):
X (ww.DataTable, pd.DataFrame): The input training data of shape [n_samples, n_features]
y (ww.DataColumn, pd.Series): The target training data of length [n_samples]
"""
if isinstance(X, ww.DataTable):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not use _convert_to_woodwork and convert_woodwork_types_wrapper here instead of doing these two checks to convert to pandas?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! I'll delete this in favor of that.

scores = pipeline.score(X, y, objectives=[objective])
return scores[objective.name] if objective.greater_is_better else -scores[objective.name]
perm_importance = sk_permutation_importance(pipeline, X, y, n_repeats=n_repeats, scoring=scorer, n_jobs=n_jobs, random_state=random_state)
if pipeline._supports_fast_permutation_importance:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice!

slow_scores = calculate_permutation_importance(pipeline, X, y, objective='Log Loss Binary',
random_state=0)

pd.testing.assert_frame_equal(fast_scores, slow_scores)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test looks good! Since we're ideally getting a speed boost by doing faster performance, would we be able to do assertions that the time for fast_scores is <= time for slow_scores?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! I tried a few times locally and the time for fast scores is smaller than slow scores. I'm pushing this up and keeping an eye if the results vary from run to run.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It flaked on windows so I'm reverting it lol.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

classic windows lol

Copy link
Contributor

@chukarsten chukarsten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Like it. Lots of good work here. A few nits/suggestions/questions but nothing to stop a merge.



def _fast_permutation_importance(pipeline, X, y, objective, n_repeats=5, n_jobs=None, random_seed=None):
"""Calculate permutation importance faster by onlu computing the estimator features once.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: "only"

Comment on lines 326 to 329
pipeline (PipelineBase or subclass): Fitted pipeline
X (ww.DataTable, pd.DataFrame): The input data used to score and compute permutation importance
y (ww.DataColumn, pd.Series): The target data
objective (str, ObjectiveBase): Objective to score on
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nittiest of nits: periods. It is private, though, so do we want docstrings anyway?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I think the docstring for calculate_permutation_importance is enough. I'll add a type hint for the return type there!

random_state (int): The random seed. Defaults to 0.

Returns:
Mean feature importance scores over n_repeats number of shuffles.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Type hinting for the return?

def _get_feature_provenance(self, component_list, input_feature_names):
"""Get the feature provenance for each feature in the input_feature_names.

The provenance is a mapping from the original feature names in the dataset to a list of
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Solid, thanks for putting that in here.

if not component_list:
return {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the intent here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pipelines with empty component graphs are valid so I wanted to exit before we get the final_estimator_features.

@@ -523,70 +523,6 @@ def test_graph_confusion_matrix_title_addition(X_y_binary):
assert fig_dict['layout']['title']['text'] == 'Confusion matrix with added title text, normalized using method "true"'


def test_get_permutation_importance_invalid_objective(X_y_regression, linear_regression_pipeline_class):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call


@pytest.mark.parametrize('pipeline_class, parameters', test_cases)
@patch('evalml.pipelines.PipelineBase._supports_fast_permutation_importance', new_callable=PropertyMock)
def test_fast_permutation_importance_matches_sklearn_output(mock_supports_fast_importance, pipeline_class, parameters,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed!

@dsherry
Copy link
Contributor

dsherry commented Feb 3, 2021

Will take a look shortly. I love the performance numbers! 10x speedup for text is 🔥
@freddyaboulton is it possible to add data points to that graph? Helpful to see how many samples were taken heh.

Currently this only supports pipelines where each feature is created from at most one other feature.

That seems like a reasonable limitation for the immediate future. We'll need to revisit this at some point, yes? Transformers which create columns today: OHE, target encoder, datetime, text, timeseries delayed features, and soon seasonality. We currently have LDA/PCA transformers defined but aren't using them in automl--adding those would definitely violate this limitation.

@freddyaboulton
Copy link
Contributor Author

freddyaboulton commented Feb 3, 2021

@dsherry Agreed on enhancing this implementation to support for more transformers! When we get there, we also need to consider pipelines with multiple estimators. The estimators act like transformers in that case - they ingest many columns and output many columns.

Copy link
Contributor

@angela97lin angela97lin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is some really great work!! The speedups are chefs kiss. Implementation looks great, just left a few minor questions and comments (some just so I understand this better hehe).

Also, _get_feature_provenance is really interesting to see from a conceptual level, to better understand how our components transform data. (Also useful since I've been working on that describes transformations :P)

@@ -3,9 +3,12 @@ Release Notes

**Future Releases**
* Enhancements
* Sped up permutation importance for some pipelines :pr:`1762`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🥳 !!

features that were created from that original feature.

For example, after fitting a OHE on a feature called 'cats', with categories 'a' and 'b', the
provenance, would have the following entry: {'cats': ['a', 'b']}.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit pick: the provenance, --> the provenance (no comma needed :P)

input_feature_names (list(str)): Names of the features in the input dataframe.

Returns:
dictionary: mapping of feature name of set feature names that were created from that feature.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mapping of feature name of --> mapping of feature name to ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! 😅

if component_input in out_feature:
provenance[in_feature] = out_feature.union(set(component_output))

# Get rid of features that are not in the dataset the final estimator uses
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In what cases does this not happen? Is this just when columns are dropped?

Copy link
Contributor Author

@freddyaboulton freddyaboulton Feb 3, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes! The other case is when we create features from a feature but delete the intermediate feature, e.g we create a month column from a datetime feature and then we ohe the month column. Since the ohe drops the month column we want to remove it from the provenance of the datetime feature.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oooo got it, thanks for explaining! sgtm 😀

for col_name in self._date_time_col_names:
provenance[col_name] = []
for feature in self.parameters['features_to_extract']:
provenance[col_name].append(f'{col_name}_{feature}')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this is from our current implementation of how we create new features, makes me wonder if we can abstract that logic away so we don't have to change two places but just a thought :p looks good!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or maybe somewhat related, but it seems like in other components, we calculate self._provenance when we fit/transform, but here we reconstruct it almost from scratch. Could we also follow that similar pattern? Would that be helpful?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good observation! I like the "from scratch" pattern because it keeps the definition of the provenance separate from the transform logic but you're right it can lead to some duplication. I think in most cases I tried to follow the "from scratch" pattern but I couldn't use it for all because the OHE, for example, has special logic for making names unique so the provenance basically has to happen in get_feature_names.

@@ -231,6 +235,56 @@ def _compute_features(self, component_list, X, y=None, fit=False):
output_cache[component_name] = output
return output_cache

def _get_feature_provenance(self, component_list, input_feature_names):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious, why component_list as an argument? I noticed the only time we use this is in calling
self._feature_provenance = self._get_feature_provenance(self.compute_order, X.columns); would it make sense to just remove the arguments and internally use the component graph's compute_order? Or is this for flexibility? Are there use cases for this? (Maybe computing not the whole graph but some path in it? 🤔 )

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great point! This is an oversight on my part. I think I was refactoring some code and thought I could make it a static method but that's not really reasonable.

I'm deleting the component_list parameter now!


test_cases = [(LinearPipelineWithDropCols, {"Drop Columns Transformer": {'columns': ['country']}}),
(LinearPipelineWithImputer, {}),
(LinearPipelineSameFeatureUsedByTwoComponents, {'DateTime Featurization Component': {'encode_as_categories': True}}),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Amazing. I was wondering what happens when the datetime featurizer creates categorical columns that the OHE then uses, since it'd rely on the output cols of the datetime featurizer as provenance. 🤩


# Do this to make sure we use the same int as sklearn under the hood
random_state = np.random.RandomState(0)
random_seed = random_state.randint(np.iinfo(np.int32).max + 1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this different behavior from what we use in get_random_seed / can we just use that? 👀

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately it is! Great suggestion though!

slow_scores = calculate_permutation_importance(pipeline, X, y, objective='Log Loss Binary',
random_state=0)

pd.testing.assert_frame_equal(fast_scores, slow_scores)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

classic windows lol

}


test_cases = [(LinearPipelineWithDropCols, {"Drop Columns Transformer": {'columns': ['country']}}),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow, this is parametrization taken to the next level LMAO

@freddyaboulton freddyaboulton added the performance Issues tracking performance improvements. label Feb 4, 2021
@freddyaboulton freddyaboulton merged commit d92a3cb into main Feb 5, 2021
@freddyaboulton freddyaboulton deleted the 1416_perm_importance_speedup branch February 5, 2021 15:14
@ParthivNaresh ParthivNaresh mentioned this pull request Feb 9, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement An improvement to an existing feature. performance Issues tracking performance improvements.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Permutation importance: speed up by precomputing estimator features
5 participants