Speed up permutation importance #1762

freddyaboulton · 2021-01-29T19:46:23Z

Pull Request Description

Currently this only supports pipelines where each feature is created from at most one other feature.

Handling pipelines where that is not the case (PCA, multiple estimators) will be more complicated. We'd have to cache the features before each of these transformers/estimators and do the permutation there.

10x speedup if pipeline has text features

On 500 rows of the fraud dataset:

About 2x speedup on a "standard" pipeline

After creating the pull request: in order to pass the release_notes_updated check you will need to update the "Future Release" section of docs/source/release_notes.rst to include this pull request by adding :pr:123.

codecov · 2021-01-29T20:09:07Z

Codecov Report

Merging #1762 (3736c42) into main (9b39336) will increase coverage by 0.1%.
The diff coverage is 100.0%.

@@            Coverage Diff            @@
##             main    #1762     +/-   ##
=========================================
+ Coverage   100.0%   100.0%   +0.1%     
=========================================
  Files         247      248      +1     
  Lines       19679    19871    +192     
=========================================
+ Hits        19670    19863    +193     
+ Misses          9        8      -1

Impacted Files	Coverage Δ
...lml/tests/model_understanding_tests/test_graphs.py	`100.0% <ø> (ø)`
evalml/model_understanding/graphs.py	`100.0% <100.0%> (+0.3%)`	⬆️
evalml/pipelines/component_graph.py	`100.0% <100.0%> (ø)`
...components/transformers/encoders/onehot_encoder.py	`100.0% <100.0%> (ø)`
...components/transformers/encoders/target_encoder.py	`100.0% <100.0%> (ø)`
.../transformers/preprocessing/datetime_featurizer.py	`100.0% <100.0%> (ø)`
...lines/components/transformers/preprocessing/lsa.py	`100.0% <100.0%> (ø)`
...ents/transformers/preprocessing/text_featurizer.py	`100.0% <100.0%> (ø)`
...l/pipelines/components/transformers/transformer.py	`100.0% <100.0%> (ø)`
evalml/pipelines/pipeline_base.py	`100.0% <100.0%> (ø)`
... and 2 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9b39336...3736c42. Read the comment docs.

freddyaboulton · 2021-01-29T21:41:40Z

evalml/pipelines/components/transformers/transformer.py

@@ -68,3 +68,6 @@ def fit_transform(self, X, y=None):
            except MethodPropertyNotFoundError as e:
                raise e
        return _convert_to_woodwork_structure(X_t)
+
+    def _get_feature_provenance(self):


Keeping this private for now since it's only used for permutation importance but this same mechanism can help with #1347

freddyaboulton · 2021-01-29T21:44:44Z

evalml/tests/model_understanding_tests/test_graphs.py

@@ -523,70 +523,6 @@ def test_graph_confusion_matrix_title_addition(X_y_binary):
    assert fig_dict['layout']['title']['text'] == 'Confusion matrix with added title text, normalized using method "true"'


-def test_get_permutation_importance_invalid_objective(X_y_regression, linear_regression_pipeline_class):


There has been talk about splitting up test_graphs.py. I figured I would get a head start by creating test_permutation_importance.py and moving all of the permutation tests there.

Nice, thank you for doing this!

freddyaboulton · 2021-01-29T21:46:11Z

evalml/tests/model_understanding_tests/test_permutation_importance.py

+
+@pytest.mark.parametrize('pipeline_class, parameters', test_cases)
+@patch('evalml.pipelines.PipelineBase._supports_fast_permutation_importance', new_callable=PropertyMock)
+def test_fast_permutation_importance_matches_sklearn_output(mock_supports_fast_importance, pipeline_class, parameters,


I think it's important to test that our optimization calculates the values correctly. Mocking doesn't really make sense here but since I'm only using 100 points and n_jobs=1 I think it's ok.

bchen1116

Looks good! The time speedup is pretty awesome!

bchen1116 · 2021-02-02T20:59:30Z

evalml/model_understanding/graphs.py

+
+
+def _fast_permutation_importance(pipeline, X, y, objective, n_repeats=5, n_jobs=None, random_seed=None):
+    """Calculate permutation importance faster by onlu computing the estimator features once.


bchen1116 · 2021-02-02T21:10:57Z

evalml/pipelines/component_graph.py

@@ -92,7 +93,12 @@ def fit(self, X, y):
            X (ww.DataTable, pd.DataFrame): The input training data of shape [n_samples, n_features]
            y (ww.DataColumn, pd.Series): The target training data of length [n_samples]
        """
+        if isinstance(X, ww.DataTable):


why not use _convert_to_woodwork and convert_woodwork_types_wrapper here instead of doing these two checks to convert to pandas?

Good point! I'll delete this in favor of that.

bchen1116 · 2021-02-02T22:15:13Z

evalml/model_understanding/graphs.py

-        scores = pipeline.score(X, y, objectives=[objective])
-        return scores[objective.name] if objective.greater_is_better else -scores[objective.name]
-    perm_importance = sk_permutation_importance(pipeline, X, y, n_repeats=n_repeats, scoring=scorer, n_jobs=n_jobs, random_state=random_state)
+    if pipeline._supports_fast_permutation_importance:


bchen1116 · 2021-02-02T22:24:13Z

evalml/tests/model_understanding_tests/test_permutation_importance.py

+    slow_scores = calculate_permutation_importance(pipeline, X, y, objective='Log Loss Binary',
+                                                   random_state=0)
+
+    pd.testing.assert_frame_equal(fast_scores, slow_scores)


Test looks good! Since we're ideally getting a speed boost by doing faster performance, would we be able to do assertions that the time for fast_scores is <= time for slow_scores?

Good point! I tried a few times locally and the time for fast scores is smaller than slow scores. I'm pushing this up and keeping an eye if the results vary from run to run.

It flaked on windows so I'm reverting it lol.

classic windows lol

chukarsten

Like it. Lots of good work here. A few nits/suggestions/questions but nothing to stop a merge.

chukarsten · 2021-02-01T22:21:59Z

evalml/model_understanding/graphs.py

+
+
+def _fast_permutation_importance(pipeline, X, y, objective, n_repeats=5, n_jobs=None, random_seed=None):
+    """Calculate permutation importance faster by onlu computing the estimator features once.


typo: "only"

chukarsten · 2021-02-03T05:06:41Z

evalml/model_understanding/graphs.py

+        pipeline (PipelineBase or subclass): Fitted pipeline
+        X (ww.DataTable, pd.DataFrame): The input data used to score and compute permutation importance
+        y (ww.DataColumn, pd.Series): The target data
+        objective (str, ObjectiveBase): Objective to score on


Nittiest of nits: periods. It is private, though, so do we want docstrings anyway?

Good point. I think the docstring for calculate_permutation_importance is enough. I'll add a type hint for the return type there!

chukarsten · 2021-02-03T05:07:51Z

evalml/model_understanding/graphs.py

+        random_state (int): The random seed. Defaults to 0.
+
+    Returns:
+        Mean feature importance scores over n_repeats number of shuffles.


Type hinting for the return?

chukarsten · 2021-02-03T05:20:27Z

evalml/pipelines/component_graph.py

+    def _get_feature_provenance(self, component_list, input_feature_names):
+        """Get the feature provenance for each feature in the input_feature_names.
+
+        The provenance is a mapping from the original feature names in the dataset to a list of


Solid, thanks for putting that in here.

chukarsten · 2021-02-03T05:24:44Z

evalml/pipelines/component_graph.py

+        if not component_list:
+            return {}


What's the intent here?

Pipelines with empty component graphs are valid so I wanted to exit before we get the final_estimator_features.

chukarsten · 2021-02-03T05:57:48Z

evalml/tests/model_understanding_tests/test_graphs.py

@@ -523,70 +523,6 @@ def test_graph_confusion_matrix_title_addition(X_y_binary):
    assert fig_dict['layout']['title']['text'] == 'Confusion matrix with added title text, normalized using method "true"'


-def test_get_permutation_importance_invalid_objective(X_y_regression, linear_regression_pipeline_class):


chukarsten · 2021-02-03T05:58:30Z

evalml/tests/model_understanding_tests/test_permutation_importance.py

+
+@pytest.mark.parametrize('pipeline_class, parameters', test_cases)
+@patch('evalml.pipelines.PipelineBase._supports_fast_permutation_importance', new_callable=PropertyMock)
+def test_fast_permutation_importance_matches_sklearn_output(mock_supports_fast_importance, pipeline_class, parameters,


dsherry · 2021-02-03T17:54:15Z

Will take a look shortly. I love the performance numbers! 10x speedup for text is 🔥
@freddyaboulton is it possible to add data points to that graph? Helpful to see how many samples were taken heh.

Currently this only supports pipelines where each feature is created from at most one other feature.

That seems like a reasonable limitation for the immediate future. We'll need to revisit this at some point, yes? Transformers which create columns today: OHE, target encoder, datetime, text, timeseries delayed features, and soon seasonality. We currently have LDA/PCA transformers defined but aren't using them in automl--adding those would definitely violate this limitation.

freddyaboulton · 2021-02-03T18:10:28Z

@dsherry Agreed on enhancing this implementation to support for more transformers! When we get there, we also need to consider pipelines with multiple estimators. The estimators act like transformers in that case - they ingest many columns and output many columns.

angela97lin

This is some really great work!! The speedups are chefs kiss. Implementation looks great, just left a few minor questions and comments (some just so I understand this better hehe).

Also, _get_feature_provenance is really interesting to see from a conceptual level, to better understand how our components transform data. (Also useful since I've been working on that describes transformations :P)

angela97lin · 2021-02-03T19:37:43Z

docs/source/release_notes.rst

@@ -3,9 +3,12 @@ Release Notes

 **Future Releases**
    * Enhancements
+        * Sped up permutation importance for some pipelines :pr:`1762`


angela97lin · 2021-02-03T19:40:28Z

evalml/pipelines/component_graph.py

+        features that were created from that original feature.
+
+        For example, after fitting a OHE on a feature called 'cats', with categories 'a' and 'b', the
+        provenance, would have the following entry: {'cats': ['a', 'b']}.


nit pick: the provenance, --> the provenance (no comma needed :P)

angela97lin · 2021-02-03T19:41:56Z

evalml/pipelines/component_graph.py

+            input_feature_names (list(str)): Names of the features in the input dataframe.
+
+        Returns:
+           dictionary: mapping of feature name of set feature names that were created from that feature.


mapping of feature name of --> mapping of feature name to ?

Thank you! 😅

angela97lin · 2021-02-03T19:43:31Z

evalml/pipelines/component_graph.py

+                        if component_input in out_feature:
+                            provenance[in_feature] = out_feature.union(set(component_output))
+
+        # Get rid of features that are not in the dataset the final estimator uses


In what cases does this not happen? Is this just when columns are dropped?

Yes! The other case is when we create features from a feature but delete the intermediate feature, e.g we create a month column from a datetime feature and then we ohe the month column. Since the ohe drops the month column we want to remove it from the provenance of the datetime feature.

Oooo got it, thanks for explaining! sgtm 😀

angela97lin · 2021-02-03T19:45:00Z

evalml/pipelines/components/transformers/preprocessing/datetime_featurizer.py

+        for col_name in self._date_time_col_names:
+            provenance[col_name] = []
+            for feature in self.parameters['features_to_extract']:
+                provenance[col_name].append(f'{col_name}_{feature}')


I know this is from our current implementation of how we create new features, makes me wonder if we can abstract that logic away so we don't have to change two places but just a thought :p looks good!

Or maybe somewhat related, but it seems like in other components, we calculate self._provenance when we fit/transform, but here we reconstruct it almost from scratch. Could we also follow that similar pattern? Would that be helpful?

Good observation! I like the "from scratch" pattern because it keeps the definition of the provenance separate from the transform logic but you're right it can lead to some duplication. I think in most cases I tried to follow the "from scratch" pattern but I couldn't use it for all because the OHE, for example, has special logic for making names unique so the provenance basically has to happen in get_feature_names.

angela97lin · 2021-02-03T20:09:34Z

evalml/pipelines/component_graph.py

@@ -231,6 +235,56 @@ def _compute_features(self, component_list, X, y=None, fit=False):
                output_cache[component_name] = output
        return output_cache

+    def _get_feature_provenance(self, component_list, input_feature_names):


Curious, why component_list as an argument? I noticed the only time we use this is in calling
self._feature_provenance = self._get_feature_provenance(self.compute_order, X.columns); would it make sense to just remove the arguments and internally use the component graph's compute_order? Or is this for flexibility? Are there use cases for this? (Maybe computing not the whole graph but some path in it? 🤔 )

Great point! This is an oversight on my part. I think I was refactoring some code and thought I could make it a static method but that's not really reasonable.

I'm deleting the component_list parameter now!

angela97lin · 2021-02-03T20:12:56Z

evalml/tests/model_understanding_tests/test_permutation_importance.py

+
+test_cases = [(LinearPipelineWithDropCols, {"Drop Columns Transformer": {'columns': ['country']}}),
+              (LinearPipelineWithImputer, {}),
+              (LinearPipelineSameFeatureUsedByTwoComponents, {'DateTime Featurization Component': {'encode_as_categories': True}}),


Amazing. I was wondering what happens when the datetime featurizer creates categorical columns that the OHE then uses, since it'd rely on the output cols of the datetime featurizer as provenance. 🤩

angela97lin · 2021-02-03T20:14:50Z

evalml/tests/model_understanding_tests/test_permutation_importance.py

+
+    # Do this to make sure we use the same int as sklearn under the hood
+    random_state = np.random.RandomState(0)
+    random_seed = random_state.randint(np.iinfo(np.int32).max + 1)


Is this different behavior from what we use in get_random_seed / can we just use that? 👀

Unfortunately it is! Great suggestion though!

angela97lin · 2021-02-03T20:15:32Z

evalml/tests/model_understanding_tests/test_permutation_importance.py

+    slow_scores = calculate_permutation_importance(pipeline, X, y, objective='Log Loss Binary',
+                                                   random_state=0)
+
+    pd.testing.assert_frame_equal(fast_scores, slow_scores)


classic windows lol

angela97lin · 2021-02-03T20:17:29Z

evalml/tests/model_understanding_tests/test_permutation_importance.py

+    }
+
+
+test_cases = [(LinearPipelineWithDropCols, {"Drop Columns Transformer": {'columns': ['country']}}),


Wow, this is parametrization taken to the next level LMAO

…ance.

freddyaboulton self-assigned this Jan 29, 2021

freddyaboulton force-pushed the 1416_perm_importance_speedup branch from 6bfd79e to 1db411b Compare January 29, 2021 19:54

freddyaboulton added the enhancement An improvement to an existing feature. label Jan 29, 2021

freddyaboulton force-pushed the 1416_perm_importance_speedup branch from 2ed9772 to 8bd4890 Compare January 29, 2021 21:16

freddyaboulton marked this pull request as ready for review January 29, 2021 21:38

freddyaboulton requested review from dsherry, bchen1116, angela97lin, jeremyliweishih, ParthivNaresh and chukarsten January 29, 2021 21:38

freddyaboulton commented Jan 29, 2021

View reviewed changes

freddyaboulton force-pushed the 1416_perm_importance_speedup branch 2 times, most recently from 55aef58 to 0073b7f Compare February 2, 2021 20:19

bchen1116 approved these changes Feb 2, 2021

View reviewed changes

chukarsten approved these changes Feb 3, 2021

View reviewed changes

freddyaboulton force-pushed the 1416_perm_importance_speedup branch from 4d25044 to 4e01fa6 Compare February 3, 2021 16:06

angela97lin approved these changes Feb 3, 2021

View reviewed changes

freddyaboulton force-pushed the 1416_perm_importance_speedup branch from ad41815 to 3f0cbbd Compare February 4, 2021 15:26

freddyaboulton added the performance Issues tracking performance improvements. label Feb 4, 2021

freddyaboulton added 5 commits February 4, 2021 15:56

First pass. Support linear pipelines

5c5ab8e

Removing commented out check.

d724547

Working implementation with some caveats.

79ce537

Making _get_feature_provenance private.

c6e052b

Adding target encoder support. Finishing touches.

3dc0144

freddyaboulton added 12 commits February 4, 2021 15:57

Updating release notes.

0c0e87c

Not supporting fast permutation importance for stacked ensembles.

04094ca

Sorting imports.

84e2877

Using ET instead of XGBoost for test_supports_fast_permutation_import…

e6acce2

…ance.

Cleaning up provenance for text featurizer.

f0c8967

Skipping target encoder test for core dependencies.

3ad0c28

Updating docstring.

a3a834c

Using _convert_woodwork_types_wrapper in ComponentGraph.fit.

0cd6803

Verifying the fast implementation is faster in a test.

ff7720b

Not checking if implementation is faster because of windows.

d3b5139

Removing docstring from _fast_permutation_importance.

edad018

Modifying api for ComponentGraph._get_feature_provenance.

3736c42

freddyaboulton force-pushed the 1416_perm_importance_speedup branch from 3f0cbbd to 3736c42 Compare February 4, 2021 20:58

freddyaboulton merged commit d92a3cb into main Feb 5, 2021

freddyaboulton deleted the 1416_perm_importance_speedup branch February 5, 2021 15:14

ParthivNaresh mentioned this pull request Feb 9, 2021

Release v0.18.2 #1811

Merged

		@@ -523,70 +523,6 @@ def test_graph_confusion_matrix_title_addition(X_y_binary):
		assert fig_dict['layout']['title']['text'] == 'Confusion matrix with added title text, normalized using method "true"'


		def test_get_permutation_importance_invalid_objective(X_y_regression, linear_regression_pipeline_class):



		def _fast_permutation_importance(pipeline, X, y, objective, n_repeats=5, n_jobs=None, random_seed=None):
		"""Calculate permutation importance faster by onlu computing the estimator features once.

		}


		test_cases = [(LinearPipelineWithDropCols, {"Drop Columns Transformer": {'columns': ['country']}}),

Speed up permutation importance #1762

Speed up permutation importance #1762

Conversation

freddyaboulton commented Jan 29, 2021 • edited Loading

Pull Request Description

10x speedup if pipeline has text features

About 2x speedup on a "standard" pipeline

codecov bot commented Jan 29, 2021 • edited Loading

Codecov Report

freddyaboulton Jan 29, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bchen1116 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chukarsten left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dsherry commented Feb 3, 2021

freddyaboulton commented Feb 3, 2021 • edited Loading

angela97lin left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

freddyaboulton Feb 3, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

freddyaboulton commented Jan 29, 2021 •

edited

Loading

codecov bot commented Jan 29, 2021 •

edited

Loading

freddyaboulton Jan 29, 2021 •

edited

Loading

freddyaboulton commented Feb 3, 2021 •

edited

Loading

freddyaboulton Feb 3, 2021 •

edited

Loading