Skip to content

Refactor permutation importance method and add per-column permutation importance method#2302

Merged
chukarsten merged 21 commits into
mainfrom
2299_per_col
Jun 1, 2021
Merged

Refactor permutation importance method and add per-column permutation importance method#2302
chukarsten merged 21 commits into
mainfrom
2299_per_col

Conversation

@angela97lin

@angela97lin angela97lin commented May 25, 2021

Copy link
Copy Markdown
Contributor

Closes #2299

  • Also updates calculate_permutation_importance to only return dict with mean. Previously, the slow impl would return the std / the feature importances but that was different from the fast impl which only returned the mean. Hence, standardizing to return just the mean, since we don't use the std/feature importances to begin with.

Question for discussion: We no longer use the scikit-learn calculate_permutation_importance method as the "slow" method. Should we still compare outputs? I originally wrote this test and confirmed that it passed, but don't think it's necessary to add to codebase. (It also takes 5 minutes for the test locally 😱). Since we're moving away from the sklearn method, I think its fine.



@pytest.mark.parametrize('pipeline_class, parameters', test_cases)
@patch('evalml.pipelines.PipelineBase._supports_fast_permutation_importance', new_callable=PropertyMock)
def test_slow_permutation_importance_matches_sklearn_output(mock_supports_fast_importance, pipeline_class, parameters,
                                                            has_minimal_dependencies, fraud_100):
    if has_minimal_dependencies and pipeline_class == LinearPipelineWithTargetEncoderAndOHE:
        pytest.skip("Skipping test_fast_permutation_importance_matches_sklearn_output for target encoder cause "
                    "dependency not installed.")
    from evalml.demos import load_fraud
    from evalml.objectives import get_objective
    from sklearn.inspection import permutation_importance as sk_permutation_importance

    X, y = fraud_100

    if pipeline_class == LinearPipelineWithTextFeatures:
        X.ww.set_types(logical_types={'provider': 'NaturalLanguage'})

    # Do this to make sure we use the same int as sklearn under the hood
    random_state = np.random.RandomState(0)
    random_seed = random_state.randint(np.iinfo(np.int32).max + 1)

    mock_supports_fast_importance.return_value = False
    parameters['Random Forest Classifier'] = {'n_jobs': 1}
    pipeline = pipeline_class(pipeline_class.component_graph, parameters=parameters)
    pipeline.fit(X, y)

    random_state = np.random.RandomState(0)
    random_seed = random_state.randint(np.iinfo(np.int32).max + 1)
    def sklearn_impl(pipeline, X, y, objective, n_repeats, n_jobs, random_seed):
        X = infer_feature_types(X)
        y = infer_feature_types(y)
        objective = get_objective(objective, return_instance=True)

        def scorer(pipeline, X, y):
            scores = pipeline.score(X, y, objectives=[objective])
            return scores[objective.name] if objective.greater_is_better else -scores[objective.name]
        perm_importance = sk_permutation_importance(pipeline, X, y, n_repeats=n_repeats, scoring=scorer, n_jobs=n_jobs,
                                                    random_state=random_seed)
        mean_perm_importance = perm_importance["importances_mean"]
        feature_names = list(X.columns)
        mean_perm_importance = list(zip(feature_names, mean_perm_importance))
        mean_perm_importance.sort(key=lambda x: x[1], reverse=True)
        return pd.DataFrame(mean_perm_importance, columns=["feature", "importance"])

    sklearn_scores = sklearn_impl(pipeline, X, y, objective='Log Loss Binary', n_repeats=5, n_jobs=None,
                                                   random_seed=0)
    mock_supports_fast_importance.return_value = False
    slow_scores = calculate_permutation_importance(pipeline, X, y, objective='Log Loss Binary',
                                                   random_seed=random_seed)
    pd.testing.assert_frame_equal(sklearn_scores, slow_scores)

@angela97lin angela97lin self-assigned this May 25, 2021
return _go.Figure(layout=layout, data=graph_data)


def _calculate_permutation_scores_fast(pipeline, precomputed_features, y, objective, col_name,

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved everything to a permutation_importance file since we were starting to add more methods that were just to calculate permutation importance. Left the graphing permutation importance method here, though.

@codecov

codecov Bot commented May 25, 2021

Copy link
Copy Markdown

Codecov Report

Merging #2302 (9431939) into main (275ea65) will increase coverage by 0.1%.
The diff coverage is 100.0%.

Impacted file tree graph

@@           Coverage Diff           @@
##            main   #2302     +/-   ##
=======================================
+ Coverage   99.9%   99.9%   +0.1%     
=======================================
  Files        280     281      +1     
  Lines      24492   24566     +74     
=======================================
+ Hits       24464   24538     +74     
  Misses        28      28             
Impacted Files Coverage Δ
evalml/model_understanding/__init__.py 100.0% <100.0%> (ø)
evalml/model_understanding/graphs.py 100.0% <100.0%> (ø)
...alml/model_understanding/permutation_importance.py 100.0% <100.0%> (ø)
...understanding_tests/test_permutation_importance.py 100.0% <100.0%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 275ea65...9431939. Read the comment docs.

@freddyaboulton

Copy link
Copy Markdown
Contributor

@angela97lin What i like about that test is that it makes sure our permutation importance implementation is producing correct results (at least by sklearn standards). I'm not sure what you have in mind for how the new tests would look, but I think it's worth ensuring we're producing correct values (especially since we're venturing out with our own impl now). We don't have to call sklearn's permutation importance to do that though.

@angela97lin

Copy link
Copy Markdown
Contributor Author

@freddyaboulton I converted the previous test which checked against sklearn's impl to compare the new "slow" vs fast method, as a way of checking for correctness. This test which I added above is to check that the "slow" produces the same as sklearn's... which I think is fine to check for this PR but since this PR is also breaking away from the public sklearn interface, is perhaps not necessary? What do you think? 🤔
I guess TLDR is, how tied are we to checking for correctness via external library :d

@freddyaboulton

Copy link
Copy Markdown
Contributor

Got it! I don't think we have to use sklearn to establish "correctness"!

@angela97lin

Copy link
Copy Markdown
Contributor Author

@freddyaboulton Sounds good!! 😁

@angela97lin angela97lin marked this pull request as ready for review May 26, 2021 03:53

@bchen1116 bchen1116 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! I left a few nit-pick comments, but the test coverage is solid, and I like how you broke the methods apart!

Comment thread evalml/model_understanding/permutation_importance.py Outdated
pipeline (PipelineBase or subclass): Fitted pipeline
X (pd.DataFrame): The input data used to score and compute permutation importance
y (pd.Series): The target data
objective (str, ObjectiveBase): Objective to score on

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick, but can you sort these args based on the order of the args in the method? Also, can you add col_name to the args here?

return pd.DataFrame(mean_perm_importance, columns=["feature", "importance"])


def calculate_permutation_importance_one_column(X, y, pipeline, col_name, objective,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: to make this and the previous method more similar, can you normalize the order of X, y, pipeline vs pipeline, X, y?

@chukarsten chukarsten left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey Angela, this looks good. I think there are a few doctstring adjustments to make. It also seems that the per-column and base perm importance functions are pretty similar so I was wondering if its possible to reuse the function more or less as-is. I had grand ideas to try it out, but I got distracted :P. Anyways, I'm sure it's the way it is for a reason, so take it or leave it if you think it's worth exploring.

Comment thread evalml/model_understanding/permutation_importance.py Outdated
Comment thread evalml/model_understanding/permutation_importance.py
Comment thread evalml/model_understanding/permutation_importance.py Outdated
Comment thread evalml/model_understanding/permutation_importance.py Outdated
Comment thread evalml/model_understanding/permutation_importance.py
Comment thread evalml/model_understanding/permutation_importance.py Outdated
@chukarsten chukarsten merged commit e323993 into main Jun 1, 2021
@chukarsten chukarsten deleted the 2299_per_col branch June 1, 2021 23:22
@chukarsten chukarsten mentioned this pull request Jun 2, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add per-column permutation importance methods

4 participants