Skip to content

[eda] added explain_rows method to autogluon.eda.auto - Kernel SHAP visualization#3014

Merged
gradientsky merged 10 commits intoautogluon:masterfrom
gradientsky:2023_03_06_eda_explain
Mar 13, 2023
Merged

[eda] added explain_rows method to autogluon.eda.auto - Kernel SHAP visualization#3014
gradientsky merged 10 commits intoautogluon:masterfrom
gradientsky:2023_03_06_eda_explain

Conversation

@gradientsky
Copy link
Copy Markdown
Contributor

@gradientsky gradientsky commented Mar 7, 2023

Description of changes:

  • added explain_rows method to autogluon.eda.auto; the methods performs Kernel SHAP values analysis and visualization
  • quick_fit: fixes for highest_error and undecided rows calculations

Examples

import pandas as pd
import numpy as np
import autogluon.eda.auto as auto

# Load data
df_train = pd.read_csv('https://autogluon.s3.amazonaws.com/datasets/titanic/train.csv')
df_test = pd.read_csv('https://autogluon.s3.amazonaws.com/datasets/titanic/test.csv')
label='Survived'

# Fit model
state = auto.quick_fit(
    train_data=df_train,
    label=label,
    save_model_to_state=True,
    return_state=True,
    render_analysis=False,  # Don't render analysis
)

# Explain the row with highest error
auto.explain_rows(
    train_data=df_train,
    model=state.model,
    backend='shap',  # default | shap/fastshap
    plot='force',  # default | force/waterfall
    rows=state.model_evaluation.highest_error[:1],
)

image

# Explain the row predicted incorrectly, but closest to the decision boundary as waterfall plot
auto.explain_rows(
    train_data=df_train,
    model=state.model,
    display_rows=True,
    plot='waterfall',
    rows=state.model_evaluation.undecided[:1],
)

image

Using primitives

s = auto.analyze(
    train_data=df_train, model=state.model, 
    return_state=True, 
    anlz_facets=[
        # Backend using `shap` package
        eda.explain.ShapAnalysis(state.model_evaluation.highest_error[:2]),
        # Backend using `fastshap` package
        # eda.explain.FastShapAnalysis(state.model_evaluation.highest_error[:2]),
    ],
    viz_facets=[
        viz.explain.ExplainForcePlot(),  # Force layout
        viz.explain.ExplainWaterfallPlot(),  # Waterfall layout
    ]
)

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

@gradientsky gradientsky changed the title [eda] added explain_rows method to autogluon.eda.auto; the methods performs Kernel SHAP values analysis and visualization [eda] added explain_rows method to autogluon.eda.auto - Kernel SHAP visualization Mar 7, 2023
@gradientsky gradientsky added this to the 0.8 Release milestone Mar 7, 2023
'phik>=0.12.2,<0.13',
'seaborn>=0.12.0,<0.13',
'ipywidgets>=7.7.1,<9.0', # min versions guidance: 7.7.1 collab/kaggle
'shap>=0.41,<0.42',
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gradientsky FYI, fastshap may be of interest to test out and compare performance as suggested here: #2222 (comment)

fastshap claims to be much faster than shap:

https://raw.githubusercontent.com/AnotherSamWilson/fastshap/master/benchmarks/iris_benchmark_time.png

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did a code rework. The code now split into backend analysis and rendering parts. Analysis supports both shap and fastshap libraries (two different backends). Using shap as a default one for auto functionality because it is faster.
Visualizations also split into separate primitives; compatible with both of the backends.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Mar 8, 2023

Job PR-3014-c5e116b is done.
Docs are uploaded to http://autogluon-staging.s3-website-us-west-2.amazonaws.com/PR-3014/c5e116b/index.html

@gradientsky gradientsky force-pushed the 2023_03_06_eda_explain branch from c5e116b to 4c93e16 Compare March 10, 2023 03:11
@review-notebook-app
Copy link
Copy Markdown

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@gradientsky gradientsky force-pushed the 2023_03_06_eda_explain branch from 4c93e16 to f1229f8 Compare March 10, 2023 08:28
@gradientsky
Copy link
Copy Markdown
Contributor Author

Blocked by AnotherSamWilson/fastshap#8

eda/setup.py Outdated
'seaborn>=0.12.0,<0.13',
'ipywidgets>=7.7.1,<9.0', # min versions guidance: 7.7.1 collab/kaggle
'shap>=0.41,<0.42',
'fastshap>=0.3,<0.4',
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is fastshap good enough to have as a required dependency? What is the relative advantages of fastshap over shap?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the purposes of explaining a few rows here, fastshap is slower than shap. I sent an update to remove fastshap backend completely (performance + support concern).

@gradientsky gradientsky force-pushed the 2023_03_06_eda_explain branch from 4a6d17d to 2bc659e Compare March 10, 2023 20:01
@github-actions
Copy link
Copy Markdown
Contributor

Job PR-3014-4a6d17d is done.
Docs are uploaded to http://autogluon-staging.s3-website-us-west-2.amazonaws.com/PR-3014/4a6d17d/index.html

@github-actions
Copy link
Copy Markdown
Contributor

Job PR-3014-2bc659e is done.
Docs are uploaded to http://autogluon-staging.s3-website-us-west-2.amazonaws.com/PR-3014/2bc659e/index.html

@github-actions
Copy link
Copy Markdown
Contributor

Job PR-3014-c4fc440 is done.
Docs are uploaded to http://autogluon-staging.s3-website-us-west-2.amazonaws.com/PR-3014/c4fc440/index.html

Copy link
Copy Markdown
Collaborator

@Innixma Innixma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Had some minor comments

Comment on lines +24 to +36
def predict_proba(self, X):
if isinstance(X, pd.Series):
X = X.values.reshape(1, -1)
if not isinstance(X, pd.DataFrame):
X = pd.DataFrame(X, columns=self.feature_names)
if self.ag_model.problem_type == REGRESSION:
preds = self.ag_model.predict(X)
else:
preds = self.ag_model.predict_proba(X)
if self.ag_model.problem_type == REGRESSION or self.target_class is None:
return preds
else:
return preds[self.target_class]
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see a similar hack here that I see in the general AutoGluon Tabular code of using predict_proba for regression. Long term do you think this is the right thing to do long-term? is it used to simplify the code logic or just to align with how Tabular does things?

logger = logging.getLogger(__name__)


class _ShapAutogluonWrapper:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_ShapAutogluonWrapper or _ShapAutoGluonWrapper?

Comment on lines +47 to +50
baseline_sample: int, default = 100
The background dataset size to use for integrating out features. To determine the impact
of a feature, that feature is set to "missing" and the change in the model output
is observed.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any guidance on what magnitudes represent what noise level?

Is 100 enough? 1000? Will 10000 differ significantly from 100?

Why choose 100 when I could choose 10000? Is it purely to save compute time?

Will it still work if I set it to 1?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There should be enough rows to cancel-out significant variance in the base value
image

1 row should work, but you will have garbage output. More rows -> longer it takes to get the values.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"There should be enough rows to cancel-out significant variance in the base value"

Do we assume the user knows what "enough rows" is? Would it be better to give guidance in the doc-string?

rows to explain
baseline_sample: int, default = 100
The background dataset size to use for integrating out features. To determine the impact
of a feature, that feature is set to "missing" and the change in the model output
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"missing": Do we have a consistent definition for what setting to "missing" means?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is how the shap describes the process.

else:
_baseline_sample = len(args.train_data)

baseline = args.train_data.sample(_baseline_sample, random_state=0)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move random_state=0 to an init arg?

Copy link
Copy Markdown
Contributor Author

@gradientsky gradientsky Mar 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

for _, row in self.rows.iterrows():
_row = pd.DataFrame([row])
if args.model.problem_type == REGRESSION:
predicted_class = 0
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not None?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call; updated


misclassified = y_proba[y_true_val != y_pred_val]
expected_value = misclassified.join(y_true_val).apply(lambda row: row.loc[row[label]], axis=1)
predicted_value = misclassified.max(axis=1)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Long-term we may need to revisit this, as .max might not actually be identical to the predicted value. For example, if we start maximizing f1 score, we would set a threshold that isn't 0.5. Probably not necessary to address in this PR, but maybe worth adding a TODO.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd imagine we eventually make a method like pred = predictor.proba_to_pred(proba, metric='f1') which contains an inner dictionary of metric -> threshold such as {'f1': 0.63} that influences the pred that is returned

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nevermind this specific comment, I see predictor.predict called above, so it should be fine.

Comment on lines +978 to +981
baseline_sample: int, default = 100
The background dataset size to use for integrating out features. To determine the impact
of a feature, that feature is set to "missing" and the change in the model output
is observed.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very random thought completely tangential to the PR: I wish IDEs/code had logic where we could specify a docstring for a parameter one time in some file, then have it auto-populate via a variable reference in the source-code docs so we just have to write

baseline_sample: int, default = 100
     {baseline_sample_docstring}

and the IDE magically converts it to text unless we click to force it to show source, so we avoid having to copy/paste the same doc-string many times across many usages.

That is my random thought of the day. Carry on.

@github-actions
Copy link
Copy Markdown
Contributor

Job PR-3014-5c757a8 is done.
Docs are uploaded to http://autogluon-staging.s3-website-us-west-2.amazonaws.com/PR-3014/5c757a8/index.html

@gradientsky gradientsky merged commit 3ab306d into autogluon:master Mar 13, 2023
@gradientsky gradientsky deleted the 2023_03_06_eda_explain branch March 13, 2023 23:38
@gradientsky gradientsky restored the 2023_03_06_eda_explain branch March 13, 2023 23:38
@gradientsky gradientsky deleted the 2023_03_06_eda_explain branch March 13, 2023 23:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants