Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

minor: Causal fidelity metrics: Harmonization between timeseries and image - detailed evaluate #70

Merged
merged 7 commits into from
Dec 10, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
24 changes: 23 additions & 1 deletion docs/api/deletion.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,29 @@ the important pixels.

-- <cite>[RISE: Randomized Input Sampling for Explanation of Black-box Models (2018)](https://arxiv.org/abs/1806.07421)</cite>[^1]

The better the method, the smaller the score.

## Score interpretation

If explanations are accurate, the score will quickly fall from the score on non-perturbed input to the score of a random predictor.
fel-thomas marked this conversation as resolved.
Show resolved Hide resolved
Thus, in this case, a lower score represent a more accurate explanation.


## Remarks

This metric only evaluate the order of importance between features.

The parameters metric, steps and max_percentage_perturbed may drastically change the score :

- For inputs with many features, increasing the number of steps will allow you to capture more efficiently the difference between attributions methods.

- The order of importance of features with low importance may not matter, hence, decreasing the max_percentage_perturbed,
may make the score more relevant.

Sometimes, attributions methods also returns negative attributions,
for those methods, do not take the absolute value before computing insertion and deletion metrics.
Otherwise, negative attributions may have higher absolute values, and the order of importance between features will change.
Therefore, take those previous remarks into account to get a relevant score.


## Example

Expand Down
34 changes: 31 additions & 3 deletions docs/api/deletion_ts.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,35 @@ This metric computes the capacity of the model to make predictions while perturb
Specific explanation metrics for time series are necessary because time series and images have different shapes (number of dimensions) and perturbations should be applied differently to them.
As the insertion and deletion metrics use input perturbation to be computed, creating new metrics for time series is natural[^2].

The better the method, the smaller the score.

## Score interpretation

The interpretation of the score depends on the score metric you are using to evaluate your model.
fel-thomas marked this conversation as resolved.
Show resolved Hide resolved
- For metrics where the score increases with the performance of the model (such as accuracy).
If explanations are accurate, the score will quickly fall from the score on non-perturbed input to the score of a random predictor.
Thus, in this case, a lower score represent a more accurate explanation.

- For metrics where the score decreases with the performance of the model (such as losses).
If explanations are accurate, the score will quickly rise.
Thus, in this case, a higher score represent a more accurate explanation.


## Remarks

This metric only evaluate the order of importance between features.

The parameters metric, steps and max_percentage_perturbed may drastically change the score :

- For inputs with many features, increasing the number of steps will allow you to capture more efficiently the difference between attributions methods.

- The order of importance of features with low importance may not matter, hence, decreasing the max_percentage_perturbed,
may make the score more relevant.

Sometimes, attributions methods also returns negative attributions,
for those methods, do not take the absolute value before computing insertion and deletion metrics.
Otherwise, negative attributions may have higher absolute values, and the order of importance between features will change.
Therefore, take those previous remarks into account to get a relevant score.


## Example

Expand All @@ -25,5 +53,5 @@ score = metric.evaluate(explanations)

{{xplique.metrics.DeletionTS}}

[^1]: [RISE: Randomized Input Sampling for Explanation of Black-box Models (2018)](https://arxiv.org/abs/1806.07421)
[^2]: [Towards a Rigorous Evaluation of XAI Methods on Time Series (2019)](https://arxiv.org/abs/1909.07082)
[^1]:[RISE: Randomized Input Sampling for Explanation of Black-box Models (2018)](https://arxiv.org/abs/1806.07421)
[^2]:[Towards a Rigorous Evaluation of XAI Methods on Time Series (2019)](https://arxiv.org/abs/1909.07082)
24 changes: 23 additions & 1 deletion docs/api/insertion.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,29 @@ The Insertion Fidelity metric measures how well a saliency-map–based explanati

-- <cite>[RISE: Randomized Input Sampling for Explanation of Black-box Models (2018)](https://arxiv.org/abs/1806.07421)</cite>[^1]

The better the method, the higher the score.

## Score interpretation

If explanations are accurate, the score will quickly rise to the score on non-perturbed input.
Thus, in this case, a higher score represent a more accurate explanation.


## Remarks

This metric only evaluate the order of importance between features.

The parameters metric, steps and max_percentage_perturbed may drastically change the score :

- For inputs with many features, increasing the number of steps will allow you to capture more efficiently the difference between attributions methods.

- The order of importance of features with low importance may not matter, hence, decreasing the max_percentage_perturbed,
may make the score more relevant.

Sometimes, attributions methods also returns negative attributions,
for those methods, do not take the absolute value before computing insertion and deletion metrics.
Otherwise, negative attributions may have higher absolute values, and the order of importance between features will change.
Therefore, take those previous remarks into account to get a relevant score.


## Example

Expand Down
30 changes: 29 additions & 1 deletion docs/api/insertion_ts.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,35 @@
The Time Series Insertion Fidelity metric measures the faithfulness of explanations on Time Series predictions[^2].
This metric computes the capacity of the model to make predictions while only the most important features are not perturbed[^1].

The better the method, the higher the score.

## Score interpretation

The interpretation of the score depends on the score metric you are using to evaluate your model.
- For metrics where the score increases with the performance of the model (such as accuracy).
If explanations are accurate, the score will quickly rise to the score on non-perturbed input.
Thus, in this case, a higher score represent a more accurate explanation.

- For metrics where the score decreases with the performance of the model (such as losses).
If explanations are accurate, the score will quickly fall to the score on non-perturbed input.
Thus, in this case, a lower score represent a more accurate explanation.


## Remarks

This metric only evaluate the order of importance between features.

The parameters metric, steps and max_percentage_perturbed may drastically change the score :

- For inputs with many features, increasing the number of steps will allow you to capture more efficiently the difference between attributions methods.

- The order of importance of features with low importance may not matter, hence, decreasing the max_percentage_perturbed,
may make the score more relevant.

Sometimes, attributions methods also returns negative attributions,
for those methods, do not take the absolute value before computing insertion and deletion metrics.
Otherwise, negative attributions may have higher absolute values, and the order of importance between features will change.
Therefore, take those previous remarks into account to get a relevant score.


## Example

Expand Down
20 changes: 12 additions & 8 deletions tests/metrics/test_fidelity.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@
from ..utils import generate_model, generate_timeseries_model, generate_data, almost_equal
from xplique.metrics import Insertion, Deletion, MuFidelity, InsertionTS, DeletionTS


def test_mu_fidelity():
# ensure we can compute the metric with consistents arguments
input_shape, nb_labels, nb_samples = ((32, 32, 3), 10, 20)
Expand Down Expand Up @@ -50,20 +51,23 @@ def test_perturbation_metrics():
model = generate_timeseries_model(input_shape, nb_labels)
explanations = np.random.uniform(0, 1, x.shape)

for step in [-1, 2, 10]:
for max_percentage_perturbed in [0.2, 1.0]:
for baseline_mode in [0.0, "zero", "inverse", "negative"]:
for step in [-1, 10]:
for baseline_mode in [0.0, "inverse"]:
for metric in ["loss", "accuracy"]:
score_insertion = InsertionTS(
model, x, y, metric="loss", baseline_mode=baseline_mode,
steps=step, max_percentage_perturbed=max_percentage_perturbed,
model, x, y, metric=metric, baseline_mode=baseline_mode,
steps=step, max_percentage_perturbed=0.2,
)(explanations)
score_deletion = DeletionTS(
model, x, y, metric="loss", baseline_mode=baseline_mode,
steps=step, max_percentage_perturbed=max_percentage_perturbed,
model, x, y, metric=metric, baseline_mode=baseline_mode,
steps=step, max_percentage_perturbed=0.2,
)(explanations)

for score in [score_insertion, score_deletion]:
assert 0.0 < score < 1
if metric == "loss":
assert 0.0 < score
elif score == "accuracy":
assert 0.0 <= score <= 1.0


def test_perfect_correlation():
Expand Down
3 changes: 2 additions & 1 deletion tests/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,8 @@ def generate_timeseries_model(input_shape=(20, 10), output_shape=10):
model.add(GlobalAveragePooling1D())
model.add(Dense(output_shape))
model.add(Activation('softmax'))
model.compile(loss='categorical_crossentropy', optimizer='sgd')
model.compile(loss='categorical_crossentropy', optimizer='sgd',
metrics=['accuracy'])

return model

Expand Down
Loading