deel-ai · fel-thomas · Dec 10, 2021 · Nov 18, 2021 · Nov 18, 2021 · Nov 18, 2021
diff --git a/docs/api/deletion.md b/docs/api/deletion.md
@@ -10,7 +10,29 @@ the important pixels.
 
     -- <cite>[RISE: Randomized Input Sampling for Explanation of Black-box Models (2018)](https://arxiv.org/abs/1806.07421)</cite>[^1]
 
-The better the method, the smaller the score.
+
+## Score interpretation
+
+If explanations are accurate, the score will quickly fall from the score on non-perturbed input to the score of a random predictor.
+  Thus, in this case, a lower score represent a more accurate explanation.
+
+
+## Remarks
+
+This metric only evaluate the order of importance between features.
+
+The parameters metric, steps and max_percentage_perturbed may drastically change the score :
+
+- For inputs with many features, increasing the number of steps will allow you to capture more efficiently the difference between attributions methods.
+
+- The order of importance of features with low importance may not matter, hence, decreasing the max_percentage_perturbed,
+may make the score more relevant.
+
+Sometimes, attributions methods also returns negative attributions,
+for those methods, do not take the absolute value before computing insertion and deletion metrics.
+Otherwise, negative attributions may have higher absolute values, and the order of importance between features will change.
+Therefore, take those previous remarks into account to get a relevant score.
+
 
 ## Example
 

diff --git a/docs/api/deletion_ts.md b/docs/api/deletion_ts.md
@@ -6,7 +6,35 @@ This metric computes the capacity of the model to make predictions while perturb
 Specific explanation metrics for time series are necessary because time series and images have different shapes (number of dimensions) and perturbations should be applied differently to them.
 As the insertion and deletion metrics use input perturbation to be computed, creating new metrics for time series is natural[^2].
 
-The better the method, the smaller the score.
+
+## Score interpretation
+
+The interpretation of the score depends on the score metric you are using to evaluate your model.
+- For metrics where the score increases with the performance of the model (such as accuracy).
+If explanations are accurate, the score will quickly fall from the score on non-perturbed input to the score of a random predictor.
+  Thus, in this case, a lower score represent a more accurate explanation.
+
+- For metrics where the score decreases with the performance of the model (such as losses). 
+If explanations are accurate, the score will quickly rise.
+  Thus, in this case, a higher score represent a more accurate explanation.
+
+
+## Remarks
+
+This metric only evaluate the order of importance between features.
+
+The parameters metric, steps and max_percentage_perturbed may drastically change the score :
+
+- For inputs with many features, increasing the number of steps will allow you to capture more efficiently the difference between attributions methods.
+
+- The order of importance of features with low importance may not matter, hence, decreasing the max_percentage_perturbed,
+may make the score more relevant.
+
+Sometimes, attributions methods also returns negative attributions,
+for those methods, do not take the absolute value before computing insertion and deletion metrics.
+Otherwise, negative attributions may have higher absolute values, and the order of importance between features will change.
+Therefore, take those previous remarks into account to get a relevant score.
+
 
 ## Example
 
@@ -25,5 +53,5 @@ score = metric.evaluate(explanations)
 
 {{xplique.metrics.DeletionTS}}
 
-[^1]: [RISE: Randomized Input Sampling for Explanation of Black-box Models (2018)](https://arxiv.org/abs/1806.07421)
-[^2]: [Towards a Rigorous Evaluation of XAI Methods on Time Series (2019)](https://arxiv.org/abs/1909.07082)
+[^1]:[RISE: Randomized Input Sampling for Explanation of Black-box Models (2018)](https://arxiv.org/abs/1806.07421)
+[^2]:[Towards a Rigorous Evaluation of XAI Methods on Time Series (2019)](https://arxiv.org/abs/1909.07082)
diff --git a/docs/api/insertion.md b/docs/api/insertion.md
@@ -10,7 +10,29 @@ The Insertion Fidelity metric measures how well a saliency-map–based explanati
 
     -- <cite>[RISE: Randomized Input Sampling for Explanation of Black-box Models (2018)](https://arxiv.org/abs/1806.07421)</cite>[^1]
 
-The better the method, the higher the score.
+
+## Score interpretation
+
+If explanations are accurate, the score will quickly rise to the score on non-perturbed input.
+  Thus, in this case, a higher score represent a more accurate explanation.
+
+
+## Remarks
+
+This metric only evaluate the order of importance between features.
+
+The parameters metric, steps and max_percentage_perturbed may drastically change the score :
+
+- For inputs with many features, increasing the number of steps will allow you to capture more efficiently the difference between attributions methods.
+
+- The order of importance of features with low importance may not matter, hence, decreasing the max_percentage_perturbed,
+may make the score more relevant.
+
+Sometimes, attributions methods also returns negative attributions,
+for those methods, do not take the absolute value before computing insertion and deletion metrics.
+Otherwise, negative attributions may have higher absolute values, and the order of importance between features will change.
+Therefore, take those previous remarks into account to get a relevant score.
+
 
 ## Example
 

diff --git a/docs/api/insertion_ts.md b/docs/api/insertion_ts.md
@@ -3,7 +3,35 @@
 The Time Series Insertion Fidelity metric measures the faithfulness of explanations on Time Series predictions[^2].
 This metric computes the capacity of the model to make predictions while only the most important features are not perturbed[^1].
 
-The better the method, the higher the score.
+
+## Score interpretation
+
+The interpretation of the score depends on the score metric you are using to evaluate your model.
+- For metrics where the score increases with the performance of the model (such as accuracy).
+If explanations are accurate, the score will quickly rise to the score on non-perturbed input.
+  Thus, in this case, a higher score represent a more accurate explanation.
+
+- For metrics where the score decreases with the performance of the model (such as losses). 
+If explanations are accurate, the score will quickly fall to the score on non-perturbed input.
+  Thus, in this case, a lower score represent a more accurate explanation.
+
+
+## Remarks
+
+This metric only evaluate the order of importance between features.
+
+The parameters metric, steps and max_percentage_perturbed may drastically change the score :
+
+- For inputs with many features, increasing the number of steps will allow you to capture more efficiently the difference between attributions methods.
+
+- The order of importance of features with low importance may not matter, hence, decreasing the max_percentage_perturbed,
+may make the score more relevant.
+
+Sometimes, attributions methods also returns negative attributions,
+for those methods, do not take the absolute value before computing insertion and deletion metrics.
+Otherwise, negative attributions may have higher absolute values, and the order of importance between features will change.
+Therefore, take those previous remarks into account to get a relevant score.
+
 
 ## Example
 

diff --git a/tests/metrics/test_fidelity.py b/tests/metrics/test_fidelity.py
@@ -4,6 +4,7 @@
 from ..utils import generate_model, generate_timeseries_model, generate_data, almost_equal
 from xplique.metrics import Insertion, Deletion, MuFidelity, InsertionTS, DeletionTS
 
+
 def test_mu_fidelity():
     # ensure we can compute the metric with consistents arguments
     input_shape, nb_labels, nb_samples = ((32, 32, 3), 10, 20)
@@ -50,20 +51,23 @@ def test_perturbation_metrics():
     model = generate_timeseries_model(input_shape, nb_labels)
     explanations = np.random.uniform(0, 1, x.shape)
 
-    for step in [-1, 2, 10]:
-        for max_percentage_perturbed in [0.2, 1.0]:
-            for baseline_mode in [0.0, "zero", "inverse", "negative"]:
+    for step in [-1, 10]:
+        for baseline_mode in [0.0, "inverse"]:
+            for metric in ["loss", "accuracy"]:
                 score_insertion = InsertionTS(
-                    model, x, y, metric="loss", baseline_mode=baseline_mode,
-                    steps=step, max_percentage_perturbed=max_percentage_perturbed,
+                    model, x, y, metric=metric, baseline_mode=baseline_mode,
+                    steps=step, max_percentage_perturbed=0.2,
                 )(explanations)
                 score_deletion = DeletionTS(
-                    model, x, y, metric="loss", baseline_mode=baseline_mode,
-                    steps=step, max_percentage_perturbed=max_percentage_perturbed,
+                    model, x, y, metric=metric, baseline_mode=baseline_mode,
+                    steps=step, max_percentage_perturbed=0.2,
                 )(explanations)
 
                 for score in [score_insertion, score_deletion]:
-                    assert 0.0 < score < 1
+                    if metric == "loss":
+                        assert 0.0 < score
+                    elif score == "accuracy":
+                        assert 0.0 <= score <= 1.0
 
 
 def test_perfect_correlation():

diff --git a/tests/utils.py b/tests/utils.py
@@ -30,7 +30,8 @@ def generate_timeseries_model(input_shape=(20, 10), output_shape=10):
     model.add(GlobalAveragePooling1D())
     model.add(Dense(output_shape))
     model.add(Activation('softmax'))
-    model.compile(loss='categorical_crossentropy', optimizer='sgd')
+    model.compile(loss='categorical_crossentropy', optimizer='sgd',
+                  metrics=['accuracy'])
 
     return model