# **Lab 7b - Explainable and Trustworthy AI**


---

**Teacher**: Eliana Pastor (eliana.pastor@polito.it)


---

## **Evaluating explanations - Text data**

### Lab Sentiment Classification and Benchmark explanations with ***ferret***

We will use [ferret](https://github.com/g8a9/ferret) to explain the prediction of Text classifier using four post-hoc explanation methods and we benchmark their explanations. We will use as an example the Sentiment Classification task.


1. compute token-level post-hoc feature attributions with SHAP, LIME, and Gradient variants;
3. visualize the explanations;
4. benchmark explanations using several faithfulness metrics;
5. collection human rationales and test explanations for plausibility.

Note: if you are running this notebook in Google Colab, you can switch to the GPU Runtime for faster compute.

Install *ferret* along with *transformers* and *datasets*.

In [1]:
%%capture
! SKLEARN_ALLOW_DEPRECATED_SKLEARN_PACKAGE_INSTALL=True pip install ferret-xai

In [2]:
#!pip install torch
#!pip install transformers
#!pip install datasets
#!pip install protobuf
#!pip install sentencepiece

### Sentiment classification

In this lab, we use pre-trained sentiment classification model named `cardiffnlp/twitter-xlm-roberta-base-sentiment`.

This model is already pre-trained and we can directly use it to make predictions.
This classifier labels each input text with the predicted sentiment: positive, negative or neutral.
We are going to use this model as a black-box. We will explain its behavior in individual predictions and evaluate the explanations.

Bonus point: do the analysis with the model used in Lab 6: grecosalvatore/binary-toxicity-BERT-xai-course

We import the model from Hugging Face as follows

In [3]:
import torch
from transformers import AutoModelForSequenceClassification, AutoTokenizer

device = "cuda:0" if torch.cuda.is_available() else "cpu"

name = "cardiffnlp/twitter-xlm-roberta-base-sentiment"
tokenizer = AutoTokenizer.from_pretrained(name, model_max_length=512)
model = AutoModelForSequenceClassification.from_pretrained(name).to(device)

  from .autonotebook import tqdm as notebook_tqdm


In [4]:
model.config.id2label

{0: 'negative', 1: 'neutral', 2: 'positive'}

We load ferret's main API access point: the `Benchmark` class. The class requires the model and tokenizer in use.

In [5]:
from ferret import Benchmark
bench = Benchmark(model, tokenizer)

Inference with ferret

In [6]:
txt = "I love your style!"

In [7]:
bench.score(txt, return_dict=False)

tensor([0.0127, 0.0616, 0.9257])

In [8]:
p_idx = bench.score(txt, return_dict=False).argmax(-1).item()
print("Text:", txt)
print("Predicted sentiment: ", model.config.id2label[p_idx])

Text: I love your style!
Predicted sentiment:  positive


### Post-Hoc Explanation

We use all supported post-hoc explainers to explain the model prediction. 
Each explainer will provide feature importance scores that quantify of *large* was the contribution of the token to a target class.

In [9]:
txt = "I love your style!"
p_idx = bench.score(txt, return_dict=False).argmax(-1).item()

explanations = bench.explain(txt, target=p_idx)

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


                                                        

Visualize the results.

In [10]:
t = bench.show_table(explanations)
t

Token,▁I,▁love,▁your,▁style,!
Partition SHAP,0.01,0.42,0.28,0.22,0.06
LIME,0.01,0.38,0.13,0.2,0.28
Gradient,0.08,0.17,0.12,0.35,0.07
Gradient (x Input),-0.02,0.08,-0.32,0.22,0.08
Integrated Gradient,-0.21,0.03,0.11,-0.23,0.03
Integrated Gradient (x Input),0.01,0.23,0.08,0.03,0.66


## Exercises

### Evaluating the Faithfulness of each explanation


We evaluate the faithfulness of the explanations using the function [`evaluate_explanations`](https://ferret.readthedocs.io/en/latest/api/api/ferret.Benchmark.evaluate_explanations.html#ferret.Benchmark.evaluate_explanations)

It requires: 
* the explanations to evaluate
* the `target` parameter the class used in explanation. 

It will compute faithfulness on each explanation for all supported measures.

Corrently it supports the following faithfulness metrics: 

*   **AOPC Comprehensiveness**. Comprehensiveness measures the drop in the model probability if the relevant tokens of the explanations are removed. We measure comprehensiveness via the Area Over the Perturbation Curve by progressively considering the most $k$ important tokens, with $k$ from 1 to #tokens (as default) and then averaging the result. The higher the value, the more the explainer is able to select the relevant tokens for the prediction.

*   **AOPC Sufficiency**. Sufficiency captures if the tokens in the explanation are sufficient for the model to make the prediction. As for comprehensiveness, we use the AOPC score.

*   **Correlation with Leave-One-Out scores**. We first compute the leave-one-out scores by computing the prediction difference when one feature at the time is omitted. We then measure the Spearman correlation with the explanations.

In [11]:
# Compute the faithfulness of the explanations using the evaluate_explanations function 

# .. = bench.evaluate_explanations(.., target=..)

Visualize the in a tabular format. We can use the [`show_evaluation_table`](https://ferret.readthedocs.io/en/latest/api/api/ferret.Benchmark.show_evaluation_table.html) api. It receives the list of explanations and visualize the scores in a tabular form

In [12]:
# Use the show_evaluation_table function to display the results

#bench.show_evaluation_table(..)

Note that:
- comprehensiveness goes from 0 to 1 (best);
- sufficiency goes from 0 (best) to 1;
- correlation with leave-one-out goes from -1 to 1 (best).

4. Evaluating Plausibility


### Evaluating the Plausibility of each explanation


We evaluate the faithfulness of the explanations using the function [`evaluate_explanations`](https://ferret.readthedocs.io/en/latest/api/api/ferret.Benchmark.evaluate_explanations.html#ferret.Benchmark.evaluate_explanations)


The plausibility quantifies how well an explanation adheres with a human-annotated rationale, the set of token considered important for us as humans.


The library ferret currently support the following plausibility measures:
Area Under the Precision Recall curve (AUPRC) (auprc_plau),  token-level f1-score (token_f1_plau) and average Intersection-Over-Union (IOU) at the token level (token_iou_plau).

* **Area Under the Precision Recall curve (AUPRC) - auprc_plau** is computed by sweeping a threshold over token scores.

* Token-level f1-score and the average Intersection-Over-Unionconsider discrete rationales.
We derive a discrete rationale by taking the top-k values. K in the example is set to 1. *

* **Token-level f1-score - token_f1_plau** is the token-level F1 scores derived from the token-level precision and recall.

**Intersection-Over-Union (IOU) - token_iou_plau ** is the size of the overlap of the tokens they cover divided by the size of their union.


To evaluate the plausibility we need to:

- Specify which tokens we expect to be salient, i.e., highlighted by an explanation.
- Provide the human_rationale to the [`evaluate_explanations`](https://ferret.readthedocs.io/en/latest/api/api/ferret.Benchmark.evaluate_explanations.html#ferret.Benchmark.evaluate_explanations) function.  
We will get the plausibility scores of each explanation with respect to the provided rationales.

In [13]:
print("Tokens: ", tokenizer.tokenize(txt))

Tokens:  ['▁I', '▁love', '▁your', '▁style', '!']


### 1. Specify the human rationale

Specify the tokens that we as humans consider important for the prediction. In this case, we can consider the token "love".

Specify them as a dictionary with the token as the key and the value is 1 if the token is important and 0 otherwise.

In [14]:
# Define the hand crafted rationale
# Substitue the XYZ with the 1 if the token is part of the rationale and 0 otherwise
# human_rationale = {"▁I": XYZ, "_love": XYZ, "_your": 0, "_style": XYZ, "!": XYZ}


### 2. Evaluate the plausibility of the explanation respect to the human rationale

Specify the parameter human_rationale for the `evaluate_explanations` function to evaluate the plausibility of the explanation respect to the human rationale

In [15]:
# Compute also the plausibility of the explanations using the evaluate_explanations function and specify the human_rationale

# evaluations = bench.evaluate_explanations(<explanations>, target=<target class>, human_rationale=list(human_rationale.values()))

Visualize the evaluation results using again the `show_evaluation_table`(https://ferret.readthedocs.io/en/latest/api/api/ferret.Benchmark.show_evaluation_table.html) method.

In [16]:
# Use the show_evaluation_table function to display the results

# table = bench.show_evaluation_table(evaluations)
# table

## Bonus - Use individual explainers and metrics

We can

- use *ferret* built-in explainers to have fine-grained control over their *init* and *call* parameters (please refer to our [doc](https://ferret.readthedocs.io/en/latest/?version=latest) to know more)
- compute individual faithfulness and plausibility metrics over explanations

**Interface to individual explainers**

You can also use individual explainers using an object oriented interface.

In [17]:
from ferret import SHAPExplainer, LIMEExplainer

In [18]:
exp = LIMEExplainer(model, tokenizer)
e1 = exp("hello my friend", target = 2)
e1

Explanation(text='hello my friend', tokens=['<s>', '▁hell', 'o', '▁my', '▁friend', '</s>'], scores=array([0.        , 0.04224412, 0.03824117, 0.04593733, 0.09091682,
       0.        ]), explainer='LIME', target=2)

In [19]:
exp = SHAPExplainer(model, tokenizer)
e2 = exp("hello my friend", target = 2)
e2

Explanation(text='hello my friend', tokens=['<s>', '▁hell', 'o', '▁my', '▁friend', '</s>'], scores=array([ 7.45058060e-09, -8.16055015e-03,  3.67656909e-02,  6.23211376e-02,
        1.20165404e-01,  3.72529030e-08]), explainer='Partition SHAP', target=2)

In [20]:
def lp_normalize_explanation(explanation):
    """Run Lp-normalization of explanation attribution scores

    Args:
        explanation (Explanation): explanation to normalize
        ord (int, optional): order of the norm. Defaults to 1.

    Returns:
        Explanation: normalized explanation
    """

    import numpy as np
    import copy
    explanation_norm = copy.copy(explanation)
    norm_axis = ( -1 if explanation_norm.scores.ndim == 1 else (0, 1))
    norm = np.linalg.norm(explanation_norm.scores, axis=norm_axis, ord=1)
    if norm != 0:  # avoid division by zero
        explanation_norm.scores /= norm
    return explanation_norm

In [21]:
# We can normalize the explanations using Lp-norm to make them comparable

bench.show_table([lp_normalize_explanation(e1), lp_normalize_explanation(e2)])

Token,▁hell,o,▁my,▁friend
LIME,0.19,0.18,0.21,0.42
Partition SHAP,-0.04,0.16,0.27,0.53


We can evaluate them with an individual evaluation measure

In [22]:
from ferret import AOPC_Comprehensiveness_Evaluation
from ferret.evaluators import ModelHelper

aopc_compr_eval = AOPC_Comprehensiveness_Evaluation(model, tokenizer)

In [23]:
ev1 = aopc_compr_eval.compute_evaluation(e1, target = 0)
ev1

Evaluation(name='aopc_compr', score=-0.1909705)

In [24]:
ev1

Evaluation(name='aopc_compr', score=-0.1909705)