# **Lab 6b - Explainable and Trustworthy AI**


---



**Teaching Assistant**: *Salvatore Greco*

**DISCLAIMER**: *This lab contains examples of offensive language*.

## **Lab 6b:** Explainable Natural Language Processing (NLP) 

In this lab, you will use *post-hoc* and *feature-based* explainability techniques to explain the **binary toxicity prediction BERT classifier** trained in lab 6a. <br>
You will also use a library to visualize the internal attention values in the model.

Firstly, you need to install these libraries. Run the next cell to install it (uncomment the lines if you need to install them).


In [None]:
#!pip install transformers
#!pip install datasets
#!pip install accelerate -U

In [None]:
#!pip install shap

In [None]:
#!pip install -U ferret-xai

In [None]:
#!pip install bertviz
#!pip install jupyterlab
#!pip install ipywidgets

Run the next cell to import the required libraries for this lab.

In [None]:
# Import the required libraries for this lab
from datasets import load_dataset

from transformers import AutoModelForSequenceClassification, AutoTokenizer
from transformers import pipeline, utils
import transformers


import pandas as pd

import shap
from ferret import Benchmark

Run the following command to check GPU utilization, memory usage, and availability.
If the command outputs information about your GPU, it means the GPU is available.
In contrast, the command returns an error or no information, it indicates that the GPU might not be available or there is an issue.

Note that a GPU is highly recommended for training (fine-tune) transformer models.<br> However, you can also complete this lab without GPUs since you have to perform inference and explanations only.

In [None]:
!nvidia-smi

---

## **Exercise 1: Load the dataset and model**

### **Exercise 1.1**: Load dataset

Firstly, you will load the dataset using the HuggingFace [Datasets](https://huggingface.co/docs/datasets/index) library. You will load the same dataset of publicly available Wikipedia comments annotated for several aspects of toxicity ([Jigsaw Toxic Comments](https://www.kaggle.com/competitions/jigsaw-toxic-comment-classification-challenge/data)).

Remember that you can load the dataset using the [load_dataset](https://huggingface.co/docs/datasets/loading) function of the `Datasets` transformers library.

The dataset of Wikipedia comments annotated for toxicity is available on HugginFace [Here](https://huggingface.co/datasets/google/jigsaw_toxicity_pred).<br> However, in this particular case, you must also have the files in a local folder and specify that folder in the `load_dataset` function.

Make sure you have a local folder with the following tree structure:
```
.
└── jigsaw_toxicity_pred/
    ├── test.csv
    ├── test_labels.csv
    └── test_pred.csv
```

You can only load the **test** set for this Lab. Therefore, you can also avoid uploading the training set.
If you only upload the test set, make sure to specify it when loading the dataset.

Replace `None` with your code.

In [None]:
#!ls

In [None]:
#!unzip -o jigsaw_toxicity_pred.zip

In [None]:
#### START CODE HERE (Replace None with your code) ####

# Load the jigsaw toxicity prediction dataset
dataset = None

#### END CODE HERE ####

In [None]:
dataset

In [None]:
# Dictionary that maps the label id to the label name
id2label = {0: "Non-Toxic", 1: "Toxic"} 

# Dictionary that maps the label name to the label id
label2id = {"Non-Toxic": 0, "Toxic": 1}

label_names = ['Non-Toxic', 'Toxic']

### **Exercise 1.2**: Load the predicted labels

In the `test_pred.csv` file, you are also provided with the **labels predicted on the test set** by the fine-tuned BERT model.

Now load the predicted labels into a Pandas DataFrame.

Replace `None` with your code.

In [None]:
#### START CODE HERE (Replace None with your code) ####

df_test_pred = None

#### END CODE HERE ####

In [None]:
df_test_pred.head()

### **Exercise 1.3**: Add the predicted columns to the dataset

Add to the `dataset` a new column `'pred_label'` containing the predicted label ids.

You can find the method to use in the [Datasets features](https://huggingface.co/docs/datasets/about_dataset_features).

In [None]:
#### START CODE HERE (~ 2 lines) ####


#### END CODE HERE ####

In [None]:
dataset

### **Exercise 1.4**: Select correctly and misclassified toxic comments

Select into:
- `toxic_comments_dataset`: All the texts predicted as toxic.
- `misclassified_toxic_comments_dataset`: All the texts that were non-toxic but are predicted as toxic (i.e., misclassified toxic comments).
- `correctly_classified_toxic_comments_dataset`: All the texts that were toxic and are predicted as toxic (i.e., correctly classified toxic comments).

You can find the method to use in the [Datasets process](https://huggingface.co/docs/datasets/process) methods.

Replace `None` with your code.

In [None]:
#### START CODE HERE (Replace None with your code) ####

toxic_comments_dataset = None
misclassified_toxic_comments_dataset = None
correctly_classified_toxic_comments_dataset = None

#### END CODE HERE ####

In [None]:
misclassified_toxic_comments_dataset

In [None]:
correctly_classified_toxic_comments_dataset

In [None]:
text_id = 0

print(f"Original label: {id2label[misclassified_toxic_comments_dataset['toxic'][text_id]]}")
print(f"Priginal label: {misclassified_toxic_comments_dataset['pred_label'][text_id]}")

In [None]:
text_id = 0

print(f"Original label: {id2label[correctly_classified_toxic_comments_dataset['toxic'][text_id]]}")
print(f"Priginal label: {correctly_classified_toxic_comments_dataset['pred_label'][text_id]}")

### **Exercise 1.5**: Load the model and tokenizer

Load the fine-tuned BERT model (available at `"grecosalvatore/binary-toxicity-BERT-xai-course"`) and the tokenizer. <br>
Set the correct maximum sequence length (i.e., 256).

Replace `None` with your code.

In [None]:
#### START CODE HERE (Replace None with your code) ####

model_name = None
tokenizer = None
model = None

#### END CODE HERE ####

### **Exercise 1.6**: Make predictions

Write a new input text, and use the model to **predict** the **toxicity label** for that text.

In [None]:
#### START CODE HERE ####



---

## **Exercise 2: Local-explanations with SHAP**

[SHAP](https://arxiv.org/pdf/1705.07874) (SHapley Additive exPlanations) is a framework for interpreting machine learning model predictions. It uses game theory concepts to allocate feature contributions to individual predictions fairly, explaining their impact positively or negatively. This method provides a consistent measure of feature importance across various models.

Firstly, you will use SHAP to explain the BERT model already fine-tuned for binary toxicity prediction.

The official SHAP implementation is available at the following [GitHub](https://github.com/shap/shap) repository and [Docs](https://shap.readthedocs.io/en/latest/). 



### **Exercise 2.1**: Compute the Local-explanations of correctly classified examples with SHAP

Perform and visualize the **local-explanations** of the first 5 comments **correctly classified** as toxic using SHAP.

In [None]:
#### START CODE HERE ####



### **Exercise 2.2**: Compute the Local-explanations of misclassified examples with SHAP

Now, perform and visualize the **local-explanations** of the first 5 comments **misclassified** as toxic using SHAP.


In [None]:
#### START CODE HERE ####



---

## **Exercise 3: Global-explanations with SHAP**

SHAP is also able to provide some kind of global-explanations (or global insights) about the model by aggregating many local-explanations.

### **Exercise 3.1**: Compute the Global-explanations of correctly classified examples with SHAP

Compute and visualize the **global-explanations** by aggregating the first 25 comments **correctly classified** as toxic using SHAP.<br> Visualize it by aggregating the local score with the **mean** of the individual contributions.

***(if it takes too long, reduce the number of input sentences)***

In [None]:
#### START CODE HERE ####



### **Exercise 3.2**: Compute the Global-explanations of correctly classified examples with SHAP

Now, visualize the same explanation by aggregating the local score with the **sum** of the individual contributions.

In [None]:
#### START CODE HERE ####



### **Exercise 3.3**: Compute the Global-explanations of misclassified examples with SHAP

Compute and visualize the **global-explanations** by aggregating the first 25 comments **misclassified** as toxic using SHAP. 

Visualize it by aggregating the local score first with the **mean** and then with the **sum** of the individual contributions.

***(if it takes too long, reduce the number of input sentences)***

In [None]:
#### START CODE HERE ####



---

## **Exercise 4: Local-explanations using Ferret**

There are several libraries that provide implementations of many XAI techniques for the NLP domain, such as [Ferret](https://github.com/g8a9/ferret), [Captum](https://github.com/pytorch/captum), [Alibi Explain](https://docs.seldon.io/projects/alibi/en/latest/).

In this exercise, you will use the [Ferret](https://github.com/g8a9/ferret) library implementation of SHAP, LIME,and Integrated Gradients techniques.

You can read the official Ferret [Docs](https://ferret.readthedocs.io/en/latest/) and [Paper](https://aclanthology.org/2023.eacl-demo.29.pdf).

In [None]:
input_text = correctly_classified_toxic_comments_dataset["comment_text"][15]
input_text

### **Exercise 4.1**: Compute local-explanations with the SHAP, LIME, and Integrated Gradients algorithms using the Ferret library

Compute and visualize the **local-explanation** of the comment stored in the variable `input_text` using the [Ferret](https://github.com/g8a9/ferret) implementation of the SHAP, LIME, and Integrated Gradients techniques.

In [None]:
#### START CODE HERE ####



---

## **Exercise 5: Visualize attention scores with BertViz**

Some techniques **visualize attention scores** in Transformer language models to provide some insights into the internal model behavior.

In this exercise, you will use [BertViz](https://github.com/jessevig/bertviz), an interactive tool for visualizing attention in Transformer language models.

You can read the demonstration [Paper](https://aclanthology.org/P19-3007.pdf).

### **Exercise 5.1**: Visualize attention scores of the BERT model for an input text

Visualize the **attention scores** of BERT for the comment stored in the variable `input_text` using the [BertViz](https://github.com/jessevig/bertviz) library.

In [None]:
from bertviz import model_view, head_view
from bertviz.transformers_neuron_view import BertModel, BertTokenizer
from bertviz.neuron_view import show

In [None]:
input_text = correctly_classified_toxic_comments_dataset["comment_text"][15]
input_text

In [None]:
#### START CODE HERE ####

