# Welcome 

Welcome to this lab on explainability at the winter school [ALPS 2021](http://lig-alps.imag.fr/)!

In this tutorial, we will lay the foundations of post-hoc explainability techniques and ways of evaluating them.

We've already prepared the modelling part for you in the **Modelling** section so that you can concentrate on exploring the explainability techniques.

As you already know from the accompanying lecture, *post-hoc explainability techniqes* generate saliency maps over the input for an already trained model, thus providing information which parts were most important for the prediction.

We'll review two types of explainability techniques in the **Explainability techniques** section and then we'll explore how those can be evaluated in the **Properties evaluation** section.

You will have the chance to *define explainability techniques on your own* as well as to see *what parameters of a model affect the performance of the above*. You will also take part in a *human-in-the-loop experiment* where you'll be able to use the explainability techniques to detect blind spots of the trained models.

For this notebook of the lab, we encourage you to work in groups, so that you could split the work and discuss the outcomes.

We also provide a [notebook](https://colab.research.google.com/drive/1-aZ9-Kzkb_BVb-8vcvHBAAYy2iBk0khV?usp=sharing) with **solutions**, which we encourage you to consult after every task to make sure you're on track.

## Set-up

This notebook can be run on Google Colab. One way to start working on the lab's notebook is to make your own copy of it (**File->Save a copy to Drive**) 

First, make sure, you've selected a GPU runtime from the menu: **Runtime -> Change Runtime Type**.

This lab notebook is largely dependent on the external code from the explainability tutorial [repo](https://github.com/copenlu/ALPS_2021), which we'll download shortly.

In [None]:
# Execute this line first as you might have to restart the runtime after this
!pip install -U scikit-learn 

In [None]:
# magic commands to make sure changes to external packages are automatically loaded and plots are displayed in the notebook
# thus, if you make any change on the imported files with code, 
# you can upload(overwrite) them and re-import the package without restarting the kernel
%reload_ext autoreload
%autoreload 2
%matplotlib inline

In [None]:
import torch
import os
import torchtext
import nltk
import argparse
import random
import numpy as np

from argparse import Namespace
from tqdm.notebook import tqdm
from torch.utils.data import DataLoader
from functools import partial

In [None]:
!pip install transformers
!pip install captum
nltk.download('punkt')
!git clone https://github.com/copenlu/ALPS_2021
!export PYTHONPATH='ALPS_2021/tutorial_src/'

Collecting captum
[?25l  Downloading https://files.pythonhosted.org/packages/bf/27/e6d97c600cabc38b860ead2f6be243d819ff3d259bc3195b7d2ed943ba5d/captum-0.3.0-py3-none-any.whl (5.7MB)
[K     |████████████████████████████████| 5.7MB 6.2MB/s 
Installing collected packages: captum
Successfully installed captum-0.3.0
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
Cloning into 'ALPS_2021'...
remote: Enumerating objects: 13, done.[K
remote: Counting objects: 100% (13/13), done.[K
remote: Compressing objects: 100% (8/8), done.[K
remote: Total 50 (delta 6), reused 12 (delta 5), pack-reused 37[K
Unpacking objects: 100% (50/50), done.


In [None]:
from transformers import BertTokenizerFast
from ALPS_2021.tutorial_src.data_loaders import TwitterDataset, get_embeddings, EmbeddingsVocabTokenizer, collate_tweet
from ALPS_2021.tutorial_src.model_builders import get_model
from ALPS_2021.tutorial_src.training_utils import enforce_reproducibility, train_model, eval_model
from ALPS_2021.tutorial_src.args_utils import ALL_ARGUMENTS, get_model_args

# Modeling

## Arguments

In [None]:
args = ALL_ARGUMENTS
args['model'] = 'transformer' # cnn/rnn/transformer are possible models
args.update(get_model_args(ALL_ARGUMENTS['model']))

model_args = Namespace(**args)
enforce_reproducibility(seed=model_args.seed)

# if you don't want to train a model, but use one the pre-trained one, you can set the following:
model_args.mode = 'train'
# check the folder for other models
model_args.epochs = 3
model_args

Namespace(batch_size=8, epochs=3, gpu=True, init_only=False, labels=3, lr=3e-05, mode='train', model='transformer', model_path='tweet_model', seed=73)

In [None]:
# in case you don't want to train a model, download one of the models and set the mode to test:
# download pre-trained lstm model:
#!wget --load-cookies /tmp/cookies.txt "https://docs.google.com/uc?export=download&confirm=$(wget --quiet --save-cookies /tmp/cookies.txt --keep-session-cookies --no-check-certificate 'https://docs.google.com/uc?export=download&id=1KBc0v-Iin5CYWEcpOckUgrBwVW01BHhg&authuser' -O- | sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/\1\n/p')&id=1KBc0v-Iin5CYWEcpOckUgrBwVW01BHhg&authuser" -O tweet_model_lstm && rm -rf /tmp/cookies.txt

# download pre-trained cnn model:
#!wget --load-cookies /tmp/cookies.txt "https://docs.google.com/uc?export=download&confirm=$(wget --quiet --save-cookies /tmp/cookies.txt --keep-session-cookies --no-check-certificate 'https://docs.google.com/uc?export=download&id=1jk2PzoULNwwLKPlyZ8RN8RtAj8zeHMTz&authuser' -O- | sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/\1\n/p')&id=1jk2PzoULNwwLKPlyZ8RN8RtAj8zeHMTz&authuser" -O tweet_model_cnn && rm -rf /tmp/cookies.txt

# download pre-trained transformer model:
!wget --load-cookies /tmp/cookies.txt "https://docs.google.com/uc?export=download&confirm=$(wget --quiet --save-cookies /tmp/cookies.txt --keep-session-cookies --no-check-certificate 'https://docs.google.com/uc?export=download&id=1FZwDnsxgearvPo8XlqJ-zszfUGq_H8o_&authuser' -O- | sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/\1\n/p')&id=1FZwDnsxgearvPo8XlqJ-zszfUGq_H8o_&authuser" -O tweet_model_transformer && rm -rf /tmp/cookies.txt

model_args.mode = 'test'
model_args.model_path = 'tweet_model_transformer'

--2021-01-17 23:31:43--  https://docs.google.com/uc?export=download&confirm=I92t&id=1FZwDnsxgearvPo8XlqJ-zszfUGq_H8o_&authuser
Resolving docs.google.com (docs.google.com)... 74.125.195.101, 74.125.195.102, 74.125.195.139, ...
Connecting to docs.google.com (docs.google.com)|74.125.195.101|:443... connected.
HTTP request sent, awaiting response... 302 Moved Temporarily
Location: https://doc-0k-74-docs.googleusercontent.com/docs/securesc/d7hum75s3huioouvps2pqmpvk6iapj9n/rnv1nmmsi4k591n7j337rq76u91shjsc/1610926275000/09251033333050931776/09755432298029098590Z/1FZwDnsxgearvPo8XlqJ-zszfUGq_H8o_?e=download [following]
--2021-01-17 23:31:43--  https://doc-0k-74-docs.googleusercontent.com/docs/securesc/d7hum75s3huioouvps2pqmpvk6iapj9n/rnv1nmmsi4k591n7j337rq76u91shjsc/1610926275000/09251033333050931776/09755432298029098590Z/1FZwDnsxgearvPo8XlqJ-zszfUGq_H8o_?e=download
Resolving doc-0k-74-docs.googleusercontent.com (doc-0k-74-docs.googleusercontent.com)... 173.194.203.132, 2607:f8b0:400e:c05::84


The arguments contain some training-specific parameters as well as **model hyper-parameters**. Later, if you want to change a hyper-parameter of a model and train it again, you can just change it's parameters.

In [None]:
# reviewing the parameters for a model:
get_model_args('cnn')

{'activation': 'relu',
 'batch_size': 64,
 'dropout': 0.05,
 'embedding_dim': 300,
 'epochs': 3,
 'in_channels': 1,
 'kernel_heights': [2, 3, 4],
 'lr': 0.001,
 'out_channels': 50,
 'padding': 0,
 'pooling': 'max',
 'stride': 1}

## Prepare Data

This might take a while for the embeddings to be downloaded for the CNN and the LSTM models.

In [None]:
device = torch.device("cuda") 

if model_args.model == 'transformer':
  tokenizer = BertTokenizerFast.from_pretrained('bert-base-uncased')
  embeddings = None
else:
  embeddings, word_to_index = get_embeddings('glove', model_args.embedding_dim)
  tokenizer = EmbeddingsVocabTokenizer(word_to_index, {v: k for k, v in word_to_index.items()})

collate_fn = partial(collate_tweet,
                     tokenizer=tokenizer,
                     device=device,
                     return_attention_masks=model_args.model == 'trans',
                     pad_to_max_length=False,
                     return_seq_lens = True)

train = TwitterDataset(split='train')
dev = TwitterDataset(split='val')
train_dl = DataLoader(batch_size=model_args.batch_size, dataset=train, collate_fn=collate_fn, shuffle=True)
dev_dl = DataLoader(batch_size=model_args.batch_size, dataset=dev, collate_fn=collate_fn, shuffle=False)

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=231508.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=466062.0, style=ProgressStyle(descripti…




## Load/Train Models

In [None]:
enforce_reproducibility(model_args.seed)
model, optimizer, scheduler = get_model(model_args, device, embeddings)

if model_args.mode == 'train':
  if model_args.init_only:
      best_model_w, best_perf = model.state_dict(), {'val_f1': 0}
  else:
      best_model_w, best_perf = train_model(model, train_dl, dev_dl, optimizer, scheduler, model_args.epochs)
      print('F1', best_perf['val_f1'])
      checkpoint = {
        'performance': best_perf,
        'args': vars(model_args),
        'model': best_model_w
      }
      torch.save(checkpoint, f'{model_args.model_path}_{args["model"]}')
      model.load_state_dict(best_model_w)
else:
  checkpoint = torch.load(f'{model_args.model_path}_{args["model"]}')
  model.load_state_dict(checkpoint["model"])

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=433.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=440473133.0, style=ProgressStyle(descri…




Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at

# Interpretability Techniques

In this section we'll get familiar with two types of explainability approaches -- gradient-based and perturbation-based. Most of the available explainability approaches are available through the [captum](https://captum.ai/) package, which we'll also make use of.

## Gradient - Based

Gradient-based approaches compute saliency map based on the gradient of the input with respect to the output.

In Natural Language Processing such approaches are harder to apply as the input layer with the input token ids is not differentiable. To alleviate this, a common trick is to *patch the model so that it would receive as input the differentible embeddings* for the token ids.

We configure the input needed for the saliency approaches by using the `get_embeddings_input_captum` and `get_tokens_input_captum` functions. The first returns the embeddings of the input instaces' tokens, which then can be used to receive the gradient for.

The baseline gradient-based technique (*Saliency*) takes the gradient of the input [(see paper)](https://arxiv.org/abs/1312.6034):

In [None]:
from captum.attr import Saliency
from typing import Callable, Union, Tuple, Any, List
from captum._utils.gradient import _run_forward

In [None]:
def compute_gradients(
    forward_fn: Callable,
    inputs: Union[torch.Tensor, Tuple[torch.Tensor, ...]],
    target_ind = None,
    additional_forward_args: Any = None,
) -> Tuple[torch.Tensor, ...]:
    r"""
    https://github.com/pytorch/captum/blob/45f3339b58bca9773e09273589db2e95298b33e4/captum/_utils/gradient.py#L94
    Computes gradients of the output with respect to inputs for an arbitrary forward function.
    Args:
        forward_fn: forward function. This can be for example model's forward function.
        input:      Input at which gradients are evaluated, will be passed to forward_fn.
        target_ind: Index of the target class for which gradients must be computed (classification only).
        additional_forward_args: Additional input arguments that forward function requires. It takes an empty tuple (no additional
                    arguments) if no additional arguments are required
    """
    with torch.autograd.set_grad_enabled(True):
        # runs forward pass, configures some specifics about the layers that require gradients
        outputs = _run_forward(forward_fn, inputs, target_ind, additional_forward_args)
        assert outputs[0].numel() == 1, (
            "Target not provided when necessary, cannot"
            " take gradient with respect to multiple outputs."
        )
        # torch.unbind(forward_out) is a list of scalar tensor tuples and
        # contains batch_size * #steps elements
        grads = torch.autograd.grad(torch.unbind(outputs), inputs)
    return grads

In [None]:
def get_embeddings_input_captum(model: torch.nn.Module, 
                         model_type: str, 
                         collate_fn: Callable, 
                         instances: List[Any], 
                         pad_token_id: int = None,
                         batch=None):
  if batch == None:
    batch = collate_fn(instances)
  token_ids = batch[0]
  sequence_lengths = batch[-1]

  if model_type == 'transformer':
    input_embeddings = model.transformer.bert.embeddings(token_ids)
  else:
    input_embeddings = model.embedding(token_ids)
  if model_type == 'lstm':
    additional_forward_args = (sequence_lengths,)
  elif model_type == 'cnn':
    additional_forward_args = None
  elif model_type == 'transformer':
    additional_forward_args = (token_ids != pad_token_id, )
  return input_embeddings, additional_forward_args

def get_tokens_input_captum(model_type: str, 
                         collate_fn: Callable, 
                         instances: List[Any], 
                         pad_token_id: int = None):
  batch = collate_fn(instances)
  token_ids = batch[0]
  sequence_lengths = batch[-1]

  if model_type == 'lstm':
    additional_forward_args = (sequence_lengths,)
  elif model_type == 'cnn':
    additional_forward_args = None
  elif model_type == 'transformer':
    additional_forward_args = (token_ids != pad_token_id, )
  return token_ids, additional_forward_args

In [None]:
input_embeddings, additional_forward_args = get_embeddings_input_captum(model, 
                                                                 model_args.model, 
                                                                 collate_fn,
                                                                 dev.dataset[:2],
                                                                 tokenizer.pad_token_id)

attributions = compute_gradients(model, 
                                 inputs=input_embeddings, 
                                 target_ind=0,
                                 additional_forward_args=additional_forward_args)
attributions[0]

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


tensor([[[-9.3983e-03,  1.1500e-03,  1.4011e-03,  ...,  3.3202e-03,
           7.9773e-03,  9.5162e-04],
         [ 2.8178e-03, -1.6889e-03, -8.3604e-03,  ...,  2.6754e-03,
          -4.5492e-04,  2.4837e-03],
         [ 2.9172e-03,  7.0193e-03, -4.9965e-03,  ..., -7.8309e-03,
           1.1074e-02,  2.3271e-03],
         ...,
         [ 0.0000e+00,  0.0000e+00,  0.0000e+00,  ...,  0.0000e+00,
           0.0000e+00,  0.0000e+00],
         [ 0.0000e+00,  0.0000e+00,  0.0000e+00,  ...,  0.0000e+00,
           0.0000e+00,  0.0000e+00],
         [ 0.0000e+00,  0.0000e+00,  0.0000e+00,  ...,  0.0000e+00,
           0.0000e+00,  0.0000e+00]],

        [[-6.8152e-04,  3.0616e-04, -5.2241e-04,  ...,  1.3034e-04,
          -2.6643e-04,  1.1583e-05],
         [ 4.4404e-05, -7.2836e-04,  1.4775e-04,  ...,  2.0612e-03,
           1.7945e-04, -3.4629e-04],
         [ 2.6430e-04,  8.7656e-04,  2.1136e-04,  ...,  5.6390e-04,
          -8.8478e-04,  3.2502e-04],
         ...,
         [ 1.3351e-04,  1

Now we have the attributions at the embedding layer, which are not quite informative and we are looking for saliency maps at the word level.

### Task 1
Can you think how the gradients of the embeddings can be used to produce **one saliency score** per word? What are the consequences of the different approaches?

*(check the solutions notebook for answers)*

You can aggregate the embedding gradients by taking the mean or the L2 norm of the vector. In our work ([see paper](https://www.aclweb.org/anthology/2020.emnlp-main.263.pdf)), we found L2 norm aggregation to be working better as the mean was making the attributions more uniform and the informantion from the separate dimentions was lost.

In [None]:
def summarize_attributions(attributions, type='mean'):
    # YOUR CODE HERE

In [None]:
summarized_saliency = summarize_attributions(attributions[0], type='l2').detach().cpu().numpy()
summarized_saliency

array([[2.4415631 , 4.2119226 , 4.190755  , 1.4050797 , 1.1381679 ,
        2.9951453 , 2.3997874 , 0.9750633 , 0.6543945 , 0.61226463,
        0.64614785, 0.6857956 , 1.2404237 , 0.61521757, 0.4497512 ,
        1.015996  , 0.5740527 , 0.9215991 , 0.99524164, 0.558179  ,
        0.9547747 , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        ],
       [0.7337796 , 0.6752617 , 0.31475812, 0.35597217, 0.69898075,
        2.6614914 , 0.35795587, 0.6978358 , 0.1671758 , 0.15031673,
        0.5202161 , 0.20650557, 0.5049865 , 0.37323612, 0.17771149,
        0.29355967, 0.5930433 , 0.27872103, 0.45088464, 0.22605383,
        0.16704844, 0.30293605, 0.20800115, 0.13248089, 0.17115802,
        0.10573292, 0.09743957, 0.08023961, 0.10284176, 0.08306251,
        0.13217077, 0.18541744]], dtype=float32)

### Task 2
There are a lot of approaches the improve over the baseline by accounting for the specifics of different layers and their back-propagation proccess. One of them, **Input X Gradient**, multiplies the gradient with the embeddings (see paper [link text](http://proceedings.mlr.press/v70/shrikumar17a.html)). It was proposed as a technique to improve the sharpness of the attribution maps.

Can you quickly modify the above approach to Input X Gradient and compare the differences between the two?

In [None]:
gradients = compute_gradients(model, 
                              inputs=input_embeddings, 
                              target_ind=0,
                              additional_forward_args=additional_forward_args)[0]

attributions = # YOUR CODE HERE
summarized_inputx = summarize_attributions(attributions, type='l2').detach().cpu().numpy()
summarized_inputx

array([[0.4184738 , 1.9442306 , 2.1370826 , 0.57326627, 0.42939132,
        1.4509289 , 1.2539201 , 0.41862074, 0.29001504, 0.2806509 ,
        0.3043002 , 0.3496624 , 0.6471158 , 0.29934034, 0.15684427,
        0.53703433, 0.26753074, 0.38561398, 0.5373926 , 0.24556574,
        0.290793  , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        ],
       [0.12156984, 0.32546508, 0.12648456, 0.1506634 , 0.31653205,
        1.17974   , 0.15311517, 0.3511062 , 0.06942554, 0.06161044,
        0.26566064, 0.09083129, 0.24966733, 0.19913176, 0.07825722,
        0.10958775, 0.26920623, 0.1329355 , 0.25202718, 0.11766723,
        0.0648167 , 0.14561571, 0.09213648, 0.0571308 , 0.08479664,
        0.04895854, 0.04220755, 0.03769324, 0.05185024, 0.03864675,
        0.04872997, 0.05726779]], dtype=float32)

We can see that the saliency scores from the Saliency approach have a larger deviation from the mean, but also a higher mean. Normalising the attributions scores first might help to compare them better.

In [None]:
for i in range(len(summarized_inputx)):
  print(np.mean(summarized_inputx[i]), 
        np.std(summarized_inputx[i]), 
        np.mean(summarized_saliency[i]), 
        np.std(summarized_saliency[i]))

0.41305542 0.538557 0.9275414 1.1308007
0.16845421 0.20347749 0.381468 0.45472315


## Local-Approximation Based

Another type of explainability techniques perturbs the input to find which regions from it change the prediction to a larger degree. One such method is LIME ([see paper](https://arxiv.org/abs/1602.04938)), which build a linear local approximator for each instance. It perturbs the input and tries to predict how the output is being changed with each local perturbation. The weights of the linear model for each token are used as saliency scores.

See this [book on interpretability](https://christophm.github.io/interpretable-ml-book/lime.html) for more information on LIME. It discusses some of the disadvantages of the approach, which are important to take into account: 

*   The correct definition of the neighborhood is a very big, unsolved problem.
*   Sampling could be improved in the current implementation of LIME. Data points are sampled from a Gaussian distribution, ignoring the correlation between features. This can lead to unlikely data points which can then be used to learn local explanation models.
* The complexity of the explanation model has to be defined in advance. 
* The instability of the explanations. If you repeat the sampling process, then the explantions that come out can be different. Instability means that it is difficult to trust the explanations, and you should be very critical.

### Task 3 Version 1

For this task, you have to train a linear classifier that will approximate the decision of the model in the neighbourhood of one instance. Here are some guidelines how to approach the task:


*   Implement a sampling function that will perturb a given instance randomly to another instance with a few tokens changed/removed/added.
*   Collect the predictions of the original model on the perturbations.
*   Train a linear model with the different perturbations to predict the confidence of the original model for the target class.
*   Use the weights of the linear model to explain the predictions of the original model.

***Note.*** If this task seems difficult and/or you won't have enough time to complete it, continue below to use the implementation of the captum package and implement only the sampling function.



In [None]:
from captum.attr import LimeBase
from sklearn.metrics import mean_squared_error
from captum._utils.models.linear_model import SkLearnLinearModel

In [None]:
def similarity_kernel(
     original_input: torch.Tensor,
     perturbed_input: torch.Tensor,
     perturbed_interpretable_input: torch.Tensor,
     **kwargs)->torch.Tensor:
         # kernel_width will be provided to attribute as a kwarg
         return torch.sum(original_input==perturbed_input)

def to_interp_rep_transform_custom(curr_sample, original_input, **kwargs: Any):
  return curr_sample

### Task 3 Version 2

*   Can you implement the sampling function that perturbs the input?
* Then, experiment with LIME to see how the explanation varies based on the number of samples and the different seed for the samples.


In [None]:
# Define sampling function
 # This function samples in original input space
def perturb_func(
     original_input: torch.Tensor,
     **kwargs: Any)->torch.Tensor:
         return # YOUR CODE HERE

 # For this example, we are setting the interpretable input to
 # match the model input, so the to_interp_rep_transform
 # function simply returns the input. In most cases, the interpretable
 # input will be different and may have a smaller feature set, so
 # an appropriate transformation function should be provided.

In [None]:
sklearnmodel = SkLearnLinearModel("linear_model.Ridge")

# The LimeBase attributor will need another wrapper for 
# the model if we try to explain more than one instance
batch = collate_fn(dev[:1])
input, additional_forward_args = get_tokens_input_captum(model_args.model, collate_fn, dev[:1], tokenizer.pad_token_id)

 # Defining LimeBase interpreter
lime_attr = LimeBase(model,
                    sklearnmodel,
                    similarity_func=similarity_kernel,
                    perturb_func=perturb_func,
                    perturb_interpretable_space=False,
                    from_interp_rep_transform=None,
                    to_interp_rep_transform=to_interp_rep_transform_custom)

 # Computes interpretable model, returning coefficients of linear model.
attr_coefs = lime_attr.attribute(input, 
                                 n_perturb_samples=100, 
                                 target=1,
                                 additional_forward_args=additional_forward_args)
attr_coefs

tensor([[ 4.6955e-04,  4.0859e-05,  3.8338e-05,  1.2541e-04,  1.6611e-04,
          2.0025e-05,  1.9535e-05,  2.4987e-04,  1.3605e-04,  4.0213e-05,
          1.3477e-04,  7.3199e-06,  9.3516e-07,  4.5949e-05,  2.3781e-05,
          4.6775e-05,  3.4967e-04,  1.7209e-05, -1.2891e-06, -1.9280e-05,
          8.7523e-03]])

In [None]:
# You can evaluate the convergence of the local explanation:
pred, true = [], []
enforce_reproducibility(seed=model_args.seed)

for i in range(20):
  instance = dev[i]
  if len(instance[0].split()) < 5:
    continue
  instance_input, additional_args = get_tokens_input_captum(model_args.model, 
                                                   collate_fn, 
                                                   [instance], 
                                                   tokenizer.pad_token_id)
  
  lime_attr.attribute(instance_input, 
                      n_perturb_samples=10, 
                      target=1,
                      additional_forward_args=additional_args)
  pred.append(sklearnmodel(instance_input.to('cpu').float()).item())
  true.append(model(instance_input, additional_args[0] if additional_args else None)[0][1].item())
mean_squared_error(true, pred)

1.2863891527193987

In [None]:
# YOUR CODE HERE

# Visualising Saliency Maps

Humans are good at understanding complex patterns when presented with good visualisations. We will visualize the saliency maps and the predictions of a model using the 

In [None]:
from ALPS_2021.tutorial_src.explainability_utils import GradientBasedVisualizer

In [None]:
ablator = Saliency(model)
visualizer = GradientBasedVisualizer(collate_fn, tokenizer, ablator)

In [None]:
# we add the instances to the visuliser and then show all of them together:
for i in range(3):
  visualizer.interpret_sentence(model, model_args.model, dev[i], target=0)
  visualizer.interpret_sentence(model, model_args.model, dev[i], target=1)
  visualizer.interpret_sentence(model, model_args.model, dev[i], target=2)
visualizer.visualize()

  if input.grad is not None:
  if input.grad is not None:


True Label,Predicted Label,Attribution Label,Attribution Score,Word Importance
neutral,neutral (0.99),negative,3.59,[CLS] last session of the day http : / / t ##wi ##tp ##ic . com / 67 ##ez ##h [SEP]
,,,,
neutral,neutral (0.99),neutral,3.55,[CLS] last session of the day http : / / t ##wi ##tp ##ic . com / 67 ##ez ##h [SEP]
,,,,
neutral,neutral (0.99),positive,3.75,[CLS] last session of the day http : / / t ##wi ##tp ##ic . com / 67 ##ez ##h [SEP]
,,,,
positive,positive (0.97),negative,3.64,[CLS] shanghai is also really exciting ( precisely - - skyscraper ##s gal ##ore ) . good t ##wee ##ps in china : ( sh ) ( b ##j ) . [SEP]
,,,,
positive,positive (0.97),neutral,3.77,[CLS] shanghai is also really exciting ( precisely - - skyscraper ##s gal ##ore ) . good t ##wee ##ps in china : ( sh ) ( b ##j ) . [SEP]
,,,,


# Human-in-the-Loop

## Task 4
Observing the saliency maps:


*   Can you use the saliency maps to spot patterns which can be used to **re-write an example to an adversarial one**, which fools the model to predict the wrong label?
*   Can you use the saliency maps on wrong predictions to find what the model fails to capture?

Hint: you can use the visualizer from above.

In [None]:
# YOUR CODE HERE

# Properties

As there are a lot of explainability approaches and they can produce quite different saliency maps, it is important to be aware of their properties and be able to perform sanity checks with them. 

For a full list of propoerties for evaluating explainability techniques, see this [paper](https://www.aclweb.org/anthology/2020.emnlp-main.263.pdf).

Here, we will look at two of the most common properties.

## Faithfulness

 Since explanation techniques are employed to explain model predictions for a single instance, an essential property is that they are faithful to the model’s inner workings and not based on arbitrary choices. A well-established way of measuring this property is by replacing a number of the most-salient words with a mask token and observing the drop in the model’s performance.

 Here, for simplicity, we will remove top 20 words from each instance and observe the change in the predictive performance. The bigger the drop in the performance, the more faithful the explainability approach is.

In [None]:
from sklearn.metrics import f1_score
from ALPS_2021.tutorial_src.explainability_utils import attribute_predict

In [None]:
def threshold_predict(dataset, attribution_method, model, words_remove=20, neg_inf=1e-10, target = 0):
  predictions_old, predictions_new, true_target = [], [], []
  dl = DataLoader(batch_size=model_args.batch_size, dataset=dataset, collate_fn=collate_fn)
      
  for batch in tqdm(dl):
    token_ids = batch[0]
    additional_args = None

    input_embeds, additional_args = get_embeddings_input_captum(model, 
                                                                model_args.model, 
                                                                collate_fn,
                                                                None,
                                                                tokenizer.pad_token_id, 
                                                                batch=batch)

    if isinstance(attribution_method, LimeBase):
      inputs = token_ids
      instance_attribution = attribution_method.attribute(token_ids, 
                                                          additional_forward_args=additional_args,
                                                          n_perturb_samples=100, 
                                                          target=1)
    else:
      inputs = input_embeds
      instance_attribution = attribution_method.attribute(inputs, 
                                                          additional_forward_args=additional_args,
                                                          target=1)
      instance_attribution = summarize_attributions(instance_attribution, type='mean').detach().cpu()
      
    mask = (token_ids != tokenizer.pad_token_id).long()
    instance_attribution = torch.tensor(instance_attribution).to(device) + (1 - mask) * neg_inf
    
    # take first words woth highest scores
    words_to_mask = (torch.argsort(torch.tensor(instance_attribution), descending=True) < words_remove).long() * mask
    
    inputs_masked = token_ids * words_to_mask + (1 - words_to_mask) * tokenizer.mask_token_id
    
    predictions_old += torch.max(model(inputs, additional_args[0] if additional_args else None), dim=1)[1].detach().cpu().numpy().tolist()
    predictions_new += torch.max(model(inputs_masked, additional_args[0]  if additional_args else None), dim=1)[1].detach().cpu().numpy().tolist()
    true_target += batch[1].detach().cpu().numpy().tolist()

  return (f1_score(true_target, predictions_old, average='macro'), 
          f1_score(true_target, predictions_new, average='macro'))

In [None]:
dev.dataset = dev.dataset[:100]
threshold_predict(dev, ablator, model)

  "In order to make embedding layers more interpretable they will "


HBox(children=(FloatProgress(value=0.0, max=13.0), HTML(value='')))

  if input.grad is not None:
  if input.grad is not None:





(0.8929330065359476, 0.8344035212322379)

## Stability

With this property we test whether instances with similar
rationales also receive similar explanations.

Here, for simplicity, we will consider two instances to have similar rationales if the input is similar and the produced output is the same. A more consistent approach would be also to measure the similarity between the activation maps in the separate layers, which we won't consider here for computational reasons.

To simplify the experiment further, we will add some words in the end of the instances, which don't change their meaning and will measure the correlation between the change in the prediction and the change in the saliency maps.

In [None]:
from scipy.stats import pearsonr, spearmanr
from scipy.spatial import distance

In [None]:
def add_dataset_sentence(dataset):
  for i in range(len(dataset)):
    dataset.dataset[i] = (dataset[i][0] + ' Bye.', dataset[i][1])
  return dataset

def stability_attributions(dataset, attribution_method, model, edit_fn, neg_inf=1e-10, abs=False):
  predictions_old, attributions_old, token_ids_old,_ = attribute_predict(collate_fn, 
                                                                         model_args, 
                                                                         dataset, 
                                                                         attribution_method,
                                                                         model, 
                                                                         target=0)

  dataset_edited = edit_fn(dataset)
  predictions_new, attributions_new, token_ids_new,_ = attribute_predict(collate_fn,
                                                                         model_args,
                                                                         dataset_edited, 
                                                                         attribution_method, 
                                                                         model, 
                                                                         target=0)
  diff_pred, diff_attr = [], []

  for p1, p2 in zip(predictions_old, predictions_new):
    diff_pred.append(distance.euclidean(p1, p2))

  for a1, a2, token_ids in zip(attributions_old, attributions_new, token_ids_old):
    a1 = [token_score for i, token_score in enumerate(a1) if token_ids[i]!=tokenizer.pad_token_id]
    a2 = a2[:len(a1)]
    diff_attr.append(distance.euclidean(a1, a2))

  if abs:
    diff_pred = np.abs(diff_pred)
    diff_attr = np.abs(diff_attr)
  
  return pearsonr(diff_pred, diff_attr), spearmanr(diff_pred, diff_attr)

In [None]:
stability_attributions(dev, ablator, model, add_dataset_sentence)

  "In order to make embedding layers more interpretable they will "


HBox(children=(FloatProgress(value=0.0, max=13.0), HTML(value='')))

  if input.grad is not None:
  if input.grad is not None:





HBox(children=(FloatProgress(value=0.0, max=13.0), HTML(value='')))




((0.5534135545926878, 2.3510616145391444e-09),
 SpearmanrResult(correlation=0.5269486948694869, pvalue=1.7811362756981797e-08))

## Task 5

Experiment with the model to find what characteristics make it more difficult for interpreting.

Guidelines:
You can think of several parameters that you could change and train a new model assuming that the resulting model would be harder to interpret. (The CNN model is the fastest to train, you could also decrease the number of epochs in the model_args).

Then, you can compare the faithfulness and the stability of the same explainability approach to see what has changed.

Why do you think is that?
Let us know of your findings!