# Template for quantitative experiments: text

This template is meant as basis for the quantitative text experiments as defined in issues [#474](https://github.com/dianna-ai/dianna/issues/474) and [#481](https://github.com/dianna-ai/dianna/issues/481).

It is based on the dianna [text tutorials](https://github.com/dianna-ai/dianna/tree/main/tutorials) for [RISE](https://github.com/dianna-ai/dianna/blob/main/tutorials/rise_text.ipynb) and [LIME](https://github.com/dianna-ai/dianna/blob/main/tutorials/lime_text.ipynb) which are laregely overlapping.

### Imports and paths

In [1]:
import os
import matplotlib.pyplot as plt
import numpy as np
import spacy
from torchtext.vocab import Vectors
from scipy.special import expit as sigmoid

import dianna
from dianna import visualization
from dianna import utils
from dianna.utils.tokenizers import SpacyTokenizer

  from .autonotebook import tqdm as notebook_tqdm
2023-03-07 16:25:58.082901: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-03-07 16:25:58.213289: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2023-03-07 16:25:58.213328: I tensorflow/compiler/xla/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
2023-03-07 16:25:59.181659: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open share

In [3]:
dianna_repo_tutorials_path = os.path.join('..','..','..', 'dianna','tutorials')
model_file = os.path.join(dianna_repo_tutorials_path,'models','movie_review_model.onnx')
word_vector_file = os.path.join(dianna_repo_tutorials_path,'data','movie_reviews_word_vectors.txt')

print("model file: ", model_file)
print("word vectorfile: ", word_vector_file)
labels = ("negative", "positive")

model file:  ../../../dianna/tutorials/models/movie_review_model.onnx
word vectorfile:  ../../../dianna/tutorials/data/movie_reviews_word_vectors.txt


## Explainable method

Here we define the XAI method and it's parameters

### Explainer

In [18]:
Explainer_type = 'RISE'
#Explainer_type = 'LIME'

### Explainer's parameters

In [21]:
if Explainer_type == 'RISE':
    print('Setting up RISE parameters')
    # here the default. but editable parameters
    n_masks = 1000
    feature_res = 8
    p_keep = None
    preprocess_function=None
else:
    if Explainer_type == 'LIME':
        print('Setting up LIME parameters')

Setting up RISE parameters


## Loading the pre-trained Stanford movie reviews model

The model (sentiment classifier) is in [ONNX format](https://onnx.ai/). 
It accepts numerical tokens as input, and outputs a score between 0 and 1, where 0 means the review has a _negative_ sentiment and 1 that it is _positive_.
Here we define a class to run the model, which accepts a sentence (i.e. string) as input and returns two classes: negative and positive.

### Tokenizer

In [4]:
# ensure the tokenizer for english is available
spacy.cli.download('en_core_web_sm')

Collecting en-core-web-sm==3.5.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.5.0/en_core_web_sm-3.5.0-py3-none-any.whl (12.8 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 12.8/12.8 MB 25.5 MB/s eta 0:00:00
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


### Model runner

In [5]:
class MovieReviewsModelRunner:
    def __init__(self, model, word_vectors, max_filter_size):
        self.run_model = utils.get_function(model)
        self.vocab = Vectors(word_vectors, cache=os.path.dirname(word_vectors))
        self.max_filter_size = max_filter_size
        
        self.tokenizer = SpacyTokenizer(name='en_core_web_sm')

    def __call__(self, sentences):
        # ensure the input has a batch axis
        if isinstance(sentences, str):
            sentences = [sentences]

        output = []
        for sentence in sentences:
            # tokenize and pad to minimum length
            tokens = self.tokenizer.tokenize(sentence)
            if len(tokens) < self.max_filter_size:
                tokens += ['<pad>'] * (self.max_filter_size - len(tokens))
            
            # numericalize the tokens
            tokens_numerical = [self.vocab.stoi[token] if token in self.vocab.stoi else self.vocab.stoi['<unk>']
                                for token in tokens]

            # run the model, applying a sigmoid because the model outputs logits, remove any remaining batch axis
            pred = float(sigmoid(self.run_model([tokens_numerical])))
            output.append(pred)

        # output two classes
        positivity = np.array(output)
        negativity = 1 - positivity
        return np.transpose([negativity, positivity])
            

In [7]:
# define model runner. max_filter_size is a property of the model
model_runner = MovieReviewsModelRunner(model_file, word_vector_file, max_filter_size=5)

## Loading the test data

At the moment only a single sentence review is loaded. For testing this should be a small batch.

In [22]:
review = "A delectable and intriguing thriller filled with surprises"

## Explaining the model with the dianna explainer

The simplest way to run DIANNA on text data is with dianna.explain_text. The arguments are:

    The function that runs the model (a path to a model in ONNX format is also accepted)
    The text we want to explain
    The name of the explainable-AI method we want to use (RISE, LIME, etc.)
    The numerical indices of the classes we want an explanation for

dianna.explain_text returns a list of tuples. Each tuple contains a word, its location in the input text, and its relevance for the selected output class

In [25]:
## TO DO!
# Run dianna with the pre-specified parameters

# An explanation is returned for each label, but we ask for just one label so the output is a list of length one.
explanation_relevances =  dianna.explain_text(model_runner, review, model_runner.tokenizer, Explainer_type,
                                              labels=[labels.index('positive')])[0]
explanation_relevances

Rise parameter p_keep was automatically determined at 0.30000000000000004


Explaining: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:15<00:00,  1.53s/it]


[('A', 0, 0.8483416525522866),
 ('delectable', 1, 0.8601043719053266),
 ('and', 2, 0.7479977830251058),
 ('intriguing', 3, 0.9797707005341846),
 ('thriller', 4, 0.8323371187845864),
 ('filled', 5, 0.7250519095857937),
 ('with', 6, 0.8161629191040992),
 ('surprises', 7, 0.7482723124821979)]

## Visualization