# Interpretability for a Linear Text Classification Model

In this example we show how to apply a simple white-box interpretability method used by [Haufe et al.](https://www.sciencedirect.com/science/article/pii/S1053811913010914) in neuroimaging applications to the predictions of a linear text classification model operating on unigram bag-of-words features. 

An empirical quantification of interpretability quality based on human-in-the-loop experiments by [Schmidt and Biessmann](https://arxiv.org/pdf/1901.08558.pdf) indicates that this method yields better explanations of machine learning (ML) predictions. Better here means that human annotators become faster and more accurate when assisted with these explanations, compared to explanations of other interpretability methods, such as [LIME](https://arxiv.org/pdf/1602.04938v1.pdf).

This notebook shows how to train the ML model and interpretability model used in [the study by Schmidt and Biessmann](https://arxiv.org/pdf/1901.08558.pdf). The classification task is to predict whether the sentiment of an IMDB movie review was positive or negative, a standard text classification benchmark due to [Maas et al](http://www.aclweb.org/anthology/P11-1015).

## Data Preparation

First, download the data behind [this link](https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz), unzip/-tar it and adapt the path in the below code cell. 

In [1]:
import os, glob, re
import pandas as pd
import numpy as np
import scipy as sp
from sklearn.feature_extraction.text import TfidfVectorizer, ENGLISH_STOP_WORDS
from sklearn.linear_model import SGDClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report
from bs4 import BeautifulSoup

# contains the unzipped data from https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
DATADIR = "aclImdb"

top_what = 3
random_state = 42
np.random.seed(random_state)

features = 'review'
label = 'rating_binary'

def load_data(path = DATADIR):
    reviews = []
    for train_test_split in ['train', 'test']:
        for label in ['pos','neg']:
            for file in glob.glob(os.path.join(DATADIR, train_test_split, label, '*.txt')):
                reviews.append({
                    'review': open(file, encoding='utf-8').read(),
                    'movie_id': int(file.split("/")[-1].split("_")[0]),
                    'rating_binary': label,
                    'rating': int(file.split("/")[-1].split('_')[1].split(".")[0]),
                    'split': train_test_split
                })
    return pd.DataFrame(reviews)

df = load_data()

df[features] = df[features].apply(lambda x: BeautifulSoup(x, 'html.parser').get_text())

train_df = df[df.split=='train']
test_df = df[df.split=='test']
test_df.index = range(len(test_df))

In [2]:
df

Unnamed: 0,review,movie_id,rating_binary,rating,split
0,For a movie that gets no respect there sure ar...,4715,pos,9,train
1,Bizarre horror movie filled with famous faces ...,12390,pos,8,train
2,"A solid, if unremarkable film. Matthau, as Ein...",8329,pos,7,train
3,It's a strange feeling to sit alone in a theat...,9063,pos,8,train
4,"You probably all already know this by now, but...",3092,pos,10,train
...,...,...,...,...,...
49995,With actors like Depardieu and Richard it is r...,11513,neg,1,test
49996,If you like to get a couple of fleeting glimps...,5409,neg,1,test
49997,When something can be anything you want it to ...,11187,neg,1,test
49998,"I had heard good things about ""States of Grace...",9359,neg,3,test


## Training a linear unigram bag-of-words classifier

Given that we restrict the model to unigram features, the hyperparameters of the classifier  in the code are optimal, as determined in offline experiments.

In [3]:
# a unigram BOW vectorizer
vect = TfidfVectorizer(stop_words=ENGLISH_STOP_WORDS).fit(train_df[features])
X_train = vect.transform(train_df[features])
X_test = vect.transform(test_df[features])

clf = SGDClassifier(loss='log', 
                    n_jobs=-1, 
                    random_state=random_state,
                    alpha=0.0001
                   ).fit(X_train, train_df[label])

test_df['predictions'] = clf.predict(X_test)
predictions_proba = clf.predict_proba(X_test)
test_df['prediction_proba_max'] = predictions_proba.max(axis=1)

print(classification_report(test_df[label], test_df['predictions']))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  if sys.path[0] == '':
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


              precision    recall  f1-score   support

         neg       0.88      0.87      0.87     12500
         pos       0.87      0.88      0.87     12500

    accuracy                           0.87     25000
   macro avg       0.87      0.87      0.87     25000
weighted avg       0.87      0.87      0.87     25000



## Rendering the model interpretable

As we only trained a linear text classification, there is a simple way of rendering the model and its predictions interpretable. 

Note that **we cannot simply interpret the coefficients of the linear model itself**! This would implicitly assume that the covariates (the features or data) are uncorrelated, which is almost never the case for real data (although it is very often true if the covariates are synthetic, as is the case for the [design matrix](https://en.wikipedia.org/wiki/Design_matrix) of well designed experiments). This is well known in some parts of the scientific community and ignored in others. 

This reasoning is also recapitulared in the [Haufe et al study](https://www.sciencedirect.com/science/article/pii/S1053811913010914) in which the authors show that the optimal generative model (explanation/interpretability model) $a\in R^d$ for a linear binary classification model $w\in R^d$ and $n$ $d$-dimensional data points stored in a matrix $X\in R^{n\times d}$ is

$a = X^{\top}Xw = X^{\top}\hat{y}$

where $\hat{y}$ are the predictions of the linear model and we assume that the data as well as the predictions are centered (meaning $\sum_i^n x_i=0$) and have unit variance (meaning  $\sum_i^n x_i^2=1$). Following Haufe et al. the model explanations are referred to as *pattern*.

Note that as we are dealing with with very high-dimensional data in the case of bag-of-words featurized text data, we approximate this z-scoring operation in order to avoid densifying the sparse data matrix.

In [8]:
labels_normalized = StandardScaler().fit_transform(predictions_proba)
data_scaled = StandardScaler(with_mean=False).fit_transform(X_test)
prediction_sign = -np.sign(predictions_proba.argmax(axis=1) - .5)

pattern = labels_normalized[:,0].T @ data_scaled

idx2word = {idx: word for word, idx in vect.vocabulary_.items()}

# print most covarying words for negative sentiment class
print([idx2word[idx] for idx in pattern.argsort()[-5:][::-1]])

['bad', 'worst', 'waste', 'awful', 'terrible']


## Explanations for single data points / predictions

The above model is a global interpretability model, meaning there is one pattern/explanation per class, but we would like to obtain explanations for single predictions. 

This can be done by computing the elementwise product between the feature vector of the $i$th data point and here the positive pattern $a^{pos}$:

$a^{pos}_{i} = sign(\hat{y}_i) a^{pos} \circ x_i$

where $\circ$ stands for elementwise multiplication. We multiply by the $sign(\hat{y}_i)$ to obtain positive values for the predicted class, independent of whether the predicted class was negative or positive sentiment. 


In [42]:
word_scores = sp.sparse.diags(prediction_sign) * ((data_scaled>0)*1.) @ sp.sparse.diags(pattern)

sample = test_df.sample(n=100)
sample['highlighted_features_covar'] = ''
for i in range(len(sample)):
    idx = sample.index[i]
    top_words = word_scores[idx,:].toarray().flatten().argsort()[-top_what:][::-1]
    words = [idx2word[w_idx] for w_idx in top_words]
    sentence_covar = test_df.loc[idx, features]
    for word in words:
        sentence_covar = re.sub(r'\b{}\b'.format(word), 
                                '<mark>{}</mark>'.format(word),
                                sentence_covar, 
                                flags=re.IGNORECASE
                               )
    sample.loc[idx, 'highlighted_features_covar'] = sentence_covar
    sample.loc[idx, 'n_words_highlighted_in_sentence_covar'] = len(re.findall('<mark>', sentence_covar))
    
    

In [43]:
from IPython.display import display, HTML

rand_example = sample.sample(n=1)
print(f"True sentiment {rand_example['rating_binary'].values[0]}, predicted sentiment {rand_example['predictions'].values[0]} (p={rand_example['prediction_proba_max'].values[0]:0.2})")
display(HTML(rand_example['highlighted_features_covar'].values[0]))

True sentiment neg, predicted sentiment neg (p=0.69)
