# Feature attribution

In [1]:
__author__ = "Christopher Potts"
__version__ = "CS224u, Stanford, Spring 2022"

## Contents

1. [Overview](#Overview)
1. [InputXGradients](#InputXGradients)
1. [Selectivity examples](#Selectivity-examples)
1. [Simple feed-forward classifier example](#Simple-feed-forward-classifier-example)
1. [Bag-of-words classifier for the SST](#Bag-of-words-classifier-for-the-SST)
1. [BERT example](#BERT-example)

## Overview

This notebook is an experimental extension of the CS224u course code. It focuses on the [Integrated Gradients](https://arxiv.org/abs/1703.01365) method for feature attribution, with comparisons to the "inputs $\times$ gradients" method. To run the notebook, first install [the Captum library](https://captum.ai/):

In [2]:
!pip install captum



This is not currently a required installation (but it will be in future years).

## InputXGradients

For both implementations, the `forward` method of `model` is used. `X` is an (m x n) tensor of attributions. Use `targets=None` for models with scalar outputs, else supply a LongTensor giving a label for each example.

In [3]:
import torch

def grad_x_input(model, X, targets=None):
    """Implementation using PyTorch directly."""
    X.requires_grad = True
    y = model(X)
    y = y if targets is None else y[list(range(len(y))), targets]
    (grads, ) = torch.autograd.grad(y.unbind(), X)
    return grads * X

In [4]:
from captum.attr import InputXGradient

def captum_grad_x_input(model, X, target):
    """Captum-based implementation."""
    X.requires_grad = True
    amod = InputXGradient(model)
    return amod.attribute(X, target=target)

## Selectivity examples

In [5]:
import numpy as np
import torch
import torch.nn as nn
from captum.attr import IntegratedGradients
from captum.attr import InputXGradient

In [6]:
class SelectivityAssessor(nn.Module):
    """Model used by Sundararajan et al, section 2.1 to show that
    input * gradients violates their selectivity axiom.
    """
    def __init__(self):
        super().__init__()
        self.relu = nn.ReLU()

    def forward(self, X):
        return 1.0 - self.relu(1.0 - X)

In [7]:
sel_mod = SelectivityAssessor()

Simple inputs with just one feature:

In [8]:
X_sel = torch.FloatTensor([[0.0], [2.0]])

The outputs for our two examples differ:

In [9]:
sel_mod(X_sel)

tensor([[0.],
        [1.]])

However, `InputXGradient` assigns the same importance to the feature across the two examples, violating selectivity:

In [10]:
captum_grad_x_input(sel_mod, X_sel, target=None)

tensor([[0.],
        [-0.]], grad_fn=<MulBackward0>)

Integrated gradients addresses the problem by averaging gradients across all interpolated representations between the baseline and the actual input:

In [11]:
ig_sel = IntegratedGradients(sel_mod)

In [12]:
sel_baseline = torch.FloatTensor([[0.0]])

In [13]:
ig_sel.attribute(X_sel, sel_baseline)

tensor([[0.],
        [1.]], dtype=torch.float64, grad_fn=<MulBackward0>)

A toy implementation to help bring out what is happening:

In [14]:
def ig_reference_implementation(model, x, base, m=50):
    vals = []
    for k in range(m):
        # Interpolated representation:
        xx = (base + (k/m)) * (x - base)
        # Gradient for the interpolated example:
        xx.requires_grad = True
        y = model(xx)
        (grads, ) = torch.autograd.grad(y.unbind(), xx)
        vals.append(grads)
    return (1 / m) * torch.cat(vals).sum(axis=0) * (x - base)

In [15]:
ig_reference_implementation(sel_mod, torch.FloatTensor([[2.0]]), sel_baseline)

tensor([[1.]])

## Simple feed-forward classifier example

In [16]:
from captum.attr import IntegratedGradients
from sklearn.datasets import make_classification
from sklearn.feature_selection import mutual_info_classif
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
import torch
from torch_shallow_neural_classifier import TorchShallowNeuralClassifier

In [17]:
X_cls, y_cls = make_classification(
    n_samples=5000,
    n_classes=3,
    n_features=5,
    n_informative=3,
    n_redundant=0,
    random_state=42)

The classification problem has two uninformative features:

In [18]:
mutual_info_classif(X_cls, y_cls)

array([0.20138107, 0.02833358, 0.11584416, 0.        , 0.        ])

In [19]:
X_cls_train, X_cls_test, y_cls_train, y_cls_test = train_test_split(X_cls, y_cls)

In [20]:
classifier = TorchShallowNeuralClassifier()

In [21]:
_ = classifier.fit(X_cls_train, y_cls_train)

Stopping after epoch 449. Training loss did not improve more than tol=1e-05. Final error is 1.3419027030467987.

In [22]:
cls_preds = classifier.predict(X_cls_test)

In [23]:
accuracy_score(y_cls_test, cls_preds)

0.8568

In [24]:
classifier_ig = IntegratedGradients(classifier.model)

In [25]:
classifier_baseline = torch.zeros(1, X_cls_train.shape[1])

Integrated gradients with respect to the actual labels:

In [26]:
classifier_attrs = classifier_ig.attribute(
    torch.FloatTensor(X_cls_test),
    classifier_baseline,
    target=torch.LongTensor(y_cls_test))

Average attribution is low for the two uninformative features:

In [27]:
classifier_attrs.mean(axis=0)

tensor([ 0.6544,  0.6739,  0.7057, -0.0173, -0.0059], dtype=torch.float64)

## Bag-of-words classifier for the SST

In [28]:
from collections import Counter
from captum.attr import IntegratedGradients
from nltk.corpus import stopwords
from operator import itemgetter
import os
from sklearn.metrics import classification_report
import torch
from torch_shallow_neural_classifier import TorchShallowNeuralClassifier
import sst

In [29]:
SST_HOME = os.path.join("data", "sentiment")

Bag-of-word featurization with stopword removal to make this a little easier to study:

In [30]:
stopwords = set(stopwords.words('english'))

def phi(text):
    return Counter([w for w in text.lower().split() if w not in stopwords])

In [31]:
def fit_mlp(X, y):
    mod = TorchShallowNeuralClassifier(early_stopping=True)
    mod.fit(X, y)
    return mod

In [32]:
experiment = sst.experiment(
    sst.train_reader(SST_HOME),
    phi,
    fit_mlp,
    sst.dev_reader(SST_HOME))

Stopping after epoch 24. Validation score did not improve by tol=1e-05 for more than 10 epochs. Final error is 1.3742991983890533

              precision    recall  f1-score   support

    negative      0.629     0.696     0.661       428
     neutral      0.295     0.100     0.150       229
    positive      0.625     0.773     0.691       444

    accuracy                          0.603      1101
   macro avg      0.516     0.523     0.500      1101
weighted avg      0.558     0.603     0.567      1101



Trained model:

In [33]:
sst_classifier = experiment['model']

Captum needs to have labels as indices rather than strings:

In [34]:
sst_classifier.classes_

['negative', 'neutral', 'positive']

In [35]:
y_sst_test = [sst_classifier.classes_.index(label)
              for label in experiment['assess_datasets'][0]['y']]

sst_preds = [sst_classifier.classes_.index(label)
             for label in experiment['predictions'][0]]

Our featurized test set:

In [36]:
X_sst_test = experiment['assess_datasets'][0]['X']

Feature names to help with analyses:

In [37]:
fnames = experiment['train_dataset']['vectorizer'].get_feature_names()

Integrated gradients:

In [38]:
sst_ig = IntegratedGradients(sst_classifier.model)

All-0s baseline:

In [39]:
sst_baseline = torch.zeros(1, experiment['train_dataset']['X'].shape[1])

Attributions with respect to the model's predictions:

In [40]:
sst_attrs = sst_ig.attribute(
    torch.FloatTensor(X_sst_test),
    sst_baseline,
    target=torch.LongTensor(sst_preds))

Helper functions for error analysis:

In [41]:
def error_analysis(gold=1, predicted=2):
    err_ind = [i for i, (g, p) in enumerate(zip(y_sst_test, sst_preds))
               if g == gold and p == predicted]
    attr_lookup = create_attr_lookup(sst_attrs[err_ind])
    return attr_lookup, err_ind

def create_attr_lookup(attrs):
    mu = attrs.mean(axis=0).detach().numpy()
    return sorted(zip(fnames, mu), key=itemgetter(1), reverse=True)

In [42]:
sst_attrs_lookup, sst_err_ind = error_analysis(gold=1, predicted=2)

In [43]:
sst_attrs_lookup[: 5]

[(',', 0.04512846003808179),
 ('.', 0.03875377384651548),
 ('film', 0.036562292947638124),
 ('fun', 0.02995531556022619),
 ('best', 0.015621606617978723)]

Error analysis for a specific example:

In [44]:
ex_ind = sst_err_ind[0]

In [45]:
experiment['assess_datasets'][0]['raw_examples'][ex_ind]

'No one goes unindicted here , which is probably for the best .'

In [46]:
ex_attr_lookup = create_attr_lookup(sst_attrs[ex_ind:ex_ind+1])

In [47]:
[(f, a) for f, a in ex_attr_lookup if a != 0]

[('best', 0.43364394640626314),
 (',', 0.04500691178712216),
 ('.', 0.03940604247146967),
 ('probably', 0.03321118433841792),
 ('one', 0.008722432294266332),
 ('goes', -0.03914730368530946)]

## BERT example

In [48]:
import torch
import torch.nn.functional as F
from transformers import AutoModelForSequenceClassification, AutoTokenizer
from captum.attr import LayerIntegratedGradients
from captum.attr import visualization as viz

In [49]:
hf_weights_name = 'cardiffnlp/twitter-roberta-base-sentiment'

In [50]:
hf_tokenizer = AutoTokenizer.from_pretrained(hf_weights_name)

In [51]:
hf_model = AutoModelForSequenceClassification.from_pretrained(hf_weights_name)

In [52]:
def hf_predict_one_proba(text):
    input_ids = hf_tokenizer.encode(
        text, add_special_tokens=True, return_tensors='pt')
    hf_model.eval()
    with torch.no_grad():
        logits = hf_model(input_ids)[0]
        preds = F.softmax(logits, dim=1)
    hf_model.train()
    return preds.squeeze(0)

In [53]:
def hf_ig_encodings(text):
    pad_id = hf_tokenizer.pad_token_id
    cls_id = hf_tokenizer.cls_token_id
    sep_id = hf_tokenizer.sep_token_id
    input_ids = hf_tokenizer.encode(text, add_special_tokens=False)
    base_ids = [pad_id] * len(input_ids)
    input_ids = [cls_id] + input_ids + [sep_id]
    base_ids = [cls_id] + base_ids + [sep_id]
    return torch.LongTensor([input_ids]), torch.LongTensor([base_ids])

In [54]:
def hf_ig_analyses(text2class):
    data = []
    for text, true_class in text2class.items():
        score_vis = hf_ig_analysis_one(text, true_class)
        data.append(score_vis)
    viz.visualize_text(data)


def hf_ig_analysis_one(text, true_class):
    # Option to look at different layers:
    # layer = model.roberta.encoder.layer[0]
    # layer = model.roberta.embeddings.word_embeddings
    layer = hf_model.roberta.embeddings

    def ig_forward(inputs):
        return hf_model(inputs).logits

    ig = LayerIntegratedGradients(ig_forward, layer)

    input_ids, base_ids = hf_ig_encodings(text)

    attrs, delta = ig.attribute(
        input_ids,
        base_ids,
        target=true_class,
        return_convergence_delta=True)

    # Summarize and z-score normalize the attributions
    # for each representation in `layer`:
    scores = attrs.sum(dim=-1).squeeze(0)
    scores = (scores - scores.mean()) / scores.norm()

    # Intuitive tokens to help with analysis:
    raw_input = hf_tokenizer.convert_ids_to_tokens(input_ids.tolist()[0])
    # RoBERTa-specific clean-up:
    raw_input = [x.strip("Ä ") for x in raw_input]

    # Predictions for comparisons:
    pred_probs = hf_predict_one_proba(text)
    pred_class = pred_probs.argmax()

    score_vis = viz.VisualizationDataRecord(
        word_attributions=scores,
        pred_prob=pred_probs.max(),
        pred_class=pred_class,
        true_class=true_class,
        attr_class=None,
        attr_score=attrs.sum(),
        raw_input_ids=raw_input,
        convergence_score=delta)

    return score_vis

In [55]:
score_vis = hf_ig_analyses({
    "They said it would be great, and they were right.": 2,
    "They said it would be great, and they were wrong.": 0,
    "They were right to say it would be great.": 2,
    "They were wrong to say it would be great.": 0,
    "They said it would be stellar, and they were correct.": 2,
    "They said it would be stellar, and they were incorrect.": 0})

True Label,Predicted Label,Attribution Label,Attribution Score,Word Importance
2.0,2 (0.82),,1.98,"#s They said it would be great , and they were right . #/s"
,,,,
0.0,0 (0.50),,0.07,"#s They said it would be great , and they were wrong . #/s"
,,,,
2.0,2 (0.76),,2.39,#s They were right to say it would be great . #/s
,,,,
0.0,0 (0.62),,3.46,#s They were wrong to say it would be great . #/s
,,,,
2.0,2 (0.77),,1.78,"#s They said it would be stellar , and they were correct . #/s"
,,,,
