# Explainable Artifical Intelligence (XAI) - Large Language Models 

___

**Introduction**


Machine learning (ML) systems are increasingly being integrated across various domains, from cars that drive themselves to smart assistants for improved user interactions. The widespread use of ML in complex applications has increased interest in creating systems that are not only high-performing but also safe, fair, and reliable. However, unlike straightforward performance metrics such as accuracy, these additional criteria are often challenging to quantify. For instance, it's difficult to list all necessary safety checks for a semi-autonomous vehicle or to fully ensure a credit scoring system is free from bias. In these situations, being able to understand how an ML system thinks becomes very important. If an ML system can explain its reasoning, we can then assess whether its logic aligns with these essential standards.

This is where Explainable AI (XAI) becomes invaluable. For data scientists and researchers, it's a tool to uncover biases and errors, to improve the system's performance and fairness. For end users and consumers, XAI fosters trust and acceptance by demystifying AI decisions, making them clear and understandable. The ability to peek under the hood of AI systems and understand their thought processes is a necessity, especially with privacy and transparency laws like GDPR highlighting the importance of clarity in AI-driven decisions. 

**resources**

- Using captum to explain LLM, metapaper: https://aclanthology.org/2023.nlposs-1.19.pdf

# Traditional Finetuning paradigm

## Local Explanations

### 1. Perturbation based method



In [None]:
# below is the example provided by the meta paper:
from captum . attr import FeatureAblation , LLMAttribution , TextTemplateFeature
fa = FeatureAblation ( model )
llm_attr = LLMAttribution ( fa , tokenizer )
inp = TextTemplateFeature (
# the text template
"{} lives in {}, {} and is a {}. {} personal interests include ",
# the values of the features
[" Dave ", " Palm Coast ", "FL", " lawyer ", " His"],
# the reference baseline values of the features
baselines =[" Sarah ", " Seattle ", "WA", " doctor ", "Her"],
)
llm_attr . attribute ( inp )

In [None]:
# assignment: perturb each of the word to see when the sentiment changes
# todo: see if there is a way to quantify each of the words to the output prediction, like coefficients.

from transformers import pipeline

# Specify the model name
model_name = "distilbert-base-uncased-finetuned-sst-2-english"

# Load a pre-trained sentiment analysis model with the specified model name
nlp = pipeline("sentiment-analysis", model=model_name)

def perturb_input(sentence, word_to_remove):
    """
    Removes a specified word from the sentence and returns the modified sentence.
    """
    words = sentence.split()
    modified_words = [word for word in words if word.lower() != word_to_remove.lower()]
    return ' '.join(modified_words)

def analyze_sentiment_change(original_sentence, word_to_remove):
    """
    Analyzes the change in sentiment when a specific word is removed from the sentence.
    """
    # Original sentiment
    original_sentiment = nlp(original_sentence)[0]
    
    # Modified sentence after removing the word
    modified_sentence = perturb_input(original_sentence, word_to_remove)
    
    # Sentiment of the modified sentence
    modified_sentiment = nlp(modified_sentence)[0]
    
    print(f"Original sentence: {original_sentence}")
    print(f"Sentiment: {original_sentiment}")
    print(f"\nModified sentence: {modified_sentence}")
    print(f"Sentiment: {modified_sentiment}")

# Example usage
original_sentence = "The movie was surprisingly good despite its slow start."
word_to_remove = "good"
analyze_sentiment_change(original_sentence, word_to_remove)


#### 2. Gradient based method


- [Integrated gradients](https://captum.ai/tutorials/IMDB_TorchText_Interpret)

- [Medium article](https://medium.com/@CVxTz/add-interpretability-to-your-nlp-model-the-easy-way-using-captum-ec56f538f746) with [associated notebook](https://colab.research.google.com/drive/1SKitCFjbiZ3k7eL3UfMuKOISKmD_b9bc?usp=sharing)

In [None]:
### 3. Surrogate Models

- [shap](https://shap.readthedocs.io/en/latest/example_notebooks/api_examples/plots/text.html)

In [None]:
# LIME
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from lime.lime_text import LimeTextExplainer

# Load the tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english")
model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english")
model.eval()

# Define a prediction function that takes texts and returns probability scores
def predict_proba(texts):
    inputs = tokenizer(texts, return_tensors="pt", padding=True, truncation=True, max_length=512)
    with torch.no_grad():
        outputs = model(**inputs)
    # Apply softmax to logits to get probabilities
    probabilities = torch.nn.functional.softmax(outputs.logits, dim=-1)
    # Convert to numpy array and return
    return probabilities.numpy()

# Create a LimeTextExplainer
explainer = LimeTextExplainer(class_names=["Negative", "Positive"])

# Example text to explain
input_text = "The movie was surprisingly good despite its slow start."

# Generate explanation with LIME
exp = explainer.explain_instance(input_text, predict_proba, num_features=6, labels=[1])

# Show explanation
exp.show_in_notebook(text=input_text)


In [None]:
# SHAP
import nlp
import numpy as np
import scipy as sp
import torch
import transformers

import shap

# load a BERT sentiment analysis model
tokenizer = transformers.DistilBertTokenizerFast.from_pretrained(
    "distilbert-base-uncased"
)
model = transformers.DistilBertForSequenceClassification.from_pretrained(
    "distilbert-base-uncased-finetuned-sst-2-english"
).cuda()


# define a prediction function
def f(x):
    tv = torch.tensor(
        [
            tokenizer.encode(v, padding="max_length", max_length=500, truncation=True)
            for v in x
        ]
    ).cuda()
    outputs = model(tv)[0].detach().cpu().numpy()
    scores = (np.exp(outputs).T / np.exp(outputs).sum(-1)).T
    val = sp.special.logit(scores[:, 1])  # use one vs rest logit units
    return val


# build an explainer using a token masker
explainer = shap.Explainer(f, tokenizer)

# explain the model's predictions on IMDB reviews
imdb_train = nlp.load_dataset("imdb")["train"]
shap_values = explainer(imdb_train[:10], fixed_context=1)

In [None]:
# from meta paper captum
from captum . attr import ShapleyValueSampling , LLMAttribution , TextTemplateFeature ,
ProductBaselines
svs = ShapleyValueSampling ( model )
baselines = ProductBaselines (
{
(" name ", " pronoun "): [(" Sarah ", "Her") , (" John ", "His")],
" city ": [" Seattle ", " Boston "],
" state ": ["WA", "MA"],
" occupation ": [" doctor ", " engineer ", " teacher ", " technician ", " plumber "],
}
)
llm_attr = LLMAttribution ( svs , tokenizer )
inp = TextTemplateFeature (
"{ name } lives in { city }, { state } and is a { occupation }. { pronoun } personal
interests include ",
{" name ":" Dave ", " city ": " Palm Coast ", " state ": "FL", " occupation ":" lawyer ", "
pronoun ":" His"},
baselines = baselines ,
)
attr_result = llm_attr . attribute ( inp , target =" playing golf , hiking , and cooking .")

In [None]:
### 4. Decomposition method

import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from captum.attr import LayerLRP

# Load the tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english")
model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english")
model.eval()  # Set the model to evaluation mode

# Tokenize the input text
input_text = "The movie was surprisingly good despite its slow start."
inputs = tokenizer(input_text, return_tensors="pt", padding=True, truncation=True)
input_ids = inputs["input_ids"].to(model.device)

# Initialize LayerLRP with the last layer of the DistilBERT transformer
layer_lrp = LayerLRP(model.distilbert.transformer.layer[-1], model.distilbert.embeddings)

# Before applying LRP, we need to ensure gradients can flow through the embedding layer
model.distilbert.embeddings.requires_grad_(True)

# Compute attributions using LRP
attributions = layer_lrp.attribute(input_ids)

# Process and visualize your attributions as needed
# Here we sum the attributions across the embedding dimension and print them
attributions_sum = attributions.sum(dim=-1).squeeze(0)
tokens = tokenizer.convert_ids_to_tokens(input_ids.squeeze().tolist())
for token, attribution in zip(tokens, attributions_sum.cpu().detach().numpy()):
    print(f"{token}: {attribution}")


In [None]:
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from captum.attr import Saliency

# Load the tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english")
model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english")
model.eval()  # Set the model to evaluation mode

# Tokenize the input text
input_text = "The movie was surprisingly good despite its slow start."
inputs = tokenizer(input_text, return_tensors="pt", padding=True, truncation=True)
input_ids = inputs["input_ids"].to(model.device)

# Ensure input_ids require gradients
input_ids.requires_grad = True

# Define a forward function that returns the model's logits
def forward_func(input_ids):
    return model(input_ids).logits

# Initialize Saliency using the forward function
saliency = Saliency(forward_func)

# Compute the saliency scores for the input IDs with respect to the target class (positive sentiment)
# Assuming the positive class index is 1
attributions = saliency.attribute(input_ids, target=1)

# Process and visualize the attributions as before
# ...


# Global Explanations