This notebook is primarily the same as the section on converting to a PDF in [Token_attention_with_head_importance](https://colab.research.google.com/drive/1iVojJQp0CZS484tMZqIizosXPLxgKvRX?usp=sharing); however, this notebook solely focuses on converting the attentions into a PDF visualization. This notebook predicts over the dataset and finds interesting examples one may want to visualize the attentions of such as false negatives, false postiives, and very confident predictions. It of course, will output a PDF of the text with the attentions of the token layered on top of the token.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


### Install and Import Dependencies

In [None]:
# import sys
# sys.path.append('/content/drive/My Drive/{}'.format("cogs402longformer/"))

In [None]:
pip install datasets --quiet

In [None]:
pip install transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


##Import Dataset and Model

In [None]:
import os

import numpy as np
import pandas as pd

import torch
import torch.nn as nn

Import the Reserach Papers dataset

In [None]:
from datasets import load_dataset
from transformers import LongformerForSequenceClassification, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained('allenai/longformer-base-4096')
model_path = 'danielhou13/longformer-finetuned_papers_v2'
model_path2 = 'danielhou13/longformer-finetuned-news-cogs402'
model_path3 = 'allenai/longformer-base-4096'

def longformer_finetuned_papers():
    test = torch.load("/content/drive/MyDrive/fakeclinicalnotes/models/full_augmented_lr2e-5_dropout3_10_trained_threshold.pt")
    model = LongformerForSequenceClassification.from_pretrained('allenai/longformer-base-4096', state_dict=test['state_dict'], num_labels = 2)
    return model

def preprocess_function(tokenizer, example, max_length):
    example.update(tokenizer(example['text'], padding='max_length', max_length=max_length, truncation=True))
    return example

def get_papers_dataset(dataset_type):
    max_length = 2048
    dataset = load_dataset("danielhou13/cogs402datafake")[dataset_type]

    # tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
    dataset = dataset.map(lambda x: preprocess_function(tokenizer, x, max_length), batched=True)
    setattr(dataset, 'input_columns', ['input_ids', 'attention_mask'])
    setattr(dataset, 'target_columns', ['labels'])
    setattr(dataset, 'max_length', max_length)
    setattr(dataset, 'tokenizer', tokenizer)
    return dataset

def papers_test_set():
    return get_papers_dataset('test')

def papers_train_set():
    return get_papers_dataset('train')

Load papers model and dataset and preprocess it

In [None]:
cogs402_test = papers_train_set()
model = longformer_finetuned_papers()
columns = cogs402_test.input_columns + cogs402_test.target_columns
print(columns)
cogs402_test.set_format(type='torch', columns=columns)
cogs402_test=cogs402_test.remove_columns(['text'])

Using custom data configuration danielhou13--cogs402datafake-f5349e6cf83e41d8
Reusing dataset parquet (/root/.cache/huggingface/datasets/danielhou13___parquet/danielhou13--cogs402datafake-f5349e6cf83e41d8/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec)


  0%|          | 0/1 [00:00<?, ?it/s]

Loading cached processed dataset at /root/.cache/huggingface/datasets/danielhou13___parquet/danielhou13--cogs402datafake-f5349e6cf83e41d8/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec/cache-3b0f0b1006ba438f.arrow
Some weights of the model checkpoint at allenai/longformer-base-4096 were not used when initializing LongformerForSequenceClassification: ['longformer_model.encoder.layer.8.attention.self.query.bias', 'longformer_model.encoder.layer.0.attention.self.query_global.bias', 'longformer_model.encoder.layer.0.attention.self.query.bias', 'longformer_model.encoder.layer.2.attention.output.dense.weight', 'longformer_model.encoder.layer.4.attention.self.value.weight', 'longformer_model.encoder.layer.4.attention.output.LayerNorm.bias', 'longformer_model.encoder.layer.9.attention.output.dense.bias', 'longformer_model.encoder.layer.3.output.LayerNorm.bias', 'longformer_model.encoder.layer.6.attention.output.LayerNorm.bias', 'longformer_model.encoder.layer.5.output.d

['input_ids', 'attention_mask', 'labels']


Load news model and dataset and preprocess it

In [None]:
# cogs402_test = news_test_set()
# model = longformer_finetuned_news(model_path2)
# columns = cogs402_test.input_columns + cogs402_test.target_columns
# print(columns)
# cogs402_test.set_format(type='torch', columns=columns)
# cogs402_test=cogs402_test.remove_columns(['text'])

In [None]:
if torch.cuda.is_available():
    model = model.cuda()

print(model.device)

cuda:0


## Predict over the dataset

Predict using the model on the selected dataset using the [Huggingface trainer](https://huggingface.co/docs/transformers/main_classes/trainer) API.

In [None]:
from transformers import Trainer, TrainingArguments

batch_size = 1
gradient_acc = 4
model_name = f"longformer-finetuned_papers"
training_args = TrainingArguments(output_dir=f"models/{model_name}",
                                  num_train_epochs = 2,
                                  learning_rate=2e-5,
                                  per_device_train_batch_size=batch_size,
                                  per_device_eval_batch_size=batch_size,
                                  weight_decay=0.01,
                                  evaluation_strategy="epoch",
                                  disable_tqdm=False,
                                  push_to_hub=False,
                                  log_level="error",
                                  fp16=True,
                                  gradient_accumulation_steps=gradient_acc,
                                  gradient_checkpointing=True,
                                  save_strategy = "epoch")

F1 and accuracy are good general metrics for model performance. Recall and precision can be used if desired.

In [None]:
from sklearn.metrics import accuracy_score, f1_score

def compute_metrics(pred):
    labels = pred.label_ids
    preds = pred.predictions.argmax(-1)
    f1 = f1_score(labels, preds, average="weighted")
    acc = accuracy_score(labels, preds)
    return {"accuracy": acc, "f1": f1}

Place the finishing touches on our trainer, passing in the arguments, model, metrics, and datacollator (which doesn't really matter here as we pass in one item at a time).

In [None]:
from transformers import DataCollatorWithPadding
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

trainer = Trainer(
    model=model,
    args=training_args,
    compute_metrics=compute_metrics,
    tokenizer=tokenizer,
    data_collator = data_collator
)

Here we predict over the entire validation set.

In [None]:
preds_output = trainer.predict(cogs402_test)

## Picking Examples

False negatives and false postives are usually very interesting examples to analyze so to get the list of all false negatives and positives, we get our model's predictions and the list of true labels.

In [None]:
y_preds = np.argmax(preds_output.predictions, axis=1)
y_true = np.array(cogs402_test["labels"])

We can get the list of false negatives and false positives by subtracting the list of labels. 


If, after subtracting, the list is 0, then we have a correct prediction as the two labels are the same. We can then filter by the value of the labels to get the positive and negative class. 


If, after subtracting the false label from the true label, we have a negative, then we know that the actual label is 0 while the predicted label is 1 (as 0-1 is -1). Therefore, we get a false positive in that case. 


On the other hand, if after subtracting the false label from the true label, we get a positive, then we know that the actual label is 1 while the predicted label is 0 (as 1-0 is 1). Therefore, we have a false negative.

In [None]:
diff = y_true-y_preds
correct = np.where(diff == 0)[0]

pos = np.where((y_true-y_preds == 0) & (y_true==1))[0]
neg = np.where((y_true-y_preds == 0) & (y_true==0))[0]

false_pos = np.where(diff == -1)[0]
false_neg = np.where(diff == 1)[0]

print('Correctly classified: ', correct)

print('cor pos: ', pos)
print('cor neg: ', neg)

print('False positives: ', false_pos)
print('False negatives: ', false_neg)

Correctly classified:  [ 9 10 11]
cor pos:  []
cor neg:  [ 9 10 11]
False positives:  []
False negatives:  [0 1 2 3 4 5 6 7 8]


Take example for evaluation based on random pick

In [None]:
# rand_pos = np.random.choice(pos, size=1)
# rand_neg = np.random.choice(neg, size=1)
# rand_fp = np.random.choice(false_pos, size=1)
# rand_fn = np.random.choice(false_neg, size=1)

Some other interesting examples include the examples that are the most confidently predicted to be positive or negative. (i.e. the examples with the highest predicted probability)

In [None]:
highest_pos = [np.argmax(preds_output.predictions[:,1])]
highest_neg = [np.argmax(preds_output.predictions[:,0])]

# for news dataset
# highest_neg = [np.argmax(np.delete(preds_output.predictions, 1933, 0)[:,0])]

print(highest_pos)
print(highest_neg)

[10]
[7]


## Getting the attention

Now that we have the example we want to visualize the attentions for, we pass it into the model again in order to obtain the attention output. We stack the attentions to get an output attention tensor of shape: (layer, batch, head, seq_len, x + attention_window + 1) and a global attention tensor of shape (layer, batch, head, seq_len, x) where x is the number of global attention tokens.

In [None]:
test_val = [7]
print(test_val)
testexam = cogs402_test[test_val]

[7]


In [None]:
output = model(testexam["input_ids"].cuda(), attention_mask=testexam['attention_mask'].cuda(), labels=testexam['labels'].cuda(), output_attentions = True)
batch_attn = output[-2]
output_attentions = torch.stack(batch_attn).cpu()
global_attention = output[-1]
output_global_attentions = torch.stack(global_attention).cpu()
print("output_attention.shape", output_attentions.shape)
print("gl_output_attention.shape", output_global_attentions.shape)

output_attention.shape torch.Size([12, 1, 12, 2046, 514])
gl_output_attention.shape torch.Size([12, 1, 12, 2048, 1])


In [None]:
print(testexam['labels'][0])
print(output[1].argmax())

tensor(1)
tensor(0, device='cuda:0')


In [None]:
# print(os.getcwd())
# yes = torch.load("resources/longformer_test2/epoch_3/aggregate_attn.pt")

A unique property of the longformer model is that the matrix output for the attention is not a seq_len x seq_len output. Each token can only attend to the preceeding w/2 tokens and the succeeding w/2 tokens, dictated by whatever you choose the model's attention window w to be. Another name for this is called the sliding window attention. Therefore, we need to convert sliding attention matrix to correct seq_len x seq_len matrix to remain consistent with other types of Transformer Neural Networks.

To do so, we run the following 4 functions. Our attentions will change from an output attention tensor of shape (layer, batch, head, seq_len, x + attention_window + 1) and a global attention tensor of shape (layer, batch, head, seq_len, x) to a single tensor of shape (layer, batch, head, seq_len, seq_len). More information about the functions can be found here. More information about the functions can be found [here](https://colab.research.google.com/drive/1Kxx26NtIlUzioRCHpsR8IbSz_DpRFxEZ#scrollTo=liVhkxiH9Le0).

In [None]:
def create_head_matrix(output_attentions, global_attentions):
    new_attention_matrix = torch.zeros((output_attentions.shape[0], 
                                      output_attentions.shape[0]))
    for i in range(output_attentions.shape[0]):
        test_non_zeroes = torch.nonzero(output_attentions[i]).squeeze()
        test2 = output_attentions[i][test_non_zeroes[1:]]
        new_attention_matrix_indices = test_non_zeroes[1:]-257 + i
        new_attention_matrix[i][new_attention_matrix_indices] = test2
        new_attention_matrix[i][0] = output_attentions[i][0]
        new_attention_matrix[0] = global_attentions.squeeze()[:output_attentions.shape[0]]
    return new_attention_matrix.detach().cpu().numpy()


def attentions_all_heads(output_attentions, global_attentions):
    new_matrix = []
    for i in range(output_attentions.shape[0]):
        matrix = create_head_matrix(output_attentions[i], global_attentions[i])
        new_matrix.append(matrix)
    return np.stack(new_matrix)


def all_batches(output_attentions, global_attentions):
    new_matrix = []
    for i in range(output_attentions.shape[0]):
        matrix = attentions_all_heads(output_attentions[i], global_attentions[i])
        new_matrix.append(matrix)
    return np.stack(new_matrix)

def all_layers(output_attentions, global_attentions):
    new_matrix = []
    for i in range(output_attentions.shape[0]):
        matrix = all_batches(output_attentions[i], global_attentions[i])
        new_matrix.append(matrix)
    return np.stack(new_matrix)

In [None]:
converted_mat = all_layers(output_attentions, output_global_attentions)
print(converted_mat.shape)

(12, 1, 12, 2046, 2046)


## Formatting the attentions

Our end goal is to overlay the attentions onto the tokens and produce a PDF of the results, so we need to grab the original tokens from the text. We cant grab the original text as it is one large string, but using the tokenizer function, we can change our input ids back to a list of tokens.

In [None]:
all_tokens = tokenizer.convert_ids_to_tokens(testexam["input_ids"][0])

Some heads may be more important than others so we scale each attention matrix by their respective head and layer. The notebook used to get head importance is [here](https://colab.research.google.com/drive/1O4QCi8ewBp7asegKqySRflTQZ9HeH8mQ?usp=sharing).

In [None]:
# head_importance = torch.load("/content/drive/MyDrive/fakeclinicalnotes/t3-visapplication/notes/head_importance.pt")
head_importance = torch.load("/content/drive/MyDrive/cogs402longformer/t3-visapplication/resources/notes/head_importance.pt") 

In [None]:
def scale_by_importance(attention_matrix, head_importance):
  new_matrix = np.zeros_like(attention_matrix)
  for i in range(attention_matrix.shape[0]):
    head_importance_layer = head_importance[i]
    for j in range(attention_matrix.shape[1]):
      new_matrix[i,j] = attention_matrix[i,j] * np.expand_dims(head_importance_layer, axis=(1,2))
  return new_matrix

In [None]:
converted_mat_importance = scale_by_importance(converted_mat, head_importance)

Get the sum of the attentions for all the tokens (column-wise). In other words, find out how much every word is attended to

In [None]:
attention_matrix_importance = converted_mat_importance.sum(axis=3)
print(attention_matrix_importance.shape)

(12, 1, 12, 2046)


## Visualizing the Attention

A dataframe is good for picking out information from the example, but it isn't the best being a easy to read visualization. Its easier to see how much each word is attended to in an example if we have the actual example, with the words highlighted based on the magnitude of attention.

We use https://github.com/jiesutd/Text-Attention-Heatmap-Visualization to show how much each token in the example is attended to, up to the max number of tokens we specified earlier.

In short, these functions iterate over the list of attentions and tokens, cleans the tokens to remove special characters, and normalizes the data if you wish for it to.

In [None]:
## convert the text/attention list to latex code, which will further generates the text heatmap based on attention weights.
import numpy as np

latex_special_token = ["!@#$%^&*(){}"]

def generate(text_list, attention_list, latex_file, color='red', rescale_value = True):
	assert(len(text_list) == len(attention_list))
	if rescale_value:
		attention_list = rescale(attention_list)
	word_num = len(text_list)
	text_list = clean_word(text_list)
	with open(latex_file,'w') as f:
		f.write(r'''\documentclass[varwidth]{standalone}
\special{papersize=210mm,297mm}
\usepackage{color}
\usepackage{tcolorbox}
\usepackage{CJK}
\usepackage{adjustbox}
\tcbset{width=0.9\textwidth,boxrule=0pt,colback=red,arc=0pt,auto outer arc,left=0pt,right=0pt,boxsep=5pt}
\begin{document}
\begin{CJK*}{UTF8}{gbsn}'''+'\n')
		string = r'''{\setlength{\fboxsep}{0pt}\colorbox{white!0}{\parbox{0.9\textwidth}{'''+"\n"
		for idx in range(word_num):
			string += "\\colorbox{%s!%s}{"%(color, attention_list[idx])+"\\strut " + text_list[idx]+"} "
		string += "\n}}}"
		f.write(string+'\n')
		f.write(r'''\end{CJK*}
\end{document}''')

def rescale(input_list):
	the_array = np.asarray(input_list)
	the_max = np.max(the_array)
	the_min = np.min(the_array)
	rescale = ((the_array - the_min)/(the_max-the_min))*100
	return rescale.tolist()


def clean_word(word_list):
	new_word_list = []
	for word in word_list:
		for special_sensitive in ["\\", "^"]:
			if special_sensitive in word:
				word = word.replace(special_sensitive, '')
		for latex_sensitive in ["%", "&", "#", "_",  "{", "}"]:
			if latex_sensitive in word:
				word = word.replace(latex_sensitive, '\\' +latex_sensitive)
		new_word_list.append(word)
	return new_word_list

Here, we sum get the attentions over all layers and heads.

In [None]:
average_attention = attention_matrix_importance.squeeze().sum(axis=1)
average_attention = average_attention.sum(axis=0)
print(average_attention)

[266.9439    43.682247  43.542274 ...  44.49121   43.936462  44.018005]


We call the main function above. It takes in a list of tokens, a list of attentions, a title, and a colour. Please change "papers" to whatever your project requires.

In [None]:
title_all = f"notes_{test_val[0]}.tex"
generate(all_tokens, average_attention, title_all, 'red')

Lets suppose you don't want to find out the attentions over all layers, but just one layer. You can do that by doing one less summation and instead picking out the layer you want immediately. Here we are picking out the last layer.

In [None]:
print(attention_matrix_importance[11].squeeze().shape)
average_attention_final_layer = attention_matrix_importance[11].squeeze().sum(axis=0)
print(average_attention_final_layer)

# mean_12 = np.median(average_attention_final_layer)
# average_attention_final_layer[average_attention_final_layer < mean_12] = 0
# print(average_attention_final_layer)

(12, 2046)
[8.128963  1.2755948 1.462493  ... 1.4194398 1.2911068 1.3994944]


We call the main function above. Please change "papers" to whatever your project requires.

In [None]:
title_last_layer = f"notes_{test_val[0]}_layer_12_only.tex"
generate(all_tokens, average_attention_final_layer, title_last_layer, 'red')

Of course, you can experiment with which layers, or heads you want to visualize the attentions for based on what you desire from your own project.