This notebook explores the relation between the model's attributions and attentions for a given example. Historically, we found that attentions are not a feasible method of explanation whereas attributions are, but attributions are also not part of a model's traditional outputs. Therefore it may be interesting to see if we can find anything with attentions by comparing them to a feasible and plausible method of explanation, attributions. Furthermore, we apply masking to various scenarios to examine the affects on the similarities between attribution and attention. 

This notebook creates and exports a few dictionaries and dataframes. Namely, the dictionary produced is the attributions so we do not have to recompute the attributions, as that takes a while. The Dataframes exported is a summary of results for every example that has been explored so far.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


## Import dependencies

In [None]:
pip install transformers --quiet

[K     |████████████████████████████████| 4.4 MB 8.0 MB/s 
[K     |████████████████████████████████| 101 kB 13.1 MB/s 
[K     |████████████████████████████████| 596 kB 80.0 MB/s 
[K     |████████████████████████████████| 6.6 MB 54.2 MB/s 
[?25h

In [None]:
pip install captum --quiet

[?25l[K     |▎                               | 10 kB 35.9 MB/s eta 0:00:01[K     |▌                               | 20 kB 21.1 MB/s eta 0:00:01[K     |▊                               | 30 kB 16.8 MB/s eta 0:00:01[K     |█                               | 40 kB 14.8 MB/s eta 0:00:01[K     |█▏                              | 51 kB 6.8 MB/s eta 0:00:01[K     |█▍                              | 61 kB 8.1 MB/s eta 0:00:01[K     |█▋                              | 71 kB 7.3 MB/s eta 0:00:01[K     |█▉                              | 81 kB 8.2 MB/s eta 0:00:01[K     |██                              | 92 kB 9.1 MB/s eta 0:00:01[K     |██▎                             | 102 kB 7.5 MB/s eta 0:00:01[K     |██▌                             | 112 kB 7.5 MB/s eta 0:00:01[K     |██▊                             | 122 kB 7.5 MB/s eta 0:00:01[K     |███                             | 133 kB 7.5 MB/s eta 0:00:01[K     |███▏                            | 143 kB 7.5 MB/s eta 0:00:01[K 

In [None]:
pip install datasets --quiet

[K     |████████████████████████████████| 365 kB 8.1 MB/s 
[K     |████████████████████████████████| 212 kB 86.0 MB/s 
[K     |████████████████████████████████| 140 kB 80.9 MB/s 
[K     |████████████████████████████████| 1.1 MB 68.8 MB/s 
[K     |████████████████████████████████| 127 kB 57.1 MB/s 
[K     |████████████████████████████████| 144 kB 57.9 MB/s 
[K     |████████████████████████████████| 271 kB 66.8 MB/s 
[K     |████████████████████████████████| 94 kB 4.3 MB/s 
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
datascience 0.10.6 requires folium==0.2.1, but you have folium 0.8.3 which is incompatible.[0m
[?25h

In [None]:
pip install rbo --quiet

In [None]:
import os
os.environ['CUDA_LAUNCH_BLOCKING'] = "1"

In [None]:
from captum.attr import visualization as viz
from captum.attr import IntegratedGradients, LayerConductance, LayerIntegratedGradients
from captum.attr import configure_interpretable_embedding_layer, remove_interpretable_embedding_layer

import torch
import pandas as pd

In [None]:
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

## Import model

Replace model_path and tokenizer with the one for your own project

In [None]:
from transformers import LongformerForSequenceClassification, LongformerTokenizer, LongformerConfig

model_path = 'danielhou13/longformer-finetuned_papers_v2'
#model_path = 'danielhou13/longformer-finetuned-new-cogs402'

# load model
test = torch.load("/content/drive/MyDrive/cogs402longformer/fakeclinicalnotes/models/full_augmented_lr2e-5_dropout3_10_trained_threshold.pt")
model = LongformerForSequenceClassification.from_pretrained('allenai/longformer-base-4096', state_dict=test['state_dict'], num_labels = 2)
model.to(device)
model.eval()
model.zero_grad()

# load tokenizer
tokenizer = LongformerTokenizer.from_pretrained("allenai/longformer-base-4096")

Downloading:   0%|          | 0.00/694 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/570M [00:00<?, ?B/s]

Some weights of the model checkpoint at allenai/longformer-base-4096 were not used when initializing LongformerForSequenceClassification: ['longformer_model.encoder.layer.1.attention.self.value_global.weight', 'longformer_model.encoder.layer.1.output.LayerNorm.bias', 'longformer_model.encoder.layer.0.attention.self.key_global.weight', 'longformer_model.encoder.layer.7.attention.self.key.weight', 'longformer_model.embeddings.position_ids', 'longformer_model.encoder.layer.4.attention.self.key.bias', 'longformer_model.encoder.layer.8.output.dense.weight', 'longformer_model.encoder.layer.9.attention.self.query.weight', 'longformer_model.encoder.layer.3.attention.self.value_global.weight', 'longformer_model.encoder.layer.1.attention.output.LayerNorm.bias', 'longformer_model.encoder.layer.0.output.dense.weight', 'longformer_model.encoder.layer.11.attention.self.query.bias', 'longformer_model.encoder.layer.1.attention.self.query_global.weight', 'longformer_model.encoder.layer.2.output.LayerNo

Downloading:   0%|          | 0.00/878k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]

In [None]:
ref_token_id = tokenizer.pad_token_id # A token used for generating token reference
sep_token_id = tokenizer.sep_token_id # A token used as a separator between question and text and it is also added to the end of the text.
cls_token_id = tokenizer.cls_token_id # A token used for prepending to the concatenated question-text word sequence

##Import Dataset

Here we import the papers dataset

In [None]:
from datasets import load_dataset
import numpy as np
# cogs402_ds = load_dataset("danielhou13/cogs402datafake")["train"]

ds = pd.read_csv("/content/drive/MyDrive/cogs402longformer/fakeclinicalnotes/data/fake_notes.csv")
dataset = datasets.Dataset.from_pandas(ds)
cogs402_ds = dataset

Downloading:   0%|          | 0.00/739 [00:00<?, ?B/s]

Using custom data configuration danielhou13--cogs402dataset-144b958ac1a53abb


Downloading and preparing dataset None/None (download: 157.87 MiB, generated: 311.56 MiB, post-processed: Unknown size, total: 469.43 MiB) to /root/.cache/huggingface/datasets/danielhou13___parquet/danielhou13--cogs402dataset-144b958ac1a53abb/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec...


Downloading data files:   0%|          | 0/2 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/33.6M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/132M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/2 [00:00<?, ?it/s]

0 tables [00:00, ? tables/s]

0 tables [00:00, ? tables/s]

Dataset parquet downloaded and prepared to /root/.cache/huggingface/datasets/danielhou13___parquet/danielhou13--cogs402dataset-144b958ac1a53abb/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

Here we import the news dataset

In [None]:
# cogs402_ds2 = load_dataset('hyperpartisan_news_detection', 'bypublisher')['validation']
# val_size = 5000
# val_indices = np.random.randint(0, len(cogs402_ds2), val_size)
# val_ds = cogs402_ds2.select(val_indices)
# labels2 = map(int, val_ds['hyperpartisan'])
# labels2 = list(labels2)
# val_ds = val_ds.add_column("labels", labels2)

## Get Attributions

We need to create a custom forward function for use in our [Integrated Gradients](https://arxiv.org/abs/1703.01365) functions. Specifially the output we want from the forward pass of the model is the softmaxed logits, which indicate the probabilities of predicting each class for the given example.

In [None]:
def predict(inputs, position_ids=None, attention_mask=None):
    output = model(inputs,
                   position_ids=position_ids,
                   attention_mask=attention_mask)
    return output.logits

In [None]:
#set 1 if we are dealing with a positive class, and 0 if dealing with negative class
def custom_forward(inputs, position_ids=None, attention_mask=None):
    preds = predict(inputs,
                   position_ids=position_ids,
                   attention_mask=attention_mask
                   )
    return torch.softmax(preds, dim = 1)

To get the attributions, we perform Integrated Gradients using the model's embeddings and pass in our custom forward function.

In [None]:
lig = LayerIntegratedGradients(custom_forward, model.longformer.embeddings)

Here we pick out the example we want to compare the attributions and the attentions for. You should either pick this example at random, or if another part of your project has given some interesting results, you can use that example.

In [None]:
example = 7
text = cogs402_ds['text'][example]
label = cogs402_ds['labels'][example]

Create functions that give us the input ids and the position ids for the text we want to examine. Furthermore, it also returns the baselines we want for integrated gradients. In this case, every token in our baseline, is a padding token.

In [None]:
max_length = 2046
def construct_input_ref_pair(text, ref_token_id, sep_token_id, cls_token_id):

    text_ids = tokenizer.encode(text, truncation = True, add_special_tokens=False, max_length = max_length)
    # construct input token ids
    input_ids = [cls_token_id] + text_ids + [sep_token_id]
    # construct reference token ids 

    ref_input_ids = [cls_token_id] + [ref_token_id] * len(text_ids) + [sep_token_id]

    return torch.tensor([input_ids], device=device), torch.tensor([ref_input_ids], device=device), len(text_ids)

def construct_input_ref_pos_id_pair(input_ids):
    seq_length = input_ids.size(1)

    #taken from the longformer implementation
    mask = input_ids.ne(ref_token_id).int()
    incremental_indices = torch.cumsum(mask, dim=1).type_as(mask) * mask
    position_ids = incremental_indices.long().squeeze() + ref_token_id

    # we could potentially also use random permutation with `torch.randperm(seq_length, device=device)`
    ref_position_ids = torch.zeros(seq_length, dtype=torch.long, device=device)

    position_ids = position_ids.unsqueeze(0).expand_as(input_ids)
    ref_position_ids = ref_position_ids.unsqueeze(0).expand_as(input_ids)
    return position_ids, ref_position_ids
    
def construct_attention_mask(input_ids):
    return torch.ones_like(input_ids)

In [None]:
all_tokens = {}

We get the inputs, position_ids and the mask along with the baselines. We store the tokens in the dictionary created above for access in future functions.

In [None]:
input_ids, ref_input_ids, sep_id = construct_input_ref_pair(text, ref_token_id, sep_token_id, cls_token_id)
position_ids, ref_position_ids = construct_input_ref_pos_id_pair(input_ids)
attention_mask = construct_attention_mask(input_ids)

indices = input_ids[0].detach().tolist()
all_tokens_curr = tokenizer.convert_ids_to_tokens(indices)

all_tokens[str(example)] = all_tokens_curr

The attributions returned has very high dimensionality and we just want a single number for every token in our example, so we sum over the last dimension and squeeze the result to get an array of shape (seq_len). You may notice that we are not normalizing the attributions here. It's okay because we will normalize it later.

In [None]:
def summarize_attributions(attributions):
    attributions = attributions.sum(dim=-1).squeeze(0)
    return attributions

In [None]:
print(attention_mask.shape)

torch.Size([1, 2048])


For use in later functions, we want to store the attributions we find along with their respective tokens.

In [None]:
all_attributions = {}

On the other hand, if you have a dictionary of attributions already saved, you can import it as follows. Replace the path with a path to your own dictionary.

In [None]:
# all_attributions = torch.load('/content/drive/MyDrive/cogs402longformer/results/papers/papers_attributions/example_attrib_dict.pt')

This function is where we perform Integrated Gradients, sum the attributions and store the result in the dictionary. We can also save the dictionary if we require. If you have already loaded your attributions, you can skip this step.

Note: the attributions will be with respect to the positive class, meaning positive attributions have more influence in the model predicting positive and negative attributions will be more influential in predicting negative.

In [None]:
attributions, delta = lig.attribute(inputs=input_ids,
                                  baselines=ref_input_ids,
                                  return_convergence_delta=True,
                                  additional_forward_args=(position_ids, attention_mask),
                                  target=1,
                                  n_steps=1500,
                                  internal_batch_size = 2)

attributions_sum = summarize_attributions(attributions)

all_attributions[str(example)] = attributions_sum.detach().cpu().numpy()

In [None]:
# torch.save(all_attributions, '/content/drive/MyDrive/cogs402longformer/results/papers/papers_attributions/example_attrib_dict.pt')

## Grabbing the attentions

We then get the attentions and global attentions so we can compare with the attributions. We stack the attention to get a tensor of shape: (layer, batch, head, seq_len, x + attention_window + 1) and a tensor of shape (layer, batch, head, seq_len, x) where x is the number of global attention tokens.

In [None]:
output = model(input_ids.cuda(), attention_mask=attention_mask.cuda(), labels=torch.tensor(label).cuda(), output_attentions = True)
batch_attn = output[-2]
output_attentions = torch.stack(batch_attn).cpu()
global_attention = output[-1]
output_global_attentions = torch.stack(global_attention).cpu()
print("output_attention.shape", output_attentions.shape)
print("gl_output_attention.shape", output_global_attentions.shape)

output_attention.shape torch.Size([12, 1, 12, 2048, 514])
gl_output_attention.shape torch.Size([12, 1, 12, 2048, 1])


A unique property of the longformer model is that the matrix output for the attention is not a seq_len x seq_len output. Each token can only attend to the preceeding w/2 tokens and the succeeding w/2 tokens, dictated by whatever you choose the model's attention window w to be. Another name for this is called the sliding window attention. Therefore, we need to convert sliding attention matrix to correct seq_len x seq_len matrix to remain consistent with other types of Transformer Neural Networks.

To do so, we run the following 4 functions. Our attentions will change from an output attention tensor of shape (layer, batch, head, seq_len, x + attention_window + 1) and a global attention tensor of shape (layer, batch, head, seq_len, x) to a single tensor of shape (layer, batch, head, seq_len, seq_len). More information about the functions can be found [here](https://colab.research.google.com/drive/1Kxx26NtIlUzioRCHpsR8IbSz_DpRFxEZ#scrollTo=t_XCoyTsQKAU).

In [None]:
def create_head_matrix(output_attentions, global_attentions):
    new_attention_matrix = torch.zeros((output_attentions.shape[0], 
                                      output_attentions.shape[0]))
    for i in range(output_attentions.shape[0]):
        test_non_zeroes = torch.nonzero(output_attentions[i]).squeeze()
        test2 = output_attentions[i][test_non_zeroes[1:]]
        new_attention_matrix_indices = test_non_zeroes[1:]-257 + i
        new_attention_matrix[i][new_attention_matrix_indices] = test2
        new_attention_matrix[i][0] = output_attentions[i][0]
        new_attention_matrix[0] = global_attentions.squeeze()[:output_attentions.shape[0]]
    return new_attention_matrix.detach().cpu().numpy()


def attentions_all_heads(output_attentions, global_attentions):
    new_matrix = []
    for i in range(output_attentions.shape[0]):
        matrix = create_head_matrix(output_attentions[i], global_attentions[i])
        new_matrix.append(matrix)
    return np.stack(new_matrix)


def all_batches(output_attentions, global_attentions):
    new_matrix = []
    for i in range(output_attentions.shape[0]):
        matrix = attentions_all_heads(output_attentions[i], global_attentions[i])
        new_matrix.append(matrix)
    return np.stack(new_matrix)

def all_layers(output_attentions, global_attentions):
    new_matrix = []
    for i in range(output_attentions.shape[0]):
        matrix = all_batches(output_attentions[i], global_attentions[i])
        new_matrix.append(matrix)
    return np.stack(new_matrix)

We should only run one example at a time, so we will squeeze the result of applying the above 4 functions to get a tensor of shape (layer, batch, head, seq_len, seq_len).

In [None]:
converted_mat = all_layers(output_attentions, output_global_attentions)
print(converted_mat.shape)

(12, 1, 12, 2048, 2048)


We get the attentions for each token by summing the converted attention matrix over the first seq_len axis. As such the resulting matrix is of shape (layer, batch, head, seq_len).

In [None]:
attention_matrix_summed = converted_mat.sum(axis=3)

Some heads may be more important than others so we scale each attention matrix by their respective head and layer. The notebook used to get head importance is [here](https://colab.research.google.com/drive/1sIEvUvCofF0puv0mRZUio3JEF0y1Ce3g?usp=sharing). However, its possible that you might not want to scale the attentions, in which case you can ignore this section.

In [None]:
head_importance = torch.load("/content/drive/MyDrive/cogs402longformer/fakeclinicalnotes/t3-visapplication/notes/head_importance.pt")
# head_importance = torch.load("/content/drive/MyDrive/cogs402longformer/t3-visapplication/resources/news/head_importance.pt")

In [None]:
def scale_by_importance(attention_matrix, head_importance):
  new_matrix = np.zeros_like(attention_matrix)
  for i in range(attention_matrix.shape[0]):
    head_importance_layer = head_importance[i]
    for j in range(attention_matrix.shape[1]):
      new_matrix[i,j] = attention_matrix[i,j] * np.expand_dims(head_importance_layer, axis=(1))
  return new_matrix

In [None]:
attention_matrix_summed = scale_by_importance(attention_matrix_summed, head_importance)

Here we are using the squeeze function to remove the batch axis, as we are likely only working with one example at a time. After that, we can either select a specific layer we want, or a range of layers we wish to compare. 

In this case, when taking a specific layer, you pick the layer you want (replace 11 with whatever layer you wish) and then we sum over all of the heads.

When taking a range of layers, you either want to specify a range (e.g. attention_matrix_summed[0:6]) or leave as it is to sum over all layers. Then we sum up the layers and the heads.

The result of both versions will be an array of shape (seq_len), the same as our attributions as desired.

In [None]:
attention_final_layer = attention_matrix_summed[11].squeeze().sum(axis=0)
attention_all_layer = attention_matrix_summed.squeeze().sum(axis=1)
attention_all_layer = attention_all_layer.sum(axis=0)
print(attention_all_layer.shape)

(2048,)


## Starting the Comparison

Grab the attributions we stored earlier. Just as an insurance, make sure that the attributions aren't for some reason longer than the attentions



In [None]:
exam_attrib = all_attributions[str(example)]
exam_attrib = exam_attrib[:len(attention_final_layer)]

Since we have the attributions and the attentions, we want to see how the attributions (in terms of magnitude) compares to the attentions.

However, it's probably a good idea to check how the Cosine similarities are when we don't do anything processing by using the raw attributions and attentions

In [None]:
from numpy.linalg import norm
cosine_raw = np.dot(exam_attrib, attention_final_layer) / (norm(exam_attrib)*norm(attention_final_layer))
print("Layer 12 Cosine Similarity raw attrib:\n", cosine_raw)
cosine_all_raw = np.dot(exam_attrib, attention_all_layer) / (norm(exam_attrib)*norm(attention_all_layer))
print("Layer 12 Cosine Similarity raw attrib:\n", cosine_all_raw)

Layer 12 Cosine Similarity raw attrib:
 -0.06023476253384612
Layer 12 Cosine Similarity raw attrib:
 -0.058512041780778415


The attributions and the attentions have different ranges. The attributions could range from -1 to 1 whereas the attentions range from 0 to 1. However, negative attributions would not necessarily mean that they have the lowest attention, rather they might have really high attention as they are more likely to help the model predict the negative class, and might be something the attentions picked up by the model. Therefore, we want to absolute value the attributions and then normalize the attentions and the attributions so they have the range of 0 to 1.

In [None]:
def normalize(data):
    return (data - np.min(data)) / (np.max(data) - np.min(data))

In [None]:
attention_final_layer2 = normalize(attention_final_layer)
attention_all_layer2 = normalize(attention_all_layer)

In [None]:
exam_attrib2 = np.abs(exam_attrib)
exam_attrib2 = normalize(exam_attrib2)

In [None]:
print(exam_attrib2)

[0.         0.3947175  0.17893116 ... 0.20388472 0.03394853 0.        ]


Now we calculate cosine simularity using normalized attentions and attributions

In [None]:
cosine = np.dot(exam_attrib2, attention_final_layer2) / (norm(exam_attrib2)*norm(attention_final_layer2))
print("Layer 12 Cosine Similarity:\n", cosine)
cosine2 = np.dot(exam_attrib2, attention_all_layer2) / (norm(exam_attrib2)*norm(attention_all_layer2))
print("All layer Cosine Similarity:\n", cosine2)

Layer 12 Cosine Similarity:
 0.7481737914400142
All layer Cosine Similarity:
 0.7403405674229725


It might be interesting to know if only the top 50% of tokens share similar attentions and attributions, so after we absolute value and normalize the attentions and attributions, we apply a mask to set the values to 0 if they are not above the median attention or attribution respectively.

In [None]:
exam_attrib3 = np.abs(exam_attrib)
exam_attrib3 = normalize(exam_attrib3)
median_exam = np.percentile(exam_attrib3, 50)
exam_attrib3[exam_attrib3 < median_exam] = 0

In [None]:
attention_final_layer3 = np.copy(attention_final_layer)
attention_final_layer3 = normalize(attention_final_layer3)
median_12 = np.percentile(attention_final_layer3, 50)
attention_final_layer3[attention_final_layer3 < median_12] = 0

attention_all_layer3 = np.copy(attention_all_layer) 
attention_all_layer3 = normalize(attention_all_layer3)
median_all = np.percentile(attention_all_layer3, 50)
attention_all_layer3[attention_all_layer3 < median_all] = 0

Now we calculate cosine similarity for the median masked attributions and attentions.

In [None]:
cosine_med = np.dot(exam_attrib3, attention_final_layer3) / (norm(exam_attrib3)*norm(attention_final_layer3))
print("Layer 12 Cosine Similarity med:\n", cosine_med)
cosine_med2 = np.dot(exam_attrib3, attention_all_layer3) / (norm(exam_attrib3)*norm(attention_all_layer3))
print("All layer Cosine Similarity med:\n", cosine_med2)

Layer 12 Cosine Similarity med:
 0.44271328850650005
All layer Cosine Similarity med:
 0.4379153570579718


Now we do the same as above, but this time we mask all the values that are lower than the mean.

In [None]:
exam_attrib4 = np.abs(exam_attrib)
exam_attrib4 = normalize(exam_attrib4)
mean_exam = np.mean(exam_attrib4)
exam_attrib4[exam_attrib4 < mean_exam] = 0

In [None]:
attention_final_layer4 = np.copy(attention_final_layer)
attention_final_layer4 = normalize(attention_final_layer4)
mean_12 = np.mean(attention_final_layer4)
attention_final_layer4[attention_final_layer4 < mean_12] = 0

attention_all_layer4 = np.copy(attention_all_layer) 
attention_all_layer4 = normalize(attention_all_layer4)
mean_all = np.mean(attention_all_layer4)
attention_all_layer4[attention_all_layer4 < mean_all] = 0

Calculate cosine similarity for our mean-masked attentions and attributions.

In [None]:
cosine_mean = np.dot(exam_attrib4, attention_final_layer4) / (norm(exam_attrib4)*norm(attention_final_layer4))
print("Layer 12 Cosine Similarity mean:\n", cosine_mean)
cosine_mean2 = np.dot(exam_attrib4, attention_all_layer4) / (norm(exam_attrib4)*norm(attention_all_layer4))
print("All layer Cosine Similarity mean:\n", cosine_mean2)

Layer 12 Cosine Similarity mean:
 0.4160853945983448
All layer Cosine Similarity mean:
 0.4129308653709167


With our normalized attributions and attentions, tokens with the same rank in both the attention and attributions arrays can have drastically different values for both. Therefore, even if you have two arrays, when ranked, that have the same ordering, it may return a similarity that is low.

If we convert each value of the both arrays into their ranks w.r.t. their own array, it alleviates this problem as not only do both arrays have the same range, they also all have the exact same set of values (1-2048 or however many your max amount of tokens are). With an exact same set of values, we can make sure that if two tokens are the same rank in both arrays (indiciating that the attentions and the attributions have some degree of similarity), our cosine similarity picks up on that.

Note, the order of the ranks is from highest to lowest so a high value in the ranks array suggests a low value.

In [None]:
exam_attrib_rank = np.abs(exam_attrib)
order_attrib = exam_attrib_rank.argsort()[::-1]
print(order_attrib)
ranks_attrib = order_attrib.argsort()
print(ranks_attrib)

[   0 2047  408 ... 1678 1843 1998]
[   0 1839 1123 ... 1247  229    1]


In [None]:
attention_final_layer_rank = np.copy(attention_final_layer)
order = attention_final_layer_rank.argsort()[::-1]
ranks = order.argsort()

attention_all_layer_rank = np.copy(attention_all_layer)
order2 = attention_all_layer_rank.argsort()[::-1]
ranks2 = order2.argsort()

In [None]:
cosine_rank = np.dot(ranks_attrib, ranks) / (norm(ranks_attrib)*norm(ranks))
print("Layer 12 Cosine Similarity rank:\n", cosine_rank)
cosine_rank2 = np.dot(ranks_attrib, ranks2) / (norm(ranks_attrib)*norm(ranks2))
print("All layer Cosine Similarity rank:\n", cosine_rank2)

Layer 12 Cosine Similarity rank:
 0.7491951488352472
All layer Cosine Similarity rank:
 0.751903946936216


Cosine similarities are not the only similarity metric we can use. Lets evaluate similarity on our example with two other metrics: [Kendalltau](https://www.jstor.org/stable/2332226), and [Rank-biased Overlap (RBO)](https://dl.acm.org/doi/10.1145/1852102.1852106).

With Kendalltau, you compare the similarities by passing in two arrays of rankings, meaning every item in your array is the rank of the item from 1-max_len.

In [None]:
import scipy.stats as stats
tau, p_value = stats.kendalltau(ranks_attrib, ranks)
print("Tau statistic layer 12:", tau, "p value", p_value)
tau2, p_value = stats.kendalltau(ranks_attrib, ranks2)
print("Tau statistic: all layers", tau, "p value", p_value)

Tau statistic layer 12: -0.001660203957010259 p value 0.9103459739732513
Tau statistic: all layers -0.001660203957010259 p value 0.7136246600448692


With RBO, instead of passing in an array of rankings, you rank each item in the array such that the item at array index 0 is the highest rank item, the item at array index 1 is the second highest, and the one at array index max_len -1 is the lowest. 

In [None]:
import rbo
rbo_1 = rbo.RankingSimilarity(order_attrib, order).rbo()
rbo_2 = rbo.RankingSimilarity(order_attrib, order2).rbo()
print("rbo layer 12", rbo_1)
print("rbo all", rbo_2)

rbo layer 12 0.5004955440189166
rbo all 0.5031104425220926


Here we compile all of the similarities we calculated into one dataframe for easier viewing.

In [None]:
d = {'example': [example], 'similarity normalized': [cosine], 'similarity raw': [cosine_raw], 'sim_norm w/ median threshold': [cosine_med], 'sim_norm w/ mean threshold': [cosine_mean], "sim w/ ranks":[cosine_rank], "kendall_tau":[tau], "RBO":[rbo_1]}
df = pd.DataFrame(data=d)
df

Unnamed: 0,example,similarity normalized,similarity raw,sim_norm w/ median threshold,sim_norm w/ mean threshold,sim w/ ranks,kendall_tau,RBO
0,7,0.748174,-0.060235,0.442713,0.416085,0.749195,-0.00166,0.500496


We do the same here as we have another set of similarities we want to examine.

In [None]:
d2 = {'example': [example], 'similarity normalized': [cosine2], 'similarity raw': [cosine_all_raw], 'sim_norm w/ median threshold':[cosine_med2], 'sim_norm w/ mean threshold':[cosine_mean2], "sim w/ ranks":[cosine_rank2], "kendall_tau":[tau2], "RBO":[rbo_2]}
df2 = pd.DataFrame(data=d2)
df2

Unnamed: 0,example,similarity normalized,similarity raw,sim_norm w/ median threshold,sim_norm w/ mean threshold,sim w/ ranks,kendall_tau,RBO
0,7,0.740341,-0.058512,0.437915,0.412931,0.751904,0.005411,0.50311


While not completely necessary, you can save these dataframes into a csv and add onto it every time you look at a new example.

In [None]:
# df_layer12 = pd.read_csv("/content/drive/MyDrive/cogs402longformer/results/papers/papers_attributions/cos_sim_layer12.csv")
# df_all = pd.read_csv("/content/drive/MyDrive/cogs402longformer/results/papers/papers_attributions/cos_sim_all.csv")

In [None]:
# df_layer12

In [None]:
# df_all

Append the new row into the dataframe.

In [None]:
# df_layer12 = pd.concat([df_layer12, df], axis=0)
# df_all = pd.concat([df_all, df2], axis=0)

In [None]:
# df_layer12

In [None]:
# df_all

When we concatenate our two dataframes, we want to make sure we don't have any duplicate rows. We consider it a duplicate if two rows have the same example number. If we find a do find duplicate rows based on the above condition, the last occuring instance of the row, which is the instance that was obtained earlier in the notebook (and not the instance that was read from file), is kept in the dataframe.

In [None]:
# df_layer12 = df_layer12.drop_duplicates(['example'], keep='last')
# df_all = df_all.drop_duplicates(['example'], keep='last')

Sort the rows by example number.

In [None]:
# df_layer12 = df_layer12.sort_values(by=['example'])
# df_all = df_all.sort_values(by=['example'])

Save the dataframe

In [None]:
# df_layer12.to_csv("/content/drive/MyDrive/cogs402longformer/results/papers/papers_attributions/cos_sim_layer12.csv", index=False)
# df_all.to_csv("/content/drive/MyDrive/cogs402longformer/results/papers/papers_attributions/cos_sim_all.csv", index=False)

## Comparing Only the Highest Attentions and Attributions

With long pieces of text, it is generally unlikely that two tokens will have the same rankings. However, the intuition is that if the tokens have really high attributions, then it might have really high attentions as that might be what the model focused on when doing the predictions. As such, 

As such, we apply the same series of functions as we did for masking all the values below the median, but this time we mask all the values below the 95th percentile. 

In [None]:
attention_final_layer5 = np.copy(attention_final_layer)
attention_final_layer5 = normalize(attention_final_layer5)

attention_all_layer5 = np.copy(attention_all_layer) 
attention_all_layer5 = normalize(attention_all_layer5)

exam_attrib5 = np.abs(exam_attrib)
exam_attrib5 = normalize(exam_attrib5)
print(exam_attrib5)

[0.         0.3947175  0.17893116 ... 0.20388472 0.03394853 0.        ]


In [None]:
top_final = np.percentile(attention_final_layer5, 95)
top_all = np.percentile(attention_all_layer5, 95)
top_attrib = np.percentile(exam_attrib5, 95)
print(top_attrib)

0.5117164376263138


In [None]:
attention_final_layer5[attention_final_layer5<top_final] = 0
attention_all_layer5[attention_all_layer5<top_all] = 0
exam_attrib5[exam_attrib5<top_attrib] = 0

In [None]:
print(exam_attrib5)

[0. 0. 0. ... 0. 0. 0.]


Calculating cosine similarities again with our new array.

In [None]:
cosine_thresh = np.dot(exam_attrib5, attention_final_layer5) / (norm(exam_attrib5)*norm(attention_final_layer5))
print("Layer 12 Cosine Similarity 95th:\n", cosine_thresh)
cosine_thresh2 = np.dot(exam_attrib5, attention_all_layer5) / (norm(exam_attrib5)*norm(attention_all_layer5))
print("All layer Cosine Similarity 95th:\n", cosine_thresh2)

Layer 12 Cosine Similarity 95th:
 0.06879101353384702
All layer Cosine Similarity 95th:
 0.056949122857615594


We do the same for our rankings, but we now set all the ranks below our 95th percentile to 0. Then, we calculate cosine similarities.

In [None]:
num = np.ceil(2048 * 0.95)
exam_attrib_rank2 = np.copy(ranks_attrib)
exam_attrib_rank2[exam_attrib_rank2 < num] = 0

attention_final_layer_rank2 = np.copy(ranks)
attention_final_layer_rank2[attention_final_layer_rank2 < num] = 0

attention_all_layer_rank2 = np.copy(ranks2)
attention_all_layer_rank2[attention_all_layer_rank2 < num] = 0

In [None]:
print(num)

1946.0


In [None]:
cosine_rank_top = np.dot(exam_attrib_rank2, attention_final_layer_rank2) / (norm(exam_attrib_rank2)*norm(attention_final_layer_rank2))
print("Layer 12 Cosine Similarity 95th ranks:\n", cosine_rank_top)
cosine_rank_top2 = np.dot(exam_attrib_rank2, attention_all_layer_rank2) / (norm(exam_attrib_rank2)*norm(attention_all_layer_rank2))
print("All layer Cosine Similarity 95th ranks:\n", cosine_rank_top2)

Layer 12 Cosine Similarity 95th ranks:
 0.0790698730093019
All layer Cosine Similarity 95th ranks:
 0.06867245604100757


Of course, cosine similarity isn't the only metric that exists for similarities, so we try RBO again on our new arrays of ranks.

In [None]:
exam_attrib_order2 = np.copy(order_attrib)

attention_final_layer_order2 = np.copy(order)

attention_all_layer_order2 = np.copy(order2)

In [None]:
print("rbo layer 12 95th", rbo.RankingSimilarity(exam_attrib_order2[:int(num)], attention_final_layer_order2[:int(num)]).rbo())
print("rbo all 95th", rbo.RankingSimilarity(exam_attrib_order2[:int(num)], attention_all_layer_order2[:int(num)]).rbo())

rbo layer 12 95th 0.4755699774701047
rbo all 95th 0.47834500264273444


### Examining the Specifics

While seeing if the model's attributions and attentions are exactly the same is one way of comparing the two arrays, another method of determining whether or not the model puts the most focus onto the same group of tokens.

Here we are taking the set of position ids that make up the top 5 percent of tokens in both the attention and the attribution array. By doing so, we can find out which tokens both arrays have in common, and the tokens that are unique to both arrays. We will be able to identify which tokens are buzzwords in both the attention and the attributions, as well as doing one last similarity metric to check how agreeable the attention and the attributions are.

In [None]:
attention_final_layer_top = np.flatnonzero(attention_final_layer5)
attention_final_layer_top = set(attention_final_layer_top)

attention_all_layer_top = np.flatnonzero(attention_all_layer5)
attention_all_layer_top = set(attention_all_layer_top)

exam_attrib_top = np.flatnonzero(exam_attrib5)
exam_attrib_top = set(exam_attrib_top)
print(exam_attrib_top)

{1537, 514, 519, 1545, 1034, 524, 544, 34, 1585, 574, 1600, 1623, 1625, 1630, 622, 118, 1153, 132, 1159, 1165, 1678, 1167, 1694, 1698, 165, 1195, 1204, 700, 199, 201, 1762, 743, 1767, 240, 752, 753, 243, 246, 763, 1278, 259, 260, 1798, 267, 1291, 1294, 1295, 1304, 794, 1819, 1308, 799, 289, 809, 300, 302, 816, 1842, 1843, 309, 823, 1848, 1857, 1858, 1349, 1862, 1353, 845, 335, 1359, 1871, 341, 853, 1365, 1877, 1881, 353, 869, 359, 875, 877, 376, 383, 385, 1426, 1428, 405, 1944, 923, 442, 1997, 1998, 2002, 982, 987, 480, 994, 485, 1512, 2028, 498, 1522, 1523}


In [None]:
print(ranks)
print(attention_final_layer_rank)

[2047    0    3 ...    1   31   76]
[8.345974  1.2188263 1.2822603 ... 1.2687193 1.4237021 1.5241399]


Grab the tokens stored in the all tokens dictionary so we can know which tokens we are working with as we currently only have the indices.

In [None]:
exam_tokens = all_tokens[str(example)]

Find out which tokens have the highest attentions but not the highest attributions, and display it in a dataframe with the unmasked attentions and the attributions.

In [None]:
diff = sorted(list(attention_final_layer_top - exam_attrib_top))
print(len(diff))
diff_tokens = [exam_tokens[idx] for idx in diff]
d_diff = {"token": diff_tokens, "position":diff, "attention_norm":attention_final_layer2[diff], "attention_rank": ranks[diff], "attribution_norm":exam_attrib2[diff], "attribution_rank":ranks_attrib[diff]}
df_diff = pd.DataFrame(d_diff)
df_diff

95


Unnamed: 0,token,position,attention_norm,attention_rank,attribution_norm,attribution_rank
0,<s>,0,1.000000,2047,0.000000,0
1,Ġa,206,0.152016,1958,0.142334,899
2,Ġbased,224,0.176038,2041,0.297475,1632
3,Ġdialogue,225,0.157150,1993,0.118423,757
4,Ġmanager,226,0.158432,2000,0.290743,1608
...,...,...,...,...,...,...
90,-,1820,0.153382,1965,0.246859,1441
91,Ġthe,1828,0.154999,1976,0.105035,663
92,Ġslot,1829,0.156081,1984,0.195746,1211
93,Ġtags,1830,0.166686,2027,0.180731,1134


Let's check what tokens are different and how many times they appear.

In [None]:
print(df_diff['token'].value_counts())

Ġand             6
Ġa               6
-                5
Ġof              4
Ġdialogue        4
                ..
Ġis              1
Ġdemonstrated    1
Ġseveral         1
Ġdifferent       1
Ġ~               1
Name: token, Length: 67, dtype: int64


Find out which tokens have the highest attributions but not the highest attentions.

In [None]:
diff2 = sorted(list(exam_attrib_top - attention_final_layer_top))
print(len(diff))
diff_tokens2 = [exam_tokens[idx] for idx in diff2]
d_diff2 = {"token": diff_tokens2, "position":diff2, "attention_norm": attention_final_layer2[diff2], "attention_rank": ranks[diff2], "attribution_norm":exam_attrib2[diff2], "attribution_rank":ranks_attrib[diff2]}
df_diff2 = pd.DataFrame(d_diff2)
df_diff2

95


Unnamed: 0,token,position,attention_norm,attention_rank,attribution_norm,attribution_rank
0,ao,34,0.028510,29,0.757903,2035
1,ĠFeb,118,0.092118,390,0.541050,1971
2,pletion,132,0.098642,604,0.516848,1954
3,Ġsystem,165,0.114003,1259,0.530169,1965
4,Ġneural,199,0.133504,1772,0.634609,2012
...,...,...,...,...,...,...
90,1,1944,0.099056,626,0.555691,1983
91,1,1997,0.055756,129,0.612820,2008
92,"Ġ,",1998,0.042450,75,1.000000,2047
93,"Ġ,",2002,0.042083,72,0.868935,2044


Let's check what tokens are different and how many times they appear.

In [None]:
print(df_diff2['token'].value_counts())

Ġsystem           20
1                  9
Ġneural            7
Ġ,                 7
ĠIn                4
Ġare               3
Ġweekend           3
Ġfind              3
G                  3
Ġwi                3
Ġsequence          2
Ġend               2
Ġmovie             2
ances              2
Ġavailable         2
Ġhave              1
Ġshow              1
ĠâĢĶ               1
Ġcontributions     1
date               1
Ġhow               1
ĠFramework         1
Ġabove             1
Ġassociated        1
ao                 1
Ġlacks             1
ĠFeb               1
Ġsupervised        1
M                  1
Ġcom               1
Ġconsisting        1
Ġcomplex           1
Ġorder             1
Ġallow             1
Ġunderstanding     1
Ġwith              1
pletion            1
ag                 1
Name: token, dtype: int64


Find out which tokens are part of the highest attentions and highest attributions.

In [None]:
same = sorted(list(attention_final_layer_top & exam_attrib_top))
print(len(same))
same_tokens = [exam_tokens[idx] for idx in same]
d_same = {"token": same_tokens, "position":same, "attention_norm": attention_final_layer2[same], "attention_rank": ranks[same], "attribution_norm":exam_attrib2[same], "attribution_rank":ranks_attrib[same]}
df_same = pd.DataFrame(d_same)
df_same

8


Unnamed: 0,token,position,attention_norm,attention_rank,attribution_norm,attribution_rank
0,Ġexperiments,243,0.164711,2022,0.651922,2018
1,Ġmovie,246,0.157143,1992,0.769662,2037
2,Ġnot,260,0.170399,2036,0.513761,1949
3,Ġsystem,267,0.162266,2017,0.722033,2029
4,1,302,0.153822,1968,0.650406,2017
5,1,1767,0.156727,1988,0.818578,2040
6,ents,1798,0.168598,2033,0.524858,1962
7,out,1819,0.156512,1987,0.556536,1984


Let's check what tokens are the same and how many times they appear.

In [None]:
print(df_same['token'].value_counts())

1               2
Ġexperiments    1
Ġmovie          1
Ġnot            1
Ġsystem         1
ents            1
out             1
Name: token, dtype: int64


Our final measure of similarity between the attention and the attributions uses Jaccard Index, which is the intersection of two sets divided by the union. This gives us an idea of how many tokens in our top 5% are the same and now many are different. 

In [None]:
def jaccard_similarity(set1, set2):
    intersection = len(list(set1.intersection(set2)))
    print(intersection)
    union = (len(set1) + len(set2)) - intersection
    print(union)
    return float(intersection) / union

In [None]:
jaccard_similarity(attention_final_layer_top, exam_attrib_top)

8
198


0.04040404040404041

### Removing Non-Alphanumeric Tokens

Here we run through the 95th percentile again, but we first mask all the non-alphanumeric tokens before we obtain our top 103 (top 5%) tokens.

In [None]:
import nltk
from transformers import AutoTokenizer
nltk.download('stopwords')
tokenizer2 = AutoTokenizer.from_pretrained('allenai/longformer-base-4096', add_prefix_space=True)

In [None]:
from nltk.corpus import stopwords
all_stopwords = stopwords.words('english')
all_stopwords.append(" ")
stopwords = set(tokenizer2.tokenize(all_stopwords, is_split_into_words =True))
stopwords.update(all_stopwords)
print(stopwords)

In [None]:
attention_final_layer6 = np.copy(attention_final_layer)
attention_final_layer6 = normalize(attention_final_layer6)

attention_all_layer6 = np.copy(attention_all_layer) 
attention_all_layer6 = normalize(attention_all_layer6)

exam_attrib6 = np.abs(exam_attrib)
exam_attrib6 = normalize(exam_attrib6)
print(exam_attrib6)

exam_tokens = all_tokens[str(example)]
alpha_neumeric_nums = [idx for idx, element in enumerate(exam_tokens) if element.isalpha() if element not in stopwords]

[0.         0.3947175  0.17893116 ... 0.20388472 0.03394853 0.        ]


If the token is not alphanumeric, the mask designates it as "true", so the value will be set to 0.

In [None]:
mask = np.ones(attention_final_layer6.shape,dtype=bool) 
mask[alpha_neumeric_nums] = False

attention_final_layer6[mask] = 0
attention_all_layer6[mask] = 0
exam_attrib6[mask] = 0

Once we have masked all the non-alphanumeric tokens, we do the same masking as previous to obtain our top 5% of tokens.

In [None]:
top_final2 = np.percentile(attention_final_layer6, 95)
top_all2 = np.percentile(attention_all_layer6, 95)
top_attrib2 = np.percentile(exam_attrib6, 95)
print(top_attrib2)

0.5016993530498294


Like previously, we convert each array into a set.

In [None]:
attention_final_layer6[attention_final_layer6<top_final2] = 0
attention_all_layer6[attention_all_layer6<top_all2] = 0
exam_attrib6[exam_attrib6<top_attrib2] = 0

attention_final_layer_top2 = np.flatnonzero(attention_final_layer6)
attention_final_layer_top2 = set(attention_final_layer_top2)
print(len(attention_final_layer_top2))

attention_all_layer_top2 = np.flatnonzero(attention_all_layer6)
attention_all_layer_top2 = set(attention_all_layer_top2)
print(len(attention_all_layer_top2))

exam_attrib_top2 = np.flatnonzero(exam_attrib6)
exam_attrib_top2 = set(exam_attrib_top2)
print(len(exam_attrib_top2))

103
103
103


We once again find out which tokens have the highest attentions but not the highest attributions, and display it in a dataframe with the unmasked attentions and the attributions.

In [None]:
diff_alpha = sorted(list(attention_final_layer_top2 - exam_attrib_top2))
print(len(diff_alpha))
diff_alpha_tokens = [exam_tokens[idx] for idx in diff_alpha]
d_diff_alpha = {"token": diff_alpha_tokens, "position":diff_alpha, "attention_norm":attention_final_layer2[diff_alpha], "attribution_norm":exam_attrib2[diff_alpha]}
df_diff_alpha = pd.DataFrame(d_diff_alpha)
df_diff_alpha

95


Unnamed: 0,token,position,attention_norm,attribution_norm
0,Ġinteract,204,0.149415,0.063223
1,Ġa,206,0.152016,0.142334
2,Ġbased,224,0.176038,0.297475
3,Ġdialogue,225,0.157150,0.118423
4,Ġmanager,226,0.158432,0.290743
...,...,...,...,...
90,ĠI,1814,0.149192,0.244435
91,Ġthe,1828,0.154999,0.105035
92,Ġslot,1829,0.156081,0.195746
93,Ġtags,1830,0.166686,0.180731


Let's check what tokens are different and how many times they appear.

In [None]:
print(df_diff_alpha['token'].value_counts())

Ġa              7
Ġand            6
Ġdialogue       4
Ġof             4
Ġto             3
               ..
Ġseveral        1
Ġexperiments    1
Ġdifferent      1
Ġgran           1
x               1
Name: token, Length: 69, dtype: int64


We once again find out which tokens have the highest attributions but not the highest attentions, and display it in a dataframe with the unmasked attentions and the attributions.

In [None]:
diff_alpha2 = sorted(list(exam_attrib_top2- attention_final_layer_top2))
print(len(diff_alpha2))
diff_alpha_tokens2 = [exam_tokens[idx] for idx in diff_alpha2]
d_diff_alpha2 = {"token": diff_alpha_tokens2, "position":diff_alpha2, "attention_norm":attention_final_layer2[diff_alpha2], "attribution_norm":exam_attrib2[diff_alpha2]}
df_diff_alpha2 = pd.DataFrame(d_diff_alpha2)
df_diff_alpha2

95


Unnamed: 0,token,position,attention_norm,attribution_norm
0,End,5,0.008427,0.503378
1,ao,34,0.028510,0.757903
2,iv,103,0.065753,0.504767
3,ĠFeb,118,0.092118,0.541050
4,pletion,132,0.098642,0.516848
...,...,...,...,...
90,Ġsequence,1871,0.118334,0.738424
91,Ġassociated,1877,0.135052,0.556761
92,1,1944,0.099056,0.555691
93,1,1997,0.055756,0.612820


Let's check what tokens are different and how many times they appear.

In [None]:
print(df_diff_alpha2['token'].value_counts())

Ġsystem           21
1                  9
Ġneural            7
ĠIn                4
Ġwi                3
Ġare               3
ĠâĢĶ               3
Ġweekend           3
Ġfind              3
G                  3
Ġend               2
Ġavailable         2
Ġmovie             2
ances              2
Ġunderstanding     2
Ġsequence          2
Ġhow               1
Ġfirst             1
Ġshow              1
Ġhave              1
Ġabove             1
date               1
ĠFramework         1
Ġassociated        1
Ġcontributions     1
End                1
Ġlacks             1
ao                 1
Ġsupervised        1
M                  1
Ġcom               1
Ġconsisting        1
Ġcomplex           1
Ġorder             1
Ġallow             1
Ġwith              1
pletion            1
ĠFeb               1
iv                 1
ag                 1
Name: token, dtype: int64


Lastly, we find out which tokens are part of the highest attentions and highest attributions.

In [None]:
same_alpha = sorted(list(exam_attrib_top2 & attention_final_layer_top2))
print(len(same_alpha))
same_alpha_tokens = [exam_tokens[idx] for idx in same_alpha]
d_same_alpha = {"token": same_alpha_tokens, "position":same_alpha, "attention_norm":attention_final_layer2[same_alpha], "attribution_norm":exam_attrib2[same_alpha]}
df_same_alpha = pd.DataFrame(d_same_alpha)
df_same_alpha

8


Unnamed: 0,token,position,attention_norm,attribution_norm
0,Ġexperiments,243,0.164711,0.651922
1,Ġmovie,246,0.157143,0.769662
2,Ġnot,260,0.170399,0.513761
3,Ġsystem,267,0.162266,0.722033
4,1,302,0.153822,0.650406
5,1,1767,0.156727,0.818578
6,ents,1798,0.168598,0.524858
7,out,1819,0.156512,0.556536


Check what tokens are the same and how many times they appear.

In [None]:
print(df_same_alpha['token'].value_counts())

1               2
Ġexperiments    1
Ġmovie          1
Ġnot            1
Ġsystem         1
ents            1
out             1
Name: token, dtype: int64


Finally, we get the Jaccard Index to indentify how similar our group of top attribtuions and attentions are.

In [None]:
jaccard_similarity(attention_final_layer_top2, exam_attrib_top2)

8
198


0.04040404040404041