In [3]:
import numpy as np
import torch
import transformers
from transformers import BertTokenizer, BertModel
import pandas as pd
import conllu
from unidecode import unidecode

# Transformers and BERT

In 2017, a paper was published that revolutionised deep learning for NLP: [Attention is All you Need](https://papers.nips.cc/paper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.html) by Vaswani and others.
This paper described a new way of encoding text using a neural architecture called a transformer.
It essentially ended up replacing recurrent neural networks (RNNs).

The problem with RNNs is that they are inherently sequential.
So regardless of how powerful your computer is, if you want to encode a sentence with $n$ words then you will need to process $n$ sequential time steps.

The transformer solves this problem by doing something similar to taking the average vector of the embedding vectors.
Adding all the embedding vectors together can be done in parallel, so that would solve the RNN's problem.
Unfortunately it also loses word order information, which is a big deal.
We'll get to how word order information is preserved in a bit, but for now we should focus on the rest of the process.

## Queries, keys, and values

Transformers do the following:

* Take each word vector ($w_i$) and use three separate neural layers to transform each word vector into three separate vectors: a query vector ($q_i$), a key vector ($k_i$), and a value vector ($v_i$).
* Combine each $q_i$ with each $k_j$ (including the key from the same word as $q_i$) to produce a value $a_{ij}$ called an attention value.
* Multiply $a_{ij}$ by $v_i$ (the value vector that came from the same word as the key vector) and take their sum to produce a vector for word $j$.

Here is a diagram illustrating this architecture:

![](attention_full.png)

Here is the same diagram but focusing only on the first word's output:

![](attention_focused.png)

The attention value $ij$ is a number that determines how relevant a particular key $j$ is to a particular query $i$.
The more relevant it is, the higher the attention value and the more word $j$ will contribute to the context vector of word $i$.
The attention values of a particular query are passed through a softmax function in order to make them all positive and sum to 1.
Since each context vector consists of the sum of value vectors multiplied by attention values that sum to 1, the context vector will be a weighted average of value vectors.

Essentially, we're comparing each word $w_i$ to each other word $w_j$, including $w_i$ itself, and then determining how important $w_j$ is for $w_i$.
This importance is used to make a weighted average of all the words in the sequence in order to represent the meaning of $w_i$.
The way we compare $w_i$ with $w_j$ is by comparing the query vector of $w_i$ to the key vector of $w_j$, sort of like how a query is used to search for a key in a database.
You can think of the value vector as the value that is returned by the database after making the query, or the value associated with the key in a Python dictionary.
This process is done for each word, which happens all in parallel.
The advantage of having separate query and key vectors is that you can compare a word to itself without comparing a vector to itself.

The attention maker consists of just the dot product of the query and key vectors, which is very fast as you can get the dot product of a bunch of vectors with every vector in another bunch by doing a single matrix multiplication (if each bunch of vectors is made into a matrix).
In order to avoid the dot product getting too big with large vectors, the dot product is divided by the square root of the vector size.

Transformers generally do not stop here but complicate themselves a little more:

* First of all, there isn't just one weighted average of value vectors as shown above but several, which is called multi-head attention.
    The process of multi-head attention consists of simply splitting each incoming vector into a number of smaller 'head' vectors which are then processed separately into different queries, keys, and values.
    The results are then concatented together into a single large vector which is passed through a feedforward layer.
* Secondly, on top of the weighted average of value vectors there is also a vector normalisation layer (converts each vector into a zero-mean unit-variance vector).
    In addition, the incoming vectors are added to the result of each layer in order to create residual connections the reduce the effect of vanishing gradients.
* We should also keep in mind that there isn't just a single layer of this, but several layers repeating the same process.

Finally, it is important to keep in mind that, since we're generating an attention value for each pairing of words, the amount of memory needed is quadratic (squared) to the number of words, which is the price we pay for being able to process everything in parallel.

## The embeddings

The above description suggests that there is no preservation of word order information in the transformer.
As-is, when calculating the attention, there is no way to take into account where a particular query and key pair are situated in the sentence, which makes it equivalent to processing a bag of words vector.

To avoid this, the word embedding vectors are modified to include positional information by adding to each vector a "positional embedding vector".
A positional embedding vector is like a word embedding vector but for positions instead of tokens.
There are many ways to do this, but one way is by using a positional embedding matrix, that is, having a matrix with a row for each possible position, where each row would be a positional embedding vector, similar to the word embedding matrix.
The problem with this is that, just like using a word embedding matrix implies having a fixed number of words in your vocabulary, you'll also need to have a fixed number of positions, which means that you have a maximum sentence length you can process.
This is a normal issue in transformers, but we'll generally run out of memory beyond a certain number of words anyway.

Many transformers also allow you to include 'token type embeddings', which are used to specify extra information about the words.
For example, if you want to input two sentences into the transformer (to do things like predict if one sentence contradicts the other), you would concatenate the two sentences together into a single sequence and use token type embeddings to specify if a word belongs to the first sentence or the other.

## Sentence representation

We've seen how to represent each word in context, but how do you represent a whole sentence with one vector?
A lot of papers to do so by just taking the average of the word vectors.
Another way is to add a pseudo-token to the beginning of each sentence, usually called a 'class token', whose context vector is then used to represent the whole sentence.

## HuggingFace and BERT

For most tasks, you're better off using a pre-trained transformer that turns words in context vectors.
One of the most popular pre-trained transformers is [BERT](https://aclanthology.org/N19-1423/) (Bidirectional Encoder Representations from Transformers).
This transformer was trained by self-supervision on a lot of text (mainly from books and Wikipedia) to do two things: predict the missing word from a sentence and predict if two sentences follow each other.
The first task was trained by using the pseudo-token "\[MASK\]" which stands for a missing token. 
The second was trained by using two pseudo-tokens: "\[SEP\]" (separator) and "\[CLS\]" (class) where the first is placed in between two sentences which are concatenated together into a single sequence whilst the context vector of the second is passed into a classifier to determine if the two sentences follow each other or if they were picked randomly.

Since it's pre-trained, BERT has its own vocabulary that you need to use together with its own tokeniser.
The vocabulary consists of "word pieces" which is a solution to unknown tokens.
Basically, it is possible to break down each individual character in a word into separate tokens, which would make the vocabulary be all the characters in the alphabet, digits, punctuation, and so on.
But this would make the sequences longer than they need to be which requires more memory.
So in addition to the individual characters, the vocabulary also includes commonly occuring substrings that are found in words.
BERT's tokeniser automatically identifies these substrings and makes them a single token.
Any unknown substrings are broken into individual characters and treated as separate tokens.
The tokeniser does not include the space character as a token.
Instead, each token has two versions: the first token in a word and an inner token.
Each of these versions have different indexes in the vocabulary, which allows BERT to know where each word begins and ends.

To use BERT, you can use [HuggingFace](https://huggingface.co/models), a library of pre-trained transformers that are readily usable and downloadable.
To use HuggingFace you just need to install the "transformers" Python library using pip.
Let's try using this library to make use of BERT.

First, you need to download the pre-trained tokeniser and model.

In [12]:
tokenizer = BertTokenizer.from_pretrained("bert-base-multilingual-cased")
bert_model = BertModel.from_pretrained("bert-base-multilingual-cased")

bert_default_vocab = tokenizer.get_vocab().keys()

#!pip install allentune --upgrade weird, perhaps dont


Some weights of the model checkpoint at bert-base-multilingual-cased were not used when initializing BertModel: ['cls.predictions.transform.dense.bias', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [13]:
english_train = open("UD_English-ParTUT/en_partut-ud-train.conllu", "r", encoding="utf-8") 
english_test = open("UD_English-ParTUT/en_partut-ud-test.conllu", "r", encoding="utf-8") 
english_dev = open("UD_English-ParTUT/en_partut-ud-dev.conllu", "r", encoding="utf-8") 

eng_train_data = conllu.parse(english_train.read())  # ['id', 'form', 'lemma', 'upostag', 'xpostag', 'feats', 'head', 'deprel', 'deps', 'misc']
eng_train_sents = [[unidecode(token['form']) for token in sentence] for sentence in eng_train_data]
eng_u_train_tags = [[token['upostag'] for token in sentence] for sentence in eng_train_data]

eng_valid_data = conllu.parse(english_dev.read())  # ['id', 'form', 'lemma', 'upostag', 'xpostag', 'feats', 'head', 'deprel', 'deps', 'misc']
eng_valid_sents = [[unidecode(token['form']) for token in sentence] for sentence in eng_valid_data]
eng_u_valid_tags = [[token['upostag'] for token in sentence] for sentence in eng_valid_data]

eng_test_data = conllu.parse(english_test.read())  # ['id', 'form', 'lemma', 'upostag', 'xpostag', 'feats', 'head', 'deprel', 'deps', 'misc']
eng_test_sents = [[unidecode(token['form']) for token in sentence] for sentence in eng_test_data]
eng_u_test_tags = [[token['upostag'] for token in sentence] for sentence in eng_test_data]


In [15]:
tag_set = list(set(tag for tags in eng_u_train_tags for tag in tags))
tag_set

['ADP',
 '_',
 'PRON',
 'AUX',
 'DET',
 'X',
 'PUNCT',
 'ADV',
 'PROPN',
 'PART',
 'CCONJ',
 'NUM',
 'NOUN',
 'INTJ',
 'ADJ',
 'SCONJ',
 'SYM',
 'VERB']

Next, you need to tokenise your text.

Now we can use BERT to encode the tokens into contextual vectors:

The important thing about these pre-trained transformers is that they are normal PyTorch modules, which means that you can manipulate them however you want.
You can get the parameters of the model, get the activations or attention values (see [here](https://huggingface.co/docs/transformers/model_doc/bert#transformers.BertModel.forward) for how), and you can even optimise their parameters on your own data.
Further optimising a pre-trained model, called fine-tuning, is done to create context vectors that work better for your data set and task.
For example, you can use the pre-trained BERT model in your model to perform part of speech tagging by training both the softmax layer you add on top of BERT and the BERT model itself.
This takes advantage of all the knowledge that was learned by BERT during pre-training so that you get less over-fitting and better performance on small training sets.

There are certain things you need to keep in mind when fine-tuning:

* You have to use `model.train(true)` and `model.train(false)` to say whether the calls you're making on the model are for optimisation or to get predictions.
    This disables the use of dropout and other training regularisation features.
* You should use a smaller learning rate for BERT than for your own parameters.
    This is to avoid "catastrophic forgetting", which is when the pre-trained model overfits on your data and forgets what it was pre-trained on.
    We usually use a learning rate of `2E-5` on BERT.
    You can do this easily in PyTorch, as shown below.
* Due to all the regularisation stuff happening inside BERT, it will have slightly unstable learning progress, which is normal, provided that the error goes down mostly.

Let's look at three common uses of a BERT model.

A 5-word sentence will result in 9 predictions.
How can you make predictions for each whole word?

What is usually done is that only the first token of every word is considered, with all other inner subword tokens being masked out.
Here's a convenient function you can use in your projects that converts a list of words with tags into a list of tokens, mask, and aligned tags:

In [14]:
def get_aligned_tokens_and_tags(tokeniser, words, tags, no_tag=''):
    indexes = [tokeniser.cls_token_id]
    tag_mask = [False]
    aligned_tags = [no_tag]
    for tag, word in zip(tags, words):
        for i, index in enumerate(tokeniser(word)['input_ids'][1:-1]):
            indexes.append(index)
            if i == 0:
                tag_mask.append(True)
                aligned_tags.append(tag)
            else:
                tag_mask.append(False)
                aligned_tags.append(no_tag)
    indexes.append(tokeniser.sep_token_id)
    tag_mask.append(False)
    aligned_tags.append(no_tag)
    
    return (indexes, tag_mask, aligned_tags)

# words = eng_train_sents[0]
# tags = eng_u_train_tags[0]
# print('words:', words)
# print('tags:', tags)
# print()

# (indexes, tag_mask, aligned_tags) = get_aligned_tokens_and_tags(tokenizer, words, tags, no_tag='_')
# print('indexes:', indexes)
# print('tokenised indexes:', tokenizer.convert_ids_to_tokens(indexes))
# print('tag_mask:', tag_mask)
# print('aligned_tags:', aligned_tags)

You can then pad the sequences and turn them into a tensor using this function:

In [16]:
def get_padded_tokens_and_tags(all_indexes, all_tag_masks, all_aligned_tags, no_tag=''):
    max_len = max(len(indexes) for indexes in all_indexes)
    token_masks = []
    new_indexes = []
    new_tag_masks = []
    new_aligned_tags = []
    for (indexes, tag_mask, aligned_tags) in zip(all_indexes, all_tag_masks, all_aligned_tags):
        num_tokens = len(indexes)
        token_masks.append([1]*num_tokens + [0]*(max_len - num_tokens))
        new_indexes.append(indexes + [tokenizer.pad_token_id]*(max_len - num_tokens))
        new_tag_masks.append(tag_mask + [False]*(max_len - num_tokens))
        new_aligned_tags.append(aligned_tags + [no_tag]*(max_len - num_tokens))
    return (
        torch.tensor(token_masks, dtype=torch.int64),
        torch.tensor(new_indexes, dtype=torch.int64),
        torch.tensor(new_tag_masks, dtype=torch.int64),
        torch.tensor(new_aligned_tags, dtype=torch.int64),
    )

# all_words = eng_train_sents
# all_tags = eng_u_train_tags


# all_indexes = []
# all_tag_masks = []
# all_aligned_tags = []
# for words, tags in zip(all_words, all_tags):
#     tags = [tag_set.index(tag) for tag in tags]
#     (indexes, tag_mask, aligned_tags) = get_aligned_tokens_and_tags(tokenizer, words, tags, no_tag=tag_set.index('PUNCT'))
#     all_indexes.append(indexes)
#     all_tag_masks.append(tag_mask)
#     all_aligned_tags.append(aligned_tags)
# (token_masks, new_indexes, new_tag_masks, new_aligned_tags) = get_padded_tokens_and_tags(all_indexes, all_tag_masks, all_aligned_tags, no_tag=tag_set.index('PUNCT'))

# print('token_masks:')
# print(token_masks)
# print()
# print('indexes:')
# print(new_indexes)
# print()
# print('tag_mask:')
# print(new_tag_masks)
# print()
# print('aligned_tags:')
# print(new_aligned_tags)

And now, the code of interest:

In [17]:
train_x = eng_train_sents[:5]
train_y = eng_u_train_tags[:5]

indexed_train_x = []
mask_train_y = []
indexed_train_y = []
for (tokens, tags) in zip(train_x, train_y):
    tags = [tag_set.index(tag) for tag in tags]
    (indexes, tag_mask, aligned_tags) = get_aligned_tokens_and_tags(tokenizer, tokens, tags, tag_set.index('X'))
    indexed_train_x.append(indexes)
    mask_train_y.append(tag_mask)
    indexed_train_y.append(aligned_tags)
(mask_train_x, indexed_train_x, mask_train_y, indexed_train_y) = get_padded_tokens_and_tags(indexed_train_x, mask_train_y, indexed_train_y, tag_set.index('X'))

print('train_x:')
print(train_x)
print()
print('train_y:')
print(train_y)
print()
print('mask_train_x:')
print(mask_train_x)
print()
print('indexed_train_x tokens:')
for row in indexed_train_x:
    print(tokenizer.convert_ids_to_tokens(row))
print()
print('indexed_train_x:')
print(indexed_train_x)
print()
print('mask_train_y:')
print(mask_train_y)
print()
print('indexed_train_y tokens:')
for row in indexed_train_y:
    print([tag_set[index] for index in row])
print()
print('indexed_train_y:')
print(indexed_train_y)

train_x:
[['Distribution', 'of', 'this', 'license', 'does', 'not', 'create', 'an', 'attorney', '-', 'client', 'relationship', '.'], ['Creative', 'Commons', 'provides', 'this', 'information', 'on', 'an', '"', 'as', '-', 'is', '"', 'basis', '.'], ['Creative', 'Commons', 'makes', 'no', 'warranties', 'regarding', 'the', 'information', 'provided', ',', 'and', 'disclaims', 'liability', 'for', 'damages', 'resulting', 'from', 'its', 'use', '.'], ['License', '.'], ['The', 'work', 'is', 'protected', 'by', 'copyright', 'and', '/', 'or', 'other', 'applicable', 'law', '.']]

train_y:
[['NOUN', 'ADP', 'DET', 'NOUN', 'AUX', 'PART', 'VERB', 'DET', 'NOUN', 'PUNCT', 'NOUN', 'NOUN', 'PUNCT'], ['PROPN', 'PROPN', 'VERB', 'DET', 'NOUN', 'ADP', 'DET', 'PUNCT', 'ADP', 'PUNCT', 'VERB', 'PUNCT', 'NOUN', 'PUNCT'], ['PROPN', 'PROPN', 'VERB', 'DET', 'NOUN', 'VERB', 'DET', 'NOUN', 'VERB', 'PUNCT', 'CCONJ', 'VERB', 'NOUN', 'ADP', 'NOUN', 'VERB', 'ADP', 'DET', 'NOUN', 'PUNCT'], ['NOUN', 'PUNCT'], ['DET', 'NOUN', 'AUX

In [18]:
class Model(torch.nn.Module):

    def __init__(self, bert, num_output):
        super().__init__()
        self.bert = bert
        self.w = torch.nn.Parameter(torch.tensor(np.random.normal(0.0, 1.0, (768, num_output)), dtype=torch.float32))
        self.b = torch.nn.Parameter(torch.zeros((num_output,), dtype=torch.float32))

    def forward(self, x, mask):
        context_vecs = self.bert(x, attention_mask=mask).last_hidden_state
        return context_vecs@self.w + self.b

model = Model(bert_model, len(tag_set))

optimiser = torch.optim.Adam([{'params': [model.w, model.b], 'lr': 0.1}, {'params': bert_model.parameters(), 'lr': 2E-5}])

print('epoch', 'error')
for epoch in range(1, 10+1):
    optimiser.zero_grad()
    model.train(True)
    logits = model(indexed_train_x, mask_train_x)
    errors = torch.nn.functional.cross_entropy(logits.transpose(1, 2), indexed_train_y, reduction='none')
    errors = errors*mask_train_y
    error = errors.sum()/mask_train_y.sum()
    error.backward()
    optimiser.step()
    model.train(False)

    if epoch%1 == 0:
        print(epoch, error.detach().tolist())

print()

with torch.no_grad():
    print('sentence', 'prediction')
    logits = model(indexed_train_x, mask_train_x)
    batch_probs = torch.softmax(logits, dim=1)
    for (indexes, probs) in zip(indexed_train_x, batch_probs):
        print(tokenizer.convert_ids_to_tokens(indexes), [tag_set[index] for index in probs.numpy().argmax(1).tolist()])

epoch error
1 26.869413375854492
2 15.400179862976074
3 11.201945304870605
4 7.383209228515625
5 3.4416868686676025
6 1.32474684715271
7 1.790221095085144
8 1.1491752862930298
9 0.7319633364677429
10 0.52610844373703

sentence prediction
['[CLS]', 'Distribution', 'of', 'this', 'license', 'does', 'not', 'create', 'an', 'attorney', '-', 'client', 'relationship', '.', '[SEP]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]'] ['_', 'SCONJ', 'ADP', 'DET', 'NOUN', 'AUX', 'PART', 'VERB', 'DET', 'ADV', 'PUNCT', 'NOUN', 'INTJ', 'PROPN', 'SCONJ', 'INTJ', 'INTJ', 'INTJ', 'INTJ', 'INTJ', 'INTJ', 'INTJ', 'CCONJ', 'CCONJ', 'INTJ', 'SCONJ', 'ADJ', 'INTJ']
['[CLS]', 'Creative', 'Commons', 'provides', 'this', 'information', 'on', 'an', '"', 'as', '-', 'is', '"', 'basis', '.', '[SEP]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]'] ['_', 'PROPN', 'PROPN', 'VERB', 'DET', 'CCONJ

## Getting hidden layer activations

BERT has 12 encoder layers plus an embedding layer.
Each one of these layers gives you a 768-element vector for each token.
You can get the activations of each one of these layers by using the parameter `output_hidden_states=True` when calling the model.
This will add the attribute `hidden_states` to the returned object, like this:

In [19]:
output = bert_model(tokenised_text['input_ids'], attention_mask=tokenised_text['attention_mask'], output_hidden_states=True)

embeddings = output.hidden_states[0]
layer1 = output.hidden_states[1]
layer2 = output.hidden_states[2]
# ...
output_layer = output.hidden_states[12]

print(embeddings.shape) # 2 sentences, 8 tokens, 768-element vectors

torch.Size([8, 14, 768])


The last layer is equivalent to calling `output.last_hidden_state` as usual.

Note that the vectors at the end could be from pad tokens.

In [20]:
print(tokenised_text['input_ids'])
print()
print(tokeniser.convert_ids_to_tokens(tokenised_text['input_ids'][0, :]))
print(tokeniser.convert_ids_to_tokens(tokenised_text['input_ids'][1, :]))

# sentence 1 activations: embeddings[0, :8, :]
# sentence 2 activations: embeddings[0, :7, :]

tensor([[ 101,  146, 1176, 1122,  119,  102,  146, 1274,  112,  189, 4819, 1122,
          119,  102],
        [ 101,  146, 4819, 1122,  119,  102,  146, 1274,  112,  189, 1176, 1122,
          119,  102],
        [ 101,  146, 1274,  112,  189, 4819, 1122,  119,  102,  146, 1176, 1122,
          119,  102],
        [ 101,  146, 1274,  112,  189, 1176, 1122,  119,  102,  146, 4819, 1122,
          119,  102],
        [ 101,  146, 1176, 1122,  119,  102,  146, 1274,  112,  189, 1176, 1122,
          119,  102],
        [ 101,  146, 4819, 1122,  119,  102,  146, 1274,  112,  189, 4819, 1122,
          119,  102],
        [ 101,  146, 1274,  112,  189, 4819, 1122,  119,  102,  146, 4819, 1122,
          119,  102],
        [ 101,  146, 1274,  112,  189, 1176, 1122,  119,  102,  146, 1176, 1122,
          119,  102]])

['[CLS]', 'I', 'like', 'it', '.', '[SEP]', 'I', 'don', "'", 't', 'hate', 'it', '.', '[SEP]']
['[CLS]', 'I', 'hate', 'it', '.', '[SEP]', 'I', 'don', "'", 't', 'like', 'it', '.

Many papers find that the middle layers, particularly layer 8, are the most transferrable across tasks.
This means that `output.hidden_states[7]` should give a better performance when used to represent words than the last layer.

You can also get the attention values.
BERT produces attention values in each encoder layer and has 12 attention heads.
Remember that attention values are produced between each word pairing, making a square matrix.
You can get the attention values in each one of these layers by using the parameter `output_attentions=True` when calling the model.
This will add the attribute `attentions` to the returned object, like this:

In [16]:
output = bert_model(tokenised_text['input_ids'], attention_mask=tokenised_text['attention_mask'], output_attentions=True)
layer1_attentions = output.attentions[0]
print(layer1_attentions.shape) # 2 sentences, 12 attention heads, 8 tokens by 8 tokens

print('attention values to produce the vector of sentence 0, head 0, word 0:')
print(layer1_attentions[0, 0, 0, :]) # An attention value for each word
print(layer1_attentions[0, 0, 0, :].sum()) # Attention values to sum to 1

torch.Size([8, 12, 14, 14])
attention values to produce the vector of sentence 0, head 0, word 0:
tensor([0.4331, 0.0044, 0.0053, 0.0111, 0.0236, 0.2365, 0.0053, 0.0055, 0.0073,
        0.0078, 0.0042, 0.0115, 0.0215, 0.2226], grad_fn=<SliceBackward0>)
tensor(1.0000, grad_fn=<SumBackward0>)


Note that if there are pad tokens then you only want to take the upper left corner of the square matrix.

In [17]:
# sentence 1 attentions: layer1_attentions[0, :, :8, :8]
# sentence 2 attentions: layer1_attentions[1, :, :7, :7]

Also, since there are many attention heads, you can take the average attention value across heads to get a nice single attention value for each word pairing:

In [18]:
averaged_attentions = layer1_attentions.mean(1)
print(averaged_attentions[0, :, :]) # Attentions in first sentence
print(averaged_attentions[0, 0, :].sum()) # Attentions still sum to 1

tensor([[0.4000, 0.0300, 0.0261, 0.0363, 0.0470, 0.1200, 0.0291, 0.0196, 0.0288,
         0.0290, 0.0317, 0.0344, 0.0461, 0.1219],
        [0.0570, 0.0995, 0.1146, 0.0785, 0.0688, 0.1023, 0.0878, 0.0809, 0.0520,
         0.0374, 0.0688, 0.0467, 0.0461, 0.0596],
        [0.1009, 0.0845, 0.0601, 0.1054, 0.0748, 0.1225, 0.0571, 0.0701, 0.0552,
         0.0400, 0.0705, 0.0422, 0.0447, 0.0718],
        [0.0648, 0.0825, 0.0925, 0.0699, 0.1153, 0.1234, 0.0631, 0.0652, 0.0478,
         0.0365, 0.0533, 0.0583, 0.0598, 0.0674],
        [0.0741, 0.0428, 0.0572, 0.0809, 0.1439, 0.1896, 0.0428, 0.0327, 0.0480,
         0.0309, 0.0328, 0.0442, 0.1189, 0.0612],
        [0.1598, 0.0433, 0.0387, 0.0812, 0.1265, 0.1885, 0.0522, 0.0335, 0.0393,
         0.0276, 0.0174, 0.0288, 0.0628, 0.1005],
        [0.0464, 0.1081, 0.0535, 0.0566, 0.0757, 0.1492, 0.0843, 0.1108, 0.0649,
         0.0399, 0.0698, 0.0395, 0.0476, 0.0537],
        [0.1032, 0.0549, 0.0617, 0.0509, 0.0537, 0.1404, 0.0756, 0.0406, 0.1041,
  