# BERT: Pre-training of Deep Bidirectional Transformers

![bert](../figs/deep_nlp/bert/entelecheia_bert.png)

- The year 2018 marked a turning point for the field of Natural Language Processing (NLP). 
- The BERT {cite}`devlin2018bert` paper introduced a new language representation model that outperformed all previous models on a wide range of NLP tasks. 
- BERT is a deep bidirectional transformer model that is pre-trained on a large corpus of unlabeled text. 
- The model is trained to predict masked words in a sentence and is also trained to predict the next sentence in a sequence of sentences. 
- The pre-trained model can then be fine-tuned on a variety of downstream NLP tasks with state-of-the-art results.

BERT builds on two key ideas:

- The transformer architecture {cite}`vaswani2017attention`
- Unsupervised pre-training

BERT is pre-trained on a large corpus of unlabeled text. Its weights are learned by predicting masked words in a sentence and predicting the next sentence in a sequence of sentences.

**BERT is a (multi-headed) beast**

BERT is a deep bidirectional transformer model. It is a multi-headed beast with 12(24) layers, 12(16) attention heads, and 110 million parameters. Since model weights are not shared across layers, the total number of different attention weights is 12(24) x 12(16) = 144(384).

## Visualizing BERT

Because of BERT’s complexity, it is difficult to understand the meaning of its learned weights intuitively. To help with this, we can visualize the attention weights of BERT’s self-attention layers.

In [None]:
%pip install bertviz

In [5]:
%config InlineBackend.figure_format='retina'

from bertviz import model_view, head_view
from transformers import AutoTokenizer, AutoModel, utils

utils.logging.set_verbosity_error()  # Suppress standard warnings

In [7]:
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModel.from_pretrained("bert-base-uncased", output_attentions=True)
inputs = tokenizer.encode("The cat sat on the mat", return_tensors='pt')
outputs = model(inputs)
attention = outputs[-1]  # Output includes attention weights when output_attentions=True
tokens = tokenizer.convert_ids_to_tokens(inputs[0]) 
head_view(attention, tokens)

<IPython.core.display.Javascript object>

- The tool visualizes attention as lines connecting the position being updated (left) with the position being attended to (right).
- Colors identify the corresponding attention head(s), while line thickness reflects the attention score. 
- At the top of the visualization, you can select the model layer and the attention head(s) to visualize.

## What does BERT actually learn?

Let's explore the attention patterns of various layers of the BERT (the BERT-Base, uncased version).

> Sentence A: I went to the store.

> Sentence B: At the store, I bought fresh strawberries.

BERT uses WordPiece tokenization and inserts special classifier ([CLS]) and separator ([SEP]) tokens, so the actual input sequence is:

> [CLS] I went to the store . [SEP] At the store , I bought fresh straw ##berries . [SEP]

In [16]:
inputs = tokenizer.encode(
    ["I went to the store.", "At the store, I bought fresh strawberries."],
    return_tensors="pt",
)
outputs = model(inputs)
attention = outputs[-1]
tokens = tokenizer.convert_ids_to_tokens(inputs[0])

### Pattern 1: Attention to next word

Select layer 2, head 0. (The selected head is indicated by the highlighted square in the color bar at the top.) Most of the attention at a particular position is directed to the next token in the sequence.

- If you do not select any token, the visualization shows the attention pattern for all tokens in the sequence.
- If you select a token, the visualization shows the attention pattern for the selected token.
- If you select a token `i`, virtually all the attention is directed to the next token `went`.
- The [SEP] token disrupts the next-token attention pattern, as most of the attention from [SEP] is directed to [CLS] (the first token in the sequence) rather than the next token.
- This pattern, attention to the next token, appears to work primarily within a sentence.
- This pattern is related to the idea of a recurrent neural network (RNN) that is trained to predict the next word in a sequence.

In [17]:
head_view(attention, tokens, layer=2, heads=[0])

<IPython.core.display.Javascript object>

### Pattern 2: Attention to previous word

Select layer 6, head 11. In this pattern, much of the attention is directed to the previous token in the sequence.

- For example, most of the attention from `went` is directed to the previous token `i`.
- The pattern is not as distinct as the next-token pattern, but it is still present.
- Some attention is also dispersed to other tokens in the sequence, especially to the [SEP] token.
- This pattern is also related to the idea of an RNN, in this case the forward direction of an RNN.

In [15]:
head_view(attention, tokens, layer=6, heads=[11])

<IPython.core.display.Javascript object>

### Pattern 3: Attention to identical/related words

Select layer 2, head 6. In this pattern, much of the attention is directed to identical or related words, including the source word itself.

- For example, most of the attention for the first occurrence of `store` is directed to itself and to the second occurrence of `store`.
- This pattern is not as distict as some of the other patterns. 

In [19]:
head_view(attention, tokens, layer=2, heads=[6])

<IPython.core.display.Javascript object>

### Pattern 4: Attention to identical/related words in other sentence

Select layer 10, head 10. In this pattern, much of the attention is directed to identical or related words in the other sentence.

- For example, most of the attention of `store` in the second sentence is directed to `store` in the first sentence.
- This is helpful for the next sentence prediction task, which is one of the pre-training tasks for BERT.

In [23]:
head_view(attention, tokens, layer=10, heads=[10])

<IPython.core.display.Javascript object>