In [1]:
%%html
<script>
  function code_toggle() {
    if (code_shown){
      $('div.input').hide('500');
      $('#toggleButton').val('Show Code')
    } else {
      $('div.input').show('500');
      $('#toggleButton').val('Hide Code')
    }
    code_shown = !code_shown
  }

  $( document ).ready(function(){
    code_shown=false;
    $('div.input').hide()
  });
</script>
<form action="javascript:code_toggle()"><input type="submit" id="toggleButton" value="Show Code"></form>
<style>
.rendered_html td {
    font-size: xx-large;
    text-align: left; !important
}
.rendered_html th {
    font-size: xx-large;
    text-align: left; !important
}
</style>

In [2]:
%%capture
import sys
sys.path.append("..")
import statnlpbook.util as util
import matplotlib
matplotlib.rcParams['figure.figsize'] = (10.0, 6.0)

In [3]:
%load_ext tikzmagic

In [4]:
from IPython.display import Image
import random

# Transformer Language Models

* Self-attention
* Masked langauge models

## Attention is all you need

*Transformers* replace the whole LSTM with *self-attention* ([Vaswani et al., 2017](https://arxiv.org/pdf/1706.03762.pdf))


All tokens attend to each other:

<center>
    <img src="http://jalammar.github.io/images/t/transformer_self-attention_visualization.png" width=40%/>
</center>

<div style="text-align: left;">
    (Image source: <a href="http://jalammar.github.io/illustrated-transformer/">The Illustrated Transformer</a>)
</div>

### The full transformer model

Deep multi-head self-attention encoder-decoder with sinusodial positional encodings:

<center>
    <img src="mt_figures/transformer.png" width=30%/>
</center>

<div style="text-align: right;">
    (from <a href="https://arxiv.org/pdf/1706.03762.pdf">Vaswani et al., 2017</a>)
</div>

### Transformer unit

Add residual connections, layer normalization and feed-forward layers (MLPs):

<center>
    <img src="mt_figures/transformer_layer.png" width=30%/>
</center>

<div style="text-align: left;">
    (Image source: <a href="https://arxiv.org/pdf/1706.03762.pdf">Vaswani et al., 2017</a>)
</div>

### Multi-head self-attention

Repeat this multiple times with multiple sets of parameter matrices, then concatenate:

<center>
    <img src="mt_figures/multi_head_self_att.png" width=30%/>
</center>

<div style="text-align: left;">
    (Image source: <a href="https://arxiv.org/pdf/1706.03762.pdf">Vaswani et al., 2017</a>)
</div>

### Dot-Product Attention

For each index $i$ in the sequence (of length $n$), use hidden representation $\mathbf{h}_i$ to create three other $d_{\mathbf{h}}$-dimensional vectors:
query vector $\color{purple}{\mathbf{q}_i}=W^q\mathbf{h}_i$,
key vector $\color{orange}{\mathbf{k}_i}=W^k\mathbf{h}_i$,
value vector $\color{blue}{\mathbf{v}_i}=W^v\mathbf{h}_i$.

We use them to calculate the attention probability distribution $\mathbf{\alpha}$, which is an $n\times n$ matrix, and the new hidden representation (for the next layer) $\mathbf{h}_i^\prime$, which is a $d_{\mathbf{h}}$-dimensional vector:
$$
\mathbf{\alpha}_i = \text{softmax}\left(
\begin{array}{c}
\color{purple}{\mathbf{q}_i} \cdot \color{orange}{\mathbf{k}_1} \\
\ldots \\
\color{purple}{\mathbf{q}_i} \cdot \color{orange}{\mathbf{k}_n}
\end{array}
\right) \\
\mathbf{h}_i^\prime = \sum_{j=1}^n \mathbf{\alpha}_{i,j} \color{blue}{\mathbf{v}_j}
$$

$W^q$, $W^k$ and $W^v$ are all trained.

### ***Scaled*** Dot-Product Attention

If we assume that $\color{purple}{\mathbf{q}_i}$ and $\color{orange}{\mathbf{k}_i}$ are $d_{\mathbf{h}}$-dimensional vectors whose components are independent random variables with mean 0 and variance 1, then their dot product, $\color{purple}{\mathbf{q}_i} \cdot \color{orange}{\mathbf{k}_i} = \sum_{j=1}^{d_{\mathbf{h}}} \mathbf{q}_{ij} \mathbf{k}_{ij}$, has mean 0 and variance $d_{\mathbf{h}}$. Since we would prefer these values to have variance 1, we divide by $d_{\mathbf{h}}$.

$$
\mathbf{\alpha}_i = \text{softmax}\left(
\begin{array}{c}
\frac{\color{purple}{\mathbf{q}_i} \cdot \color{orange}{\mathbf{k}_1}}{\sqrt{d_{\mathbf{h}}}} \\
\ldots \\
\frac{\color{purple}{\mathbf{q}_i} \cdot \color{orange}{\mathbf{k}_n}}{\sqrt{d_{\mathbf{h}}}}
\end{array}
\right) \\
\mathbf{h}_i^\prime = \sum_{j=1}^n \mathbf{\alpha}_{i,j} \color{blue}{\mathbf{v}_j}
$$

In matrix form:

$$
\text{Attention}(Q,K,V)=
\text{softmax}\left(
\frac{\color{purple}{Q}
\color{orange}{K}^\intercal}
{\sqrt{d_{\mathbf{h}}}}
\right) \color{blue}{V}
$$

$$
\text{MultiHead}(Q,K,V)=\text{Concat}(\text{head}_1,\ldots,\text{head}_h)W^O
$$
where
$$
\text{head}_i=\text{Attention}(QW_i^q,KW_i^k,VW_i^v)
$$

### Transformer

Repeat this for multiple layers, each using the previous as input:


$$
\text{MultiHead}^\ell(Q^\ell,K^\ell,V^\ell)=\text{Concat}(\text{head}_1^\ell,\ldots,\text{head}_h^\ell)W_\ell^O
$$
where
$$
\text{head}_i^\ell=\text{Attention}(Q^\ell W_{i,\ell}^q,K^\ell W_{i,\ell}^k,V^\ell W_{i,\ell}^v)
$$

### Long-distance dependencies

<center>
    <img src="mt_figures/ldd.png" width=80%/>
</center>

<div style="text-align: left;">
    (Image source: <a href="https://arxiv.org/pdf/1706.03762.pdf">Vaswani et al., 2017</a>)
</div>

## Back to bag-of-words?

RNNs process tokens sequentially, but
Transformers process all tokens **at once**.

In fact, we did not even provide any information about the order of tokens...

## Positional Encoding ##

Represent **positions** with fixed-length vectors, with the same dimensionality as word embeddings:

(1st position, 2nd position, 3rd position, ...) $\to$ Must decide on maximum sequence length

Add to word embeddings at the input layer:

<center>
    <img src="../img/positional_1.png" width="80%">
</center>

## Positional Encoding ##

Alternatives:

* Learned position embeddings (like word embeddings)
* **Static position encoding:**

<center>
    <img src="../img/positional_2.png" width="80%">
</center>

Picture source: https://kazemnejad.com/blog/transformer_architecture_positional_encoding/

# Superiority over LSTM

Transformers at similar parameter counts to LSTMs are:
* better at language modelling
* better at effectively using more input tokens

<center>
    <img src="../img/scaling_laws_lstms_trms.png" width="80%">
    (Kaplan et al. arXiv:2001.0836)
</center>


### Transformers for decoding

Attends to encoded input *and* to partial output.

<center>
    <img src="http://jalammar.github.io/images/xlnet/transformer-encoder-decoder.png" width=70%/>
</center>

<div style="text-align: left;">
    (Image source: <a href="http://jalammar.github.io/illustrated-gpt2/">The Illustrated GPT-2</a>)
</div>

Can only attend to already-generated tokens.

<center>
    <img src="http://jalammar.github.io/images/gpt2/self-attention-and-masked-self-attention.png" width=80%/>
</center>

<div style="text-align: left;">
    (Image source: <a href="http://jalammar.github.io/illustrated-gpt2/">The Illustrated GPT-2</a>)
</div>

The encoder transformer is sometimes called "bidirectional transformer".

## BERT

**B**idirectional **E**ncoder **R**epresentations from **T**ransformers ([Devlin et al., 2019](https://www.aclweb.org/anthology/N19-1423.pdf)).

<center>
    <img src="https://miro.medium.com/max/300/0*2XpE-VjhhLGkFDYg.jpg" width=40%/>
</center>

### Reminder: BERT training objective (1): **masked** language model

Predict masked words given context on both sides:

<center>
    <img src="http://jalammar.github.io/images/BERT-language-modeling-masked-lm.png" width=50%/>
</center>

<div style="text-align: right;">
    (from <a href="http://jalammar.github.io/illustrated-bert/">The Illustrated BERT</a>)
</div>

### Reminder: BERT Training objective (2): next sentence prediction

**Conditional encoding** of both sentences:

<center>
    <img src="http://jalammar.github.io/images/bert-next-sentence-prediction.png" width=60%/>
</center>

<div style="text-align: right;">
    (from <a href="http://jalammar.github.io/illustrated-bert/">The Illustrated BERT</a>)
</div>

### BERT architecture

Transformer with $L$ layers of dimension $H$, and $A$ self-attention heads.

* BERT$_\mathrm{BASE}$: $L=12, H=768, A=12$
* BERT$_\mathrm{LARGE}$: $L=24, H=1024, A=16$

(Many other variations available through [HuggingFace Transformers](https://huggingface.co/docs/transformers/index))

Trained on 16GB of text from Wikipedia + BookCorpus.

* BERT$_\mathrm{BASE}$: 4 TPUs for 4 days
* BERT$_\mathrm{LARGE}$: 16 TPUs for 4 days

#### SNLI results

| Model | Accuracy |
|---|---|
| LSTM | 77.6 |
| LSTMs with conditional encoding | 80.9 |
| LSTMs with conditional encoding + attention | 82.3 |
| LSTMs with word-by-word attention | 83.5 |
| Self-attention | 85.6 |
| BERT$_\mathrm{BASE}$ | 89.2 |
| BERT$_\mathrm{LARGE}$ |  90.4 |

([Zhang et al., 2019](https://bcmi.sjtu.edu.cn/home/zhangzs/pubs/paclic33.pdf))

## RoBERTa

Same architecture as BERT but better hyperparameter tuning and more training data ([Liu et al., 2019](https://arxiv.org/pdf/1907.11692.pdf)):

- CC-News (76GB)
- OpenWebText (38GB)
- Stories (31GB)

and **no** next-sentence-prediction task (only masked LM).

Training: 1024 GPUs for one day.


#### SNLI results

| Model | Accuracy |
|---|---|
| LSTM | 77.6 |
| LSTMs with conditional encoding | 80.9 |
| LSTMs with conditional encoding + attention | 82.3 |
| LSTMs with word-by-word attention | 83.5 |
| Self-attention | 85.6 |
| BERT$_\mathrm{BASE}$ | 89.2 |
| BERT$_\mathrm{LARGE}$ |  90.4 |
| RoBERTa$_\mathrm{BASE}$ | 90.7 |
| RoBERTa$_\mathrm{LARGE}$ |  91.4 |

([Sun et al., 2020](https://arxiv.org/abs/2012.01786))

## Transformer LMs as pre-trained representations

<center>
    <img src="https://d3i71xaburhd42.cloudfront.net/cd18800a0fe0b668a1cc19f2ec95b5003d0a5035/4-Figure1-1.png" width=90%/>
</center>

<div style="text-align: right;">
    (from <a href="https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf">Radford et al., 2018</a>)
</div>

### Text and position embeddings in BERT (and friends)

<center>
    <img src="https://d3i71xaburhd42.cloudfront.net/df2b0e26d0599ce3e70df8a9da02e51594e0e992/5-Figure2-1.png" width=70%/>
</center>

<div style="text-align: right;">
    (from <a href="https://www.aclweb.org/anthology/N19-1423.pdf">Devlin et al., 2019</a>)
</div>

### Using BERT (and friends)

<center>
    <img src="https://d3i71xaburhd42.cloudfront.net/df2b0e26d0599ce3e70df8a9da02e51594e0e992/3-Figure1-1.png" width=70%/>
</center>

<div style="text-align: right;">
    (from <a href="https://www.aclweb.org/anthology/N19-1423.pdf">Devlin et al., 2019</a>)
</div>

### Using BERT (and friends) for various tasks

<center>
    <img src="http://jalammar.github.io/images/bert-tasks.png" width=70%/>
</center>

<div style="text-align: right;">
    (from <a href="https://www.aclweb.org/anthology/N19-1423.pdf">Devlin et al., 2019</a>)
</div>

### Which layer to use?

<center>
    <img src="http://jalammar.github.io/images/bert-contexualized-embeddings.png" width=80%/>
</center>

<div style="text-align: right;">
    (from <a href="http://jalammar.github.io/illustrated-bert/">The Illustrated BERT</a>)
</div>

<center>
    <img src="http://jalammar.github.io/images/bert-feature-extraction-contextualized-embeddings.png" width=80%/>
</center>

<div style="text-align: right;">
    (from <a href="http://jalammar.github.io/illustrated-bert/">The Illustrated BERT</a>)
</div>

## Multilingual BERT

* One model pre-trained on 104 languages with the largest Wikipedias
* 110k *shared* WordPiece vocabulary
* Same architecture as BERT$_\mathrm{BASE}$: $L=12, H=768, A=12$
* Same training objectives, **no cross-lingual signal**

https://github.com/google-research/bert/blob/master/multilingual.md

### Other multilingual transformers

+ XLM and XLM-R ([Lample and Conneau, 2019](https://arxiv.org/pdf/1901.07291.pdf))
+ DistilmBERT ([Sanh et al., 2020](https://arxiv.org/pdf/1910.01108.pdf)) is a lighter version of mBERT
+ Many monolingual BERTs for languages other than English
([CamemBERT](https://arxiv.org/pdf/1911.03894.pdf),
[BERTje](https://arxiv.org/pdf/1912.09582),
[Nordic BERT](https://github.com/botxo/nordic_bert)...)

## Outlook

* Transformer models keep coming out: larger, trained on more data, languages and domains, etc.
  + Increasing energy usage and climate impact: see https://github.com/danielhers/climate-awareness-nlp
* In the machine translation lecture, you will learn how to use them for cross-lingual tasks

## Additional Reading

+ [Jurafsky & Martin Chapter 9](https://web.stanford.edu/~jurafsky/slp3/9.pdf)
+ [Jurafsky & Martin Chapter 11](https://web.stanford.edu/~jurafsky/slp3/11.pdf)
+ Jay Alammar's blog posts:
    + [The Illustrated GPT-2](http://jalammar.github.io/illustrated-gpt2/)
    + [The Illustrated BERT](http://jalammar.github.io/illustrated-bert/)