# Transformers: Intuition

![](../images/transformers-ex.png)
(Source: http://jalammar.github.io/illustrated-bert/)

- Transformers are a very exciting development in deep learning NLP.
- It can be seen as an important architecture in deep learning that allows the model to learn things from the co-occurring contexts of words.
- Most importantly, this mechanism enables the model to effectively model the long distance dependency relations in languages, which have long been a difficult task in traditional statistical NLP.

## Self-attention

- The fundamental component of a transformer is the **self-attention** mechanism.
- Self-attention is like a sequence-to-sequence model, where an input sequence goes in and an output sequence comes out.
- The main characteristics of self-attention is when determining every token of the output sequence, it considers not only one particular token of the input sequence, but all the other input tokens.

- In other words, each output token, $y_i$, is a weighted average over all the input tokens .

$$
y_i = \sum_jw_{ij}x_j
$$

![](../images/transformers-self-attention.svg)
(Source: http://peterbloem.nl/blog/transformers)

## From Self-Attention to Transformers

- A **transformer** is an architecture that builds upon self-attention layers.
- Peter Bloem's definition of transformers:

> Any architecture designed to process a connected set of units--such as the tokens in a sequence or the pixels in an image--where the only interaction between units is through self-attention.

![](../images/transformer-block.svg)
(Source: http://peterbloem.nl/blog/transformers)

- A transformer block combines the self-attention layer with a local feedforward network and add normalization and residual connections.
- Normalization and residual connections are standard tricks used to help neural network train faster and more accurately.
- A transformer block can also have **multiheaded attention layers**, which multiple self-attention layers to keep track of different types of long-distance relationships between input tokens.

## From Transformers to Classifiers

- With the transformer blocks, the most common way to build a classifier is to have a architecture consisting of a large chain of transformer blocks.
- All we need to do is work out how to feed the input sequences into the architecture and how to transform the final output sequence into a single classification.

![](../images/transformers-classifier.svg)
(Source: http://peterbloem.nl/blog/transformers)

- The trick in the classifier is to apply global average pooling to the final output sequence, and map the result to a softmaxed class vector.
    - The output sequence is averaged to produce a single vector.
    - This vector is then projected down to a vector with one element per class and softmaxed into probabilities.

## Token Positions

- The above operation of transformers does not take into account the relative positions of tokens in each sequence. 
- The output sequence may therefore be the same no matter how the tokens of the input sequence vary in order. (The model is **permutation invariant**).
- To fix this, most transformers models create **position embeddings** or **position encodings** for each token of the sequence to:
    - represent the position of the word/token in the current sequence
    - add this to word/token embedding

## Famous Transformers-based Models

### BERT

- The paper: [BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/abs/1810.04805)
- BERT consists of a simple stacks of transformer blocks.
- It is pre-trained on a large general-domain corpus consisting of 800M words from English books and 2.5B words of Wikipedia articles.

- BERT pretraining features two language tasks:
    - **Masking**: A certain number of words in the input sequences are randomly masked out and the model is to learn to predict which words have been modified and what the original words are for each input sequence.
    - **Next Sequence Classification**: Two sequences (around 256 words) are sampled from the corpus which may follow each other directly in the corpus, or are taken from random places. The model needs to learn which case it would be.

- BERT utilizes **WordPiece** tokenization. Each token is somewhere in between word-level and character level sequences.

- With this pretrained BERT, we can add signle task-specific layer after the stach of transformer blocks, which maps the general purpose representation to a task specific output (e.g., binary classification).
- The model then will be fine-tuned for that particular task at hand. (**transfer learning**!!)

- Statistics of the large BERT model:
    - Transformer blocks: 24
    - Sequence length: 256(?)
    - Embedding dimension: 1024
    - Attention heads: 16
    - Parameter number: 340M

### GPT-2

- GPT-2 is famous (notorious) in the news media as the "[malicious writing AI](https://www.bbc.com/news/technology-47249163)".
- Different from BERT, GPT-2 is fundamentally a language **generation** model.
- GPT-2 features its the linguistic diversity of their training data (e.g., posts and links via the social media site *Reddit* with a minimum level of social support, i.e., 按讚數).
- Statistics of GPT-2:
    - Transformer blocks: 48
    - Sequence length: 1024
    - Ebmedding dimension: 1600
    - Attention heads: 36
    - Parameter number: 1.5B

## More

- [Transformer-XL](https://arxiv.org/abs/1901.02860)
- The current performance limit is purely in the hardware.
- Transformers are generic, waiting to be exploited in many more fields.

## References

- The paper: [Attention is All You Need](https://arxiv.org/abs/1706.03762)
- This lecture is Peter Bloem's blog post: [Transformers from Scratch](http://peterbloem.nl/blog/transformers).
- Jay Alammar's blog post: [The Illustrated Transformer](https://jalammar.github.io/illustrated-transformer/)
- Jay Alammar's blog post: [The Illustrated BERT, ELMo, and co. (How NLP Cracked Transfer Learning)](http://jalammar.github.io/illustrated-bert/)
