# Transformers: Intuition

![](../images/transformers-ex.png)
(Source: http://jalammar.github.io/illustrated-bert/)

- Transformers are a very exciting development in deep learning NLP.
- It can be seen as an important architecture in deep learning that allows the model to learn things from the co-occurring contexts of words.
- Most importantly, this mechanism enables the model to effectively model the long distance dependency relations in languages, which have long been a difficult task in traditional statistical NLP.

## Why Attention is All We Need?

- In the encoder-decoder sequence model, as the input sequence length grows, more and more of the inputs are compressed into one vector (e.g., the last time step hidden state of the LSTM).
- The intermediate hidden states of the inputs out of the encoder are NOT easily available to the decoder.
- That is, the encoder states are mostly hidden from the decoder.
- The goal of **Attention layer** is to make available the hidden states of the encoder at all time steps to the decoder.

## Types of Attention

- The **General Attention** mechanism allows the decoder to focus its attention on a subset of the encoded input vectors while decoding.
- The **Self Attention** enables the connections between different encodings of input tokens in different positions.
    - Each attention mechanism may need a way to determine for a given token, how much it is dependent on the other tokens of the input. This is referred to as the attention weights.
    - There are several ways for attention weights computation: **Bahdanau Attention**, **Luong's** Attention.
    - The key is that the output of the Attention Layer is a weighted version of the input vectors.

## Self-attention

- The fundamental component of a transformer is the **self-attention** mechanism.
- Self-attention is like a sequence-to-sequence model, where an input sequence goes in and an output sequence comes out.
- The main characteristics of self-attention is when determining every token of the output sequence, it considers not only one particular token of the input sequence, but all the other input tokens.

- In other words, each output token, $y_i$, is a weighted average over all the input tokens .

$$
y_i = \sum_jw_{ij}x_j
$$

![](../images/transformers-self-attention.svg)
(Source: http://peterbloem.nl/blog/transformers)

## From Self-Attention to Transformers

- A **transformer** is an architecture that builds upon self-attention layers.
- Peter Bloem's definition of transformers:

> Any architecture designed to process a connected set of units--such as the tokens in a sequence or the pixels in an image--where the only interaction between units is through self-attention.

![](../images/transformer-block.svg)
(Source: http://peterbloem.nl/blog/transformers)

- A transformer block combines the self-attention layer with a local feedforward network and add normalization and residual connections.
- Normalization and residual connections are standard tricks used to help neural network train faster and more accurately.
- A transformer block can also have **multiheaded attention layers**, which multiple self-attention layers to keep track of different types of long-distance relationships between input tokens.

## From Transformers to Classifiers

- With the transformer blocks, the most common way to build a classifier is to have a architecture consisting of a large chain of transformer blocks.
- All we need to do is work out how to feed the input sequences into the architecture and how to transform the final output sequence into a single classification.

![](../images/transformers-classifier.svg)
(Source: http://peterbloem.nl/blog/transformers)

- The trick in the classifier is to apply global average pooling to the final output sequence, and map the result to a softmaxed class vector.
    - The output sequence is averaged to produce a single vector.
    - This vector is then projected down to a vector with one element per class and softmaxed into probabilities.

## Token Positions

- The above operation of transformers does not take into account the relative positions of tokens in each sequence. 
- The output sequence may therefore be the same no matter how the tokens of the input sequence vary in order. (The model is **permutation invariant**).
- To fix this, most transformers models create **position embeddings** or **position encodings** for each token of the sequence to:
    - represent the position of the word/token in the current sequence
    - add this to word/token embedding

## Famous Transformers-based Models

### BERT

- Prior embeddings models are mostly context-free; BERT, however, is claimed to be considering contexts in its language model. BERT was developed by Google Research in May 2019.
- The paper: [BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/abs/1810.04805).
- BERT refers to Bi-directional Encoder Representations from Transformers.
- BERT consists of a simple stacks of transformer blocks.
- It is pre-trained on a large general-domain corpus consisting of 800M words from English books and 2.5B words of Wikipedia articles.

- BERT pretraining features two language tasks:
    - **Masking**: A certain number of words in the input sequences are randomly masked out and the model is to learn to predict which words have been modified and what the original words are for each input sequence.
    - **Next Sequence Classification**: Two sequences (around 256 words) are sampled from the corpus which may follow each other directly in the corpus, or are taken from random places. The model needs to learn which case it would be.

- BERT utilizes **WordPiece** tokenization. Each token is somewhere in between word-level and character level sequences.

- With this pretrained BERT, we can add signle task-specific layer after the stach of transformer blocks, which maps the general purpose representation to a task specific output (e.g., binary classification).
- The model then will be fine-tuned for that particular task at hand. (**transfer learning**!!)

- Statistics of the large BERT model:
    - Transformer blocks: 24
    - Sequence length: 256(?)
    - Embedding dimension: 1024
    - Attention heads: 16
    - Parameter number: 340M

#### English Examples

In [1]:
## Google Colab Setting
!pip install -U transformers

Requirement already up-to-date: transformers in /usr/local/lib/python3.7/dist-packages (4.3.3)


In [2]:
import textwrap

from transformers import AutoModelForMaskedLM, AutoTokenizer
import torch

tokenizer_dbert = AutoTokenizer.from_pretrained("distilbert-base-cased")
model_dbert = AutoModelForMaskedLM.from_pretrained("distilbert-base-cased")


In [3]:
%%time
text = f"China and Taiwan are two {tokenizer_dbert.mask_token} countries."
input = tokenizer_dbert.encode(text, return_tensors="pt")

mask_token_index = torch.where(input==tokenizer_dbert.mask_token_id)[1]
token_logits = model_dbert(input).logits
mask_token_logits = token_logits[0, mask_token_index, :]
top_5_tokens = torch.topk(mask_token_logits, 5, dim=1).indices[0].tolist()

CPU times: user 56.2 ms, sys: 4.02 ms, total: 60.2 ms
Wall time: 59.7 ms


In [4]:
for token in top_5_tokens:
...     print(text.replace(tokenizer_dbert.mask_token, tokenizer_dbert.decode([token])))

China and Taiwan are two sister countries.
China and Taiwan are two partner countries.
China and Taiwan are two neighbouring countries.
China and Taiwan are two neighboring countries.
China and Taiwan are two member countries.


#### Chinese examples

In [5]:
from transformers import AutoModelForMaskedLM, AutoTokenizer
import torch

tokenizer_zh_albert = AutoTokenizer.from_pretrained("ckiplab/albert-tiny-chinese")
model_zh_albert = AutoModelForMaskedLM.from_pretrained("ckiplab/albert-tiny-chinese")


In [6]:
%%time
text = f"台灣與中國是兩個{tokenizer_zh_albert.mask_token}的國家。"
input = tokenizer_zh_albert.encode(text, return_tensors="pt")

mask_token_index = torch.where(input==tokenizer_zh_albert.mask_token_id)[1]
token_logits = model_zh_albert(input).logits
mask_token_logits = token_logits[0, mask_token_index, :]
top_5_tokens = torch.topk(mask_token_logits, 5, dim=1).indices[0].tolist()

CPU times: user 15.4 ms, sys: 889 µs, total: 16.2 ms
Wall time: 16.6 ms


In [7]:
for token in top_5_tokens:
...     print(text.replace(tokenizer_zh_albert.mask_token, tokenizer_zh_albert.decode([token])))

台灣與中國是兩個新的國家。
台灣與中國是兩個國的國家。
台灣與中國是兩個好的國家。
台灣與中國是兩個洲的國家。
台灣與中國是兩個大的國家。


### GPT-2

- GPT-2 is famous (notorious) in the news media as the "[malicious writing AI](https://www.bbc.com/news/technology-47249163)".
- Different from BERT, GPT-2 is fundamentally a language **generation** model.
- GPT-2 features its the linguistic diversity of their training data (e.g., posts and links via the social media site *Reddit* with a minimum level of social support, i.e., 按讚數).
- Statistics of GPT-2:
    - Transformer blocks: 48
    - Sequence length: 1024
    - Ebmedding dimension: 1600
    - Attention heads: 36
    - Parameter number: 1.5B

#### English Examples

In [8]:
from transformers import AutoTokenizer, AutoModelWithLMHead, AutoModelForCausalLM

tokenizer_en_gpt2 = AutoTokenizer.from_pretrained("gpt2")

#model = AutoModelWithLMHead.from_pretrained("gpt2-xl")
model_en_gpt2 = AutoModelForCausalLM.from_pretrained("gpt2", output_hidden_states=True)

In [9]:
%%time
text = "Taiwan and China are two independent countries"

# Tokenize the input string
input = tokenizer_en_gpt2.encode(text, return_tensors="pt")

# Run the model
output = model_en_gpt2.generate(input, max_length=100, do_sample=True,top_p=0.95, top_k=60)

# Print the output
print(textwrap.fill('\n'+tokenizer_en_gpt2.decode(output[0]),40))

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


 Taiwan and China are two independent
countries. The Republic of China has
claimed the status of a democratic
nation for 100 years, in accordance with
China's state laws, but has been engaged
in a civil war with the Chinese mainland
for nearly a decade.  In the past year,
North Korea's rocket launch has been
repeatedly monitored by South Korean
media outlets. It has not been
independently confirmed.<|endoftext|>
CPU times: user 3.89 s, sys: 25.6 ms, total: 3.92 s
Wall time: 3.92 s


#### Chinese

In [10]:
from transformers import AutoTokenizer, AutoModelWithLMHead, AutoModelForCausalLM

tokenizer_zh_gpt2 = AutoTokenizer.from_pretrained("ckiplab/gpt2-base-chinese")

#model = AutoModelWithLMHead.from_pretrained("gpt2-xl")
model_zh_gpt2 = AutoModelForCausalLM.from_pretrained("ckiplab/gpt2-base-chinese", output_hidden_states=True)

In [11]:
#text = "打從一開始，我就不想信人工智慧，因為"
text = "老太太把眼鏡往下一移，從眼鏡上面朝屋內四處張望了一圈，然後又把眼鏡往上抬著，從眼鏡下面往屋外瞧。她平時很少、甚至從來沒有透過眼鏡去找像一個小男孩這樣小傢伙。對她來說，自己這副做工考究的眼鏡是地位的象徵，它的裝飾價值遠遠超出了實用價值，其實，即使戴上兩片爐蓋也照樣看得一清二楚。"
# Tokenize the input string
input = tokenizer_zh_gpt2.encode(text, add_special_tokens=False,return_tensors="pt")

# Run the model
output = model_zh_gpt2.generate(input, max_length=500, 
                                do_sample=True,top_p=0.95, top_k=60)



Setting `pad_token_id` to `eos_token_id`:102 for open-end generation.


In [12]:

# Print the output
print(textwrap.fill('\n'+ tokenizer_zh_gpt2.decode(output[0]), 40))

 老 太 太 把 眼 鏡 往 下 一 移 ， 從 眼 鏡 上 面 朝 屋 內 四
處 張 望 了 一 圈 ， 然 後 又 把 眼 鏡 往 上 抬 著 ， 從 眼
鏡 下 面 往 屋 外 瞧 。 她 平 時 很 少 、 甚 至 從 來 沒 有
透 過 眼 鏡 去 找 像 一 個 小 男 孩 這 樣 小 傢 伙 。 對 她
來 說 ， 自 己 這 副 做 工 考 究 的 眼 鏡 是 地 位 的 象 徵
， 它 的 裝 飾 價 值 遠 遠 超 出 了 實 用 價 值 ， 其 實 ，
即 使 戴 上 兩 片 爐 蓋 也 照 樣 看 得 一 清 二 楚 。 她 們
一 共 只 見 到 了 六 分 鐘 的 眼 鏡 ， 就 有 兩 片 小 男 孩
眼 鏡 。 這 時 ， 一 位 自 稱 是 [UNK] [UNK] 的 眼 鏡
， 卻 有 一 塊 不 亮 的 小 男 孩 ， 在 沒 有 任 何 人 都 能
看 到 的 情 況 下 ， 眼 鏡 可 能 是 不 存 在 的 ， 眼 鏡 能
在 幾 秒 內 變 回 更 多 的 距 離 。 眼 鏡 中 的 鏡 面 亮 度
最 高 ， 僅 有 6 米 ， 是 世 界 上 最 高 。 她 的 眼 睛 和
地 球 之 間 的 不 同 屬 於 全 球 性 的 ， 這 對 這 種 眼 鏡
是 最 好 的 一 種 。 由 於 眼 睛 的 顏 色 以 往 是 棕 色 、
深 色 、 光 色 和 黃 色 ， 與 以 往 的 鏡 眼 並 不 相 同 。
她 可 以 辨 識 許 多 不 同 顏 色 的 人 ， 例 如 美 國 女 性
， 以 及 來 自 英 國 的 女 性 ， 不 會 像 一 個 更 為 聰 明
的 男 性 所 感 到 的 奇 怪 、 更 危 險 。 眼 鏡 也 會 像
[UNK]. net ， 比 如 將 眼 鏡 塗 抹 在 人 眼 上 的 人 更
為 亮 眼 。 她 會 在 人 眼 上 塗 抹 的 是 綠 色 ， 以 防 止
眼 睛 內 出 現 不 明 原 因 ， 最 終 它 們 會 被 破 壞 。 她
用 眼 睛 塗 抹 的 是 紅 色 或 黑 色 。 她 還 被 作 為 女 人
臉 部 塗 料 。 但 是 ， 她 還 能 夠 使 用 自 己 的 眼 球 塗
抹 。 她 在 看 起 來 是 自 然 光 或 者 是 [UNK]. com 。

## More

- [Transformer-XL](https://arxiv.org/abs/1901.02860)
- The current performance limit is purely in the hardware.
- Transformers are generic, waiting to be exploited in many more fields.

## References

- The paper: [Attention is All You Need](https://arxiv.org/abs/1706.03762)
- This lecture is Peter Bloem's blog post: [Transformers from Scratch](http://peterbloem.nl/blog/transformers).
- Jay Alammar's blog post: [The Illustrated Transformer](https://jalammar.github.io/illustrated-transformer/)
- Jay Alammar's blog post: [The Illustrated BERT, ELMo, and co. (How NLP Cracked Transfer Learning)](http://jalammar.github.io/illustrated-bert/)
