# Transformer

Nowadays transformer is a popular neural network structure in NLP. Compared to RNN, it allows more context information to flow through the net, achieving more precise result.
and it can be trained more parallelly, requiring less time to train and perform inference. in this notebook, I'm going to clarify how a transform works.

A transformer usually consists of an encoder-decoder structure. where the encoder maps the input sequence of symbol representations $(x_1, x_2, ..., x_n)$ to a sequence of continuous representations $\mathbf{z} = (z_1, z_2, ..., z_n)$. Given $\mathbf{z}$, the decoder generate an output sequence $(y_1, y_2, ..., y_m)$ of symbols one by one. At each step the model is auto-regressive, consuming the previously generated symbols as additional input when generating the next.

## Concepts

First I will introduce some basic terminologies, you may find them boring, but it is worthy understanding them before we dive deeper.

* token_ids or input_ids

As with other NN models, transformer expects numeric input, but what we have is raw text during training or inference, so we need some preprocess to convert input text into some numeric representation that tranformer can understand. and the preprocess is called tokenization. what tokenization does is to chop long sentence into separate tokens, a token may be a word, or words, or part of a word, and all the unqiue tokens form a vocabulary, then map these tokens into a unique index number according to the vocabulary. and the input_ids of a sentence is the index list of the token after perform tokenization on it.

* attention_mask

we usually provide a batch of input texts to transformer instead of only one to improve performance, but their lengths are often different. however tranformer needs all inputs are of the same length. so the tokenizer has to do some padding or truncation to make sure the resulting input_ids are of the same length. and in padding case, we need tell transformer some tokens of the shorter sequence is padding, and should not be atttened during training or infernece, this is where attention_mask comes to help. attention_mask is a tensor of the same shape with input_ids, and for padding tokens the corresponding attention_mask are 0, otherwise 1 for non-padding tokens.

* token_type_ids or segment_ids

For NLP tasks like classification on pairs of sentences or QA, these require two different sequences to be joined in a single input_ids entry, which is usually performed with the help of special tokens, such as classifier ([CLS]) and separator ([SEP]) tokens. For example, BERT model builds such sequence in `[CLS] SEQUENCE_A [SEP] SEQUENCE_B [SEP]` way. and for these tasks tokenizer returns a token_type_ids entry which tells the model which sequence a token comes from.

* self-attention

Each element of the input finds out which other elements of the input they should attend to.


you may refer to HuggingFace for formal definitions:

* [HuggingFace transformers/glossary](https://huggingface.co/docs/transformers/glossary)
    * [What are input IDs?](https://huggingface.co/docs/transformers/glossary#input-ids)
    * [What are attention masks?](https://huggingface.co/docs/transformers/glossary#attention-mask)
    * [What are token type IDs?](https://huggingface.co/docs/transformers/glossary#token-type-ids)


## Papers

- [Attention Is All You Need](https://arxiv.org/abs/1706.03762)
    - [Transformer: A Novel Neural Network Architecture for Language Understanding](https://blog.research.google/2017/08/transformer-novel-neural-network.html)
    - [推荐: 可视化 token 间的关联](https://huggingface.co/spaces/exbert-project/exbert)
    - [tensorflow transformer tutorial](https://www.tensorflow.org/text/tutorials/transformer)
    - [The Annotated Transformer](http://nlp.seas.harvard.edu/2018/04/03/attention.html)
- BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

in the following section, let's display what's inside a transformer at a high level, and you may find more details from [The Illustrated Transformer](https://jalammar.github.io/illustrated-transformer/)

In [None]:
#! pip install mermaid-python
from mermaid import Mermaid

Mermaid("""
---
title: Transformer for Machine Translation
---
graph LR
input[早上好]
output[Good morning]
input -- input --> Transformer -- output --> output  
""")

In [47]:
Mermaid("""
---
title: Decode Transformer 1
---
flowchart LR
    input[早上好]
    output[Good morning] 
    subgraph Transformer
        direction LR
        encoder --> decoder
    end
    input -- input --> encoder
    decoder -- output --> output
""")

In [56]:
Mermaid("""
---
title: Decode Transformer 2
---
%% Encoder is a stack of encoder layers, the Decoder ditto. and the number of encoders or decoders is six in the paper,
%% it is not a magical number, you may adjust the number in your experiment
flowchart LR
    input[早上好]
    output[Good morning] 
    subgraph Transformer
        direction LR
        subgraph Encoder
            direction TB
            e1[encoder]
            e2[encoder]
            e3[encoder]
            e4[encoder]
            e5[encoder]
            e6[encoder]
            e1 --> e2 --> e3 --> e4 --> e5 --> e6
        end
        subgraph Decoder
            direction BT
            d1[decoder]
            d2[decoder]
            d3[decoder]
            d4[decoder]
            d5[decoder]
            d6[decoder]
            d1 --> d2 --> d3 --> d4 --> d5 --> d6
        end
        Encoder --> Decoder
    end
    input -- input --> Encoder
    Decoder -- output --> output
""")

In [67]:
Mermaid("""
---
title: Decode Transformer 3
---
%% Encoder is a stack of encoder layers, the Decoder ditto. and the number of encoders or decoders is six in the paper,
%% it is not a magical number, you may adjust the number in your experiment
graph LR
    subgraph Encoder
        direction BT
        sat1[Self-attention]
        ffw1[Feed Forward network]
        sat1 --> ffw1
    end
    subgraph Decoder
        direction BT
        sat2[Self-attention]
        edt[Encoder-Decoder attention]
        ffw2[Feed Forward network]
        sat2 --> edt --> ffw2
    end
    Encoder ~~~ Decoder
""")

In [68]:
import os
from pprint import pprint
from transformers import BertTokenizer

model_path = os.environ["HOME"] + "/dev-repo/model-store/bert-base-uncased"
tokenizer = BertTokenizer.from_pretrained(model_path)
sequence_a = "This is a short sequence."
sequence_b = "This is a rather long sequence. It is at least longer than the sequence A."
padded_sequences = tokenizer([sequence_a, sequence_b], padding=True)
print("tokenization result:", padded_sequences.keys())
pprint(padded_sequences["input_ids"][0])
pprint(padded_sequences["input_ids"][1])
pprint(padded_sequences["attention_mask"])
pprint(padded_sequences["token_type_ids"])

# what is token_type_ids?
question = "what's your name?"
answer = "my name is cherry."
encoded_dict = tokenizer(question, answer)
print("input_ids:", encoded_dict["input_ids"])
decoded = tokenizer.decode(encoded_dict["input_ids"])
print("decoded: ", decoded)
#print("attention_mask:", encoded_dict["attention_mask"])
print("token_type_ids", encoded_dict["token_type_ids"])

tokenization result: dict_keys(['input_ids', 'token_type_ids', 'attention_mask'])
[101, 2023, 2003, 1037, 2460, 5537, 1012, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
[101,
 2023,
 2003,
 1037,
 2738,
 2146,
 5537,
 1012,
 2009,
 2003,
 2012,
 2560,
 2936,
 2084,
 1996,
 5537,
 1037,
 1012,
 102]
[[1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
 [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]
[[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]
input_ids: [101, 2054, 1005, 1055, 2115, 2171, 1029, 102, 2026, 2171, 2003, 9115, 1012, 102]
decoded:  [CLS] what's your name? [SEP] my name is cherry. [SEP]
token_type_ids [0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1]


what does self-attention do in mathematic?

$$ softmax({QK^T \over \sqrt{d_k}}) V $$


there are several intermediate values when running a transformer:

1. embeddings: a float vector encoding a token, and we can get it by looking up an embedding layer. its dimension is a hyperparameter we can adjust, 512 in the paper.
2. query vectors(q): we can get it by multiplying embedding with a weight matrix called $W^Q$, its dimension is a hyperparameter we can adjust, 64 in the paper.
3. key vectors(k): we can get it by multiplying embedding with a weight matrix called $W^K$, its dimension is a hyperparameter we can adjust, 64 in the paper.
4. value vectors(v): we can get it by multiplying embedding with a weight matrix called $W^V$, its dimension is a hyperparameter we can adjust, 64 in the paper.

with `q, k, v,` we can compute how much focus each token would be put on each position in the sequence. and you may think the conversion from embeddings to `q, k, v` is some kind of dimensionality reduction.