# Introduction to Python and Natural Language Technologies

__Lecture 9, Transformers, BERT__

__April 13, 2021__

__Judit Ács__

In [None]:
import gc
from IPython.display import Image
import numpy as np
import seaborn as sns
import torch
import torch.nn as nn

from transformers import pipeline
from transformers import AutoTokenizer, AutoModel

# Attention mechanism

Attention:
- emphasizes the important part of the input
- and de-emphasizes the rest.
- Mimics cognitive attention.

Method:
- It does this by assigning weights to the elements of the input sequence.
- The weights depend on the current context in the decoder:
    - the current decoder hidden state,
    - the previous output.
- The source vectors are multiplied by the weights and then summed -> **context vector**
- The context vector is used for predicting the next output symbol.

In [None]:
Image("img/dl/attention_mechanism.jpg")

## Problems

Recall that we used recurrent neural cells, specifically LSTMs to encode and decode sequences.

__Problem 1. No parallelism__

LSTMs are recurrent, they rely on their left and right history (horizontal arrows), so the symbols need to be processed in order -> no parallelism.

__Problem 2. Long-range dependencies__

Long-range dependencies are not infrequent in NLP.

"The **people/person** who called and wanted to rent your house when you go away next year **are/is** from California" -- Miller & Chomsky 1963

LSTMs have a problem capturing these because there are too many backpropagation steps between the symbols.

# Transformers

Introduced in [Attention Is All You Need](https://papers.nips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf) by Vaswani et al., 2017

Transformers solve Problem 1 by relying purely on attention instead of recurrence.

Not having recurrent connections means that sequence position no longer matters.

Recurrence is replaced by **self attention**.

Each symbol is encoded the following way:

__Step 1__: the encoder 'looks' at the other symbols in the input sequence
    - In the example above: the representation of **are/is** depends on **people/person** more than any other word in the sentence, it should receive the highest attention weight.

In [None]:
Image("http://jalammar.github.io/images/t/transformer_self-attention_visualization.png", embed=True)  # from Illustrated Transformers

__Step 2__: the context vector is passed through a feed-forward network which is shared across all symbols.

In [None]:
Image("http://jalammar.github.io/images/t/encoder_with_tensors.png", embed=True)  # from Illustrated Transformers

This visualization is available in the [Tensor2tensor notebook in Google Colab](https://colab.research.google.com/github/tensorflow/tensor2tensor/blob/master/tensor2tensor/notebooks/hello_t2t.ipynb)

## Other components

__Residual connections__

- Also called __skip connections__
- The output of a module is added to the input

$$
\text{output} = \text{layer}(\text{input}) + \text{input}
$$

__Softmax__

- Only used in the decoder
- Maps the output vector to a probability distribution
    - In other words it tells us how likely each symbol is.

## Multiple heads and layers

Transformers have a number of additional components summarized in this figure:

In [None]:
Image("img/dl/transformer.png")  # from Vaswani et al. 2018

## PyTorch support

PyTorch has a `nn.Transformer` class and its encoder and decoder versions.

In [None]:
from torch.nn import TransformerEncoder, TransformerEncoderLayer

In [None]:
embedding_dim = 12
# num_heads = 5  # embedding_dim must be divisible by the number of heads
num_heads = 2
hidden_size = 7
dropout = 0.2
TransformerEncoderLayer(embedding_dim, num_heads, hidden_size, dropout)

In [None]:
layer = TransformerEncoderLayer(embedding_dim, num_heads, hidden_size, dropout)
TransformerEncoder(layer, 2)

In [None]:
encoder = TransformerEncoder(layer, 2)

sequence_len = 9
batch_size = 3
X = torch.rand((sequence_len, batch_size, embedding_dim))
y = encoder(X)
y.size()

## Positional encoding

Without recurrence word order information is lost.

Positional information is important:

    John loves Mary.
    Mary loves John.

Transformers apply positional encoding:

$$
\text{PE}_{\text{pos},2i} = \sin(\frac{\text{pos}}{10000^{2i/d_{\text{model}}}}), \\
\text{PE}_{\text{pos},2i+1} = \cos(\frac{\text{pos}}{10000^{2i/d_{\text{model}}}}),
$$

where:
- $d_{\text{model}}$ is the input dimension to the Transformer, usually the embedding size
- $\text{pos}$ is the position of the symbol in the input sequence i.e. first word, second word etc.
- $i$ is the coordinate index in the input vector.

Let's create a position encoder in PyTorch.

For $\text{pos}=0$, the sine values are 0, and the cosine values are 1:

In [None]:
t = torch.FloatTensor([0.])
torch.cos(t), torch.sin(t)

For large $i$ values, the denominator is close to 10000, so it's again close to 0.

In [None]:
# Pick a few random values for pos
pos = torch.randint(512, size=(10, ))
print(pos)
# Divide by 10000^2*i/d_model. Make 2*i/d_model close to one (high 2*i values)
t = pos / (10000 ** 0.95)

torch.cos(t), torch.sin(t)

Let's generate the full grid. There are $\text{maxlen} \times d_\text{model}$ values.

__maxlen__ is the maximum position we allow. This has to be predefined.

__d_model__ is the size of the input, which is embedding_dim in most cases.

In [None]:
maxlen = 20
d_model = 12

pe = torch.zeros((maxlen, d_model))
pe.dtype

__pos__ are the indices of the sequence from 0 to $\text{maxlen}-1$:

In [None]:
pos = torch.arange(maxlen, dtype=torch.float)
pos

Reminder:

$$
\text{PE}_{\text{pos},2i} = \sin(\frac{\text{pos}}{10000^{2i/d_{\text{model}}}}), \\
\text{PE}_{\text{pos},2i+1} = \cos(\frac{\text{pos}}{10000^{2i/d_{\text{model}}}}),
$$

Let's define the denominator:

In [None]:
divterm = 10000 ** (torch.arange(0, d_model, step=2) / float(d_model))
divterm

In [None]:
pos.size(), divterm.size()
(pos[:, None] / divterm).size()

In [None]:
pe[:, ::2] = torch.sin(pos[:, None] / divterm)
sns.heatmap(pe, cmap='RdBu', center=0)

In [None]:
pe[:, 1::2] = torch.cos(pos[:, None] / divterm)
sns.heatmap(pe, cmap='RdBu', center=0)

Combining it in a `nn.Module`:

In [None]:
# took inspiration from here: https://pytorch.org/tutorials/beginner/transformer_tutorial.html
class PositionalEncoding(nn.Module):

    def __init__(self, d_model, dropout=0.1, maxlen=50):
        super(PositionalEncoding, self).__init__()
        self.dropout = nn.Dropout(p=dropout)
        
        pe = torch.zeros(maxlen, d_model)
        pos = torch.arange(maxlen, dtype=torch.float)
        divterm = 10000 ** (torch.arange(0, d_model, step=2) / float(d_model))
        pe[:, ::2] = torch.sin(pos[:, None] / divterm)
        pe[:, 1::2] = torch.cos(pos[:, None] / divterm)
        
        # Since pe is a constant value not a parameter of the module, we register it as a buffer.
        # Buffers are part of the state dictionary of the module along with parameters.
        # Docs: https://pytorch.org/docs/stable/generated/torch.nn.Module.html#torch.nn.Module.register_buffer
        self.register_buffer('pe', pe)

    def forward(self, x):
        # The input sequence may be shorter than maxlen
        seqlen = x.size(0)
        
        # The middle dimension is the batch size.
        # We add it as a dummy dimension.
        x = x + self.pe[:seqlen, None, :]
        return self.dropout(x)

In [None]:
d_model = 12
maxlen = 20
batch_size = 7
seqlen = 11
pos_enc = PositionalEncoding(d_model=d_model, dropout=0., maxlen=maxlen)
x = torch.rand(size=(seqlen, batch_size, d_model))
x_pe = pos_enc(x)
x_pe.size()

# Contextual embeddings

In GloVe and Word2vec representations, words have static representations, in other words, the same vector is assigned for every occurrence of the word.
But words can have different meaning in different contexts, e.g. the word 'stick':

1. Find some dry sticks and we'll make a campfire.
2. Let's stick with glove embeddings.

![elmo](http://jalammar.github.io/images/elmo-embedding-robin-williams.png)

_(Peters et. al., 2018 in the ELMo paper)_

## ELMo

**E**mbeddings from **L**anguage **Mo**dels

Word representations are functions of the full sentences instead of the word alone.

Two bidirectional LSTM layers are linearly combined.

[Deep contextualized word representations](https://arxiv.org/abs/1802.05365) by Peters et al., 2018, 6300 citations

# BERT

[BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://www.aclweb.org/anthology/N19-1423/)
by Devlin et al. 2018, 17500 citations

[BERTology](https://huggingface.co/transformers/bertology.html) is the nickname for the growing amount of BERT-related research.

Trained on two tasks:

1. Masked language model:

    1. 15% of the <s>tokens</s>wordpieces are selected at the beginning.
    2. 80% of those are replaced with `[MASK]`,
    3. 10% are replaced with a random token,
    4. 10% are kept intact.
    
2. Next sentence prediction:
    - Are sentences A and B consecutive sentences?
    - Generate 50-50%.
    - Binary classification task.

## Embedding layer

In [None]:
Image("img/dl/bert_embedding.png")

## Transformer layers


## Finetuning

1. Take a trained BERT model.
2. Add a small classification layer on top (typically a 2-layer MLP).
3. Train BERT along with the classification layer on an annotated dataset.
    - Much smaller than the data BERT was trained on

Another option: freeze BERT and train the classification layer only.
- Easier training regime.
- Smaller memory footprint.
- Worse performance.

In [None]:
Image("img/dl/bert_encoding_finetuning.png")

## BERT pretrained checkpoints

### BERT-Base

- 12 layers
- 12 attention heads per layer
- 768 hidden size
- 110M parameters

### BERT-Large

- 24 layers
- 16 attention heads per layer
- 1024 hidden size
- 340M parameters

### Cased and uncased

Uncased: everything is lowercased. Diacritics are removed.

### Multilingual BERT - mBERT

104 language version trained on the 100 largest Wikipedia.

## BERT implementations

[Original Tensorflow implementation](https://github.com/google-research/bert)

[Huggingface Transformers](https://huggingface.co/transformers/)
- PyTorch implementation originally for BERT-only
- Now it supports dozens of other models
- Hundreds of other model checkpoints from the community

# BERT tokenization

## WordPiece tokenizer

BERT's input **must** be tokenized with BERT's own tokenizer.

A middle ground between word and character tokenization.

Static vocabulary:
- Byte-pair encoding: simple frequency-based tokenization method
- Continuation symbols (\#\#symbol)
- Special tokens: `[CLS]`, `[SEP]`, `[MASK]`, `[UNK]`
- It tokenizes everything, falling back to characters and `[UNK]` if necessary

`AutoTokenizer` is a factory class for pretrained tokenizers. ng id. `from_pretrained` instantiates the corresponding class and loads the weights:

In [None]:
t = AutoTokenizer.from_pretrained('bert-base-uncased')
print(type(t))
print(len(t.get_vocab()))

t.tokenize("My beagle's name is Tündérke.")

In [None]:
t.tokenize("Русский")

**Cased** models keep diacritics:

In [None]:
t = AutoTokenizer.from_pretrained('bert-base-cased')

t.tokenize("My beagle's name is Tündérke.")

In [None]:
len(t.get_vocab())

It character tokenizes Chinese and Japanese but doesn't know all the characters:

In [None]:
t.tokenize("日本語")

Korean is missing from this version:

In [None]:
t.tokenize("한 한국어")

## mBERT tokenization

104 languages, 1 vocabulary

In [None]:
t = AutoTokenizer.from_pretrained('bert-base-multilingual-cased')

In [None]:
len(t.get_vocab())

In [None]:
t.tokenize("My puppy's name is Tündérke.")

In [None]:
t.tokenize("한 한국어")

In [None]:
t.tokenize("日本語")

# Using BERT

## Using `BertModel` directly

`AutoModel`
- each pretrained checkpoint has a string id. `from_pretrained` instantiates the corresponding class and loads the weights:

In [None]:
tokenizer = AutoTokenizer.from_pretrained('bert-base-cased')
model = AutoModel.from_pretrained('bert-base-cased')
type(model), type(tokenizer)

In [None]:
tokenizer.tokenize("There are black cats and black dogs.")

`__call__` return a dictionary of BERT's encoding:

In [None]:
tokenizer("There are black cats and black dogs.")

It can be used for pairs of sentences. Note the values of `token_type_ids`:

In [None]:
tokenizer("There are black cats and black dogs.", "Another sentence.")

It can be used for multiple sentences:

In [None]:
tokenizer(["There are black cats and black dogs.", "There are two white cats."])

We need tensors as inputs for BERT:

In [None]:
encoded = tokenizer("There are black cats and black dogs.", return_tensors='pt')
encoded['input_ids'].size()

In [None]:
output = model(**encoded, return_dict=True)
output.keys()

In [None]:
output['last_hidden_state'].size(), output['pooler_output'].size()

Getting all layers:

In [None]:
output = model(**encoded, output_hidden_states=True, return_dict=True)
output.keys()

In [None]:
len(output['hidden_states']), output['hidden_states'][0].size()

Remove variable from the global namespace, run the garbage collector:

In [None]:
del model
gc.collect()

## BERT applications

### Sequence classification

Pretrained model for sentiment analysis.

Base model: `distilbert-base-uncased`

Finetuned on the [Stanford Sentiment Treebank](https://nlp.stanford.edu/sentiment/index.html) or SST-2, a popular sentiment analysis dataset.

Model id: `distilbert-base-uncased-finetuned-sst-2-english`

In [None]:
nlp = pipeline("sentiment-analysis")
nlp("This is an amazing class.")

In [None]:
nlp("This is not a good class but it's not too bad either.")

In [None]:
nlp("This is not a class.")

In [None]:
del nlp
gc.collect()

### Sequence tagging/labeling: Named entity recognition

Base model: `bert-large-cased`

Finetuned on [CoNLL-2003 NER](https://www.clips.uantwerpen.be/conll2003/ner/).

In [None]:
nlp = pipeline("ner")

In [None]:
result = nlp("jupiter is a Planet that orbits around James the center of the Universe")
result

In [None]:
result = nlp("George Clooney has a pet pig named Estella.")
result

In [None]:
del nlp
gc.collect()

### Machine translation

In [None]:
nlp = pipeline("translation_en_to_fr")
print(nlp("Hugging Face is a technology company based in New York and Paris", max_length=40))

Even the [blessé - blessed false cognate](https://frenchtogether.com/french-english-false-friends/) is handled correctly:

In [None]:
nlp("I was blessed by God after I injured my head.", max_length=40)

In [None]:
gc.collect()

In [None]:
del nlp
gc.collect()

### Masked language modeling

Uses `distilroberta-base`

In [None]:
nlp = pipeline("fill-mask")

In [None]:
prompt = "Twitter is a bad idea /s> [MASK]"

for n in range(10):
    result = nlp(f"{prompt} {nlp.tokenizer.mask_token}")
    token = result[0]['token_str'][1:]
    prompt += " " + token
    
prompt

In [None]:
from pprint import pprint
pred = nlp(f"HuggingFace is creating a {nlp.tokenizer.mask_token} that the community uses to solve NLP tasks.")
pprint(pred)

In [None]:
pred = nlp(f"{nlp.tokenizer.mask_token} is a very good idea.")
pprint(pred)

In [None]:
pred = nlp(f"{nlp.tokenizer.mask_token} is a bad idea.")
pprint(pred)

In [None]:
del nlp
gc.collect()

# Other models

## Pretrained models

RoBERTa: identical model, larger training data, different training objective

DistilBERT: smaller version of BERT. It was _distilled_ or compressed from BERT with a student-teacher setup.

ALBERT: smaller BERT

XLM-RoBERTa: multilingual version of RoBERTa

Distil-mBERT: distilled multilingual BERT

## Community models

[Over 1000 community contributions](https://huggingface.co/models)

## huBERT

The first Hungarian-only model and the only one registered on Huggingface.
Other models are available at https://hilanco.github.io/.

BERT base, trained on Webcorpus 2.0, a version of CommonCrawl.

Its tokenizer works much better for Hungarian than mBERT's:

In [None]:
hubert_tokenizer = AutoTokenizer.from_pretrained('SZTAKI-HLT/hubert-base-cc')
# hubert = AutoModel.from_pretrained('SZTAKI-HLT/hubert-base-cc')

In [None]:
sent = ("George Clooney Magyarországról szóló, az Orbán-kormányt kritizáló levelére miniszteri és "
        "államtitkári szinten is reagált a magyar kormány.")
hubert_tokenizer.tokenize(sent)

In [None]:
bert_tokenizer = AutoTokenizer.from_pretrained('bert-base-multilingual-cased')

In [None]:
bert_tokenizer.tokenize(sent)

## GPT-2 text generation

Causal language modeliing is when the $i^{th}$ token is modeled based on all the previous tokens as opposed to masked language modeling where both left and right context are used.

In [None]:
text_generator = pipeline("text-generation")

In [None]:
print(text_generator("This is a serious issue we should address", max_length=50, do_sample=False)[0]['generated_text'])

In [None]:
print(text_generator("Twitter is a bad idea, Jack Dorsey had a bad day when he came up with it", max_length=100, do_sample=False)[0]['generated_text'])

In [None]:
del text_generator
gc.collect()

# Further information

[Official PyTorch Transformer tutorial](https://pytorch.org/tutorials/beginner/transformer_tutorial.html)

[Illustrated Transformer](http://jalammar.github.io/illustrated-transformer/)
- Famous blog post with a detailed gentle introduction to Transformers

[The Annotated Transformer](http://nlp.seas.harvard.edu/2018/04/03/attention.html)
- A walkthrough of original Transformer paper with code and detailed illustration

[Huggingface Transformers - Summary of tasks](https://huggingface.co/transformers/task_summary.html)

[My blog post about mBERT's tokenizer](http://juditacs.github.io/2019/02/19/bert-tokenization-stats.html)