# Introduction to Python and Natural Language Technologies

__Lecture 10, Transformers, BERT__

__Nov 25, 2020__

__Judit Ács__

In [None]:
from IPython.display import Image
import torch
import torch.nn as nn

from transformers import pipeline
from transformers import AutoTokenizer, AutoModel

# Transformers

Introduced in [Attention Is All You Need](https://papers.nips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf) by Vaswani et al., 2017

[Illustrated Transformer](http://jalammar.github.io/illustrated-transformer/)



## Motivation

Recall that we used recurrent neural cells, specifaclly LSTMs to encode and decode sequences.

LSTMs rely on their left and right history (horizontal arrows).

In [None]:
Image("img/tikz/abstract_seq2seq.png")

This makes it impossible to parallelize these steps.

Transformers solve this problem by relying purely on attention instead of recurrence.

In [None]:
Image("http://jalammar.github.io/images/t/encoder_with_tensors.png", embed=True)  # from Illustrated Transformers

Attention assigns a weight to each element of the sequence.

This weight is the _importance_ of the element, i.e. how much 'attention we should pay'.

Self-attention means that the encoder attends to itself:

In [None]:
Image("http://jalammar.github.io/images/t/transformer_self-attention_visualization.png", embed=True)  # from Illustrated Transformers

This visualization is available in the [Tensor2tensor notebook in Google Colab](https://colab.research.google.com/github/tensorflow/tensor2tensor/blob/master/tensor2tensor/notebooks/hello_t2t.ipynb)

## Word order

Without recurrence word order information is lost.

Positional information is important:

    John loves Mary.
    Mary loves John.

Transformers apply positional encoding:

$$
PE_{pos,2i} = sin(pos/10000^{2i/d_{\text{model}}}), \\
PE_{pos,2i+1} = cos(pos/10000^{2i/d_{\text{model}}}).
$$

## Other components

Transformers have a number of additional components summarized in this figure:

In [None]:
Image("img/dl/transformer.png")  # from Vaswani et al. 2018

## PyTorch support

PyTorch has a `nn.Transformer` class and its encoder and decoder versions.

# Contextual embeddings

(from Lecture 8) In GloVe and Word2vec representations, words have static representations, in other words, the same vector is assigned for every occurrence of the word.
But words can have different meaning in different contexts, e.g. the word 'stick':

1. Find some dry sticks and we'll make a campfire.
2. Let's stick with glove embeddings.

![elmo](http://jalammar.github.io/images/elmo-embedding-robin-williams.png)

_(Peters et. al., 2018 in the ELMo paper)_

## ELMo

**E**mbeddings from **L**anguage **Mo**dels

Word representations are functions of the full sentences instead of the word alone.

Two bidirectional LSTM layers are linearly combined.

[Deep contextualized word representations](https://arxiv.org/abs/1802.05365) by Peters et al., 2018, 5200 citations

# BERT

[BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://www.aclweb.org/anthology/N19-1423/)
by Devlin et al. 2018, 12500 citations

[BERTology](https://huggingface.co/transformers/bertology.html) is the nickname for the growing amount of BERT-related research.

Trained on two tasks:

1. Masked language model:

    1. 15% of the <s>tokens</s>wordpieces are selected at the beginning.
    2. 80% of those are replaced with `[MASK]`,
    3. 10% are replaced with a random token,
    4. 10% are kept intact.
    
2. Next sentence prediction:
    - Are sentences A and B consecutive sentences?
    - Generate 50-50%.
    - Binary classification task.

## Embedding layer

In [None]:
Image("img/dl/bert_embedding.png")

## Transformer layers


## Finetuning

1. Take a trained BERT model.
2. Add a small classification layer on top (typically a 2-layer MLP).
3. Train BERT along with the classification layer on an annotated dataset.
    - Much smaller than the data BERT was trained on

Another option: freeze BERT and train the classification layer only.
- Easier training regime.
- Smaller memory footprint.
- Worse performance.

In [None]:
Image("img/dl/bert_encoding_finetuning.png")

## BERT pretrained checkpoints

### BERT-Base

- 12 layers
- 12 attention heads per layer
- 768 hidden size
- 110M parameters

### BERT-Large

- 24 layers
- 16 attention heads per layer
- 1024 hidden size
- 340M parameters

### Cased and uncased

Uncased: everything is lowercased. Diacritics are removed.

### Multilingual BERT - mBERT

104 language version trained on the 100 largest Wikipedia.

## BERT implementations

[Original Tensorflow implementation](https://github.com/google-research/bert)

[Huggingface Transformers](https://huggingface.co/transformers/)
- PyTorch implementation originally for BERT-only
- Now it supports dozens of other models
- Hundreds of other model checkpoints from the community

# BERT tokenization

## WordPiece tokenizer

BERT's input **must** be
A middle ground between word and character tokenization.

Static vocabulary:
- Byte-pair encoding: simple frequency-based tokenization method
- Continuation symbols (\#\#symbol)
- Special tokens: `[CLS]`, `[SEP]`, `[MASK]`, `[UNK]`
- It tokenizes everything, falling back to characters and `[UNK]` if necessary

`AutoTokenizer` is a factory class for pretrained tokenizers. ng id. `from_pretrained` instantiates the corresponding class and loads the weights:

In [None]:
t = AutoTokenizer.from_pretrained('bert-base-uncased')

t.tokenize("My beagle's name is Tündérke.")

In [None]:
t.tokenize("Русский")

**Cased** models keep diacritics:

In [None]:
t = AutoTokenizer.from_pretrained('bert-base-cased')

t.tokenize("My beagle's name is Tündérke.")

It character tokenizes Chinese and Japanese but doesn't know all the characters:

In [None]:
t.tokenize("日本語")

Korean is missing from this version:

In [None]:
t.tokenize("한 한국어")

## mBERT tokenization

104 languages, 1 vocabulary

In [None]:
t = AutoTokenizer.from_pretrained('bert-base-multilingual-cased')

In [None]:
len(t.get_vocab())

In [None]:
t.tokenize("My beagle's name is Tündérke.")

In [None]:
t.tokenize("한 한국어")

In [None]:
t.tokenize("日本語")

# Using BERT

## Using `BertModel` directly

`AutoModel`
- each pretrained checkpoint has a string id. `from_pretrained` instantiates the corresponding class and loads the weights:

In [None]:
tokenizer = AutoTokenizer.from_pretrained('bert-base-cased')
model = AutoModel.from_pretrained('bert-base-cased')
type(model), type(tokenizer)

In [None]:
tokenizer.tokenize("There are black cats and black dogs.")

`__call__` return a dictionary of BERT's encoding:

In [None]:
tokenizer("There are black cats and black dogs.")

It can be used for multiple sentences:

In [None]:
tokenizer(["There are black cats and black dogs.", "There are two white cats."])

We need tensors as inputs for BERT:

In [None]:
encoded = tokenizer("There are black cats and black dogs.", return_tensors='pt')
encoded['input_ids'].size()

In [None]:
output = model(**encoded, return_dict=True)
output.keys()

In [None]:
output['last_hidden_state'].size(), output['pooler_output'].size()

Getting all layers:

In [None]:
output = model(**encoded, output_hidden_states=True, return_dict=True)
output.keys()

In [None]:
len(output['hidden_states']), output['hidden_states'][0].size()

## BERT applications

### Sequence classification

Pretrained model for sentiment analysis.

Base model: `distilbert-base-uncased`

Finetuned on the [Stanford Sentiment Treebank](https://nlp.stanford.edu/sentiment/index.html) or SST-2, a popular sentiment analysis dataset.

Model id: `distilbert-base-uncased-finetuned-sst-2-english`

In [None]:
nlp = pipeline("sentiment-analysis")
nlp("This is an amazing class.")

### Sequence tagging/labeling: Named entity recognition

Base model: `bert-large-cased`

Finetuned on [CoNLL-2003 NER](https://www.clips.uantwerpen.be/conll2003/ner/).

In [None]:
nlp = pipeline("ner")

In [None]:
result = nlp("jupiter is a Planet that orbits around James the center of the Universe")
result

In [None]:
result

### Machine translation

In [None]:
translator = pipeline("translation_en_to_fr")
print(translator("Hugging Face is a technology company based in New York and Paris", max_length=40))

### Masked language modeling

Uses `distilroberta-base`

In [None]:
nlp = pipeline("fill-mask")

In [None]:
from pprint import pprint
pred = nlp(f"HuggingFace is creating a {nlp.tokenizer.mask_token} that the community uses to solve NLP tasks.")
pprint(pred)

In [None]:
pred[0]['token_str']

# Other models

## Pretrained models

RoBERTa: identical model, larger training data, different training objective

DistilBERT: smaller version of BERT. It was _distilled_ or compressed from BERT with a student-teacher setup.

XLM-RoBERTa: multilingual version of RoBERTa

Distil-mBERT: distilled multilingual BERT

## Community models

[Over 1000 community contributions](https://huggingface.co/models)

## huBERT

The first and so far only Hungarian-only model

BERT base, trained on Webcorpus 2.0, a version of CommonCrawl.

Its tokenizer works much better for Hungarian than mBERT's:

In [None]:
hubert_tokenizer = AutoTokenizer.from_pretrained('SZTAKI-HLT/hubert-base-cc')
# hubert = AutoModel.from_pretrained('SZTAKI-HLT/hubert-base-cc')

In [None]:
sent = ("George Clooney Magyarországról szóló, az Orbán-kormányt kritizáló levelére miniszteri és "
        "államtitkári szinten is reagált a magyar kormány.")
hubert_tokenizer.tokenize(sent)

In [None]:
bert_tokenizer = AutoTokenizer.from_pretrained('bert-base-multilingual-cased')

In [None]:
bert_tokenizer.tokenize(sent)

## GPT-2 text generation

Causal language modeliing is when the $i^{th}$ token is modeled based on all the previous tokens as opposed to masked language modeling where both left and right context are used.

In [None]:
text_generator = pipeline("text-generation")

In [None]:
print(text_generator("This is a serious issue we should address", max_length=50, do_sample=False)[0]['generated_text'])

# Further information

[Illustrated Transformer](http://jalammar.github.io/illustrated-transformer/)
- Famous blog post with a detailed gentle introduction to Transformers

[The Annotated Transformer](http://nlp.seas.harvard.edu/2018/04/03/attention.html)
- A walkthrough of original Transformer paper with code and detailed illustration

[Huggingface Transformers - Summary of tasks](https://huggingface.co/transformers/task_summary.html)

[My blog post about mBERT's tokenizer](http://juditacs.github.io/2019/02/19/bert-tokenization-stats.html)