Skip to content

Latest commit

 

History

History
565 lines (418 loc) · 29.1 KB

TUTORIAL_4_ELMO_BERT_FLAIR_EMBEDDING.md

File metadata and controls

565 lines (418 loc) · 29.1 KB

Tutorial 4: BERT, ELMo, and Flair Embeddings

Next to standard WordEmbeddings and CharacterEmbeddings, we also provide classes for BERT, ELMo and Flair embeddings. These embeddings enable you to train truly state-of-the-art NLP models.

This tutorial explains how to use these embeddings. We assume that you're familiar with the base types of this library as well as standard word embeddings, in particular the StackedEmbeddings class.

Embeddings

All word embedding classes inherit from the TokenEmbeddings class and implement the embed() method which you need to call to embed your text. This means that for most users of Flair, the complexity of different embeddings remains hidden behind this interface. Simply instantiate the embedding class you require and call embed() to embed your text.

All embeddings produced with our methods are Pytorch vectors, so they can be immediately used for training and fine-tuning.

Flair Embeddings

Contextual string embeddings are powerful embeddings that capture latent syntactic-semantic information that goes beyond standard word embeddings. Key differences are: (1) they are trained without any explicit notion of words and thus fundamentally model words as sequences of characters. And (2) they are contextualized by their surrounding text, meaning that the same word will have different embeddings depending on its contextual use.

With Flair, you can use these embeddings simply by instantiating the appropriate embedding class, same as standard word embeddings:

from flair.embeddings import FlairEmbeddings

# init embedding
flair_embedding_forward = FlairEmbeddings('news-forward')

# create a sentence
sentence = Sentence('The grass is green .')

# embed words in sentence
flair_embedding_forward.embed(sentence)

You choose which embeddings you load by passing the appropriate string to the constructor of the FlairEmbeddings class. Currently, the following contextual string embeddings are provided (note: replace 'X' with either 'forward' or 'backward'):

ID Language Embedding
'multi-X' English, German, French, Italian, Dutch, Polish Mix of corpora (Web, Wikipedia, Subtitles, News)
'multi-X-fast' English, German, French, Italian, Dutch, Polish Mix of corpora (Web, Wikipedia, Subtitles, News), CPU-friendly
'news-X' English Trained with 1 billion word corpus
'news-X-fast' English Trained with 1 billion word corpus, CPU-friendly
'mix-X' English Trained with mixed corpus (Web, Wikipedia, Subtitles)
'ar-X' Arabic Added by @stefan-it: Trained with Wikipedia/OPUS
'bg-X' Bulgarian Added by @stefan-it: Trained with Wikipedia/OPUS
'bg-X-fast' Bulgarian Added by @stefan-it: Trained with various sources (Europarl, Wikipedia or SETimes)
'cs-X' Czech Added by @stefan-it: Trained with Wikipedia/OPUS
'cs-v0-X' Czech Added by @stefan-it: LM embeddings (earlier version)
'de-X' German Trained with mixed corpus (Web, Wikipedia, Subtitles)
'de-historic-ha-X' German (historical) Added by @stefan-it: Historical German trained over Hamburger Anzeiger
'de-historic-wz-X' German (historical) Added by @stefan-it: Historical German trained over Wiener Zeitung
'es-X' Spanish Added by @iamyihwa: Trained with Wikipedia
'es-X-fast' Spanish Added by @iamyihwa: Trained with Wikipediam CPU-friendly
'eu-X' Basque Added by @stefan-it: Trained with Wikipedia/OPUS
'eu-v0-X' Basque Added by @stefan-it: LM embeddings (earlier version)
'fa-X' Persian Added by @stefan-it: Trained with Wikipedia/OPUS
'fi-X' Finnish Added by @stefan-it: Trained with Wikipedia/OPUS
'fr-X' French Added by @mhham: Trained with French Wikipedia
'he-X' Hebrew Added by @stefan-it: Trained with Wikipedia/OPUS
'hi-X' Hindi Added by @stefan-it: Trained with Wikipedia/OPUS
'hr-X' Croatian Added by @stefan-it: Trained with Wikipedia/OPUS
'id-X' Indonesian Added by @stefan-it: Trained with Wikipedia/OPUS
'it-X' Italian Added by @stefan-it: Trained with Wikipedia/OPUS
'ja-X' Japanese Added by @frtacoa: Trained with 439M words of Japanese Web crawls (2048 hidden states, 2 layers)
'nl-X' Dutch Added by @stefan-it: Trained with Wikipedia/OPUS
'nl-v0-X' Dutch Added by @stefan-it: LM embeddings (earlier version)
'no-X' Norwegian Added by @stefan-it: Trained with Wikipedia/OPUS
'pl-X' Polish Added by @borchmann: Trained with web crawls (Polish part of CommonCrawl)
'pl-opus-X' Polish Added by @stefan-it: Trained with Wikipedia/OPUS
'pt-X' Portuguese Added by @ericlief: LM embeddings
'sl-X' Slovenian Added by @stefan-it: Trained with Wikipedia/OPUS
'sl-v0-X' Slovenian Added by @stefan-it: Trained with various sources (Europarl, Wikipedia and OpenSubtitles2018)
'sv-X' Swedish Added by @stefan-it: Trained with Wikipedia/OPUS
'sv-v0-X' Swedish Added by @stefan-it: Trained with various sources (Europarl, Wikipedia or OpenSubtitles2018)
'pubmed-X' English Added by @jessepeng: Trained with 5% of PubMed abstracts until 2015 (1150 hidden states, 3 layers)

So, if you want to load embeddings from the German forward LM model, instantiate the method as follows:

flair_de_forward = FlairEmbeddings('de-forward')

And if you want to load embeddings from the Bulgarian backward LM model, instantiate the method as follows:

flair_bg_backward = FlairEmbeddings('bg-backward')

Recommended Flair Usage

We recommend combining both forward and backward Flair embeddings. Depending on the task, we also recommend adding standard word embeddings into the mix. So, our recommended StackedEmbedding for most English tasks is:

from flair.embeddings import WordEmbeddings, FlairEmbeddings, StackedEmbeddings

# create a StackedEmbedding object that combines glove and forward/backward flair embeddings
stacked_embeddings = StackedEmbeddings([
                                        WordEmbeddings('glove'),
                                        FlairEmbeddings('news-forward'),
                                        FlairEmbeddings('news-backward'),
                                       ])

That's it! Now just use this embedding like all the other embeddings, i.e. call the embed() method over your sentences.

sentence = Sentence('The grass is green .')

# just embed a sentence using the StackedEmbedding as you would with any single embedding.
stacked_embeddings.embed(sentence)

# now check out the embedded tokens.
for token in sentence:
    print(token)
    print(token.embedding)

Words are now embedded using a concatenation of three different embeddings. This combination often gives state-of-the-art accuracy.

PyTorch-Transformers

Thanks to the brilliant pytorch-transformers library from Hugging Face, Flair is able to support various Transformer-based architectures like BERT or XLNet.

The following embeddings can be used in Flair:

  • BertEmbeddings
  • OpenAIGPTEmbeddings
  • OpenAIGPT2Embeddings
  • TransformerXLEmbeddings
  • XLNetEmbeddings
  • XLMEmbeddings
  • RoBERTaEmbeddings

This section shows how to use these Transformer-based architectures in Flair and is heavily based on the excellent [PyTorch-Transformers pre-trained models documentation](https://huggingface.co/pytorch-transformers/pretrained_models.html.

BERT Embeddings

BERT embeddings were developed by Devlin et al. (2018) and are a different kind of powerful word embedding based on a bidirectional transformer architecture. The embeddings itself are wrapped into our simple embedding interface, so that they can be used like any other embedding.

from flair.embeddings import BertEmbeddings

# init embedding
embedding = BertEmbeddings()

# create a sentence
sentence = Sentence('The grass is green .')

# embed words in sentence
embedding.embed(sentence)

The BertEmbeddings class has several arguments:

Argument Default Description
bert_model_or_path bert-base-uncased Defines BERT model or points to user-defined path
layers -1,-2,-3,-4 Defines the to be used layers of the Transformer-based model
pooling_operation first See Pooling operation section.
use_scalar_mix False See Scalar mix section.

You can load any of the pre-trained BERT models by providing bert_model_or_path during initialization:

Model Details
bert-base-uncased 12-layer, 768-hidden, 12-heads, 110M parameters
Trained on lower-cased English text
bert-large-uncased 24-layer, 1024-hidden, 16-heads, 340M parameters
Trained on lower-cased English text
bert-base-cased 12-layer, 768-hidden, 12-heads, 110M parameters
Trained on cased English text
bert-large-cased 24-layer, 1024-hidden, 16-heads, 340M parameters
Trained on cased English text
bert-base-multilingual-uncased (Original, not recommended) 12-layer, 768-hidden, 12-heads, 110M parameters
Trained on lower-cased text in the top 102 languages with the largest Wikipedias
(see details)
bert-base-multilingual-cased (New, recommended) 12-layer, 768-hidden, 12-heads, 110M parameters
Trained on cased text in the top 104 languages with the largest Wikipedias
(see details)
bert-base-chinese 12-layer, 768-hidden, 12-heads, 110M parameters
Trained on cased Chinese Simplified and Traditional text
bert-base-german-cased 12-layer, 768-hidden, 12-heads, 110M parameters
Trained on cased German text by Deepset.ai
(see details on deepset.ai website)
bert-large-uncased-whole-word-masking 24-layer, 1024-hidden, 16-heads, 340M parameters
Trained on lower-cased English text using Whole-Word-Masking
(see details)
bert-large-cased-whole-word-masking 24-layer, 1024-hidden, 16-heads, 340M parameters
Trained on cased English text using Whole-Word-Masking
(see details)
bert-large-uncased-whole-word-masking-finetuned-squad 24-layer, 1024-hidden, 16-heads, 340M parameters
The bert-large-uncased-whole-word-masking model fine-tuned on SQuAD (see details of fine-tuning in the
example section of PyTorch-Transformers)
bert-large-cased-whole-word-masking-finetuned-squad 24-layer, 1024-hidden, 16-heads, 340M parameters
The bert-large-cased-whole-word-masking model fine-tuned on SQuAD
(see details of fine-tuning in the example section)
bert-base-cased-finetuned-mrpc 12-layer, 768-hidden, 12-heads, 110M parameters
The bert-base-cased model fine-tuned on MRPC
(see details of fine-tuning in the example section of PyTorch-Transformers)

OpenAI GPT Embeddings

The OpenAI GPT model was proposed by Radford et. al (2018). GPT is an uni-directional Transformer-based model.

The following example shows how to use the OpenAIGPTEmbeddings:

from flair.embeddings import OpenAIGPTEmbeddings

# init embedding
embedding = OpenAIGPTEmbeddings()

# create a sentence
sentence = Sentence('Berlin and Munich are nice cities .')

# embed words in sentence
embedding.embed(sentence)

The OpenAIGPTEmbeddings class has several arguments:

Argument Default Description
model openai-gpt Defines GPT model
layers 1 Defines the to be used layers of the Transformer-based model
pooling_operation first_last See Pooling operation section
use_scalar_mix False See Scalar mix section

OpenAI GPT-2 Embeddings

The OpenAI GPT-2 model was proposed by [Radford et. al (2018)(https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf). GPT-2 is also an uni-directional Transformer-based model, that was trained on a larger corpus compared to the GPT model.

The GPT-2 model can be used with the OpenAIGPT2Embeddings class:

from flair.embeddings import OpenAIGPT2Embeddings

# init embedding
embedding = OpenAIGPT2Embeddings()

# create a sentence
sentence = Sentence('The Englischer Garten is a large public park in the centre of Munich .')

# embed words in sentence
embedding.embed(sentence)

The OpenAIGPT2Embeddings class has several arguments:

Argument Default Description
model gpt2-medium Defines GPT-2 model
layers 1 Defines the to be used layers of the Transformer-based model
pooling_operation first_last See Pooling operation section
use_scalar_mix False See Scalar mix section

Following GPT-2 models can be used:

Model Details
gpt2 12-layer, 768-hidden, 12-heads, 117M parameters
OpenAI GPT-2 English model
gpt2-medium 24-layer, 1024-hidden, 16-heads, 345M parameters
OpenAI's Medium-sized GPT-2 English model

Transformer-XL Embeddings

The Transformer-XL model was proposed by Dai et. al (2019). It is an uni-directional Transformer-based model with relative positioning embeddings.

The Transformer-XL model can be used with the TransformerXLEmbeddings class:

from flair.embeddings import TransformerXLEmbeddings

# init embedding
embedding = TransformerXLEmbeddings()

# create a sentence
sentence = Sentence('The Berlin Zoological Garden is the oldest and best-known zoo in Germany .')

# embed words in sentence
embedding.embed(sentence)

The following arguments can be passed to the TransformerXLEmbeddings class:

Argument Default Description
model transfo-xl-wt103 Defines Transformer-XL model
layers 1,2,3 Defines the to be used layers of the Transformer-based model
use_scalar_mix False See Scalar mix section

Notice: The Transformer-XL model (trained on WikiText-103) is a word-based language model. Thus, no subword tokenization is necessary is needed (pooling_operation is not needed).

XLNet Embeddings

The XLNet model was proposed by Yang et. al (2019). It is an extension of the Transformer-XL model using an autoregressive method to learn bi-directional contexts.

The XLNet model can be used with the XLNetEmbeddings class:

from flair.embeddings import XLNetEmbeddings

# init embedding
embedding = XLNetEmbeddings()

# create a sentence
sentence = Sentence('The Hofbräuhaus is a beer hall in Munich .')

# embed words in sentence
embedding.embed(sentence)

The following arguments can be passed to the XLNetEmbeddings class:

Argument Default Description
model xlnet-large-cased Defines XLNet model
layers 1 Defines the to be used layers of the Transformer-based model
pooling_operation first_last See Pooling operation section
use_scalar_mix False See Scalar mix section

Following XLNet models can be used:

Model Details
xlnet-base-cased 12-layer, 768-hidden, 12-heads, 110M parameters
XLNet English model
xlnet-large-cased 24-layer, 1024-hidden, 16-heads, 340M parameters
XLNet Large English model

XLM Embeddings

The XLM model was proposed by Lample and Conneau (2019). It extends the generative pre-training approach for English to multiple languages and show the effectiveness of cross-lingual pretraining.

The XLM model can be used with the XLMEmbeddings class:

from flair.embeddings import XLMEmbeddings

# init embedding
embedding = XLMEmbeddings()

# create a sentence
sentence = Sentence('The BER is an international airport under construction near Berlin .')

# embed words in sentence
embedding.embed(sentence)

The following arguments can be passed to the XLMEmbeddings class:

Argument Default Description
model xlm-mlm-en-2048 Defines XLM model
layers 1 Defines the to be used layers of the Transformer-based model
pooling_operation first_last See Pooling operation section
use_scalar_mix False See Scalar mix section

Following XLM models can be used:

Model Details
xlm-mlm-en-2048 12-layer, 1024-hidden, 8-heads
XLM English model
xlm-mlm-ende-1024 12-layer, 1024-hidden, 8-heads
XLM English-German Multi-language model
xlm-mlm-enfr-1024 12-layer, 1024-hidden, 8-heads
XLM English-French Multi-language model
xlm-mlm-enro-1024 12-layer, 1024-hidden, 8-heads
XLM English-Romanian Multi-language model
xlm-mlm-xnli15-1024 12-layer, 1024-hidden, 8-heads
XLM Model pre-trained with MLM on the 15 XNLI languages
xlm-mlm-tlm-xnli15-1024 12-layer, 1024-hidden, 8-heads
XLM Model pre-trained with MLM + TLM on the 15 XNLI languages
xlm-clm-enfr-1024 12-layer, 1024-hidden, 8-heads
XLM English model trained with CLM (Causal Language Modeling)
xlm-clm-ende-1024 12-layer, 1024-hidden, 8-heads
XLM English-German Multi-language model trained with CLM (Causal Language Modeling)

RoBERTa Embeddings

The RoBERTa (Robustly optimized BERT pre-training approach) model was proposed by Liu et. al (2019), and uses an improved pre-training procedure to train a BERT model on a large corpus.

It can be used with the RoBERTaEmbeddings class:

from flair.embeddings import RoBERTaEmbeddings

# init embedding
embedding = RoBERTaEmbeddings()

# create a sentence
sentence = Sentence("The Oktoberfest is the world's largest Volksfest .")

# embed words in sentence
embedding.embed(sentence)

The following arguments can be passed to the RoBERTaEmbeddings class:

Argument Default Description
model roberta.large Defines RoBERTa model
layers -1 Defines the to be used layers of the Transformer-based model
pooling_operation first Pooling operation section
use_scalar_mix False Scalar mix section

Following XLM models can be used:

Model Details
roberta.base 12-layer, 768-hidden, 12-heads
RoBERTa English model
roberta.base 24-layer, 1024-hidden, 16-heads
RoBERTa English model
roberta.large.mnli 24-layer, 1024-hidden, 16-heads
RoBERTa English model, finetuned on MNLI

Pooling operation

Most of the Transformer-based models (except Transformer-XL) use subword tokenization. E.g. the following token puppeteer could be tokenized into the subwords: pupp, ##ete and ##er.

We implement different pooling operations for these subwords to generate the final token representation:

  • first: only the embedding of the first subword is used
  • last: only the embedding of the last subword is used
  • first_last: embeddings of the first and last subwords are concatenated and used
  • mean: a torch.mean over all subword embeddings is calculated and used

Scalar mix

The Transformer-based models have a certain number of layers. Liu et. al (2019) propose a technique called scalar mix, that computes a parameterised scalar mixture of user-defined layers.

This technique is very useful, because for some downstream tasks like NER or PoS tagging it can be unclear which layer(s) of a Transformer-based model perform well, and per-layer analysis can take a lot of time.

To use scalar mix, all Transformer-based embeddings in Flair come with a use_scalar_mix argument. The following example shows how to use scalar mix for a base RoBERTa model on all layers:

from flair.embeddings import RoBERTaEmbeddings

# init embedding
embedding = RoBERTaEmbeddings(model="roberta.base", layers="0,1,2,3,4,5,6,7,8,9,10,11,12",
                              pooling_operation="first", use_scalar_mix=True)

# create a sentence
sentence = Sentence("The Oktoberfest is the world's largest Volksfest .")

# embed words in sentence
embedding.embed(sentence)

ELMo Embeddings

ELMo embeddings were presented by Peters et al. in 2018. They are using a bidirectional recurrent neural network to predict the next word in a text. We are using the implementation of AllenNLP. As this implementation comes with a lot of sub-dependencies, which we don't want to include in Flair, you need to first install the library via pip install allennlp before you can use it in Flair. Using the embeddings is as simple as using any other embedding type:

from flair.embeddings import ELMoEmbeddings

# init embedding
embedding = ELMoEmbeddings()

# create a sentence
sentence = Sentence('The grass is green .')

# embed words in sentence
embedding.embed(sentence)

AllenNLP provides the following pre-trained models. To use any of the following models inside Flair simple specify the embedding id when initializing the ELMoEmbeddings.

ID Language Embedding
'small' English 1024-hidden, 1 layer, 14.6M parameters
'medium' English 2048-hidden, 1 layer, 28.0M parameters
'original' English 4096-hidden, 2 layers, 93.6M parameters
'pt' Portuguese
'pubmed' English biomedical data more information

Combining BERT and Flair

You can very easily mix and match Flair, ELMo, BERT and classic word embeddings. All you need to do is instantiate each embedding you wish to combine and use them in a StackedEmbedding.

For instance, let's say we want to combine the multilingual Flair and BERT embeddings to train a hyper-powerful multilingual downstream task model.

First, instantiate the embeddings you wish to combine:

from flair.embeddings import FlairEmbeddings, BertEmbeddings

# init Flair embeddings
flair_forward_embedding = FlairEmbeddings('multi-forward')
flair_backward_embedding = FlairEmbeddings('multi-backward')

# init multilingual BERT
bert_embedding = BertEmbeddings('bert-base-multilingual-cased')

Now instantiate the StackedEmbeddings class and pass it a list containing these three embeddings.

from flair.embeddings import StackedEmbeddings

# now create the StackedEmbedding object that combines all embeddings
stacked_embeddings = StackedEmbeddings(
    embeddings=[flair_forward_embedding, flair_backward_embedding, bert_embedding])

That's it! Now just use this embedding like all the other embeddings, i.e. call the embed() method over your sentences.

sentence = Sentence('The grass is green .')

# just embed a sentence using the StackedEmbedding as you would with any single embedding.
stacked_embeddings.embed(sentence)

# now check out the embedded tokens.
for token in sentence:
    print(token)
    print(token.embedding)

Words are now embedded using a concatenation of three different embeddings. This means that the resulting embedding vector is still a single PyTorch vector.

Next

You can now either look into document embeddings to embed entire text passages with one vector for tasks such as text classification, or go directly to the tutorial about loading your corpus, which is a pre-requirement for training your own models.