Tutorial 4: BERT, ELMo, and Flair Embeddings

Next to standard WordEmbeddings and CharacterEmbeddings, we also provide classes for BERT, ELMo and Flair embeddings. These embeddings enable you to train truly state-of-the-art NLP models.

This tutorial explains how to use these embeddings. We assume that you're familiar with the base types of this library as well as standard word embeddings, in particular the StackedEmbeddings class.

Embeddings

All word embedding classes inherit from the TokenEmbeddings class and implement the embed() method which you need to call to embed your text. This means that for most users of Flair, the complexity of different embeddings remains hidden behind this interface. Simply instantiate the embedding class you require and call embed() to embed your text.

All embeddings produced with our methods are Pytorch vectors, so they can be immediately used for training and fine-tuning.

Flair Embeddings

Contextual string embeddings are powerful embeddings that capture latent syntactic-semantic information that goes beyond standard word embeddings. Key differences are: (1) they are trained without any explicit notion of words and thus fundamentally model words as sequences of characters. And (2) they are contextualized by their surrounding text, meaning that the same word will have different embeddings depending on its contextual use.

With Flair, you can use these embeddings simply by instantiating the appropriate embedding class, same as standard word embeddings:

from flair.embeddings import FlairEmbeddings

# init embedding
flair_embedding_forward = FlairEmbeddings('news-forward')

# create a sentence
sentence = Sentence('The grass is green .')

# embed words in sentence
flair_embedding_forward.embed(sentence)

You choose which embeddings you load by passing the appropriate string to the constructor of the FlairEmbeddings class. Currently, the following contextual string embeddings are provided (note: replace 'X' with either 'forward' or 'backward'):

ID	Language	Embedding
'multi-X'	English, German, French, Italian, Dutch, Polish	Mix of corpora (Web, Wikipedia, Subtitles, News)
'multi-X-fast'	English, German, French, Italian, Dutch, Polish	Mix of corpora (Web, Wikipedia, Subtitles, News), CPU-friendly
'news-X'	English	Trained with 1 billion word corpus
'news-X-fast'	English	Trained with 1 billion word corpus, CPU-friendly
'mix-X'	English	Trained with mixed corpus (Web, Wikipedia, Subtitles)
'ar-X'	Arabic	Added by @stefan-it: Trained with Wikipedia/OPUS
'bg-X'	Bulgarian	Added by @stefan-it: Trained with Wikipedia/OPUS
'bg-X-fast'	Bulgarian	Added by @stefan-it: Trained with various sources (Europarl, Wikipedia or SETimes)
'cs-X'	Czech	Added by @stefan-it: Trained with Wikipedia/OPUS
'cs-v0-X'	Czech	Added by @stefan-it: LM embeddings (earlier version)
'de-X'	German	Trained with mixed corpus (Web, Wikipedia, Subtitles)
'de-historic-ha-X'	German (historical)	Added by @stefan-it: Historical German trained over Hamburger Anzeiger
'de-historic-wz-X'	German (historical)	Added by @stefan-it: Historical German trained over Wiener Zeitung
'es-X'	Spanish	Added by @iamyihwa: Trained with Wikipedia
'es-X-fast'	Spanish	Added by @iamyihwa: Trained with Wikipediam CPU-friendly
'eu-X'	Basque	Added by @stefan-it: Trained with Wikipedia/OPUS
'eu-v0-X'	Basque	Added by @stefan-it: LM embeddings (earlier version)
'fa-X'	Persian	Added by @stefan-it: Trained with Wikipedia/OPUS
'fi-X'	Finnish	Added by @stefan-it: Trained with Wikipedia/OPUS
'fr-X'	French	Added by @mhham: Trained with French Wikipedia
'he-X'	Hebrew	Added by @stefan-it: Trained with Wikipedia/OPUS
'hi-X'	Hindi	Added by @stefan-it: Trained with Wikipedia/OPUS
'hr-X'	Croatian	Added by @stefan-it: Trained with Wikipedia/OPUS
'id-X'	Indonesian	Added by @stefan-it: Trained with Wikipedia/OPUS
'it-X'	Italian	Added by @stefan-it: Trained with Wikipedia/OPUS
'ja-X'	Japanese	Added by @frtacoa: Trained with 439M words of Japanese Web crawls (2048 hidden states, 2 layers)
'nl-X'	Dutch	Added by @stefan-it: Trained with Wikipedia/OPUS
'nl-v0-X'	Dutch	Added by @stefan-it: LM embeddings (earlier version)
'no-X'	Norwegian	Added by @stefan-it: Trained with Wikipedia/OPUS
'pl-X'	Polish	Added by @borchmann: Trained with web crawls (Polish part of CommonCrawl)
'pl-opus-X'	Polish	Added by @stefan-it: Trained with Wikipedia/OPUS
'pt-X'	Portuguese	Added by @ericlief: LM embeddings
'sl-X'	Slovenian	Added by @stefan-it: Trained with Wikipedia/OPUS
'sl-v0-X'	Slovenian	Added by @stefan-it: Trained with various sources (Europarl, Wikipedia and OpenSubtitles2018)
'sv-X'	Swedish	Added by @stefan-it: Trained with Wikipedia/OPUS
'sv-v0-X'	Swedish	Added by @stefan-it: Trained with various sources (Europarl, Wikipedia or OpenSubtitles2018)
'pubmed-X'	English	Added by @jessepeng: Trained with 5% of PubMed abstracts until 2015 (1150 hidden states, 3 layers)

So, if you want to load embeddings from the German forward LM model, instantiate the method as follows:

flair_de_forward = FlairEmbeddings('de-forward')

And if you want to load embeddings from the Bulgarian backward LM model, instantiate the method as follows:

flair_bg_backward = FlairEmbeddings('bg-backward')

Recommended Flair Usage

We recommend combining both forward and backward Flair embeddings. Depending on the task, we also recommend adding standard word embeddings into the mix. So, our recommended StackedEmbedding for most English tasks is:

from flair.embeddings import WordEmbeddings, FlairEmbeddings, StackedEmbeddings

# create a StackedEmbedding object that combines glove and forward/backward flair embeddings
stacked_embeddings = StackedEmbeddings([
                                        WordEmbeddings('glove'),
                                        FlairEmbeddings('news-forward'),
                                        FlairEmbeddings('news-backward'),
                                       ])

That's it! Now just use this embedding like all the other embeddings, i.e. call the embed() method over your sentences.

sentence = Sentence('The grass is green .')

# just embed a sentence using the StackedEmbedding as you would with any single embedding.
stacked_embeddings.embed(sentence)

# now check out the embedded tokens.
for token in sentence:
    print(token)
    print(token.embedding)

Words are now embedded using a concatenation of three different embeddings. This combination often gives state-of-the-art accuracy.

PyTorch-Transformers

Thanks to the brilliant pytorch-transformers library from Hugging Face, Flair is able to support various Transformer-based architectures like BERT or XLNet.

The following embeddings can be used in Flair:

BertEmbeddings
OpenAIGPTEmbeddings
OpenAIGPT2Embeddings
TransformerXLEmbeddings
XLNetEmbeddings
XLMEmbeddings
RoBERTaEmbeddings

This section shows how to use these Transformer-based architectures in Flair and is heavily based on the excellent [PyTorch-Transformers pre-trained models documentation](https://huggingface.co/pytorch-transformers/pretrained_models.html.

BERT Embeddings

BERT embeddings were developed by Devlin et al. (2018) and are a different kind of powerful word embedding based on a bidirectional transformer architecture. The embeddings itself are wrapped into our simple embedding interface, so that they can be used like any other embedding.

from flair.embeddings import BertEmbeddings

# init embedding
embedding = BertEmbeddings()

# create a sentence
sentence = Sentence('The grass is green .')

# embed words in sentence
embedding.embed(sentence)

The BertEmbeddings class has several arguments:

Argument	Default	Description
`bert_model_or_path`	`bert-base-uncased`	Defines BERT model or points to user-defined path
`layers`	`-1,-2,-3,-4`	Defines the to be used layers of the Transformer-based model
`pooling_operation`	`first`	See Pooling operation section.
`use_scalar_mix`	`False`	See Scalar mix section.

You can load any of the pre-trained BERT models by providing bert_model_or_path during initialization:

Model	Details
`bert-base-uncased`	12-layer, 768-hidden, 12-heads, 110M parameters
	Trained on lower-cased English text
`bert-large-uncased`	24-layer, 1024-hidden, 16-heads, 340M parameters
	Trained on lower-cased English text
`bert-base-cased`	12-layer, 768-hidden, 12-heads, 110M parameters
	Trained on cased English text
`bert-large-cased`	24-layer, 1024-hidden, 16-heads, 340M parameters
	Trained on cased English text
`bert-base-multilingual-uncased`	(Original, not recommended) 12-layer, 768-hidden, 12-heads, 110M parameters
	Trained on lower-cased text in the top 102 languages with the largest Wikipedias
	(see details)
`bert-base-multilingual-cased`	(New, recommended) 12-layer, 768-hidden, 12-heads, 110M parameters
	Trained on cased text in the top 104 languages with the largest Wikipedias
	(see details)
`bert-base-chinese`	12-layer, 768-hidden, 12-heads, 110M parameters
	Trained on cased Chinese Simplified and Traditional text
`bert-base-german-cased`	12-layer, 768-hidden, 12-heads, 110M parameters
	Trained on cased German text by Deepset.ai
	(see details on deepset.ai website)
`bert-large-uncased-whole-word-masking`	24-layer, 1024-hidden, 16-heads, 340M parameters
	Trained on lower-cased English text using Whole-Word-Masking
	(see details)
`bert-large-cased-whole-word-masking`	24-layer, 1024-hidden, 16-heads, 340M parameters
	Trained on cased English text using Whole-Word-Masking
	(see details)
`bert-large-uncased-whole-word-masking-finetuned-squad`	24-layer, 1024-hidden, 16-heads, 340M parameters
	The `bert-large-uncased-whole-word-masking` model fine-tuned on SQuAD (see details of fine-tuning in the
	example section of PyTorch-Transformers)
`bert-large-cased-whole-word-masking-finetuned-squad`	24-layer, 1024-hidden, 16-heads, 340M parameters
	The `bert-large-cased-whole-word-masking` model fine-tuned on SQuAD
	(see details of fine-tuning in the example section)
`bert-base-cased-finetuned-mrpc`	12-layer, 768-hidden, 12-heads, 110M parameters
	The `bert-base-cased` model fine-tuned on MRPC
	(see details of fine-tuning in the example section of PyTorch-Transformers)

OpenAI GPT Embeddings

The OpenAI GPT model was proposed by Radford et. al (2018). GPT is an uni-directional Transformer-based model.

The following example shows how to use the OpenAIGPTEmbeddings:

from flair.embeddings import OpenAIGPTEmbeddings

# init embedding
embedding = OpenAIGPTEmbeddings()

# create a sentence
sentence = Sentence('Berlin and Munich are nice cities .')

# embed words in sentence
embedding.embed(sentence)

The OpenAIGPTEmbeddings class has several arguments:

Argument	Default	Description
`model`	`openai-gpt`	Defines GPT model
`layers`	`1`	Defines the to be used layers of the Transformer-based model
`pooling_operation`	`first_last`	See Pooling operation section
`use_scalar_mix`	`False`	See Scalar mix section

OpenAI GPT-2 Embeddings

The OpenAI GPT-2 model was proposed by [Radford et. al (2018)(https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf). GPT-2 is also an uni-directional Transformer-based model, that was trained on a larger corpus compared to the GPT model.

The GPT-2 model can be used with the OpenAIGPT2Embeddings class:

from flair.embeddings import OpenAIGPT2Embeddings

# init embedding
embedding = OpenAIGPT2Embeddings()

# create a sentence
sentence = Sentence('The Englischer Garten is a large public park in the centre of Munich .')

# embed words in sentence
embedding.embed(sentence)

The OpenAIGPT2Embeddings class has several arguments:

Argument	Default	Description
`model`	`gpt2-medium`	Defines GPT-2 model
`layers`	`1`	Defines the to be used layers of the Transformer-based model
`pooling_operation`	`first_last`	See Pooling operation section
`use_scalar_mix`	`False`	See Scalar mix section

Following GPT-2 models can be used:

Model	Details
`gpt2`	12-layer, 768-hidden, 12-heads, 117M parameters
	OpenAI GPT-2 English model
`gpt2-medium`	24-layer, 1024-hidden, 16-heads, 345M parameters
	OpenAI's Medium-sized GPT-2 English model

Transformer-XL Embeddings

The Transformer-XL model was proposed by Dai et. al (2019). It is an uni-directional Transformer-based model with relative positioning embeddings.

The Transformer-XL model can be used with the TransformerXLEmbeddings class:

from flair.embeddings import TransformerXLEmbeddings

# init embedding
embedding = TransformerXLEmbeddings()

# create a sentence
sentence = Sentence('The Berlin Zoological Garden is the oldest and best-known zoo in Germany .')

# embed words in sentence
embedding.embed(sentence)

The following arguments can be passed to the TransformerXLEmbeddings class:

Argument	Default	Description
`model`	`transfo-xl-wt103`	Defines Transformer-XL model
`layers`	`1,2,3`	Defines the to be used layers of the Transformer-based model
`use_scalar_mix`	`False`	See Scalar mix section

Notice: The Transformer-XL model (trained on WikiText-103) is a word-based language model. Thus, no subword tokenization is necessary is needed (pooling_operation is not needed).

XLNet Embeddings

The XLNet model was proposed by Yang et. al (2019). It is an extension of the Transformer-XL model using an autoregressive method to learn bi-directional contexts.

The XLNet model can be used with the XLNetEmbeddings class:

from flair.embeddings import XLNetEmbeddings

# init embedding
embedding = XLNetEmbeddings()

# create a sentence
sentence = Sentence('The Hofbräuhaus is a beer hall in Munich .')

# embed words in sentence
embedding.embed(sentence)

The following arguments can be passed to the XLNetEmbeddings class:

Argument	Default	Description
`model`	`xlnet-large-cased`	Defines XLNet model
`layers`	`1`	Defines the to be used layers of the Transformer-based model
`pooling_operation`	`first_last`	See Pooling operation section
`use_scalar_mix`	`False`	See Scalar mix section

Following XLNet models can be used:

Model	Details
`xlnet-base-cased`	12-layer, 768-hidden, 12-heads, 110M parameters
	XLNet English model
`xlnet-large-cased`	24-layer, 1024-hidden, 16-heads, 340M parameters
	XLNet Large English model

XLM Embeddings

The XLM model was proposed by Lample and Conneau (2019). It extends the generative pre-training approach for English to multiple languages and show the effectiveness of cross-lingual pretraining.

The XLM model can be used with the XLMEmbeddings class:

from flair.embeddings import XLMEmbeddings

# init embedding
embedding = XLMEmbeddings()

# create a sentence
sentence = Sentence('The BER is an international airport under construction near Berlin .')

# embed words in sentence
embedding.embed(sentence)

The following arguments can be passed to the XLMEmbeddings class:

Argument	Default	Description
`model`	`xlm-mlm-en-2048`	Defines XLM model
`layers`	`1`	Defines the to be used layers of the Transformer-based model
`pooling_operation`	`first_last`	See Pooling operation section
`use_scalar_mix`	`False`	See Scalar mix section

Following XLM models can be used:

Model	Details
`xlm-mlm-en-2048`	12-layer, 1024-hidden, 8-heads
	XLM English model
`xlm-mlm-ende-1024`	12-layer, 1024-hidden, 8-heads
	XLM English-German Multi-language model
`xlm-mlm-enfr-1024`	12-layer, 1024-hidden, 8-heads
	XLM English-French Multi-language model
`xlm-mlm-enro-1024`	12-layer, 1024-hidden, 8-heads
	XLM English-Romanian Multi-language model
`xlm-mlm-xnli15-1024`	12-layer, 1024-hidden, 8-heads
	XLM Model pre-trained with MLM on the 15 XNLI languages
`xlm-mlm-tlm-xnli15-1024`	12-layer, 1024-hidden, 8-heads
	XLM Model pre-trained with MLM + TLM on the 15 XNLI languages
`xlm-clm-enfr-1024`	12-layer, 1024-hidden, 8-heads
	XLM English model trained with CLM (Causal Language Modeling)
`xlm-clm-ende-1024`	12-layer, 1024-hidden, 8-heads
	XLM English-German Multi-language model trained with CLM (Causal Language Modeling)

RoBERTa Embeddings

The RoBERTa (Robustly optimized BERT pre-training approach) model was proposed by Liu et. al (2019), and uses an improved pre-training procedure to train a BERT model on a large corpus.

It can be used with the RoBERTaEmbeddings class:

from flair.embeddings import RoBERTaEmbeddings

# init embedding
embedding = RoBERTaEmbeddings()

# create a sentence
sentence = Sentence("The Oktoberfest is the world's largest Volksfest .")

# embed words in sentence
embedding.embed(sentence)

The following arguments can be passed to the RoBERTaEmbeddings class:

Argument	Default	Description
`model`	`roberta.large`	Defines RoBERTa model
`layers`	`-1`	Defines the to be used layers of the Transformer-based model
`pooling_operation`	`first`	Pooling operation section
`use_scalar_mix`	`False`	Scalar mix section

Following XLM models can be used:

Model	Details
`roberta.base`	12-layer, 768-hidden, 12-heads
	RoBERTa English model
`roberta.base`	24-layer, 1024-hidden, 16-heads
	RoBERTa English model
`roberta.large.mnli`	24-layer, 1024-hidden, 16-heads
	RoBERTa English model, finetuned on MNLI

Pooling operation

Most of the Transformer-based models (except Transformer-XL) use subword tokenization. E.g. the following token puppeteer could be tokenized into the subwords: pupp, ##ete and ##er.

We implement different pooling operations for these subwords to generate the final token representation:

first: only the embedding of the first subword is used
last: only the embedding of the last subword is used
first_last: embeddings of the first and last subwords are concatenated and used
mean: a torch.mean over all subword embeddings is calculated and used

Scalar mix

The Transformer-based models have a certain number of layers. Liu et. al (2019) propose a technique called scalar mix, that computes a parameterised scalar mixture of user-defined layers.

This technique is very useful, because for some downstream tasks like NER or PoS tagging it can be unclear which layer(s) of a Transformer-based model perform well, and per-layer analysis can take a lot of time.

To use scalar mix, all Transformer-based embeddings in Flair come with a use_scalar_mix argument. The following example shows how to use scalar mix for a base RoBERTa model on all layers:

from flair.embeddings import RoBERTaEmbeddings

# init embedding
embedding = RoBERTaEmbeddings(model="roberta.base", layers="0,1,2,3,4,5,6,7,8,9,10,11,12",
                              pooling_operation="first", use_scalar_mix=True)

# create a sentence
sentence = Sentence("The Oktoberfest is the world's largest Volksfest .")

# embed words in sentence
embedding.embed(sentence)

ELMo Embeddings

ELMo embeddings were presented by Peters et al. in 2018. They are using a bidirectional recurrent neural network to predict the next word in a text. We are using the implementation of AllenNLP. As this implementation comes with a lot of sub-dependencies, which we don't want to include in Flair, you need to first install the library via pip install allennlp before you can use it in Flair. Using the embeddings is as simple as using any other embedding type:

from flair.embeddings import ELMoEmbeddings

# init embedding
embedding = ELMoEmbeddings()

# create a sentence
sentence = Sentence('The grass is green .')

# embed words in sentence
embedding.embed(sentence)

AllenNLP provides the following pre-trained models. To use any of the following models inside Flair simple specify the embedding id when initializing the ELMoEmbeddings.

ID	Language	Embedding
'small'	English	1024-hidden, 1 layer, 14.6M parameters
'medium'	English	2048-hidden, 1 layer, 28.0M parameters
'original'	English	4096-hidden, 2 layers, 93.6M parameters
'pt'	Portuguese
'pubmed'	English biomedical data	more information

Combining BERT and Flair

You can very easily mix and match Flair, ELMo, BERT and classic word embeddings. All you need to do is instantiate each embedding you wish to combine and use them in a StackedEmbedding.

For instance, let's say we want to combine the multilingual Flair and BERT embeddings to train a hyper-powerful multilingual downstream task model.

First, instantiate the embeddings you wish to combine:

from flair.embeddings import FlairEmbeddings, BertEmbeddings

# init Flair embeddings
flair_forward_embedding = FlairEmbeddings('multi-forward')
flair_backward_embedding = FlairEmbeddings('multi-backward')

# init multilingual BERT
bert_embedding = BertEmbeddings('bert-base-multilingual-cased')

Now instantiate the StackedEmbeddings class and pass it a list containing these three embeddings.

from flair.embeddings import StackedEmbeddings

# now create the StackedEmbedding object that combines all embeddings
stacked_embeddings = StackedEmbeddings(
    embeddings=[flair_forward_embedding, flair_backward_embedding, bert_embedding])

That's it! Now just use this embedding like all the other embeddings, i.e. call the embed() method over your sentences.

sentence = Sentence('The grass is green .')

# just embed a sentence using the StackedEmbedding as you would with any single embedding.
stacked_embeddings.embed(sentence)

# now check out the embedded tokens.
for token in sentence:
    print(token)
    print(token.embedding)

Words are now embedded using a concatenation of three different embeddings. This means that the resulting embedding vector is still a single PyTorch vector.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TUTORIAL_4_ELMO_BERT_FLAIR_EMBEDDING.md

TUTORIAL_4_ELMO_BERT_FLAIR_EMBEDDING.md

Tutorial 4: BERT, ELMo, and Flair Embeddings

Embeddings

Flair Embeddings

Recommended Flair Usage

PyTorch-Transformers

BERT Embeddings

OpenAI GPT Embeddings

OpenAI GPT-2 Embeddings

Transformer-XL Embeddings

XLNet Embeddings

XLM Embeddings

RoBERTa Embeddings

Pooling operation

Scalar mix

ELMo Embeddings

Combining BERT and Flair

Next

Files

TUTORIAL_4_ELMO_BERT_FLAIR_EMBEDDING.md

Latest commit

History

TUTORIAL_4_ELMO_BERT_FLAIR_EMBEDDING.md

File metadata and controls

Tutorial 4: BERT, ELMo, and Flair Embeddings

Embeddings

Flair Embeddings

Recommended Flair Usage

PyTorch-Transformers

BERT Embeddings

OpenAI GPT Embeddings

OpenAI GPT-2 Embeddings

Transformer-XL Embeddings

XLNet Embeddings

XLM Embeddings

RoBERTa Embeddings

Pooling operation

Scalar mix

ELMo Embeddings

Combining BERT and Flair

Next