Flair Embeddings

Contextual string embeddings are powerful embeddings that capture latent syntactic-semantic information that goes beyond standard word embeddings. Key differences are: (1) they are trained without any explicit notion of words and thus fundamentally model words as sequences of characters. And (2) they are contextualized by their surrounding text, meaning that the same word will have different embeddings depending on its contextual use.

With Flair, you can use these embeddings simply by instantiating the appropriate embedding class, same as standard word embeddings:

from flair.embeddings import FlairEmbeddings

# init embedding
flair_embedding_forward = FlairEmbeddings('news-forward')

# create a sentence
sentence = Sentence('The grass is green .')

# embed words in sentence
flair_embedding_forward.embed(sentence)

You choose which embeddings you load by passing the appropriate string to the constructor of the FlairEmbeddings class. Currently, the following contextual string embeddings are provided (note: replace 'X' with either 'forward' or 'backward'):

ID	Language	Embedding
'multi-X'	300+	JW300 corpus, as proposed by Agić and Vulić (2019). The corpus is licensed under CC-BY-NC-SA
'multi-X-fast'	English, German, French, Italian, Dutch, Polish	Mix of corpora (Web, Wikipedia, Subtitles, News), CPU-friendly
'news-X'	English	Trained with 1 billion word corpus
'news-X-fast'	English	Trained with 1 billion word corpus, CPU-friendly
'mix-X'	English	Trained with mixed corpus (Web, Wikipedia, Subtitles)
'ar-X'	Arabic	Added by @stefan-it: Trained with Wikipedia/OPUS
'bg-X'	Bulgarian	Added by @stefan-it: Trained with Wikipedia/OPUS
'bg-X-fast'	Bulgarian	Added by @stefan-it: Trained with various sources (Europarl, Wikipedia or SETimes)
'cs-X'	Czech	Added by @stefan-it: Trained with Wikipedia/OPUS
'cs-v0-X'	Czech	Added by @stefan-it: LM embeddings (earlier version)
'de-X'	German	Trained with mixed corpus (Web, Wikipedia, Subtitles)
'de-historic-ha-X'	German (historical)	Added by @stefan-it: Historical German trained over Hamburger Anzeiger
'de-historic-wz-X'	German (historical)	Added by @stefan-it: Historical German trained over Wiener Zeitung
'de-historic-rw-X'	German (historical)	Added by @redewiedergabe: Historical German trained over 100 million tokens
'es-X'	Spanish	Added by @iamyihwa: Trained with Wikipedia
'es-X-fast'	Spanish	Added by @iamyihwa: Trained with Wikipedia, CPU-friendly
'es-clinical-'	Spanish (clinical)	Added by @matirojasg: Trained with Wikipedia
'eu-X'	Basque	Added by @stefan-it: Trained with Wikipedia/OPUS
'eu-v0-X'	Basque	Added by @stefan-it: LM embeddings (earlier version)
'fa-X'	Persian	Added by @stefan-it: Trained with Wikipedia/OPUS
'fi-X'	Finnish	Added by @stefan-it: Trained with Wikipedia/OPUS
'fr-X'	French	Added by @mhham: Trained with French Wikipedia
'he-X'	Hebrew	Added by @stefan-it: Trained with Wikipedia/OPUS
'hi-X'	Hindi	Added by @stefan-it: Trained with Wikipedia/OPUS
'hr-X'	Croatian	Added by @stefan-it: Trained with Wikipedia/OPUS
'id-X'	Indonesian	Added by @stefan-it: Trained with Wikipedia/OPUS
'it-X'	Italian	Added by @stefan-it: Trained with Wikipedia/OPUS
'ja-X'	Japanese	Added by @frtacoa: Trained with 439M words of Japanese Web crawls (2048 hidden states, 2 layers)
'nl-X'	Dutch	Added by @stefan-it: Trained with Wikipedia/OPUS
'nl-v0-X'	Dutch	Added by @stefan-it: LM embeddings (earlier version)
'no-X'	Norwegian	Added by @stefan-it: Trained with Wikipedia/OPUS
'pl-X'	Polish	Added by @borchmann: Trained with web crawls (Polish part of CommonCrawl)
'pl-opus-X'	Polish	Added by @stefan-it: Trained with Wikipedia/OPUS
'pt-X'	Portuguese	Added by @ericlief: LM embeddings
'sl-X'	Slovenian	Added by @stefan-it: Trained with Wikipedia/OPUS
'sl-v0-X'	Slovenian	Added by @stefan-it: Trained with various sources (Europarl, Wikipedia and OpenSubtitles2018)
'sv-X'	Swedish	Added by @stefan-it: Trained with Wikipedia/OPUS
'sv-v0-X'	Swedish	Added by @stefan-it: Trained with various sources (Europarl, Wikipedia or OpenSubtitles2018)
'ta-X'	Tamil	Added by @stefan-it
'pubmed-X'	English	Added by @jessepeng: Trained with 5% of PubMed abstracts until 2015 (1150 hidden states, 3 layers)
'de-impresso-hipe-v1-X'	German (historical)	In-domain data (Swiss and Luxembourgish newspapers) for CLEF HIPE Shared task. More information on the shared task can be found in this paper
'en-impresso-hipe-v1-X'	English (historical)	In-domain data (Chronicling America material) for CLEF HIPE Shared task. More information on the shared task can be found in this paper
'fr-impresso-hipe-v1-X'	French (historical)	In-domain data (Swiss and Luxembourgish newspapers) for CLEF HIPE Shared task. More information on the shared task can be found in this paper
'am-X'	Amharic	Based on 6.5m Amharic text corpus crawled from different sources. See this paper and the official GitHub Repository for more information.
'uk-X'	Ukrainian	Added by @dchaplinsky: Trained with UberText corpus.

So, if you want to load embeddings from the German forward LM model, instantiate the method as follows:

flair_de_forward = FlairEmbeddings('de-forward')

And if you want to load embeddings from the Bulgarian backward LM model, instantiate the method as follows:

flair_bg_backward = FlairEmbeddings('bg-backward')

Recommended Flair Usage

We recommend combining both forward and backward Flair embeddings. Depending on the task, we also recommend adding standard word embeddings into the mix. So, our recommended StackedEmbedding for most English tasks is:

from flair.embeddings import WordEmbeddings, FlairEmbeddings, StackedEmbeddings

# create a StackedEmbedding object that combines glove and forward/backward flair embeddings
stacked_embeddings = StackedEmbeddings([
                                        WordEmbeddings('glove'),
                                        FlairEmbeddings('news-forward'),
                                        FlairEmbeddings('news-backward'),
                                       ])

That's it! Now just use this embedding like all the other embeddings, i.e. call the embed() method over your sentences.

sentence = Sentence('The grass is green .')

# just embed a sentence using the StackedEmbedding as you would with any single embedding.
stacked_embeddings.embed(sentence)

# now check out the embedded tokens.
for token in sentence:
    print(token)
    print(token.embedding)

Words are now embedded using a concatenation of three different embeddings. This combination often gives state-of-the-art accuracy.

Pooled Flair Embeddings

We also developed a pooled variant of the FlairEmbeddings. These embeddings differ in that they constantly evolve over time, even at prediction time (i.e. after training is complete). This means that the same words in the same sentence at two different points in time may have different embeddings.

PooledFlairEmbeddings manage a 'global' representation of each distinct word by using a pooling operation of all past occurences. More details on how this works may be found in Akbik et al. (2019).

You can instantiate and use PooledFlairEmbeddings like any other embedding:

from flair.embeddings import PooledFlairEmbeddings

# init embedding
flair_embedding_forward = PooledFlairEmbeddings('news-forward')

# create a sentence
sentence = Sentence('The grass is green .')

# embed words in sentence
flair_embedding_forward.embed(sentence)

Note that while we get some of our best results with PooledFlairEmbeddings they are very ineffective memory-wise since they keep past embeddings of all words in memory. In many cases, regular FlairEmbeddings will be nearly as good but with much lower memory requirements.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FLAIR_EMBEDDINGS.md

FLAIR_EMBEDDINGS.md

Flair Embeddings

Recommended Flair Usage

Pooled Flair Embeddings

Next

Files

FLAIR_EMBEDDINGS.md

Latest commit

History

FLAIR_EMBEDDINGS.md

File metadata and controls

Flair Embeddings

Recommended Flair Usage

Pooled Flair Embeddings

Next