Skip to content

Release 0.6.1

Compare
Choose a tag to compare
@alanakbik alanakbik released this 23 Sep 10:40
· 3396 commits to master since this release
0ac2704

Release 0.6.1 is bugfix release that fixes the issues caused by moving the server that originally hosted the Flair models. Additionally, this release adds a ton of new NER datasets, including the XTREME corpus for 40 languages, and a new model for NER on German-language legal text.

New Model: Legal NER (#1872)

Add legal NER model for German. Trained using the German legal NER dataset available here that can be loaded in Flair with the LER_GERMAN corpus object.

Uses German Flair and FastText embeddings and gets 96.35 F1 score.

Use like this:

# load German LER tagger
tagger = SequenceTagger.load('de-ler')

# example text
text = "vom 6. August 2020. Alle Beschwerdeführer befinden sich derzeit gemeinsam im Urlaub auf der Insel Mallorca , die vom Robert-Koch-Institut als Risikogebiet eingestuft wird. Sie wollen am 29. August 2020 wieder nach Deutschland einreisen, ohne sich gemäß § 1 Abs. 1 bis Abs. 3 der Verordnung zur Testpflicht von Einreisenden aus Risikogebieten auf das SARS-CoV-2-Virus testen zu lassen. Die Verordnung sei wegen eines Verstoßes der ihr zugrunde liegenden gesetzlichen Ermächtigungsgrundlage, des § 36 Abs. 7 IfSG , gegen Art. 80 Abs. 1 Satz 1 GG verfassungswidrig."

sentence = Sentence(text)

# predict and print entities
tagger.predict(sentence)

for entity in sentence.get_spans('ner'):
    print(entity)

New Datasets

Add XTREME and WikiANN corpora for multilingual NER (#1862)

These huge corpora provide training data for NER in 176 languages. You can either load the language-specific parts of it by supplying a language code:

# load German Xtreme
german_corpus = XTREME('de')
print(german_corpus)

# load French Xtreme
french_corpus = XTREME('fr')
print(french_corpus)

Or you can load the default 40 languages at once into one huge MultiCorpus by not providing a language ID:

# load Xtreme MultiCorpus for all
multi_corpus = XTREME()
print(multi_corpus)

Add Twitter NER Dataset (#1850)

Dataset of tweets annotated with NER tags. Load with:

# load twitter dataset
corpus = TWITTER_NER()

# print example tweet
print(corpus.test[0])

Add German Europarl NER Dataset (#1849)

Dataset of German-language speeches in the European parliament annotated with standard NER tags like person and location. Load with:

# load corpus
corpus = EUROPARL_NER_GERMAN()
print(corpus)

# print first test sentence
print(corpus.test[1])

Add MIT Restaurant NER Dataset (#1177)

Dataset of English restaurant reviews annotated with entities like "dish", "location" and "rating". Load with:

# load restaurant dataset
corpus = MIT_RESTAURANTS()

# print example sentence
print(corpus.test[0])  

Add Universal Propositions Banks for French and German (#1866)

Our kickoff into supporting the Universal Proposition Banks adds the first two UP datasets to Flair. Load with:

# load German UP
corpus = UP_GERMAN()
print(corpus)

# print example sentence
print(corpus.dev[1])

Add Universal Dependencies Dataset for Chinese (#1880)

Adds the Kyoto dataset for Chinese. Load with:

# load Chinese UD dataset
corpus = UD_CHINESE_KYOTO()

# print example sentence
print(corpus.test[0])  

Bug fixes

  • Move models to HU server (#1834 #1839 #1842)
  • Fix deserialization issues in transformer tokenizers #1865
  • Documentation fixes (#1819 #1821 #1836 #1852)
  • Add link to a repo with examples of Flair on GCP (#1825)
  • Correct variable names (#1875)
  • Fix problem with custom delimiters in ColumnDataset (#1876)
  • Fix offensive language detection model (#1877)
  • Correct Dutch NER model (#1881)