Assessing the Impact of Contextual Embeddings for Portuguese Named Entity Recognition

Modern approaches to Named Entity Recognition (NER) use neural networks (NN) to automatically extract features from text and seamlessly integrate them with sequence taggers in an end-to-end fashion. Word embeddings, which are a side product of pretrained neural language models (LMs), are key ingredients to boost the performance of NER systems. More recently, contextual word embeddings, which adapt according to the context where the word appears, have proved to be an invaluable resource to improve NER systems. In this work, we assess how different combinations of (shallow) word embeddings and contextual embeddings impact NER for the Portuguese Language. We show a comparative study of 16 different combinations of shallow and contextual embeddings and explore how textual diversity and the size of training corpora used in LMs impact our NER results. We evaluate NER performance using the HAREM corpus. Our best NER system outperforms the state-of-the-art in Portuguese NER by 5.99 in absolute percentage points. State-of-The-Art results evaluated by CoNLL-2002 Script.

Results for the Total Scenario (HAREM)

Approach	Precision	Recall	F1
BiLSTM-CRF+FlairBBP	74.91%	74.37%	74.64%
BiLSTM-CRF (Castro, et al.)	72.28%	68.03%	70.33%
CharWNN (dos Santos, et al.)	67.16%	63.74%	65.41%

Results for the Selective Scenario (HAREM)

Approach	Precision	Recall	F1
BiLSTM-CRF+FlairBBP	83.38%	81.17%	82.26%
BiLSTM-CRF (Castro, et al.)	78.26%	74.39%	76.27%
CharWNN (dos Santos, et al.)	73.98%	68.68%	65.41%

Reproduce our tests for NER

Before you begin, you should download the Flair library. Flair is a powerful NLP library with state-of-the-art results. Flair was developed by Zalando Research. You can see all details in this github link.

Paper: Contextual String Embeddings for Sequence Labeling (Akbik, et al.)

STEP 1: Download our language model FlairBBP (backward and forward);

STEP 2: Clone this repository;

STEP 3: Install Flair 0.4.1. See how to install here;

STEP 4: Download NILC's Word Embedding. You must download Word2Vec-Skip-Gram with 300 dimensions; Put the file inside the cloned folder;

STEP 5: Run our script python3.6 ner_flair.py

Tagging your portuguese text with our NER model

Tag your text using our best model for NER. The model is formed by FlairBBP + NILC-Word2Vec-Skpg-300d. It is possible to recognize the following categories: PERSON, LOCATION, ORGANIZATION, TIME and VALUE. You need install Flair 0.4.1.

STEP 1: Download our NER model Download Here!;

STEP 2: Clone this repository;

STEP 3: Run our script python3.6 tagging_ner.py [input_file_name.txt] [output_file_name.txt] [mode] modes:

conll - input text in conll formart
plain - input text in plain formart

Language Models

Flair Embeddings - FlairBBP

You can download our Flair Embeddings models (FlairBBP) in the following links:

Backward: FlairBBP-Backward
Forward: FlairBBP-Forward

Word Embeddings

You can download our Word Embedding models in the following links, note that all models were trained in 300 dimensions:

Algorithm	Architecture	Downloads
Word2Vec	Skip-Gram	Word2Vec_skpg_300d
Word2Vec	CBOW	Word2Vec_cbow_300d
FastText	Skip-Gram	Fasttext_skpg_300d
FastText	CBOW	Fasttext_cbow_300d

NILC Word Embeddings

You can download the Word Embeddings provided by NILC in the following link: http://nilc.icmc.usp.br/embeddings

Paper: Portuguese Word Embeddings: Evaluating on Word Analogies and Natural Language Tasks (Hartmann, et al.)

Language Models Corpora

BlogSet-BR

BlogSet-BR is a large corpus built from millions of sentences taken from Brazilian Portuguese web blogs.

Paper: BlogSet-BR: A Brazilian Portuguese Blog Corpus (Santos, et al.)
Download Here!

brWaC

brWaC is another portuguese large corpus.

Paper: The brWaC Corpus: A New Open Resource for Brazilian Portuguese (Filho, et al.)
Download Here!

ptwiki-20190301

ptwiki-20190301 is a corpus formed by texts from wikipedia in Portuguese.

Download Here!

Language Model Corpora Size Details (after pre-processing):

Corpus	Sentences	Tokens
brWaC	127,272,109	2,930,573,938
BlogSet-BR	58,494,090	1,807,669,068
ptwiki-20190301	7,053,954	162,109,057
All Corpora	192,820,153	4,900,352,063

Citing our Paper

@inproceedings{santos2019assessing,
  author    = {Joaquim Santos and
               Bernardo Consoli and
               Cicero dos Santos and
               Juliano Terra and
               Sandra Collonini and
               Renata Vieira},
  title     = {Assessing the Impact of Contextual Embeddings for Portuguese Named Entity Recognition},
  booktitle = {8th Brazilian Conference on Intelligent Systems, {BRACIS}, Bahia, Brazil, October 15-18},
  pages     = {437--442},
  year      = {2019}
}

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
data		data
README.md		README.md
conlleval_02.pl		conlleval_02.pl
ner_flair.py		ner_flair.py
tagging_ner.py		tagging_ner.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data

data

README.md

README.md

conlleval_02.pl

conlleval_02.pl

ner_flair.py

ner_flair.py

tagging_ner.py

tagging_ner.py

Repository files navigation

Assessing the Impact of Contextual Embeddings for Portuguese Named Entity Recognition

Reproduce our tests for NER

Tagging your portuguese text with our NER model

Language Models

Flair Embeddings - FlairBBP

Word Embeddings

NILC Word Embeddings

Language Models Corpora

BlogSet-BR

brWaC

ptwiki-20190301

Citing our Paper

About

Releases

Packages

Languages

gazzola/ner-pt

Folders and files

Latest commit

History

Repository files navigation

Assessing the Impact of Contextual Embeddings for Portuguese Named Entity Recognition

Reproduce our tests for NER

Tagging your portuguese text with our NER model

Language Models

Flair Embeddings - FlairBBP

Word Embeddings

NILC Word Embeddings

Language Models Corpora

BlogSet-BR

brWaC

ptwiki-20190301

Citing our Paper

About

Resources

Stars

Watchers

Forks

Languages