NER-datasets for Portuguese

HAREM

HAREM was an evaluation contest for named entity recognition in Portuguese. There were two editions:

First HAREM

data: ColeccaoDouradaHAREM.txt
cite: "HAREM: An Advanced NER Evaluation Contest for Portuguese"

Second HAREM

data:
- CDSegundoHAREMclassico.xml: named-entities
- CDSegundoHAREM_TEMPO.xml: named-entities and time expressions are grounded to a time/date structure.
- CDSegundoHAREMReRelEM.xml: named-entities and relationships.
cite: "Second HAREM: Advancing the State of the Art of Named Entity Recognition in Portuguese"

NOTE: the XML format might be painfull to parse, check the Paramopama corpus which includes the HAREM data in CoNNL format.

WikiNER

A NER-corpus based on exploration of inter-document links in Wikipedia.

data: https://figshare.com/articles/Learning_multilingual_named_entity_recognition_from_Wikipedia/5462500
cite: "Learning multilingual named entity recognition from Wikipedia"

Paramopama

Extends the PtBR version of WikiNER corpus, revising incorrect assigned tags in order to improve corpus quality. In the experiments the authors also produced a CoNNL format version of the HAREM corpus, which made publicly available:

data:
cite: "Paramopama: a Brazilian-Portuguese Corpus for Named Entity Recognition"

leNER-Br

A dataset for named entity recognition in Brazilian legal documents is, unlike other Portuguese language datasets, this dataset is composed entirely of legal documents. In addition to tags for persons, locations, time entities and organizations, the dataset contains specific tags for law and legal cases entities.

data:
cite: "LeNER-Br: a Dataset for Named Entity Recognition in Brazilian Legal Text"
web: https://cic.unb.br/~teodecampos/LeNER-Br/

Peres 2017

A dataset for named entity recognition in Brazilian Portuguese (#noisydata #twitter)

data:
- dataset.ptbr.twitter.train.ner
- dataset.ptbr.twitter.test.ner
cite: "Bidirectional LSTM with a Context Input Window for Named Entity Recognition in Tweets"

WikiANN

WikiANN (sometimes called PAN-X) is a multilingual named entity recognition dataset consisting of Wikipedia articles annotated with LOC (location), PER (person), and ORG (organisation) tags in the IOB2 format.

data:
- train
- extra
- dev
- test
cite: "Cross-lingual Name Tagging and Linking for 282 Languages"

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.MD

README.MD

NER-datasets for Portuguese

HAREM

First HAREM

Second HAREM

WikiNER

Paramopama

leNER-Br

Peres 2017

WikiANN

Files

README.MD

Latest commit

History

README.MD

File metadata and controls

NER-datasets for Portuguese

HAREM

First HAREM

Second HAREM

WikiNER

Paramopama

leNER-Br

Peres 2017

WikiANN