HAREM was an evaluation contest for named entity recognition in Portuguese. There were two editions:
-
data:
- CDSegundoHAREMclassico.xml: named-entities
- CDSegundoHAREM_TEMPO.xml: named-entities and time expressions are grounded to a time/date structure.
- CDSegundoHAREMReRelEM.xml: named-entities and relationships.
-
cite: "Second HAREM: Advancing the State of the Art of Named Entity Recognition in Portuguese"
NOTE: the XML format might be painfull to parse, check the Paramopama corpus which includes the HAREM data in CoNNL format.
A NER-corpus based on exploration of inter-document links in Wikipedia.
- data: https://figshare.com/articles/Learning_multilingual_named_entity_recognition_from_Wikipedia/5462500
- cite: "Learning multilingual named entity recognition from Wikipedia"
Extends the PtBR version of WikiNER corpus, revising incorrect assigned tags in order to improve corpus quality. In the experiments the authors also produced a CoNNL format version of the HAREM corpus, which made publicly available:
A dataset for named entity recognition in Brazilian legal documents is, unlike other Portuguese language datasets, this dataset is composed entirely of legal documents. In addition to tags for persons, locations, time entities and organizations, the dataset contains specific tags for law and legal cases entities.
A dataset for named entity recognition in Brazilian Portuguese (#noisydata #twitter)
WikiANN (sometimes called PAN-X) is a multilingual named entity recognition dataset consisting of Wikipedia articles annotated with LOC (location), PER (person), and ORG (organisation) tags in the IOB2 format.