Skip to content

Latest commit

 

History

History
90 lines (53 loc) · 3.43 KB

README.MD

File metadata and controls

90 lines (53 loc) · 3.43 KB

NER-datasets for Portuguese

HAREM

HAREM was an evaluation contest for named entity recognition in Portuguese. There were two editions:

First HAREM

Second HAREM

NOTE: the XML format might be painfull to parse, check the Paramopama corpus which includes the HAREM data in CoNNL format.

WikiNER

A NER-corpus based on exploration of inter-document links in Wikipedia.

Paramopama

Extends the PtBR version of WikiNER corpus, revising incorrect assigned tags in order to improve corpus quality. In the experiments the authors also produced a CoNNL format version of the HAREM corpus, which made publicly available:

leNER-Br

A dataset for named entity recognition in Brazilian legal documents is, unlike other Portuguese language datasets, this dataset is composed entirely of legal documents. In addition to tags for persons, locations, time entities and organizations, the dataset contains specific tags for law and legal cases entities.

Peres 2017

A dataset for named entity recognition in Brazilian Portuguese (#noisydata #twitter)

WikiANN

WikiANN (sometimes called PAN-X) is a multilingual named entity recognition dataset consisting of Wikipedia articles annotated with LOC (location), PER (person), and ORG (organisation) tags in the IOB2 format.