We semi-automatically corrected annotation errors in the classic CoNLL-03 dataset for Named Entity Recognition (NER). Get our corpus CleanCoNLL -- CoNLL-03 with nearly noise-free NER annotations -- with the help of this repository!
For details of the creation and evaluation of the dataset, have a look at our EMNLP 2023 paper: Rücker and Akbik (2023)!
Details are in the paper, but in short: We leveraged the Wikipedia links from the AIDA CoNLL Yago dataset for assigning NER labels for each mention in a hybrid (automatically as well as manual) relabeling approach. Furthermore, we performed several rounds of cross-checking for correcting remaining errors and resolving inconsistencies. Overall, we updated 7% of labels from the original CoNLL-03.
We keep the original tagging scheme with 4 types (PER, LOC, ORG, MISC). We add the NEL (Named Entity Linking) annotations, i.e. Wikipedia links to our annotations.
Note: As source text base, we used the corrected corpus version by Reiss et al. (2020) (paper, repo), as they not only already modified some of the label errors, but also corrected some problems with token, sentence and mention splitting.
We distribute our CleanCoNLL annotations in column format. In the annotation files the tokens are masked ([TOKEN]
) for licence reasons, but you'll find a simple shell script that allows you to recreate CleanCoNLL with the help of the original CoNLL-03.
Step-by-step guide:
- Clone this repository.
- Inside
/data/cleanconll_annotations
you can find our masked annotation files (cleanconll_annotations.dev.train
,cleanconll_annotations.dev
,cleanconll_annotations.test
). - Inside
/data/patch_files
you find patch files that represent the updates in the text base between the original CoNLL-03 and CleanCoNLL. This is needed for merging our annotations to the original corpus! - However, you simply need to run our script
which will:
chmod u+x create_cleanconll_from_conll03.sh bash create_cleanconll_from_conll03.sh
- download the original CoNLL-03 corpus
- apply the patch files to the original CoNLL-03 text for alligning the text base before merging our annotations
- create the three CleanCoNLL files with text and annotations, they will be placed inside
/data/cleanconll
.
The three files will look like this: Column format with the following 5 columns, the last 3 with BIO tagging scheme:
Token POS Wikipedia NER (CleanCoNLL*) NER (CleanCoNLL)
CleanCoNLL* is the CleanCoNLL version before Phase 3, i.e. before reverting the adjectival affiliations back to MISC, see paper for details.
Sentences are separated by an empty line, articles by the -DOCSTART-
token.
So, an excerpt of the dataset looks like this:
-DOCSTART- -X- O O O
SOCCER NN O O O
- : O O O
JAPAN NNP B-Japan_national_football_team B-ORG B-ORG
GET VB O O O
LUCKY NNP O O O
WIN NNP O O O
, , O O O
CHINA NNP B-China_national_football_team B-ORG B-ORG
IN IN O O O
SURPRISE DT O O O
DEFEAT NN O O O
. . O O O
If you use CleanCoNLL or find our approach useful, please cite our work.
- In the EMNLP 2023 Proceedings:
@inproceedings{rucker-akbik-2023-cleanconll,
title = "{C}lean{C}o{NLL}: A Nearly Noise-Free Named Entity Recognition Dataset",
author = {R{\"u}cker, Susanna and Akbik, Alan},
editor = "Bouamor, Houda and Pino, Juan and Bali, Kalika",
booktitle = "Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing",
month = dec,
year = "2023",
address = "Singapore",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2023.emnlp-main.533",
doi = "10.18653/v1/2023.emnlp-main.533",
pages = "8628--8645",
}
- On arXiv:
@misc{rücker2023cleanconll,
title={{C}lean{C}o{NLL}: A Nearly Noise-Free Named Entity Recognition Dataset},
author={Susanna R{\"u}cker and Alan Akbik},
year={2023},
eprint={2310.16225},
archivePrefix={arXiv},
primaryClass={cs.CL}
}