New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add AQMAR NER modified corpus #2188
Conversation
@megantosh why "draft"? Looks good ;) |
thought I would mark as draft since this is not the original dataset. Subsequent edits possible (i.e. split corpus into docs, split MISC into MIS0, MIS1, MIS2, MIS3). Just tested branch after uploading now. PR ready for merge 👍 |
A few potential issues with the dataset: corpus = AQMAR()
corpus.make_label_dictionary() prints the following label dictionary:
I.e. it includes the labels B-SPANISH and B-ENGLISH, but they only occur twice in the corpus and seem to be wrong? Another potential problem are sentences like this one:
Why is Richard Stallman marked as MISC. Shouldn't it be a person? |
according to the original dataset's README.txt: Problem is, that in another file, MIS-2 represents something else. The dataset has indeed a few issues beyond that, too. I thought, however, to keep these "mistakes" for the sake of comparability with the published scores in https://www.aclweb.org/anthology/W19-4607/ (same paper as above). Alternatively, we can establish different versions of the dataset if that helps.. What do you think? |
for clarification: ريتشارد B-PER |
Good question, but I think merging them into one MISC label for now is a good way to go. What about the issue of the B-ENGLISH and B-SPANISH tag? Is this an error in the conversion script? I.e.:
|
that is my assumption yes. Same issue exists in original corpus. |
@megantosh thanks for adding this - will merge this version now! |
Draft of a Modern Standard Arabic dataset from 28 Wikipedia articles used in https://www.aclweb.org/anthology/W19-4607/