Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add AQMAR NER modified corpus #2188

Merged
merged 1 commit into from Apr 7, 2021
Merged

Conversation

megantosh
Copy link
Contributor

Draft of a Modern Standard Arabic dataset from 28 Wikipedia articles used in https://www.aclweb.org/anthology/W19-4607/

@alanakbik
Copy link
Collaborator

@megantosh why "draft"? Looks good ;)

@megantosh
Copy link
Contributor Author

thought I would mark as draft since this is not the original dataset. Subsequent edits possible (i.e. split corpus into docs, split MISC into MIS0, MIS1, MIS2, MIS3).

Just tested branch after uploading now. PR ready for merge 👍

@alanakbik
Copy link
Collaborator

A few potential issues with the dataset:

corpus = AQMAR()
corpus.make_label_dictionary()

prints the following label dictionary:

[b'B-PER', b'I-PER', b'O', b'B-MISC', b'I-MISC', b'B-ORG', b'I-ORG', b'B-LOC', b'I-LOC', b'B-SPANISH', b'B-ENGLISH']

I.e. it includes the labels B-SPANISH and B-ENGLISH, but they only occur twice in the corpus and seem to be wrong?

Another potential problem are sentences like this one:

ستالمان I-PER
( O
بالإنجليزية O
: O
Richard B-MISC
Stallman I-MISC
) O
( O

Why is Richard Stallman marked as MISC. Shouldn't it be a person?

@megantosh
Copy link
Contributor Author

according to the original dataset's README.txt:
Richard_Stallman.txt
MIS-1: Name of software or hardware (e.g. Emacs, Kernel)
MIS-2: English entities

appears later as mistagged once all MIS-1, -2 etc are merged into only MISC.
Hence it looks like an error, however the name in arabic
ريتشارد B-PER
ستالمان I-PER

is tagged correctly as PER.

Problem is, that in another file, MIS-2 represents something else.
E.g. in atom.txt >> MIS-2: Name of theories (e.g. Dalton Theory)

The dataset has indeed a few issues beyond that, too. I thought, however, to keep these "mistakes" for the sake of comparability with the published scores in https://www.aclweb.org/anthology/W19-4607/ (same paper as above).

Alternatively, we can establish different versions of the dataset if that helps.. What do you think?

@megantosh
Copy link
Contributor Author

for clarification:

ريتشارد B-PER
ستالمان I-PER
is a transliteration of the english name
Richard B-MISC
Stallman I-MISC

@alanakbik
Copy link
Collaborator

Good question, but I think merging them into one MISC label for now is a good way to go.

What about the issue of the B-ENGLISH and B-SPANISH tag? Is this an error in the conversion script?

I.e.:

كان O
الفريق O
يستخدم O
بعض O
الملاعب O
البسيطة O
في O
مدريد B-LOC
للعب O
المباريات O
حتى O
انتقل O
لأول O
ملعب O
كان O
اسمه O
ملعب O
O'Donnell B-SPANISH
في O
عام O

@megantosh
Copy link
Contributor Author

megantosh commented Mar 30, 2021

that is my assumption yes. Same issue exists in original corpus.

@alanakbik
Copy link
Collaborator

@megantosh thanks for adding this - will merge this version now!

@alanakbik alanakbik merged commit 6661799 into flairNLP:master Apr 7, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants