Add AQMAR NER modified corpus #2188

megantosh · 2021-03-26T19:33:14Z

Draft of a Modern Standard Arabic dataset from 28 Wikipedia articles used in https://www.aclweb.org/anthology/W19-4607/

alanakbik · 2021-03-26T20:11:28Z

@megantosh why "draft"? Looks good ;)

megantosh · 2021-03-28T00:03:25Z

thought I would mark as draft since this is not the original dataset. Subsequent edits possible (i.e. split corpus into docs, split MISC into MIS0, MIS1, MIS2, MIS3).

Just tested branch after uploading now. PR ready for merge 👍

alanakbik · 2021-03-30T14:19:20Z

A few potential issues with the dataset:

corpus = AQMAR()
corpus.make_label_dictionary()

prints the following label dictionary:

[b'B-PER', b'I-PER', b'O', b'B-MISC', b'I-MISC', b'B-ORG', b'I-ORG', b'B-LOC', b'I-LOC', b'B-SPANISH', b'B-ENGLISH']

I.e. it includes the labels B-SPANISH and B-ENGLISH, but they only occur twice in the corpus and seem to be wrong?

Another potential problem are sentences like this one:

ستالمان I-PER
( O
بالإنجليزية O
: O
Richard B-MISC
Stallman I-MISC
) O
( O

Why is Richard Stallman marked as MISC. Shouldn't it be a person?

megantosh · 2021-03-30T14:53:39Z

according to the original dataset's README.txt:
Richard_Stallman.txt
MIS-1: Name of software or hardware (e.g. Emacs, Kernel)
MIS-2: English entities
appears later as mistagged once all MIS-1, -2 etc are merged into only MISC.
Hence it looks like an error, however the name in arabic
ريتشارد B-PER
ستالمان I-PER
is tagged correctly as PER.

Problem is, that in another file, MIS-2 represents something else.
E.g. in atom.txt >> MIS-2: Name of theories (e.g. Dalton Theory)

The dataset has indeed a few issues beyond that, too. I thought, however, to keep these "mistakes" for the sake of comparability with the published scores in https://www.aclweb.org/anthology/W19-4607/ (same paper as above).

Alternatively, we can establish different versions of the dataset if that helps.. What do you think?

megantosh · 2021-03-30T14:59:57Z

for clarification:

ريتشارد B-PER
ستالمان I-PER
is a transliteration of the english name
Richard B-MISC
Stallman I-MISC

alanakbik · 2021-03-30T15:01:02Z

Good question, but I think merging them into one MISC label for now is a good way to go.

What about the issue of the B-ENGLISH and B-SPANISH tag? Is this an error in the conversion script?

I.e.:

كان O
الفريق O
يستخدم O
بعض O
الملاعب O
البسيطة O
في O
مدريد B-LOC
للعب O
المباريات O
حتى O
انتقل O
لأول O
ملعب O
كان O
اسمه O
ملعب O
O'Donnell B-SPANISH
في O
عام O

megantosh · 2021-03-30T15:20:10Z

that is my assumption yes. Same issue exists in original corpus.

alanakbik · 2021-04-07T12:58:37Z

@megantosh thanks for adding this - will merge this version now!

Add AQMAR NER modified corpus

25cf70c

alanakbik merged commit 6661799 into flairNLP:master Apr 7, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add AQMAR NER modified corpus #2188

Add AQMAR NER modified corpus #2188

megantosh commented Mar 26, 2021

alanakbik commented Mar 26, 2021

megantosh commented Mar 28, 2021

alanakbik commented Mar 30, 2021

megantosh commented Mar 30, 2021

megantosh commented Mar 30, 2021

alanakbik commented Mar 30, 2021

megantosh commented Mar 30, 2021 •

edited

alanakbik commented Apr 7, 2021

Add AQMAR NER modified corpus #2188

Add AQMAR NER modified corpus #2188

Conversation

megantosh commented Mar 26, 2021

alanakbik commented Mar 26, 2021

megantosh commented Mar 28, 2021

alanakbik commented Mar 30, 2021

megantosh commented Mar 30, 2021

megantosh commented Mar 30, 2021

alanakbik commented Mar 30, 2021

megantosh commented Mar 30, 2021 • edited

alanakbik commented Apr 7, 2021

megantosh commented Mar 30, 2021 •

edited