The Arabic-PADT UD treebank is based on the Prague Arabic Dependency Treebank (PADT), created at the Charles University in Prague.
The treebank consists of 7,664 sentences (282,384 tokens) and its domain is mainly newswire. The annotation is licensed under the terms of CC BY-NC-SA 3.0 and its original (non-UD) version can be downloaded from http://hdl.handle.net/11858/00-097C-0000-0001-4872-3.
The morphological and syntactic annotation of the Arabic UD treebank is created through conversion of PADT data. The conversion procedure has been designed by Dan Zeman. The main coordinator of the original PADT project was Otakar Smrž.
Source of annotations
This table summarizes the origins and checking of the various columns of the CoNLL-U data.
|ID||Sentence-level units in PADT often correspond to entire paragraphs and they were obtained automatically. Low-level tokenization (whitespace and punctuation) was done automatically and then hand-corrected. Splitting of fused tokens into syntactic words in Arabic is part of morphological analysis. ElixirFM was used to provide context-independent options, then these results were disambiguated manually.|
|FORM||The unvocalized surface form is used. Fully vocalized counterpart can be found in the MISC column as Vform attribute.|
|LEMMA||Plausible analyses provided by ElixirFM, manual disambiguation. Lemmas are vocalized. Part of the selection of lemmas was also word sense disambiguation of the lexemes, providing English equivalents (see the Gloss attribute of the MISC column).|
|UPOSTAG||Converted automatically from XPOSTAG (via Interset); human checking of patterns revealed by automatic consistency tests.|
|XPOSTAG||Manual selection from possibilities provided by ElixirFM.|
|FEATS||Converted automatically from XPOSTAG (via Interset); human checking of patterns revealed by automatic consistency tests.|
|HEAD||Original PADT annotation is manual. Automatic conversion to UD; human checking of patterns revealed by automatic consistency tests.|
|DEPREL||Original PDT annotation is manual. Automatic conversion to UD; human checking of patterns revealed by automatic consistency tests.|
|DEPS||— (currently unused)|
|MISC||Information about token spacing taken from PADT annotation. Additional word attributes provided by morphological analysis (i.e. ElixirFM rules + manual disambiguation): Vform (fully vocalized Arabic form), Translit (Latin transliteration of word form), LTranslit (Latin transliteration of lemma), Root (word root), Gloss (English translation of lemma).|
We wish to thank all of the contributors to the original PADT annotation effort, including Otakar Smrž, Jan Hajič, Petr Zemánek, Petr Pajas, Jan Šnaidauf, Emanuel Beška, Jakub Kráčmar, and Kamila Hassanová.
Further corrections of additional data (not part of PADT release 1.0) were done by Shadi Saleh and Zdeněk Žabokrtský.
- Jan Hajič, Otakar Smrž, Petr Zemánek, Petr Pajas, Jan Šnaidauf, Emanuel Beška, Jakub Kráčmar, Kamila Hassanová. 2009. Prague Arabic Dependency Treebank 1.0, LINDAT/CLARIN digital library at Institute of Formal and Applied Linguistics, Charles University in Prague, http://hdl.handle.net/11858/00-097C-0000-0001-4872-3.
- Otakar Smrž, Viktor Bielický, Iveta Kouřilová, Jakub Kráčmar, Jan Hajič, Petr Zemánek. 2008. Prague Arabic Dependency Treebank: A Word on the Million Words. In: Proceedings of the Workshop on Arabic and Local Languages (LREC 2008), pp. 16–23. Marrakech, Morocco.
- Added enhanced relations with case information.
- Added empty nodes to enhanced graphs (but orphans are just converted to dep).
- Added enhanced relations around relative clauses.
- Distinguishing SCONJ from CCONJ.
- Fixed various bugs found by the new UD validator.
- Fixed partial word forms of those multiword tokens where original morphological analysis was not disambiguated but existed.
- Repository renamed from UD_Arabic to UD_Arabic-PADT.
- Added enhanced representation of dependencies propagated across coordination. The distinction of shared and private dependents is derived deterministically from the original Prague annotation.
- Prepositional objects are now obl:arg.
- Fixed relative pronouns that were attached as 'cc' to their antecedents.
- Multi-word prepositions annotated with the 'fixed' relation.
- Converted to UD v2 guidelines.
- Reconsidered PRON vs. DET.
- Improved advmod vs. obl distinction.
- Changed train-dev-test split to be compatible with UD_Arabic-NYUAD, which in turn follows Diab et al. (Mona Diab, Nizar Habash, Owen Rambow, Ryan Roth. 2013. LDC Arabic treebanks and associated corpora: Data divisions manual. arXiv preprint arXiv:1309.5652.) Some documents appear both in UD_Arabic and in UD_Arabic-NYUAD, and we want these to end up in the same section in both treebanks.
- Unvocalized surface forms are now the main word forms in the FORM column. Fused tokens are shown. Vocalized forms available as MISC attributes.
- Added lemmas, roots, transliteration and English glosses.
- The % symbols are now attached as
- Chains of auxiliaries have been removed as the negative copula لَيسَ / laysa is now treated as copula and not as auxiliary verb.
- Fixed adverbs that were attached as nmod; correct: advmod.
- Fixed sentence ids.
- Added the MISC attribute SpaceAfter=No.
- Improved conversion of AuxY.
- Modified version from HamleDT 3.0