UD EWT treebank consists of different genres of new media. The treebank contains 5,536 trees, 68,868 tokens.
Estonian Web Treebank UD v2.8 consists of three parts. Its older part (1,662 trees, v2.4) is a converted version of the Estonian Web Treebank (EWT), originally annotated in the Constraint Grammar (CG) annotation scheme, and consisting of different genres of new media. The second part (1,495 trees, v2.6) consists of internet forum texts and has been annotated using Stanza parser, followed by manual post-editing. The third part (v2.8) has been annnotated in the same way. It consists of users' feedbacks to news about Covid19 pandemic in 2020-2021 (~12,725 tokens).
The treebank consists of 5,536 trees, 68,868 tokens. As for enhanced dependencies, the empty nodes for missing predicates have been added, and the relative pronoun is attached to its antecedent with the relation 'ref' but there are no other types of enhanced dependencies in this version.
The treebank has been divided to train, test and dev parts as 46,756; 13,156 and 8,956 tokens respectively.
The treebank covers unedited new media texts.
This work was financed by the National Programme for Estonian Language Technology and Estonian Ministery of Education and Research (grant 20-56 IUT20-56 "Computational models for Estonian").
- Kadri Muischnek, Kaili Müürisep, Tiina Puolakainen, Eleri Aedmaa, Riin Kirt, Dage Särg. 2014. Estonian Dependency Treebank and its annotation scheme. In: Proceedings of the 13th Workshop on Treebanks and Linguistic Theories (TLT13), pp. 285–291, ISBN 978-3-9809183-9-8, Tübingen, Germany.
- Kadri Muischnek, Kaili Müürisep, Dage Särg. 2019. CG Roots of UD Treebank of Estonian Web Language. In Proceedings of the NoDaLiDa 2019 Workshop on Constraint Grammar-Methods, Tools and Applications, pp. 23-26, Turku, Finland
- UD v2.8: new texts added to the training corpus, annotation of numerals modified, enhanced annotation of relative pronouns added
- UD v2.7: new texts, extra annotation for typos, better tokenization and sentence segmentation
- UD v2.6: new internet forum texts (~15,000 tokens), 0-nodes in clauses.
- UD v2.4: automatic conversion from CG, manual reannotation.
=== Machine-readable metadata ================================================= Documentation status: stub Data source: semi-automatic Data available since: UD v2.4 License: CC BY-NC-SA 4.0 Includes text: yes Genre: blog web social Lemmas: converted from manual UPOS: converted from manual XPOS: converted from manual Features: converted from manual Relations: converted from manual Contributing: here Contributors: Muischnek, Kadri; Müürisep, Kaili; Puolakainen, Tiina; Särg, Dage; Eiche, Sandra Contact: firstname.lastname@example.org, email@example.com ===============================================================================