Permalink
Switch branches/tags
Nothing to show
Find file Copy path
Fetching contributors…
Cannot retrieve contributors at this time
99 lines (62 sloc) 6.92 KB

Summary

The Romanian Non-standard UD treebank (called UAIC-RoDia) is based on UAIC-RoDia Treebank.

Introduction

The Romanian Non-standard UD treebank (called UAIC-RoDia)is based on UAIC-RoDia Treebank (The Treebank of the Faculty of Computer Science, ”AL. I. Cuza” University, Iași, Romania). This is a balanced treebank. The Contemporary standard part of it (Perez, 2014) was included in the UD-Romanian-RRT Treebank. Since 2015, the UAIC Treebank has been developed by including several nonstandard language genres, Old Romanian, Chat, Folklore (Mărănduc 2015, 2016, 2017c, 2018, Perez 2016), considering that the nonstandard langage is more used than the standard one. The digitization of cultural heritage includes the old texts and also the folklore, wich is an oral phenomenon that is threatened with extinction (Mărănduc, 2017b).

The UAIC-RoDia Treebank (ISLRN 156-635-615-024-0) has now 19,000 sentences in its basic format.

For the first release, we transposed in the UD format a part of the New Testament from Alba Iulia (1648), 916 sentences. It is the first printed New Testament in Romanian, with Cyrillic letters. The text with Latin alphabet is obtained by an OCR program built at the Institut of Mathematics and Computer Science of Chișinău, Republic of Moldova, by a group of researchers led by Alexander Colesnicov and Ludmila Malahov (Colesnicov 2016, Cojocaru 2017).

The first release includes in the second part, 284 senteces are folklore in verses; 230 sentences from Romania and 54 from the Republic of Moldova (where the Romanian language is spoken)(Bobicev 2016).

For the second release, we finished the transposition in UD format of the first part of the New Testament (1648): all the prefaces and the four Gospels = 5,172 sentences, including the 916 fron the first release.

Also, the contribution of the Republic of Moldova is now 902 sentences folklore, including the 54 sentences from the first relase. We have now 1132 sentences Folklore.

Contributors

Cătălina Mărănduc - The New Testament 1648; Revision of all the data. (Afiiliation: Faculty of Computer Science, ”AL. I. Cuza” University, Iași,http://www.info.uaic.ro/bin/Main/ and the Academic Institute of Linguistics ”Iorgu Iordan - Al.Rosetti”, Bucharest www.lingv.ro/, Romania)

Cenel Augusto Perez - Romanian Folclore, 230 sentences. (Affiliation ”AL. I. Cuza” University, www.uaic.ro/ Iași, Romania),

Victoria Bobicev - Moldovan Folklore (Affiliation: Tehnical University of Moldova utm.md/ Chișinău, Republik of Moldova),

Cătălin Mititelu (Ro) (Author of the TREEOPS (XML converter from the UAIC format in UD format) (Colhon 2017, Mărănduc 2018).

Florinel Hociung (Ro) (Author of the Treebank Multiformat Annotator interface) (Mărănduc 2017a).

Valentin Roșca (Ro) Author of the converter XML-CoNLLU, including in the CoNLLU all the specific data of our treebank.

Sources

The New Testament. Printed for the first time in Romanian at 1648 by Simion Stephen, Metropolitan of Transylvania.

Grigore C. Bostan The Romanian folk poetry in the Carpathian-Nistrian - lasi: Cantes, 1998 280 p

References

Bobicev, Victoria, Tudor Bumbu, Victoria Lazu, Victoria Maxim, Daniela Istrati, 2016. Folk poetry for computers: Moldovan Codri’s ballads parsing. Proceedings of the 12th International Conference “Linguistic Resources and Tools for Processing the Romanian Language, pp. 39-50.

Cojocaru, Svetlana, Alexander Colesnicov, and Ludmila Malahov, 2017. Digitization of Old Romanian Texts Printed in the Cyrillic Script. In Proceedings of International Conference on Digital Access to Textual Cultural Heritage. pages 143–148.

Colhon, Mihaela, Cătălina Mărănduc and Cătălin Mititelu, 2017. A Multiform Balanced Dependency Treebank for Romanian, in Proceedings of Knowledge Resources for the Socio-Economic Sciences and Humanities, (KnowRSH), Varna, Bulgaria September 8, 2017 workshop at the Recent Advances in Natural Language Processing (RANLP) p. 9-19.

Colesnicov, Alexander, Ludmila Malahov, Tudor Bumbu, 2016. Digitization of Romanian Printed Texts of the 17-th Century. roceedings of the 12th International Conference “Linguistic Resources and Tools for Processing the Romanian Language, p. 3-11.

Mărănduc, Cătălina Perez, Cenel-Augusto, 2015. A Romanian Dependency Treebank. In International Journal of Computational Linguistics and Applications, vol. 6, no. 2, issue July-December 2015, p. 25–40

Mărănduc, Cătălina, Malahov, Ludmila, Perez, Cenel-Augusto, Colesnicov, Alexander, 2016. ”RoDia project of a regional and historical corpus for Romanian” in Proceedings of MFOI, Chisinau, p. 268-284.

Cătălina Mărănduc, Florinel Hociung, Victoria Bobicev, 2017a. Treebank Annotator for multiple formats and conventions. In Proceedings of The 4th Conference of Mathematical and Computer Science Society of the Republic of Moldova, Chisinau, Republic of Moldova, June 28 – July 2, 2017, p. 529-534

Cătălina Mărănduc, Victoria Bobicev and Cenel-Augusto Perez, 2017b. Tools for Building a Corpus to Study the Historical and Geographical Variation of the Romanian Language in Proceeding of Language technology for Digital Humanities in Central and (South-) Eastern Europe (LT4DH-CEE 2017) Varna, Bulgaria September 8, 2017 workshop at the Recent Advances in Natural Language Processing (RANLP) conference, p. 10-20

Cătălina Mărănduc, Victoria Bobicev. 2017c. Non Standard Treebank Romania – Republic of Moldova in the Universal Dependencies, in Proceedings of Conference on Mathematical Foundations of Informatics (MFOI-2017) November 9–11, 2017, Chisinau, Moldova, pp. 111-116.

Cătălina Mărănduc, Cătălin Mititelu, Victoria Bobicev. 2018 Syntactic Semantic Correspondence in Dependency Grammar – in Proceeding of 16th International Workshop on Treebanks and Linguistic Theories Prague, Jan. 23-24

Perez, Cenel-Augusto, 2014. Linguistic Resources for Natural Language Processing. (PhD thesis). Al. I. Cuza University, Iași.

Perez, Cenel-Augusto, Cătălina Mărănduc, and Radu Simionescu, 2016. Social media – processing romanian chats and discourse analysis. Computación y Sistemas 20(3):404–414.

Changelog

  • 2018-07-01 v2.2
    • More data both in the 1648 translation of the New Testament and in the Moldovan folklore.
  • 2017-11-15 v2.1
    • Initial release in Universal Dependencies.

Metadata

=== Machine-readable metadata (DO NOT REMOVE!) ================================
Data available since: UD v2.1
License: CC BY-SA 4.0
Includes text: yes
Genre: bible poetry
Lemmas: converted from manual
UPOS: converted from manual
XPOS: manual native
Features: converted from manual
Relations: converted from manual
Contributors: Mărănduc, Cătălina; Perez, Cenel-Augusto; Bobicev, Victoria; Mititelu, Cătălin; Hociung, Florinel; Roșca, Valentin
Contributing: elsewhere
Contact: catalinamaranduc@gmail.com, perez_cenel_augusto@yahoo.com, victoria.bobicev@gmail.com
===============================================================================