Skip to content
Permalink
master
Go to file
 
 
Cannot retrieve contributors at this time
143 lines (81 sloc) 9.49 KB

Summary

The Romanian Non-standard UD treebank (called UAIC-RoDia) is based on UAIC-RoDia Treebank. UAIC-RoDia = ISLRN 156-635-615-024-0

Introduction

The Romanian Non-standard UD treebank (called UAIC-RoDia)is based on UAIC-RoDia Treebank (The Treebank of the Faculty of Computer Science, ”AL. I. Cuza” University, Iași, Romania). This is a balanced treebank. The Contemporary standard part of it (Perez, 2014) was included in the UD-Romanian-RRT Treebank. Since 2015, the UAIC Treebank has been developed by including several nonstandard language genres, Old Romanian, Chat, Folklore (Mărănduc 2015, 2016, 2017c, 2018, Perez 2016), considering that the nonstandard langage is more used than the standard one. The digitization of cultural heritage includes the old texts and also the folklore, wich is an oral phenomenon that is threatened with extinction (Mărănduc, 2017b).

The UAIC-RoDia Treebank (ISLRN 156-635-615-024-0) has in March 2020, 34,794 sentences in its basic format.

For the first release, we transposed in the UD format a part of the New Testament from Alba Iulia (1648), 916 sentences. It is the first printed New Testament in Romanian, with Cyrillic letters. The text with Latin alphabet is obtained by an OCR program built at the Institut of Mathematics and Computer Science of Chișinău, Republic of Moldova, by a group of researchers led by Alexander Colesnicov and Ludmila Malahov (Colesnicov 2016, Cojocaru 2017).

The first release includes in the second part, 284 senteces are folklore in verses; 230 sentences from Romania and 54 from the Republic of Moldova (where the Romanian language is spoken)(Bobicev 2016).

For the second release, we finished the transposition in UD format of the first part of the New Testament (1648): all the prefaces and the four Gospels = 5,172 sentences, including the 916 fron the first release.

For the third release, all the Alba Iulia New Testament (1648).

For the next release, Flower of Gifts, Moldavian Ballads, Romanian Ballads.

Also, the contribution of the Republic of Moldova is now 1805 sentences folklore.

Today, 23 September 2019, we add a new sub-corpus, Caragea's Law, 1818. In May 2020 we add the whole book Dosoftei, ”David's Psalms translation with rhymes” (1673), and the first part of the Ion Neculce's ”Chronicle” (1743), to be continued. In October 2020 we added 1000 sentences ”Romanian Ballads”. The folclore is at the beginning of the train document, but 50 sentences are at the end of the test and dev documents. Also in October 2020 we addad the rest of the Ion Neculce's ”Chronicle” (1743).

Contributors

Cătălina Mărănduc - The New Testament 1648; Flower of Gifts; Caragea's Law, Romanian Folk Ballads. Revision of all the data. (Afiiliation: Faculty of Computer Science, ”AL. I. Cuza” University, Iași,http://www.info.uaic.ro/bin/Main/ and the Academic Institute of Linguistics ”Iorgu Iordan - Al.Rosetti”, Bucharest www.lingv.ro/, Romania)

Cenel Augusto Perez - Romanian Folclore, 230 sentences. (Affiliation ”AL. I. Cuza” University, www.uaic.ro/ Iași, Romania),

Victoria Bobicev - Moldovan Folklore; Moldavian Ballads. Training of the UD parsers on this corpus (Affiliation: Tehnical University of Moldova utm.md/ Chișinău, Republik of Moldova)

Cătălin Mititelu (Ro) (Author of the TREEOPS (XML converter from the UAIC format in UD format) (Colhon 2017, Mărănduc 2018).

Florinel Hociung (Ro) (Author of the Treebank Multiformat Annotator interface) (Mărănduc 2017a).

Valentin Roșca (Ro) Author of the converter XML-CoNLLU, including in the CoNLLU all the specific data of our treebank.

Roman Untilov (Affiliation: Tehnical University of Moldova utm.md/ Chișinău, Republik of Moldova) Training of the UD parsers and other tools on this corpus and on the Romanian language. The Folklore corpus was made a separate train corpus. But a small part of them exists in the test and dev corpora.

Petru Rebeja: Faculty of the Computer Science, Al. I. Cuza University, Iasi. Romania works of the conllu converter and of the POS-tagger.

Sources

(1592) Flower of gifts. [Anonymous translation]. In The Oldest Popular Books in Romanian Literature, I. (Coordinators: Ion Ghetie and Alexandru Mareş). Bucharest, Minerva Publishing House, 1996, p. 119–182.

(1648) The New Testament. Printed for the first time in Romanian at 1648 by Simion Stephen, Metropolitan of Transylvania.

1673 Dosoftei, Psalms of David, translation with rhymes.

Grigore C. Bostan The Romanian folk poetry in the Carpathian-Nistrian - lasi: Cantes, 1998 280 p

Folklore from the Codri parts, Academy of Sciences of the Republic of Moldova, 1962

Romanian folk ballads. I Anthology by Al. Amzulescu. Bucharest, Publishing House for Literature, 1964;

Caragea Voievod's Law, printed at the typography from Cismeaua Rosie, Bucharest, 1818;

Ion Neculce, Works, ”The Chronicle of the Country of Moldova”, ”A lot of words” (1743), edition by Gabriel Ștrempel, 1982

References

Bobicev, Victoria, Tudor Bumbu, Victoria Lazu, Victoria Maxim, Daniela Istrati, 2016. Folk poetry for computers: Moldovan Codri’s ballads parsing. Proceedings of the 12th International Conference “Linguistic Resources and Tools for Processing the Romanian Language, pp. 39-50.

Cojocaru, Svetlana, Alexander Colesnicov, and Ludmila Malahov, 2017. Digitization of Old Romanian Texts Printed in the Cyrillic Script. In Proceedings of International Conference on Digital Access to Textual Cultural Heritage. pages 143–148.

Colhon, Mihaela, Cătălina Mărănduc and Cătălin Mititelu, 2017. A Multiform Balanced Dependency Treebank for Romanian, in Proceedings of Knowledge Resources for the Socio-Economic Sciences and Humanities, (KnowRSH), Varna, Bulgaria September 8, 2017 workshop at the Recent Advances in Natural Language Processing (RANLP) p. 9-19.

Colesnicov, Alexander, Ludmila Malahov, Tudor Bumbu, 2016. Digitization of Romanian Printed Texts of the 17-th Century. roceedings of the 12th International Conference “Linguistic Resources and Tools for Processing the Romanian Language, p. 3-11.

Mihaela Colhon, Cătălina Mărănduc and Cătălin Mititelu, A Multiform Balanced Dependency Treebank for Romanian, in Proceedings of Knowledge Resources for the Socio-Economic Sciences and Humanities, (KnowRSH), Varna, Bulgaria September 8, 2017 workshop at the Recent Advances in Natural Language Processing (RANLP) p. 9-19.

Mărănduc, Cătălina Perez, Cenel-Augusto, 2015. A Romanian Dependency Treebank. In International Journal of Computational Linguistics and Applications, vol. 6, no. 2, issue July-December 2015, p. 25–40

Mărănduc, Cătălina, Malahov, Ludmila, Perez, Cenel-Augusto, Colesnicov, Alexander, 2016. ”RoDia project of a regional and historical corpus for Romanian” in Proceedings of MFOI, Chisinau, p. 268-284.

Cătălina Mărănduc, Florinel Hociung, Victoria Bobicev, 2017a. Treebank Annotator for multiple formats and conventions. In Proceedings of The 4th Conference of Mathematical and Computer Science Society of the Republic of Moldova, Chisinau, Republic of Moldova, June 28 – July 2, 2017, p. 529-534

Cătălina Mărănduc, Victoria Bobicev and Cenel-Augusto Perez, 2017b. Tools for Building a Corpus to Study the Historical and Geographical Variation of the Romanian Language in Proceeding of Language technology for Digital Humanities in Central and (South-) Eastern Europe (LT4DH-CEE 2017) Varna, Bulgaria September 8, 2017 workshop at the Recent Advances in Natural Language Processing (RANLP) conference, p. 10-20

Cătălina Mărănduc, Victoria Bobicev. 2017c. Non Standard Treebank Romania – Republic of Moldova in the Universal Dependencies, in Proceedings of Conference on Mathematical Foundations of Informatics (MFOI-2017) November 9–11, 2017, Chisinau, Moldova, pp. 111-116.

Cătălina Mărănduc, Cătălin Mititelu, Victoria Bobicev. 2018 Syntactic Semantic Correspondence in Dependency Grammar – in Proceeding of 16th International Workshop on Treebanks and Linguistic Theories Prague, Jan. 23-24

Perez, Cenel-Augusto, 2014. Linguistic Resources for Natural Language Processing. (PhD thesis). Al. I. Cuza University, Iași.

Perez, Cenel-Augusto, Cătălina Mărănduc, and Radu Simionescu, 2016. Social media – processing romanian chats and discourse analysis. Computación y Sistemas 20(3):404–414.

Cătălina Mărănduc, Victoria Bobicev, Roman Untilov, ”Syntactic Parser for Old and Regional Romanian”,∗ presented at the 3-rd DATeCH Conference, Brussels May 2019

Cătălina Mărănduc, Victoria Bobicev, Roman Untilov, Morpho-Syntactic Regularities in UD_Romanian-Nonstandard Parsing, In Proceedings of ConsILR, Cluj, 18-20 Nov. 2019, Iași, Al. I. Cuza University Publishing House, 2019.

Changelog

  • 2018-07-01 v2.2
    • More data both in the 1648 translation of the New Testament and in the Moldovan folklore.
  • 2017-11-15 v2.1
    • Initial release in Universal Dependencies.

Metadata

=== Machine-readable metadata (DO NOT REMOVE!) ================================
Data available since: UD v2.1
License: CC BY-SA 4.0
Includes text: yes
Genre: bible poetry
Lemmas: converted from manual
UPOS: converted from manual
XPOS: manual native
Features: converted from manual
Relations: converted from manual
Contributors: Mărănduc, Cătălina; Perez, Cenel-Augusto; Bobicev, Victoria; Mititelu, Cătălin; Hociung, Florinel; Roșca, Valentin; Untilov, Roman; Rebeja, Petru
Contributing: elsewhere
Contact: catalinamaranduc@gmail.com, perez_cenel_augusto@yahoo.com, victoria.bobicev@gmail.com
===============================================================================
You can’t perform that action at this time.