Skip to content

adrianastan/rolex

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 

Repository files navigation

RoLEX

An extended Romanian lexical dataset

The lexicon developed in ReTeRom, RoLEX, is a resource with 330,866 entries of the following tabular form: word_form lemma MSD_tag syllabification lexical_stress phonetic_transcription It is a thoroughly manually validated and corrected resource and represents the most extended phonological validated dataset available at this point for Romanian.

RoLEX is based on the textual component of an assembled speech corpus containing data coming from the Romanian Wikipedia articles, news, interviews on contemporary subjects, talk shows, spontaneous speech, tales, novels, etc. The data set’s completion with morphosyntactic and lemma information is based on a general, largescale (over 1,1 million entries) lexicon of the Romanian language under development at RACAI, called TBL.TBL was also used to include in RoLEX all the morphological variants of the lemmas encountered in the speech corpus. The other lexical information (syllabification, lexical stress, phonetic transcription) were partially based on RoSyllabiDict (Barbu, 2008) and MaRePhor (Toma et al., 2017), and partially automatically predicted with the front-end tool developed in (Stan et al. 2011). The aggregated lexicon underwent a carefully designed automatic and manual correction process.


Additional resources, such as lexical information prediction models will soon be added.

If you are using RoLEX in your work, please cite:

Beáta LŐRINCZ, Elena IRIMIA, Adriana STAN, Verginica BARBU MITITELU, RoLEX: The development of an extended Romanian lexical dataset and its evaluation at predicting concurrent lexical information, Natural Language Engineering, vol 29, issue 3, 2022. link

About

The Romanian Lexicon

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published