This dataset is the Romanian version of the STS dataset. It is a high-quality translation of the aforementioned dataset, containing 8628 pairs of sentences with their similarity score. The dataset respects the same split: 5749 train, 1500 dev and 1379 test sentence pairs.
The dataset is offered in two flavours:
1.5 Un bărbat cântă la harpă. Un bărbat cântă la claviatură.
1.8 O femeie taie cepe. O femeie taie tofu.
3.5 Un bărbat merge pe o bicicletă electrică. Un bărbat merge pe bicicletă.
2.2 Un bărbat cântă la tobe. Un bărbat cântă la chitară.
2.2 Un bărbat cântă la chitară. O doamnă cântă la chitară.
The train/dev/test splits are identical to the original English STS corpus splits.
Direct download link:
- As a single zip file containing everything: RO-STS.text-similarity.zip
- Separate files: RO-STS.train.tsv, RO-STS.dev.tsv, RO-STS.test.tsv
More information in the dataset folder.
The parallel corpus is a direct result of the translation process. It can be used as-is in any other downstream NLP task. It is split in 3 train/dev/test pair of ro
& en
files, totaling 6 files. It is formatted in the standard one-sentence per line.
Direct download link, as a single zip file containing all the ro-en
files: RO-STS.ro-en.zip
For more information and the unzipped files go to the dataset folder.
We provide 2 baselines for this dataset, a transformer-based model and a recurrent neural network model. Both models were trained on the train set until the Pearson score did not improve on the dev set, and results are reported on both dev and test sets.
Model | # of parameters | Dev-set Pearson | Dev-set Spearman | Test-set Pearson | Test-set Spearman |
---|---|---|---|---|---|
RNN | 16.7M | 0.7342 | 0.7349 | 0.6744 | 0.6662 |
Romanian BERT v1 (uncased) | 124M | 0.8459 | 0.8426 | 0.8159 | 0.8086 |
Romanian BERT v1 (cased) | 124M | 0.8426 | 0.8409 | 0.7911 | 0.7826 |
Multilingual BERT (uncased) | 167M | 0.8237 | 0.8235 | 0.7690 | 0.7650 |
Multilingual BERT (cased) | 167M | 0.8071 | 0.8077 | 0.7664 | 0.7641 |
For more details on how to reproduce these scores please check out the detailed evaluation page.
The dataset was created in three steps:
- Automatic translation with Google's translation service.
- Correction round by a person that rectified all errors resulting from the automatic translation - and there were plenty.
- Validation by a different person that double-checked the translation.
This process has ensured the high-quality translation of this dataset.
Here are the annotators/contributors, alphabetically listed:
- Adriana STAN
- Andrei PRUTEANU
- Andrei-Marius AVRAM
- Beáta LŐRINCZ
- Madalina CHITEZ
- Mihai ILIE
- Petru REBEJA
- Razvan PASCANU
- Roxana ROGOBETE
- Stefan Daniel DUMITRESCU
- Viorica PATRAUCEAN
This work, like it's original, is licensed as CC BY-SA 4.0. That means you're free to do anything you want with it, as long as you keep the same license.
Dumitrescu, S. D., Rebeja, P., Lorincz, B., Gaman, M., Avram, A., Ilie, M., Pruteanu, A., Stan, A., Rosia, L., Iacobescu, C., & others. (2021). Liro: Benchmark and leaderboard for romanian language tasks. Thirty-Fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 1).
@inproceedings{dumitrescu2021liro,
title={Liro: Benchmark and leaderboard for romanian language tasks},
author={Dumitrescu, Stefan Daniel and Rebeja, Petru and Lorincz, Beata and Gaman, Mihaela and Avram, Andrei and Ilie, Mihai and Pruteanu, Andrei and Stan, Adriana and Rosia, Lorena and Iacobescu, Cristina and others},
booktitle={Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 1)},
year={2021}
}