This repo is meant as a space to centralize Romanian Transformers and to provide a uniform evaluation. Contributions are welcome.
We're using HuggingFace's Transformers lib, an awesome tool for NLP. What's BERT you ask? Here's a clear and condensed article about what BERT is and what it can do. Also check out this summary of different transformer models.
What follows is the list of Romanian transformer models, both masked and conditional language models.
Feel free to open an issue and add your model/eval here!
Model | Type | Size | Article/Citation/Source | Pre-trained / Fine-tuned | Release Date |
---|---|---|---|---|---|
dumitrescustefan/bert-base-romanian-cased-v1 | BERT | 124M | PDF / Cite | Pre-trained | Apr, 2020 |
dumitrescustefan/bert-base-romanian-uncased-v1 | BERT | 124M | PDF / Cite | Pre-trained | Apr, 2020 |
racai/distillbert-base-romanian-cased | DistilBERT | 81M | - | Pre-trained | Apr, 2021 |
readerbench/RoBERT-small | BERT | 19M | Pre-trained | May, 2021 | |
readerbench/RoBERT-base | BERT | 114M | Pre-trained | May, 2021 | |
readerbench/RoBERT-large | BERT | 341M | Pre-trained | May, 2021 | |
dumitrescustefan/bert-base-romanian-ner | BERT | 124M | HF Space | Named Entity Recognition on RONECv2 | Jan, 2022 |
snisioi/bert-legal-romanian-cased-v1 | BERT | 124M | - | Legal documents on MARCELLv2 | Jan, 2022 |
readerbench/jurBERT-base | BERT | 111M | Legal documents | Oct, 2021 | |
readerbench/jurBERT-large | BERT | 337M | Legal documents | Oct, 2021 |
Model | Type | Size | Article/Citation/Source | Pre-trained / Fine-tuned | Release Date |
---|---|---|---|---|---|
dumitrescustefan/gpt-neo-romanian-780m | GPT-Neo | 780M | not yet / HF Space | Pre-trained | Sep, 2022 |
readerbench/RoGPT2-base | GPT2 | 124M | Pre-trained | Jul, 2021 | |
readerbench/RoGPT2-medium | GPT2 | 354M | Pre-trained | Jul, 2021 | |
readerbench/RoGPT2-large | GPT2 | 774M | Pre-trained | Jul, 2021 |
NEW: Check out this HF Space to play with Romanian generative models: https://huggingface.co/spaces/dumitrescustefan/romanian-text-generation
Models are evaluated using the public Colab script available here. All results reported are the average score of 5 runs, using the same parameters. For larger models, if it was possible, a larger batch-size was simulated by accumulating gradients, such that all models should have the same effective batch size. Only standard models (not finetuned for a particular task) and that could fit in 16GB of RAM are evaluated.
The tests cover the following fields, and, for brevity, we select a single metric from each field:
- Named Entity Recognition: on RONECv2 we measure the test strict match measure. A model must correctly detect whether a word is an entity and tag it with its correct class.
- Part of Speech Tagging: on ro-pos-tagger we measure the test UPOS F1 score. This test should reveal how well a model understands the language's structure.
- Semantic Textual Similarity: on RO-STS we measure the test Pearson correlation coefficient. Given two sentences the model must predict whether they are entailed, contradictory or are on different subjects (neutral). This test should highlight how well a model can embed the meaning of a sentence.
- Emotion Detection: on the REDv2 emotion detection in Romanian Tweets we measure the test Hamming loss in the classification setting (lower is better). This test should show how well a model can "understand" emotions from short texts.
- Perplexity: on wiki-ro's test split, we measure CLM-only models' perplexity with a stride of 512 and a batch size of 4.
Model | Type | Size | NER/EM_strict | RoSTS/Pearson | Ro-pos-tagger/UPOS F1 | REDv2/hamming_loss |
---|---|---|---|---|---|---|
dumitrescustefan/bert-base-romanian-cased-v1 | BERT | 124M | 0.8815 | 0.7966 | 0.982 | 0.1039 |
dumitrescustefan/bert-base-romanian-uncased-v1 | BERT | 124M | 0.8572 | 0.8149 | 0.9826 | 0.1038 |
racai/distillbert-base-romanian-cased | DistilBERT | 81M | 0.8573 | 0.7285 | 0.9637 | 0.1119 |
readerbench/RoBERT-small | BERT | 19M | 0.8512 | 0.7827 | 0.9794 | 0.1085 |
readerbench/RoBERT-base | BERT | 114M | 0.8768 | 0.8102 | 0.9819 | 0.1041 |
Model | Type | Size | NER/EM_strict | RoSTS/Pearson | Ro-pos-tagger/UPOS F1 | REDv2/hamming_loss | Perplexity |
---|---|---|---|---|---|---|---|
readerbench/RoGPT2-base | GPT2 | 124M | 0.6865 | 0.7963 | 0.9009 | 0.1068 | 52.34 |
readerbench/RoGPT2-medium | GPT2 | 354M | 0.7123 | 0.7979 | 0.9098 | 0.114 | 31.26 |
Using HuggingFace's Transformers lib, instantiate a model and replace the model name as necessary. Then use an appropriate model head depending on your task. Here are a few examples:
from transformers import AutoTokenizer, AutoModel
import torch
# load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("dumitrescustefan/bert-base-romanian-cased-v1")
model = AutoModel.from_pretrained("dumitrescustefan/bert-base-romanian-cased-v1")
# tokenize a sentence and run through the model
input_ids = tokenizer.encode("Acesta este un test.", add_special_tokens=True, return_tensors="pt")
outputs = model(input_ids)
# get encoding
last_hidden_states = outputs[0] # The last hidden-state is the first element of the output tuple
- For dumitrescustefan/* models, remember to correct the ș/ț diacritics before feeding it to the model (it was trained only with the correct, comma-style diacritics, and will see the cedilla ş an ţ as UNKs and thus decrease overall performance):
text = text.replace("ţ", "ț").replace("ş", "ș").replace("Ţ", "Ț").replace("Ş", "Ș")
Give a prompt to a generative model and let it write:
tokenizer = AutoTokenizer.from_pretrained("dumitrescustefan/gpt-neo-romanian-125m")
model = AutoModelForCausalLM.from_pretrained("dumitrescustefan/gpt-neo-romanian-125m")
input_ids = tokenizer.encode("Cine a fost Mihai Eminescu? A fost", return_tensors='pt')
text = model.generate(input_ids, max_length=128, do_sample=True, no_repeat_ngram_size=2, top_k=50, top_p=0.9, early_stopping=True)
print(tokenizer.decode(text[0], skip_special_tokens=True))
P.S. You can test all generative models here: https://huggingface.co/spaces/dumitrescustefan/romanian-text-generation
- While this repo initially started as an in-depth of a single transformer model back in 2020, with the express hope that more models would be added quickly, it turned out that training a good model is not that easy, and it takes a lot of effort to curate the data and then have access to sufficient compute power. So, I feel it's no longer useful to just list a couple of models, and it would make more impact to list all the models I could find that are Romanian-only, and have a minimal level of performance/documentation. Here you go :)
- This repo contained some code to download and clean a Romanian corpus. I have removed this part as Oscar is now offered on HuggingFace (new version), and OPUS's API is no longer working as it should (some manual filtering is now required, not to mention new resources are being added constantly) - thus maintaining this code is not really feasible.
- Please contribute to this repo with new Romanian models you mihgt find, or with citations or updates to existing models.