This is the repository for the TransMI framework, which aims to directly build strong baselines from existing mPLMs for transliterated data. In this work, we only consider the mPLMs that use SentencePiece Unigram tokenizers. Specifically, we consider three models: XLM-R, Glot500, and Furina. We applied three different merge modes (Min-Merge, Average-Merge, and Max-Merge) to each model type and evaluated the resulting models on both non-transliterated evaluation datasets (the original ones) and transliterated evaluation datasets (we use Uroman to transliterate the original texts into a common script: Latin).
Paper on arXiv: TransMI
Models available on HuggingFace:
xlm-r-with-transliteration-average, xlm-r-with-transliteration-min, xlm-r-with-transliteration-max
glot500-with-transliteration-average, glot500-with-transliteration-min, glot500-with-transliteration-max
furina-with-transliteration-average, furina-with-transliteration-min, furina-with-transliteration-max
Simply run the following code to create the corresponding tokenizer and the TransMI-modified model in Max-Merge mode. The tokenizer and the model will be stored at ./models/xlm-roberta-base-with-transliteration-max
python transmi.py \\
--save_path './models' \\
--model_name 'xlm-roberta-base' \\
--merge_mode 'max'
Load the model by simply specifying the saved path using from_pretrained
method. For example, to load the tokenizer and the model stored above:
from transformers import XLMRobertaForMaskedLM, XLMRobertaTokenizer
MODEL_PATH = './models/xlm-roberta-base-with-transliteration-max'
model = XLMRobertaForMaskedLM.from_pretrained(MODEL_PATH)
tokenizer = XLMRobertaTokenizer.from_pretrained(MODEL_PATH)
Please refer to Glot500 and SIB200 for downloading the datasets used for evaluation. The scripts used to run each evaluation experiment are included in the corresponding directories in this repo.
If you find our code or models useful for your research, please considering citing:
@article{liu2024transmi,
title={TransMI: A Framework to Create Strong Baselines from Multilingual Pretrained Language Models for Transliterated Data},
author={Yihong Liu and Chunlan Ma and Haotian Ye and Hinrich Sch{\"u}tze},
journal={arXiv preprint arXiv:2405.09913},
year={2024}
}
This repository is built on top of xtreme, Glot500 and Furina.