TransMI

This is the repository for the TransMI framework, which aims to directly build strong baselines from existing mPLMs for transliterated data. In this work, we only consider the mPLMs that use SentencePiece Unigram tokenizers. Specifically, we consider three models: XLM-R, Glot500, and Furina. We applied three different merge modes (Min-Merge, Average-Merge, and Max-Merge) to each model type and evaluated the resulting models on both non-transliterated evaluation datasets (the original ones) and transliterated evaluation datasets (we use Uroman to transliterate the original texts into a common script: Latin).

Paper on arXiv: TransMI

Models available on HuggingFace:

xlm-r-with-transliteration-average, xlm-r-with-transliteration-min, xlm-r-with-transliteration-max
glot500-with-transliteration-average, glot500-with-transliteration-min, glot500-with-transliteration-max
furina-with-transliteration-average, furina-with-transliteration-min, furina-with-transliteration-max

Apply TransMI to an mPLM

Simply run the following code to create the corresponding tokenizer and the TransMI-modified model in Max-Merge mode. The tokenizer and the model will be stored at ./models/xlm-roberta-base-with-transliteration-max

python transmi.py \\
--save_path './models' \\
--model_name 'xlm-roberta-base' \\
--merge_mode 'max'

Model Loading

Load the model by simply specifying the saved path using from_pretrained method. For example, to load the tokenizer and the model stored above:

from transformers import XLMRobertaForMaskedLM, XLMRobertaTokenizer

MODEL_PATH = './models/xlm-roberta-base-with-transliteration-max'

model = XLMRobertaForMaskedLM.from_pretrained(MODEL_PATH)
tokenizer = XLMRobertaTokenizer.from_pretrained(MODEL_PATH)

Evaluation

Dataset Preparation

Please refer to Glot500 and SIB200 for downloading the datasets used for evaluation. The scripts used to run each evaluation experiment are included in the corresponding directories in this repo.

Citation

If you find our code or models useful for your research, please considering citing:

@article{liu2024transmi,
  title={TransMI: A Framework to Create Strong Baselines from Multilingual Pretrained Language Models for Transliterated Data},
  author={Yihong Liu and Chunlan Ma and Haotian Ye and Hinrich Sch{\"u}tze},
  journal={arXiv preprint arXiv:2405.09913},
  year={2024}
}

Acknowledgements

This repository is built on top of xtreme, Glot500 and Furina.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
evaluation		evaluation
images		images
.DS_Store		.DS_Store
README.md		README.md
sentencepiece_model_pb2.py		sentencepiece_model_pb2.py
transmi.py		transmi.py
uroman.py		uroman.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TransMI

Apply TransMI to an mPLM

Model Loading

Evaluation

Dataset Preparation

Citation

Acknowledgements

About

Releases

Packages

Languages

cisnlp/TransMI

Folders and files

Latest commit

History

Repository files navigation

TransMI

Apply TransMI to an mPLM

Model Loading

Evaluation

Dataset Preparation

Citation

Acknowledgements

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages