This repository contains the code used for the paper Match the Script, Adapt if Multilingual: Analyzing the Effect of Multilingual Pretraining on Cross-lingual Transferability.
- pytorch
- transformers
- jsonlines
- conllu (for preprocessing)
If running regression analysis:
- statsmodels
- lang2vec
- Obtain CoNLL 2017 Wikipedia dump from https://lindat.mff.cuni.cz/repository/xmlui/handle/11234/1-1989.
- or
wget https://lindat.mff.cuni.cz/repository/xmlui/bitstream/handle/11234/1-1989/English-annotated-conll17.tar
and change "English" for other languages.
- or
- Preprocess by obtaining the raw text by running e.g.,
scripts/preprocess_en/en00.sh
- Downsample by running
src/scripts/downsample_train_dev.sh
- Run
scripts/train/pretrain_en.sh
(for pretraining only on English, remember to set all necessary constants like output directory)
Please refer to https://github.com/facebookresearch/XLM for the details on XLM-R, XLM-17, and XLM-100.
For XLM-R models, use the --continued_pretraining
flag and specify the models to adapt.
For XLM-{17,100} models,
- Run
scripts/train/adapt_xlm{17,100}.sh
Check out https://github.com/google-research/xtreme repository.
xtreme/scripts/train_udpos.sh
is used for POS tagging.xtreme/scripts/train_panx.sh
is used for NER.xtreme/scripts/train_xnli.sh
is used for NLI.
src/regression/regression.py
provides an example output.
Notes:
src/regression/lang_data.txt
is referred from Table 5 in the appendix of the XTREME paper.
@inproceedings{fujinuma2022match,
title = "Match the Script, Adapt if Multilingual: Analyzing the Effect of Multilingual Pretraining on Cross-lingual Transferability",
author = "Yoshinari Fujinuma and Jordan Boyd-Graber and Katharina Kann",
booktitle = "Proceedings of the Association for Computational Linguistics (to appear)",
year = "2022",
url = "http://arxiv.org/abs/2203.10753",
}