Skip to content

benjamin-mlr/mbert-unseen-languages

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

17 Commits
 
 
 
 
 
 

Repository files navigation

When Being Unseen from mBERT is just the Beginning: Handling New Languages With Multilingual Language Models

This repository includes pointers and scripts to reproduce experiments presented in When Being Unseen from mBERT is just the Beginning: Handling New Languages With Multilingual Language Models (accepted to NAACL-HLT 2021)

Transliteration

Linguistically motivated Transliteration

Uyghur to the Latin script

Install pearl at https://www.perl.org/get.html and run:

cd transfer/transliteration/

cat ug.txt | perl ./alTranscribe.pl -f ug -t tr > ug_latin_script.txt

Sorani to the Latin script

cat ckb.txt | perl ./alTranscribe.pl -f ckb -t ku > ckb_latin_script.txt

Fine-tuning mBERT

Data

Raw data for MLM training/unsupervised fine-tuning

Download OSCAR deduplicated datasets.

MLM training

We use the script run_language_modeling.py from Hugging-Face

export TRAIN_FILE=./train.txt
export TEST_FILE=./test.txt

python ./run_language_modeling.py \
    --output_dir=output \
    --model_type="bert" \
    --do_train \
    --train_data_file=$TRAIN_FILE \
    --do_eval \
    --eval_data_file=$TEST_FILE \
    --per_gpu_train_batch_size 8 \
    --per_gpu_eval_batch_size 4 \
    --num_train_epochs 20 \
    --output_dir ./ \
    --evaluate_during_training 1 \
    --save_steps 1000 \
    --save_total_limit 2 \
    --mlm \
    --overwrite_output_dir \
    --block_size 128 \
    --line_by_line

mBERT Unsupervised Fine Tuning

Similarly run_language_modeling.py from Hugging-Face

export TRAIN_FILE=./train.txt
export TEST_FILE=./test.txt

python ./run_language_modeling.py \
    --output_dir=output \
    --model_type="bert" \
    --model_name_or_path="bert-base-multilingual-cased" \
    --do_train \
    --train_data_file=$TRAIN_FILE \
    --do_eval \
    --eval_data_file=$TEST_FILE \
    --per_gpu_train_batch_size 8 \
    --per_gpu_eval_batch_size 4 \
    --num_train_epochs 20 \
    --output_dir ./ \
    --evaluate_during_training 1 \
    --save_steps 1000 \
    --save_total_limit 2 \
    --mlm \
    --overwrite_output_dir \
    --block_size 128 \
    --line_by_line

How to cite

If you extend or use this work, please cite:

@misc{muller2020unseen,
      title={When Being Unseen from mBERT is just the Beginning: Handling New Languages With Multilingual Language Models}, 
      author={Benjamin Muller and Antonis Anastasopoulos and Benoît Sagot and Djamé Seddah},
      year={2020},
      eprint={2010.12858},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published