Facebook Low Resource (FLoRes) MT Benchmark
Branch: master
Clone or download

README.md

Facebook Low Resource (FLoRes) MT Benchmark

This repository contains data and baselines from the paper:
Two New Evaluation Data-Sets for Low-Resource Machine Translation: Nepali–English and Sinhala–English.

The data can be downloaded directly at:
https://github.com/facebookresearch/flores/raw/master/data/wikipedia_en_ne_si_test_sets.tgz

Baselines

The following instructions will can be used to reproduce the baseline results from the paper.

Requirements

The baseline uses the Indic NLP Library and sentencepiece for preprocessing; fairseq for model training; and sacrebleu for scoring.

Dependencies can be installed via pip:

$ pip install fairseq sacrebleu sentencepiece

The Indic NLP Library will be cloned automatically by the prepare-{ne,si}en.sh scripts.

Download and preprocess data

The download-data.sh script can be used to download and extract the raw data. Thereafter the prepare-neen.sh and prepare-sien.sh scripts can be used to preprocess the raw data. In particular, they will use the sentencepiece library to learn a shared BPE vocabulary with 5000 subword units and binarize the data for training with fairseq.

To download and extract the raw data:

$ bash download-data.sh

Thereafter, run the following to preprocess the raw data:

$ bash prepare-neen.sh
$ bash prepare-sien.sh

Train a baseline Transformer model

To train a baseline Ne-En model on a single GPU:

$ CUDA_VISIBLE_DEVICES=0 fairseq-train \
    data-bin/wiki_ne_en_bpe5000/ \
    --source-lang ne --target-lang en \
    --arch transformer --share-all-embeddings \
    --encoder-layers 5 --decoder-layers 5 \
    --encoder-embed-dim 512 --decoder-embed-dim 512 \
    --encoder-ffn-embed-dim 2048 --decoder-ffn-embed-dim 2048 \
    --encoder-attention-heads 2 --decoder-attention-heads 2 \
    --encoder-normalize-before --decoder-normalize-before \
    --dropout 0.4 --attention-dropout 0.2 --relu-dropout 0.2 \
    --weight-decay 0.0001 \
    --label-smoothing 0.2 --criterion label_smoothed_cross_entropy \
    --optimizer adam --adam-betas '(0.9, 0.98)' --clip-norm 0 \
    --lr-scheduler inverse_sqrt --warmup-update 4000 --warmup-init-lr 1e-7 \
    --lr 1e-3 --min-lr 1e-9 \
    --max-tokens 4000 \
    --update-freq 4 \
    --max-epoch 100 --save-interval 10

To train on 4 GPUs, remove the --update-freq flag and run CUDA_VISIBLE_DEVICES=0,1,2,3 fairseq-train (...). If you have a Volta or newer GPU you can further improve training speed by adding the --fp16 flag.

This same architecture can be used for En-Ne, Si-En and En-Si:

  • For En-Ne, update the training command with:
    fairseq-train data-bin/wiki_ne_en_bpe5000 --source-lang en --target-lang ne
  • For Si-En, update the training command with:
    fairseq-train data-bin/wiki_si_en_bpe5000 --source-lang si --target-lang en
  • For En-Si, update the training command with:
    fairseq-train data-bin/wiki_si_en_bpe5000 --source-lang en --target-lang si

Compute BLEU using sacrebleu

Run beam search generation and scoring with sacrebleu:

$ fairseq-generate \
    data-bin/wiki_ne_en_bpe5000/ \
    --source-lang ne --target-lang en \
    --path checkpoints/checkpoint_best.pt \
    --beam 5 --lenpen 1.2 \
    --gen-subset valid \
    --remove-bpe=sentencepiece \
    --sacrebleu

Replace --gen-subset valid with --gen-subset test above to score the test set.

Tokenized BLEU for En-Ne and En-Si:

For these language pairs we report tokenized BLEU. You can compute tokenized BLEU by removing the --sacrebleu flag from generate.py:

$ fairseq-generate \
    data-bin/wiki_ne_en_bpe5000/ \
    --source-lang en --target-lang ne \
    --path checkpoints/checkpoint_best.pt \
    --beam 5 --lenpen 1.2 \
    --gen-subset valid \
    --remove-bpe=sentencepiece

Citation

If you use this data in your work, please cite:

@inproceedings{,
  title={Two New Evaluation Datasets for Low-Resource Machine Translation: Nepali-English and Sinhala-English},
  author={Guzm\'{a}n, Francisco and Chen, Peng-Jen and Ott, Myle and Pino, Juan and Lample, Guillaume and Koehn, Philipp and Chaudhary, Vishrav and Ranzato, Marc'Aurelio},
  journal={arXiv preprint arXiv:1902.01382},
  year={2019}
}

License

The dataset is licenced under CC-BY-SA, see the LICENSE file for details.