STAPLE 2020 README
This repository has code for:
- Scoring a predictions file
- Training an example baseline model with fairseq
Python 3.6+ is required. It is strongly recommended that you run this in a virtual environment.
There are no special requirements for running the scoring function.
You can score a predicted file as follows (using the AWS baseline as example, and running in the repo top level directory):
$ python staple_2020_scorer.py --goldfile staple-2020-train/en_vi/train.en_vi.2020-01-13.gold.txt --predfile staple-2020-train/en_vi/train.en_vi.aws_baseline.pred.txt
If all you want to do is evaluation, then ignore this section.
Most participants will probably write their own code for this task, but we also provide code for training a vanilla sequence-to-sequence models using fairseq. This does not produce the best results for this task, but it is an obvious baseline and may give you a jumpstart. This code is an adaptation of translation tutorials from fairseq.
Certain scripts require perl to run. If you are on mac or Linux, you probably already have it. See here for more details.
Next, get these repositories:
$ git clone https://github.com/moses-smt/mosesdecoder $ git clone https://github.com/rsennrich/subword-nmt
Go to the
variables.sh file and set the paths for
Install python requirements:
$ pip install fairseq sacremoses subword_nmt sacrebleu tqdm
The following files are provided.
variables.sh: common BASH variables
preprocess.sh: to preprocess the data for training with fairseq
train.sh: to train the model using preprocessed data
run_pretrained.sh: script to run pretrained fairseq models
my_cands_extract.py: used to convert outputs from fairseq into shared task format files (used in
get_traintest_data.py: converts shared task format files into fairseq-readable format (used in
The most relevant files are
If you have questions, feel free to check or post to the mailing list