Word Quality Estimation for NMT
This is an updated version of the WMT word-level quality estimation task (Bojar et al 2017) that takes into account both fluency and adequacy issues. It requires not only the detection of wrong words but also insertion errors. It also requires as well detecting words in the source that can be related to errors on the target side.
The tags are determined using the tools in previous WMT editions (fast_align, tercom) with minor changes. Namely alignments are used to determine source words that can be related to target side errors and one or more consecutive insertions after tercom alignment are indicated as a single gap (insertion) error.
Following tools are needed
Install Fast Align
Download zip and uncompress it into the
./external_tools/ folder. In Unix
systems this can be done with
mkdir ./external_tools/ cd ./external_tools/ wget https://github.com/clab/fast_align/archive/master.zip unzip master.zip rm master.zip
README.md in that folder as there may be extra libraries needed.
Ubuntu friendly commands are provided to instal these. With the needed
libraries just do
mkdir build build cmake .. make
as indicated in the
fast_align-master/README.md. If everything goes right,
this should create
Just go to
Download the latest version of the tool and decompress it. For the WMT2018 corpus creation we used
cd ./external_tools wget http://www.cs.umd.edu/~snover/tercom/tercom-0.7.25.tgz tar -xf tercom-0.7.25.tgz
If you are sucesful the following file should be available
Generating the first version of the tags
This is a simple example using WMT2017. In reality you will need to train fast align from a sufficiently big corpus.
Uncompress the WMT2017 data on a
DATA folder. This should look like
mkdir DATA DATA/WMT2017/task2_de-en_training DATA/WMT2017/task2_de-en_training-dev DATA/WMT2017/task2_de-en_dev DATA/WMT2017/task2_en-de_dev DATA/WMT2017/task2_en-de_training DATA/WMT2017/task2_de-en_test DATA/WMT2017/task2_en-de_test
cd corpus_generation/ bash train_fast_align_wmt2017.sh
Once fast align is trained, call the following to generate the tags
Tags are currently stored under e.g.
Exploring the tags
You can explore the created tags using the notebook in
notebooks. For this
you will have to install the
jupyter Python module