-
Notifications
You must be signed in to change notification settings - Fork 1
Translation Error Categorization-based Translation Quality Metric
fishel/TerrorCat
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
TerrorCat http://terra.cl.uzh.ch/terrorcat.html TODO POS-SETS * WHAT IS IT TerrorCat performs automatic ranking of translations -- i.e. given several translations (e.g. done by several MT systems) of the same source text, it will rank those translations according to their quality. * WHAT IS IT NOT TerrorCat is not an MT metric (like BLEU, TER or METEOR) -- you cannot use its output as an objective and independent estimate of translation quality; it can only rank translations, relative to other translations. * HOW DOES IT DO IT TerrorCat is based on automatic translation error categorization; it uses Addicter and Hjerson to automatically analyze and categorize translation errors of every hypothesis translation, then calculates the frequency of every error type per part-of-speech (in other words, frequencies of missing nouns/missing verbs/superfluous nouns/superfluous adjectives/misinflected nouns/misplaced nouns/etc. for every type of error and every part of speech), then uses those frequencies for pairwise comparison of the translation sentences (using a trained binary classifier), and finally computes the final ranking based on the outcome of all the pairwise comparisons. * HOW DO I APPLY IT First, take a look at the dependencies section down below. To rank a set of hypothesis translations, you need to 1. lemmatize and PoS-tag the source text, the reference and the hypothesis translations - save them all in the factored format "surface-form|pos-tag|lemma", for example ====== the|DT|the cars|NN|car were|VB|be small|JJ|small .|SENT|. ====== - name the source and reference files in the convention <test-set>.<language>.fact, for example newstest2011.fr.fact and newstest2011.en.fact - name the hypothesis files in the convention <test-set>.<language-pair>.<MT-system-ID>.fact, for example newstest2011.fr-en.uedin.fact, newstest2011.fr-en.cmu.fact and newstest2011.fr-en.cu-combo.fact - place them all in a single folder 2. launch TerrorCat's ranking with the script terrorcat.pl: ./terrorcat.pl [options] work-dir source-dir model-file > output.txt - work-dir is a path for saving auxilliary files, such as automatic error analysis output and such; you can specify a non-existent path and it will be created. To avoid having to re-generate the error analysis, always specify the same path -- as long as file names do not intersect, you can re-use the folder over several sessions (with different languages, etc.) - source-dir is the path to where you put the source texts and reference and hypothesis translations - model-file is the path to the pre-trained model for the pairwise classifier; you can download models from the homepage, or train your own - output.txt is the file where you want the ranking saved; the format is the same as for the WMT metric shared tasks (e.g. http://www.statmt.org/wmt12/metrics-task.html) - the options include -s to switch to sentence-level ranking (default: document-level ranking) -m to set the number of threads to use (default: 2) * HOW DO I TRAIN A NEW MODEL First, take a look at the dependencies section down below. To train a new model for TerrorCat you need a set of manually ranked translations -- the source text, reference and two or more hypothesis translations and their quality comparison on a per-sentence basis. 1. launch the training with the script train-model.pl: ./train-model.pl [options] work-dir man-rank-file source-dir model-file - work-dir is a path for saving auxilliary files, such as automatic error analysis output and such; you can specify a non-existent path and it will be created. To avoid having to re-generate the error analysis, always specify the same path -- as long as file names do not intersect, you can re-use the folder over several sessions (with different languages, etc.) - source-dir is where you placed the source text, reference and hypothesis translation files - man-rank-file is the file with the manual translation rankings on a per-sentence basis; the format is ====== language-pair,test-set-id,line-nr,sys-1-id,sys-1-rank,sys-2-id,sys-2-rank,... ... ====== for example ====== fr-en,newstest2011,42,uedin,1,cmu,2,cu-combo,1 ====== and a couple of hundred (or thousand) more of such lines. - model-file is the path to where to save the model file (to pass to the applying script terrorcat.pl afterwords) - the options include -m to set the number of threads to use (default: 2) * HOW DO I META-EVALUATE If you have a trained model and a manual ranking file for data, that was not involved in training the model, you can calculate the correlation between TerrorCat's ranking and manual ranking. To do so, use the script metaeval.sh: ./metaeval.sh man-ranks auto-ranks - man-ranks is the file with manual ranking data in the same format as for training models (see above) - auto-ranks is the output of TerrorCat's ranking script; metaeval.sh will automatically guess, whether the ranking output was on document or sentence level and calculate the appropriate correlation coefficient (Spearman's Rho for document level and Kendall's Tau for sentence level) * WHAT ARE THE DEPENDENCIES TerrorCat uses Addicter, Hjerson and Weka: - https://wiki.ufal.ms.mff.cuni.cz/user:zeman:addicter - http://www.dfki.de/~mapo02/hjerson/ - http://www.cs.waikato.ac.nz/ml/weka/ Weka's version -- not sure, but with 3.6.6 (and probably newer ones) it works. The paths to these Software packages can be set it the "config.ini" file. Also, TerrorCat has only been tested on Linux, but in theory it could work elsewhere (e.g. on Windows via Cygwin, or on MacOS). It contains Perl scripts, Bash scripts and GNU Make scripts, so Perl, Bash and Make are also a requirement. * HOW WELL DOES IT WORK The correlation with human judgements varies, depending on the language, text domain, size of training data used to create the model, etc. When evaluated on WMT'2011 data, document-level correlations were: English: 0.854 French: 0.919 German: 0.879 Spanish: 0.943 Czech: 0.915 and sentence-level correlations were: English: 0.305 French: 0.3 German: 0.175 Spanish: 0.254 Czech: 0.208 See the homepage for more recent evaluation results.
About
Translation Error Categorization-based Translation Quality Metric
Resources
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published