Skip to content

andyweizhao/SharedTask2021

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

32 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SharedTask2021

This repository contains the data, baselines and evaluation scripts for the Eval4NLP Shared Task on Explainable Quality Estimation.

Datasets

Training and development data

The directory data contains training and development data for Romanian-English (Ro-En) and Estonian-English (Et-En) language pairs.

The directories for each data partition and language pair contain the following files:

  • <partition>.src: source sentences
  • <partition>.mt: MT outputs
  • <partition>.pe: post-editing of the MT outputs
  • <partition>.da: sentence-level quality scores
  • <partition>.hter: HTER score
  • <partition>.tgt-tags: word-level labels indicating whether each token in the MT output is an error (1) or is correct (0)
  • <partition>.src-tags: word-level labels indicating whether each token in the source corresponds to an error (1) or to a correct token (0) in the MT output

All the data is tokenized. Word-level labels were derived by comparing the MT outputs with their post-edited versions based on the alignments provided by the TER tool.

This data is an adjusted version of the MLQE-PE dataset that was used at the WMT2020 Shared Task on Quality Estimation. The differences with the QE Shared Task are as follows:

  • For simplicity, gaps are ignored. Thus, the number of word-level labels corresponds to the number of tokens.
  • OK and BAD labels are replaced by 0 and 1, respectively.

Test data

As test data, we are collecting sentence-level quality scores and word-level error annotations for Et-En and Ro-En, as well as two zero-shot language pairs: German-Chinese (DE-ZH) and Russian-German (RU-DE). Human annotators are asked to indicate translation errors as an explanation for the overall sentence score, as well as the corresponding words in the source sentence. The guidelines for this annotation effort are available in the annotation-guidelines directory.

We provide a gold standard of 20 annotated sentence pairs for DE-ZH and RU-DE. This may be used for the participants to gain an intuition of the task for those languages: DE-ZH, RU-DE. The data is not tokenized in this case, unlike the data used eventually in the shared task (further, the different colors codings are not relevant).

Baselines

We provide the following baselines:

Evaluation

The following metrics will be used to assess performance:

  • AUC score
  • AP (Average Precision)
  • Recall at top-K

To run evaluation on a toy example:

cd scripts
python evaluate.py --gold_explanations_fname example/target-gold.example.roen 
                   --model_explanations_fname example/target.example.roen
                   --gold_sentence_scores_fname example/sentence-gold.example.roen
                   --model_sentence_scores_fname example/sentence.example.roen

Contact information

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Jupyter Notebook 53.1%
  • Python 46.9%