Skip to content

eval4nlp/SharedTask2021

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

55 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SharedTask2021

This repository contains the data, baselines and evaluation scripts for the Eval4NLP Shared Task on Explainable Quality Estimation.

Datasets

Training and development data (for sentence-level scores)

The directory data contains training and development data for Romanian-English (Ro-En) and Estonian-English (Et-En) language pairs.

Especially, the data contains sentence-level scores and word-level scores.

  • The sentence-level scores can be used to train a supervised model indicating the similarity between source and target sentences. If participants wish, they can also use unsupervised approaches like XMoverScore, i.e., they can ignore the sentence-level training scores.

  • The data also contains word-level labels. Word-level labels were derived by comparing the MT outputs with their post-edited versions based on the alignments provided by the TER tool.

  • The word-level labels can be used for exploratory purposes, to estimate how well a model performs for word-level explainability (in the absence of test data, which will be provided only in a later stage of the shared task). If participants don't train their systems on word-level labels, they will be routed to the constrained track.

  • Alternatively, participants may train a system on the existing word-level annotations (unconstrained track). In this case, participants should be aware that the human annotations may be similar to the word-level annotations provided below in the test data, but there will not be a full correspondence (similar to a domain shift).

Participants submitting in the constrained tracks will be evaluated separately from participants submitting in the unconstrained track.

Data format

  • <partition>.src: source sentences
  • <partition>.mt: MT outputs
  • <partition>.pe: post-editing of the MT outputs
  • <partition>.da: sentence-level quality scores
  • <partition>.hter: HTER score
  • <partition>.tgt-tags: word-level labels indicating whether each token in the MT output is an error (1) or is correct (0)
  • <partition>.src-tags: word-level labels indicating whether each token in the source corresponds to an error (1) or to a correct token (0) in the MT output

All the data is tokenized.

This data is an adjusted version of the MLQE-PE dataset that was used at the WMT2020 Shared Task on Quality Estimation. The differences with the QE Shared Task are as follows:

  • For simplicity, gaps are ignored. Thus, the number of word-level labels corresponds to the number of tokens.
  • OK and BAD labels are replaced by 0 and 1, respectively.

Additionally:

We provide a gold standard of 20 annotated sentence pairs for DE-ZH and RU-DE. This may be used for the participants to gain an intuition of the task for those languages: DE-ZH, RU-DE.

Test data

As test data, we are collecting sentence-level quality scores and word-level error annotations for Et-En and Ro-En, as well as two zero-shot language pairs: German-Chinese (DE-ZH) and Russian-German (RU-DE). Human annotators are asked to indicate translation errors as an explanation for the overall sentence score, as well as the corresponding words in the source sentence. The guidelines for this annotation effort are available in the annotation-guidelines directory.

Baselines

We provide the following baselines:

Evaluation

The following metrics will be used to assess performance:

  • AUC score
  • AP (Average Precision)
  • Recall at top-K

To run evaluation on a toy example:

cd scripts
python evaluate.py --gold_explanations_fname example/target-gold.example.roen 
                   --model_explanations_fname example/target.example.roen
                   --gold_sentence_scores_fname example/sentence-gold.example.roen
                   --model_sentence_scores_fname example/sentence.example.roen

Citation

@inproceedings{fomicheva-etal-2021-eval4nlp,
    title = "The {E}val4{NLP} Shared Task on Explainable Quality Estimation: Overview and Results",
    author = "Fomicheva, Marina  and
      Lertvittayakumjorn, Piyawat  and
      Zhao, Wei  and
      Eger, Steffen  and
      Gao, Yang",
    booktitle = "Proceedings of the 2nd Workshop on Evaluation and Comparison of NLP Systems",
    month = nov,
    year = "2021",
    address = "Punta Cana, Dominican Republic",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2021.eval4nlp-1.17",
    pages = "165--178",
}

Contact information

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published