SharedTask2021

This repository contains the data, baselines and evaluation scripts for the Eval4NLP Shared Task on Explainable Quality Estimation.

Shared task official website: https://eval4nlp.github.io/sharedtask.html
Repository: https://github.com/eval4nlp/SharedTask2021
Submission website: https://competitions.codalab.org/competitions/33038

Datasets

Training and development data (for sentence-level scores)

The directory data contains training and development data for Romanian-English (Ro-En) and Estonian-English (Et-En) language pairs.

Especially, the data contains sentence-level scores and word-level scores.

The sentence-level scores can be used to train a supervised model indicating the similarity between source and target sentences. If participants wish, they can also use unsupervised approaches like XMoverScore, i.e., they can ignore the sentence-level training scores.
The data also contains word-level labels. Word-level labels were derived by comparing the MT outputs with their post-edited versions based on the alignments provided by the TER tool.
The word-level labels can be used for exploratory purposes, to estimate how well a model performs for word-level explainability (in the absence of test data, which will be provided only in a later stage of the shared task). If participants don't train their systems on word-level labels, they will be routed to the constrained track.
Alternatively, participants may train a system on the existing word-level annotations (unconstrained track). In this case, participants should be aware that the human annotations may be similar to the word-level annotations provided below in the test data, but there will not be a full correspondence (similar to a domain shift).

Participants submitting in the constrained tracks will be evaluated separately from participants submitting in the unconstrained track.

Data format

<partition>.src: source sentences
<partition>.mt: MT outputs
<partition>.pe: post-editing of the MT outputs
<partition>.da: sentence-level quality scores
<partition>.hter: HTER score
<partition>.tgt-tags: word-level labels indicating whether each token in the MT output is an error (1) or is correct (0)
<partition>.src-tags: word-level labels indicating whether each token in the source corresponds to an error (1) or to a correct token (0) in the MT output

All the data is tokenized.

This data is an adjusted version of the MLQE-PE dataset that was used at the WMT2020 Shared Task on Quality Estimation. The differences with the QE Shared Task are as follows:

For simplicity, gaps are ignored. Thus, the number of word-level labels corresponds to the number of tokens.
OK and BAD labels are replaced by 0 and 1, respectively.

Additionally:

We provide a gold standard of 20 annotated sentence pairs for DE-ZH and RU-DE. This may be used for the participants to gain an intuition of the task for those languages: DE-ZH, RU-DE.

Test data

As test data, we are collecting sentence-level quality scores and word-level error annotations for Et-En and Ro-En, as well as two zero-shot language pairs: German-Chinese (DE-ZH) and Russian-German (RU-DE). Human annotators are asked to indicate translation errors as an explanation for the overall sentence score, as well as the corresponding words in the source sentence. The guidelines for this annotation effort are available in the annotation-guidelines directory.

Baselines

We provide the following baselines:

TransQuest-LIME: This method uses the pre-trained sentence-level quality estimation models available from the TransQuest toolkit to rate the translations, and uses the LIME explainer to explain the ratings.
XMover-SHAP: This method uses XMover to rate translations and uses the SHAP explainer to explain the ratings.

Evaluation

The following metrics will be used to assess performance:

AUC score
AP (Average Precision)
Recall at top-K

To run evaluation on a toy example:

cd scripts
python evaluate.py --gold_explanations_fname example/target-gold.example.roen 
                   --model_explanations_fname example/target.example.roen
                   --gold_sentence_scores_fname example/sentence-gold.example.roen
                   --model_sentence_scores_fname example/sentence.example.roen

Citation

@inproceedings{fomicheva-etal-2021-eval4nlp,
    title = "The {E}val4{NLP} Shared Task on Explainable Quality Estimation: Overview and Results",
    author = "Fomicheva, Marina  and
      Lertvittayakumjorn, Piyawat  and
      Zhao, Wei  and
      Eger, Steffen  and
      Gao, Yang",
    booktitle = "Proceedings of the 2nd Workshop on Evaluation and Comparison of NLP Systems",
    month = nov,
    year = "2021",
    address = "Punta Cana, Dominican Republic",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2021.eval4nlp-1.17",
    pages = "165--178",
}

Contact information

Website: https://eval4nlp.github.io/
Email: eval4nlp@gmail.com

Name		Name	Last commit message	Last commit date
Latest commit History 55 Commits
annotation-guidelines		annotation-guidelines
baselines		baselines
data		data
scripts		scripts
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SharedTask2021

Datasets

Training and development data (for sentence-level scores)

Test data

Baselines

Evaluation

Citation

Contact information

About

Releases

Packages

Contributors 5

Languages

eval4nlp/SharedTask2021

Folders and files

Latest commit

History

Repository files navigation

SharedTask2021

Datasets

Training and development data (for sentence-level scores)

Test data

Baselines

Evaluation

Citation

Contact information

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 5

Languages

Packages