layout | permalink | excerpt | classes | header | ||||||
---|---|---|---|---|---|---|---|---|---|---|
single |
/ |
HumEval 2021 |
wide |
|
Human evaluation plays a central role in NLP, from the large-scale crowd-sourced evaluations carried out e.g. by the WMT workshops, to the much smaller experiments routinely encountered in conference papers. Moreover, while NLP embraced automatic evaluation metrics from BLEU (Papineni et al, 2001) onwards, the field has always been acutely aware of their limitations (Callison-Burch et al., 2006; Reiter and Belz, 2009; Novikova et al., 2017; Reiter, 2018), and has gauged their trustworthiness in terms of how well, and how consistently, they correlate with human evaluation scores (Over et al., 2007; Gatt and Belz, 2008; Bojar et al., 2016; Shimorina, 2018; Ma et al., 2019; Mille et al., 2019; Dušek et al., 2020).
Yet there is growing unease about how human evaluations are conducted in NLP. Researchers have pointed out the less than perfect experimental and reporting standards that prevail (van der Lee et al., 2019). Only a small proportion of papers provide enough detail for reproduction of human evaluations, and in many cases the information provided is not even enough to support the conclusions drawn. We have found that more than 200 different quality criteria (Fluency, Grammaticality, etc.) have been used in NLP. Different papers use the same quality criterion name with different definitions, and the same definition with different names. As a result, we currently do not have a way of determining whether two evaluations assess the same thing which poses problems for both meta-evaluation and reproducibility assessments.
Reproducibility in the context of automatically computed system scores has recently attracted a lot of attention, against the background of a troubling history (Pedersen, 2008; Mieskes et al., 2019), where reproduction fails in 24.9% of cases for own results, and in 56.7% for another team’s (Mieskes et al., 2019). Initiatives have included the Reproducibility Challenge (Pineau et al., 2019, Sinha et al., 2020); the Reproduction Paper special category at COLING'18; the reproducibility programme at NeurIPS'19 comprising code submission, a reproducibility challenge, and the ML Reproducibility checklist, also adopted by EMNLP'20 and AAAI'21; and the REPROLANG shared task at LREC'20 (Branco et al., 2020).
However, reproducibility in the context of system scores obtained via human evaluations has barely been addressed at all, with a tiny number of papers (e.g. Belz & Kow, 2010; Cooper & Shardlow, 2020) reporting attempted reproductions of results. The developments in reproducibility of automatically computed scores listed above are important, but it is concerning that not a single one of the initiatives and events above addresses human evaluations. E.g. if a paper fully complies with all of the NeurIPS'19/EMNLP'20 reproducibility criteria, any human evaluation results reported in it may not be reproducible to any degree, simply because the criteria do not address human evaluation in any way.
With this workshop we wish to create a forum for current human evaluation research and future directions, a space for researchers working with human evaluations to exchange ideas and begin to address the issues human evaluation in NLP currently faces, including experimental design, reporting standards, meta-evaluation and reproducibility. We will invite papers on topics including, but not limited to, the following:
- Experimental design for human evaluations
- Reproducibility of human evaluations
- Ethical considerations in human evaluation of computational systems
- Quality assurance for human evaluation
- Crowdsourcing for human evaluation
- Issues in meta-evaluation of automatic metrics by correlation with human evaluations
- Alternative forms of meta-evaluation and validation of human evaluations
- Comparability of different human evaluations
- Methods for assessing the quality of human evaluations
- Methods for assessing the reliability of human evaluations
- Work on measuring inter-evaluator and intra-evaluator agreement
- Frameworks, model cards and checklists for human evaluation
- Explorations of the role of human evaluation in the context of Responsible AI and Accountable AI
- Protocols for human evaluation experiments in NLP
We welcome work on the above topics and more from any subfield of NLP (and ML/AI more generally), with a particular focus on evaluation of systems that produce language as output.
In addition to the general track for technical papers, there will be discussion and poster sessions, and a special Shared Task Proposals track.
Margaret Mitchell, Google AI Ethics Team, US
Lucia Specia, University College London, UK