facebookresearch · EricMichaelSmith · Jan 13, 2022 · Jan 13, 2022 · Jan 13, 2022 · Jan 13, 2022
diff --git a/projects/README.md b/projects/README.md
@@ -144,3 +144,5 @@ _QA model for answering questions by retrieving and reading knowledge._
 - **ACUTE-Eval** [[parlai task]](https://github.com/facebookresearch/ParlAI/tree/main/parlai/crowdsourcing/tasks/acute_eval) [[paper]](https://arxiv.org/abs/1909.03087).
 _ACUTE Eval is a sensitive human evaluation method for dialogue which evaluates whole conversations in a pair-wise fashion, and is our recommended method._
 
+- **Human Evaluation Comparison** [[project]](https://parl.ai/projects/humaneval) [paper coming soon!].
+_Compares how well different human crowdworker evaluation techniques can detect relative performance differences among dialogue models._
diff --git a/projects/humaneval/README.md b/projects/humaneval/README.md
@@ -0,0 +1,11 @@
+# Human Evaluation of Conversations is an Open Problem: comparing the sensitivity of various methods for evaluating dialogue agents
+
+Eric Michael Smith, Orion Hsu, Rebecca Qian, Stephen Roller, Y-Lan Boureau, Jason Weston
+
+## Abstract
+
+At the heart of improving conversational AI is the open problem of how to evaluate conversations. Issues with automatic metrics are well known ([Liu et al., 2016](https://arxiv.org/abs/1603.08023)), with human evaluations still considered the gold standard. Unfortunately, how to perform human evaluations is also an open problem: differing data collection methods have varying levels of human agreement and statistical sensitivity, resulting in differing amounts of human annotation hours and labor costs. In this work we compare five different crowdworker-based human evaluation methods and find that different methods are best depending on the types of models compared, with no clear winner across the board. While this highlights the open problems in the area, our analysis leads to advice of when to use which one, and possible future directions.
+
+## Paper
+
+[Link coming soon!]