Skip to content
This repository has been archived by the owner on Nov 3, 2023. It is now read-only.

[Per-Turn Eval] Add project page #4304

Merged
merged 3 commits into from Jan 13, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
2 changes: 2 additions & 0 deletions projects/README.md
Expand Up @@ -144,3 +144,5 @@ _QA model for answering questions by retrieving and reading knowledge._
- **ACUTE-Eval** [[parlai task]](https://github.com/facebookresearch/ParlAI/tree/main/parlai/crowdsourcing/tasks/acute_eval) [[paper]](https://arxiv.org/abs/1909.03087).
_ACUTE Eval is a sensitive human evaluation method for dialogue which evaluates whole conversations in a pair-wise fashion, and is our recommended method._

- **Human Evaluation Comparison** [[project]](https://parl.ai/projects/humaneval) [paper coming soon!].
_Compares how well different human crowdworker evaluation techniques can detect relative performance differences among dialogue models._
11 changes: 11 additions & 0 deletions projects/humaneval/README.md
@@ -0,0 +1,11 @@
# Human Evaluation of Conversations is an Open Problem: comparing the sensitivity of various methods for evaluating dialogue agents

Eric Michael Smith, Orion Hsu, Rebecca Qian, Stephen Roller, Y-Lan Boureau, Jason Weston

## Abstract

At the heart of improving conversational AI is the open problem of how to evaluate conversations. Issues with automatic metrics are well known ([Liu et al., 2016](https://arxiv.org/abs/1603.08023)), with human evaluations still considered the gold standard. Unfortunately, how to perform human evaluations is also an open problem: differing data collection methods have varying levels of human agreement and statistical sensitivity, resulting in differing amounts of human annotation hours and labor costs. In this work we compare five different crowdworker-based human evaluation methods and find that different methods are best depending on the types of models compared, with no clear winner across the board. While this highlights the open problems in the area, our analysis leads to advice of when to use which one, and possible future directions.

## Paper

[Link coming soon!]