# How NOT To Evaluate Your Dialogue System: 
An Empirical Study of Unsupervised Evaluation Metrics for Dialogue Response Generation

* 바벨피쉬 / 딥엘라스틱 : 파트 3 - 딥챗봇 [1]
* 김무성

# Contents
* Abstract
* 1 Introduction
* 2 Related Work
* 3 Evaluation Metrics
    - 3.1 Word Overlap-based Metrics
    - 3.2 Embedding-based Metrics
* 4 End-to-End Dialogue Models
    - 4.1 Retrieval Models
    - 4.2 Generative Models
    - 4.3 Conclusions from an Incomplete Analysis
* 5 Human Correlation Analysis
    - 5.1 Data collection
    - 5.2 Survey Results
    - 5.3 Qualitative Analysis
* 6 Discussion

#### 참고 
* [2] 'How NOT To Evaluate Your Dialogue System'Paper's slide - http://cs.mcgill.ca/~rlowe1/dialogue_evaluation_mila2016.pdf
* [3] Laurent Charlin's homepage - http://www.cs.toronto.edu/~lcharlin/

# Abstract

* We investigate evaluation metrics for end- to-end dialogue systems where supervised labels, such as task completion, are not available.
* Recent works in end-to-end dialogue systems 
    - have adopted metrics from 
        - machine translation and 
        - text summarization 
    - to compare 
        - a model’s generated response 
        - to a single target response. 
* We show that these metrics 
    - correlate very weakly or 
    - not at all with human judgements of 
    - the response quality 
        - in both technical and non-technical domains. 
* We provide 
    - quantitative and qualitative results 
        - highlighting specific weaknesses in existing metrics, and 
    - provide recommendations 
        - for future development of 
            - better automatic evaluation metrics 
                - for dialogue systems.

# 1 Introduction

Significant progress has been made in learning end-to-end systems 
* directly from large amounts of text data for a variety of natural language tasks, such as 
    - question answering (Weston et al., 2015), 
    - machine translation (Cho et al., 2014), and 
    - dialogue response generation systems (Sordoni et al., 2015), 
* in particular through the use of neural network models.

In the case of dialogue systems, 
* an important challenge is to provide a <font color="red">reliable evaluation of the learned systems</font>.

#### Supervised dialogue models VS  Unsupervised dialogue models.

Supervised dialogue models
* Typically, evaluation is done using human-generated supervised signals, such as 
    - a task completion test or 
    - a user satisfaction score
* We call models that are <font color="red">trained to optimize for such supervised objectives</font> 
    - supervised dialogue models
* While supervised models have historically been the method of choice,
    - <font color="red">supervised labels are difficult</font> 
        - to collect on a large scale 
            - due to the cost of human labour. 
    - Further, for free-form types of dialogues 
        - (e.g., chatbots), 
        - <font color="red">the notion of task completion is ill-defined</font> 
            - since it may differ from one human user to another.

Unsupervised dialogue models
* Unsupervised dialogue models are receiving increased attention. 
* These models are typically 
    - trained (end-to-end) to 
    - predict the next utterance of a conversation, 
        - given several context utterances (Serban et al., 2015). 
    - This task is referred to as <font color="red">response generation</font>. 
* However <font color="red">automatically evaluating the quality of unsupervised models</font> 
    - remains an <font color="blue">open question</font>. 
    - Automatic evaluation metrics would 
        - help accelerate the deployment of unsupervised dialogue systems.

#### Automatic evaluation metrics

Faced with similar challenges, other natural language tasks have successfully developed automatic evaluation metrics.
* machine translation
    - BLEU (Papineni et al., 2002a)
    - METEOR (Banerjee and Lavie, 2005) 
* automatic summarization
    - ROUGE (Lin, 2004) 

#### Automatic evaluation metrics for dialogue response generation systems

* Since the machine translation task appears similar to the dialogue response generation task, 
    - dialogue researchers have <font color="red">adopted the same metrics</font> for evaluating the performance of their models.
* However, the applicability of these methods has not been validated for dialogue-related tasks. 
    - A particular challenge in dialogues is 
        - the <font color="red">significant diversity</font> 
            - in the space of valid responses to 
                - a given conversational context.
    -  This is illustrated in Table 1, 
        - where two reasonable proposed responses are given to the context; 
        - however, these responses 
            - <font color="red">do not share any words in common</font> and 
            - <font color="red">do not have the same semantic meaning</font>.
* We investigate several evaluation metrics for dialogue response generation systems, including both 
    - <font color="blue">statistical word based similarity metrics</font> 
        - such as BLEU, METEOR, and ROUGE, and 
    - <font color="blue">word-embedding based similarity metrics</font> 
        - derived from word embedding models such as Word2Vec (Mikolov et al., 2013).

<img src="figures/cap1.png" width=600 />

#### applicability

We study the applicability of these metrics 
*  by using them to evaluate 
    - a <font color="red">variety of end-to-end dialogue models</font>, including both 
        - <font color="blue">retrieval models</font> such as 
            - the Dual Encoder (Lowe et al., 2015) and 
        - <font color="blue">generative models</font> that incorporate some form of 
            - recurrent decoder (Serban et al., 2015). 
* We use these models 
    - to produce a proposed response 
        - given the context of the conversation and 
    - compare them to the ground-truth response 
        - (the actual next response) 
    - using the above metrics.

#### Results

When evaluating these models <font color="red">with the embedding-based metrics</font>, 
* we find that even though some models significantly outperform others across several metrics and domains, <font color="red">the metrics only very weakly correlate with human judgement</font>, 
    - as determined by human evaluation of the responses.

We highlight the <font color="red">shortcomings of these metrics</font> using: 
* a) a <font color="blue">statistical analysis</font> of 
    - our survey’s results; 
* b) a <font color="blue">qualitative analysis</font> of 
    - examples taken from our data; and 
* c) an <font color="blue">exploration of the sensitivity</font> of 
    - the metrics.

# 2 Related Work

#### Evaluation methods for supervised dialogue systems
* PARADISE framework (Walker et al., 1997),
* MeMo (Mo ̈ller et al., 2006) 
* Jokinen and McTear, 2009

#### model-independent
* We <font color="red">focus on metrics that are model-independent</font>, 
    - i.e. where the model generating the response does not also evaluate its quality; 
* thus, <font color="red">we do not consider word perplexity</font>, 
    - although it has been used to evaluate unsupervised dialogue models (Serban et al., 2015).
    - This is because it is not computed on a per-response basis, 
        - and cannot be computed for retrieval models. 
* Further, we only consider metrics 
    - that can be used to evaluate proposed responses 
        - against ground-truth responses, 
    - so <font color="red">we do not consider retrieval-based metrics</font> such as 
        - <font color="blue">recall</font>, 
            - which has been used to evaluate dialogue models (Schatzmann et al., 2005; Lowe et al., 2015).

#### unsupervised dialogue systems 
* BLEU
* Ritter et al. (2011b)
* statistical machine translation (SMT) model
* using a simple bag-of-words model
* deltaBLEU (Galley et al., 2015b),
* Galley et al (2015b) 

# 3 Evaluation Metrics
* 3.1 Word Overlap-based Metrics
* 3.2 Embedding-based Metrics

## 3.1 Word Overlap-based Metrics
* BLEU
* METEOR
* ROUGE

#### word-overlap
We first consider metrics that evaluate the amount of word-overlap between the proposed response and the ground-truth response.

#### BLEU

#### 참고
* [4] Chapter 8. Evaluation (Statistical Machine Translation) - http://www.statmt.org/book/slides/08-evaluation.pdf

BLEU (Papineni et al., 2002a) analyzes the co-occurences of n-grams in the ground truth and the proposed responses.

<img src="http://image.slidesharecdn.com/kantanmtworkshop-150501084851-conversion-gate01/95/eamt-workshop-2015-kantanmt-23-638.jpg?cb=1430470217" width=600 />

<img src="figures/cap2.png" width=600 />

<img src="figures/cap3.png" width=600 />

#### METEOR

#### 참고
* [5] METEOR: Metric for Evaluation of Translation with Explicit Ordering An Automatic Metric for MT Evaluation with Improved Correlations with Human Judgments. - http://slideplayer.com/slide/7480005/

<img src="figures/meteor.png" width=600 />

The METEOR metric (Banerjee and Lavie, 2005) was introduced to address several weaknesses in BLEU.
* The alignment is based on exact token matching, followed by WordNet synonyms, stemmed tokens, and then paraphrases. 
* Given a set of alignments, the METEOR score is 
    - the harmonic mean of 
        - precision and recall 
            - between the proposed and ground truth sentence.

#### ROUGE

#### 참고
* [6] Text summarization - http://www.slideshare.net/kareemhashem/text-summarization

<img src="http://image.slidesharecdn.com/textsummarization-130415121119-phpapp02/95/text-summarization-14-638.jpg?cb=1366027930" width=600 />

ROUGE (Lin, 2004) is a set of evaluation metrics used for automatic summarization. 
* We consider ROUGE-L, which is a F-measure based on the Longest Common Subsequence (LCS) between a candidate and target sen- tence.

## 3.2 Embedding-based Metrics
* Greedy Matching
* Embedding Average
* Vector Extrema

An alternative to using word-overlap based metrics is to consider the meaning of each word as defined by a word embedding

<img src="figures/w2v.png" width=600 />

#### Greedy Matching

Greedy matching is the one embedding-based metric that does <font color="red">not compute sentence-level embeddings</font>. 
* Instead, given two sequences $r$ and $\hat{r}$, each token $w ∈ r$ is greedily matched with a token $\hat{w} ∈ \hat{r}$ based on the cosine similarity of their word embeddings (ew), and the total score is then averaged across all words:
    <img src="figures/cap4.png" width=400 />
* This formula is asymmetric, thus we must aver- age the greedy matching scores G in each direction. 

#### Embedding Average

The embedding average metric <font color="red">calculates sentence-level embeddings</font> using additive composition, a method for computing the meanings of phrases by averaging the vector representations of their constituent words
* This method has been widely used in other domains, for example in textual similarity tasks

The embedding average, $\bar{e}$, is defined as the mean of the word embeddings of each token in a sentence r:

<img src="figures/cap5.png"  />

To compare a ground truth response $r$ and retrieved response $\hat{r}$, 
* we compute the cosine similarity between their respective sentence level embeddings: 
    - EA := cos($\bar{e}_{r}$ , $\bar{e}_{\hat{r}}$).

#### Vector Extrema

Another way to calculate sentence-level embeddings is using vector extrema (Forgues et al., 2014). 
* For each dimension of the word vectors, 
    - take the most extreme value amongst 
        - all word vectors in the sentence, and 
    - use that value in the sentence-level embedding:
    <img src="figures/cap6.png"  />
        - where $d$ indexes the dimensions of a vector; $e_{wd}$ is the $d$’th dimensions of $ew$ ($w$’s embedding).

# 4 End-to-End Dialogue Models
* 4.1 Retrieval Models
* 4.2 Generative Models
* 4.3 Conclusions from an Incomplete Analysis

## 4.1 Retrieval Models
* TF-IDF
* Dual Encoder

Ranking or retrieval models for dialogue systems are typically evaluated based on whether they can retrieve the correct response <font color="red">from a corpus of pre-defined responses</font>, which includes the ground truth response to the conversation

#### TF-IDF

<img src="figures/tfidf.png" width=600 />

<img src="figures/cap7.png" />

#### Dual Encoder

#### 참고 
* [7] DEEP LEARNING FOR CHATBOTS, PART 2 – IMPLEMENTING A RETRIEVAL-BASED MODEL IN TENSORFLOW - http://www.wildml.com/2016/07/deep-learning-for-chatbots-2-retrieval-based-model-tensorflow/
* [8] Dual LSTM Encoder for Dialog Response Generation (tensorflow code) - https://github.com/dennybritz/chatbot-retrieval/

The model then calculates the probability that the given response is the ground truth response given the context, by taking a weighted dot product: 

$p(r \ is \ correct|c,r,M) = σ(c^TMr+b)$ 

where $M$ is a matrix of learned parameters and $b$ is a bias.

To our knowledge, our application of neural network models to large-scale retrieval in dialogue systems is novel.

<img src="http://d3kbpzbmcynnmx.cloudfront.net/wp-content/uploads/2016/04/Screen-Shot-2016-04-21-at-10.51.18-AM.png" width=600 />

## 4.2 Generative Models
* LSTM language model
* HRED

In this context, we refer to a model as generative if it is able to <font color="red">generate entirely new sentences</font> that are <font color="red">unseen in the training set</font>.

#### LSTM language model

#### 참고
* [9] 딥러닝을 이용한 자연어처리의 연구동향 - http://www.slideshare.net/ssuser06e0c5/ss-64417928
* [10] 엘에스티엠 네트워크 이해하기 (Understanding LSTM Networks) - http://roboticist.tistory.com/m/571
* [11] Generating Sequences With Recurrent Neural Networks - https://arxiv.org/pdf/1308.0850v5.pdf
* [12] Generating Sequences With Recurrent Neural Networks (tensorflow code) - https://github.com/tensorflow/magenta/blob/master/magenta/reviews/summary_generation_sequences.md

<img src="http://image.slidesharecdn.com/random-160727015834/95/-25-638.jpg?cb=1469584793" width=600 />

<img src="http://colah.github.io/images/post-covers/lstm.png" width=600 />

#### HRED

#### 참고
* [13] Building end-to-end dialogue systems using generative hierarchical neural network models - https://blog.acolyer.org/2016/07/01/building-end-to-end-dialogue-systems-using-generative-hierarchical-neural-network-models/
* [14] Hierarchical Recurrent Encoder-Decoder code (HRED) for Query Suggestion (code) - https://github.com/sordonia/hred-qs

Finally we consider the Hierarchical Recurrent Encoder-Decoder (HRED) (Serban et al., 2015).

The HRED model uses a hierarchy of encoders; 
* each utterance in the context passes through an ‘utterance-level’ en coder, and 
* the output of these encoders is passed through another ‘context-level’ encoder. 

This enables the handling of longer-term dependencies compared to a conventional Encoder-Decoder.

<img src="https://adriancolyer.files.wordpress.com/2016/06/hred-fig-1.png" width=600 />

## 4.3 Conclusions from an Incomplete Analysis

<img src="figures/cap8.png" width=600 />

# 5 Human Correlation Analysis
* 5.1 Data collection
* 5.2 Survey Results
* 5.3 Qualitative Analysis

## 5.1 Data collection

<img src="figures/cap12.png" width=600 />

## 5.2 Survey Results

<img src="figures/cap11.png" width=600 />

<img src="figures/cap10.png" width=600 />

## 5.3 Qualitative Analysis

<img src="figures/cap13.png" width=600 />

# 6 Discussion
* Constrained tasks
* Incorporating multiple responses
* Searching for suitable metrics

<img src="figures/cap9.png" width=600 />

#### Constrained tasks

#### Incorporating multiple responses

#### Searching for suitable metrics

# Appendix: Full scatter plots

<img src="figures/cap14.png" width=600 />

<img src="figures/cap15.png" width=600 />

<img src="figures/cap16.png" width=600 />

<img src="figures/cap17.png" width=600 />

<img src="figures/cap18.png" width=600 />
<img src="figures/cap19.png" width=600 />

<img src="figures/cap20.png" width=600 />

# 참고자료
* [1] How NOT To Evaluate Your Dialogue System: An Empirical Study of Unsupervised Evaluation Metrics for Dialogue Response Generation - http://arxiv.org/pdf/1603.08023v1.pdf
* [2] 'How NOT To Evaluate Your Dialogue System'Paper's slide - http://cs.mcgill.ca/~rlowe1/dialogue_evaluation_mila2016.pdf
* [3] Laurent Charlin's homepage - http://www.cs.toronto.edu/~lcharlin/
* [4] Chapter 8. Evaluation (Statistical Machine Translation) - http://www.statmt.org/book/slides/08-evaluation.pdf
* [5] METEOR: Metric for Evaluation of Translation with Explicit Ordering An Automatic Metric for MT Evaluation with Improved Correlations with Human Judgments. - http://slideplayer.com/slide/7480005/
* [6] Text summarization - http://www.slideshare.net/kareemhashem/text-summarization
* [7] DEEP LEARNING FOR CHATBOTS, PART 2 – IMPLEMENTING A RETRIEVAL-BASED MODEL IN TENSORFLOW - http://www.wildml.com/2016/07/deep-learning-for-chatbots-2-retrieval-based-model-tensorflow/
* [8] Dual LSTM Encoder for Dialog Response Generation (tensorflow code) - https://github.com/dennybritz/chatbot-retrieval/
* [9] 딥러닝을 이용한 자연어처리의 연구동향 - http://www.slideshare.net/ssuser06e0c5/ss-64417928
* [10] 엘에스티엠 네트워크 이해하기 (Understanding LSTM Networks) - http://roboticist.tistory.com/m/571
* [11] Generating Sequences With Recurrent Neural Networks - https://arxiv.org/pdf/1308.0850v5.pdf
* [12] Generating Sequences With Recurrent Neural Networks (tensorflow code) - https://github.com/tensorflow/magenta/blob/master/magenta/reviews/summary_generation_sequences.md
* [13] Building end-to-end dialogue systems using generative hierarchical neural network models - https://blog.acolyer.org/2016/07/01/building-end-to-end-dialogue-systems-using-generative-hierarchical-neural-network-models/
* [14] Hierarchical Recurrent Encoder-Decoder code (HRED) for Query Suggestion (code) - https://github.com/sordonia/hred-qs