Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Randomness in evaluation? #2

Open
shaform opened this issue Nov 27, 2018 · 2 comments
Open

Randomness in evaluation? #2

shaform opened this issue Nov 27, 2018 · 2 comments

Comments

@shaform
Copy link

shaform commented Nov 27, 2018

I spotted some randomness in evaluation code. For example,

venue_offered[domain] = random.sample(venues, 1)

Wouldn't it make the match and success rates different even if we evaluate the same model twice?

@budzianowski
Copy link
Owner

Hi!

thanks for raising this. Indeed the scores will be slightly different (up to 0.5% in my experience).
The randomness models the real life problem where often the system can choose among many entities that can be offered to the user. Thanks to it we can also differentiate between the ones that adapt to the changing goal of the user.

@skiingpacman
Copy link

Realise this is a somewhat old topic but I ended up digging into this to understand the Inform Rate metric better.

On a static, off-line set of dialogue logs the score can vary by around 1 percentage point (tested empirically). As an example, I used a fixed set of test set dialogue sampled from a recent GPT2 based model and found the Inform Rate varied between a min. 71.9% to a max. of 72.9%. Mean 72.42% +/- 0.56% (+/- 2*STD) over 10 samples.

For the Dialogue-Context-to-Text Generation task, my interpretation of Inform Rate is that for each domain contained in a dialogue, the last mentioned entity (venue name or id) for that domain should match the requirements in the user's goal. Where the user's goal is specified a priori for each test dialog (but hidden from the dialogue system). Given the assumed benchmarking setup of perfect (oracle) belief state and database extraction for each turn, then the set of entities (per domain) available to be mentioned by policy should eventually satisfy the goal constraints assuming the dialogue remained on course and the user provided all their constraints.

The Inform Rate metric, in this offline setting, thus should assess whether the policy finally present a matching entity to the user when all the information about the user's constraints (including any revisions) was available. Thus I was somewhat confused why there should be some random element when computing this metric over a fix corpus of dialogue logs. The venue sampling, as identified by @shaform (above), should still result in venues consistent with the user's goal.

Understanding where the stochasticity in the results comes from, reveals an assumption that is broken in the current implementation. To provide the 'oracle' belief state for each turn in the dialogue the Inform Rate implementation uses the (test set) dialogues collected with human-wizards in the loop as "perfect" state trackers. Unfortunately the human-wizards were not perfect, sometime making mistakes and sometime taking shortcuts.

In one example (MUL0890) the user asks for indian cuisine on the east side of town. The human-wizard records this and came up with 4 results, 2 moderately priced and two expensive. The user request an expensive restaurant but the human-wizard fails to update the belief state to add this information - in reality I guess they didn't need to as they could likely read off one of the expensive options (which they did) and reply to the user without updating the UI.

Looking at the code, Inform Rate's implementation uses the human-wizard provided belief state from the dialogues. It computes the set of matching venues that satisfy the human-wizard provided belief state for that turn, then samples one entity from this set and compares to the set of possible entities that match the user's goal. In the above example there are two entities (two restaurants) that satisfy the goal, however the set of available entities that match the human-wizard provided belief state is 4 and thus there is a 50:50 chance that the process of randomly sampling one entity will result in a mismatch and thus the dialogue will be classified as having failed on this metric.

Having discovered the cause of the randomness the next question that occured to me is to what extent does the Inform Rate metric implementation make it impossible to select a match, i.e. in how many test set dialogues does the human-wizard provided belief state not overlap at all with the set of requirements in the user's goal.

Assuming that the human-wizard dialogues in the test set present the domain entities during the optimum turns in the dialogues with respect to belief state, it turns out that there is zero overlap between the available venues that match the human-wizard provided belief state and the user goal set of venues in 84 dialogues. Given that the test set consists of 1000 dialogues this would likely* bound the highest score that can be achieved on this implementation of Inform Rate to 91.6%.

*There's a possibility - thought my suspicion is that it's a low possibility - that a better strategy than using the human-wizard responses to decide in which turn to present entities could exist. I hacked the metric code to simulated a policy that presented a venue for each domain in the goal set at the end of every dialogue but this scored worse (~83% Inform Rate) indicating that there is an "optimal" turn in several of the dialogues. I speculate that the human-wizards make further erroneous updates to the belief state after that turn.

Note all of this analysis is based on MultiWoZ 2.0.

Thoughts?

budzianowski pushed a commit that referenced this issue Jul 8, 2021
New evaluation scripts for BLEU, Inform & Success #2
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants