-
Notifications
You must be signed in to change notification settings - Fork 200
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Randomness in evaluation? #2
Comments
Hi! thanks for raising this. Indeed the scores will be slightly different (up to 0.5% in my experience). |
Realise this is a somewhat old topic but I ended up digging into this to understand the Inform Rate metric better. On a static, off-line set of dialogue logs the score can vary by around 1 percentage point (tested empirically). As an example, I used a fixed set of test set dialogue sampled from a recent GPT2 based model and found the Inform Rate varied between a min. 71.9% to a max. of 72.9%. Mean 72.42% +/- 0.56% (+/- 2*STD) over 10 samples. For the Dialogue-Context-to-Text Generation task, my interpretation of Inform Rate is that for each domain contained in a dialogue, the last mentioned entity (venue name or id) for that domain should match the requirements in the user's goal. Where the user's goal is specified a priori for each test dialog (but hidden from the dialogue system). Given the assumed benchmarking setup of perfect (oracle) belief state and database extraction for each turn, then the set of entities (per domain) available to be mentioned by policy should eventually satisfy the goal constraints assuming the dialogue remained on course and the user provided all their constraints. The Inform Rate metric, in this offline setting, thus should assess whether the policy finally present a matching entity to the user when all the information about the user's constraints (including any revisions) was available. Thus I was somewhat confused why there should be some random element when computing this metric over a fix corpus of dialogue logs. The venue sampling, as identified by @shaform (above), should still result in venues consistent with the user's goal. Understanding where the stochasticity in the results comes from, reveals an assumption that is broken in the current implementation. To provide the 'oracle' belief state for each turn in the dialogue the Inform Rate implementation uses the (test set) dialogues collected with human-wizards in the loop as "perfect" state trackers. Unfortunately the human-wizards were not perfect, sometime making mistakes and sometime taking shortcuts. In one example (MUL0890) the user asks for indian cuisine on the east side of town. The human-wizard records this and came up with 4 results, 2 moderately priced and two expensive. The user request an expensive restaurant but the human-wizard fails to update the belief state to add this information - in reality I guess they didn't need to as they could likely read off one of the expensive options (which they did) and reply to the user without updating the UI. Looking at the code, Inform Rate's implementation uses the human-wizard provided belief state from the dialogues. It computes the set of matching venues that satisfy the human-wizard provided belief state for that turn, then samples one entity from this set and compares to the set of possible entities that match the user's goal. In the above example there are two entities (two restaurants) that satisfy the goal, however the set of available entities that match the human-wizard provided belief state is 4 and thus there is a 50:50 chance that the process of randomly sampling one entity will result in a mismatch and thus the dialogue will be classified as having failed on this metric. Having discovered the cause of the randomness the next question that occured to me is to what extent does the Inform Rate metric implementation make it impossible to select a match, i.e. in how many test set dialogues does the human-wizard provided belief state not overlap at all with the set of requirements in the user's goal. Assuming that the human-wizard dialogues in the test set present the domain entities during the optimum turns in the dialogues with respect to belief state, it turns out that there is zero overlap between the available venues that match the human-wizard provided belief state and the user goal set of venues in 84 dialogues. Given that the test set consists of 1000 dialogues this would likely* bound the highest score that can be achieved on this implementation of Inform Rate to 91.6%. *There's a possibility - thought my suspicion is that it's a low possibility - that a better strategy than using the human-wizard responses to decide in which turn to present entities could exist. I hacked the metric code to simulated a policy that presented a venue for each domain in the goal set at the end of every dialogue but this scored worse (~83% Inform Rate) indicating that there is an "optimal" turn in several of the dialogues. I speculate that the human-wizards make further erroneous updates to the belief state after that turn. Note all of this analysis is based on MultiWoZ 2.0. Thoughts? |
New evaluation scripts for BLEU, Inform & Success #2
I spotted some randomness in evaluation code. For example,
multiwoz/model/evaluator.py
Line 142 in e4922d6
Wouldn't it make the match and success rates different even if we evaluate the same model twice?
The text was updated successfully, but these errors were encountered: