Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The upper bound of Inform and Success rate? #20

Open
yunhaoli1995 opened this issue Apr 28, 2020 · 5 comments
Open

The upper bound of Inform and Success rate? #20

yunhaoli1995 opened this issue Apr 28, 2020 · 5 comments

Comments

@yunhaoli1995
Copy link

yunhaoli1995 commented Apr 28, 2020

I run evaluate.py and get Matches(inform): 90.40, Success 82.3. Are these the upper bound of metric Inform and Success? In some paper, the inform and success rate can exceed 90.40,82.3. In DMAD, under data augmentation setting, it can get inform 95.4 and Success 87.2. Which confused me so much.
4e4b735bf38315fb2f608ab49a216ca

@budzianowski
Copy link
Owner

Hi, can you explain in more detailed what models you evaluated?

@yunhaoli1995
Copy link
Author

Sorry,i didn't speak clearly. I evaluated the data/test_dials, the ground truth using the script evaluate.py and get Matches(inform): 90.40, Success 82.3, i want to know are these the upper bound of metric inform and success? In DMAD, under data augmentation setting, it can get inform 95.4 and Success 87.2. Which confused me so much.

@skiingpacman
Copy link

skiingpacman commented Sep 8, 2020

I think there is a likely upper bound on Inform Rate of 91.6% on the MultiWOZ 2.0 test of set due to a combination of the implementation of Inform Rate and errors in the belief state in the test set. This is based on the metric internally using the test-set to provide the "oracle" belief state, when sampling venues that the policy presents.

In practice evaluating the test set dialogues themselves (as per @leeyunhao) I got min 90.3%, max 90.9%, mean 90.54% +/- 0.46% (+/- 2 * STD) using 5 samples.

For more details see comment: #2 (comment)

@comprehensiveMap
Copy link

Sorry,i didn't speak clearly. I evaluated the data/test_dials, the ground truth using the script evaluate.py and get Matches(inform): 90.40, Success 82.3, i want to know are these the upper bound of metric inform and success? In DMAD, under data augmentation setting, it can get inform 95.4 and Success 87.2. Which confused me so much.

Hello, I am confused by this too. Have you solved this problem? In my opinion, there are some differences in the DAMD evaluation scripts, DAMD considers 'match' as 1 if the set of returned venues has overlap with the set of true venues. But in this script, as you see, the randomly selected one should be included in the set of true venues.

@yunhaoli1995
Copy link
Author

yunhaoli1995 commented May 10, 2021

Sorry,i didn't speak clearly. I evaluated the data/test_dials, the ground truth using the script evaluate.py and get Matches(inform): 90.40, Success 82.3, i want to know are these the upper bound of metric inform and success? In DMAD, under data augmentation setting, it can get inform 95.4 and Success 87.2. Which confused me so much.

Hello, I am confused by this too. Have you solved this problem? In my opinion, there are some differences in the DAMD evaluation scripts, DAMD considers 'match' as 1 if the set of returned venues has overlap with the set of true venues. But in this script, as you see, the randomly selected one should be included in the set of true venues.

It's still unsolved. But at least I think models should be compared on the same evaluation script, otherwise, the comparison is meanless.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants