-
Notifications
You must be signed in to change notification settings - Fork 158
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reproducing the ASQA numbers #4
Comments
Hi, thank you so much for reporting! Hmm, the citation rec and precision particularly look low... Let me check in on this tomorrow. |
Hi, apologies for pinging again, but just checking in on this to see if you have any update? Thanks so much! |
Sorry for my late response! I was busy with other commitments in the past two weeks. I think the issue might have happened due to some code changes I did for refactoring but I haven't investigated the diffs line by line. Do you mind if I get back to you early next week? I can also upload the model prediction file we have first if it helps! |
No worries! Early next week sounds good. |
Sorry for my late response! This is the link to our 7B prediction results: Google Drive Here's the output of the asqa eval.py script.
I'm still investigating the gap in the citation rec and prec, but someone just found a bug in our long-form qa script I mistakenly added during refactoring and I am currently re-running the evaluations. I'll keep you posted! |
Thanks for sharing this! Please let me know whenever you have updated the long-form QA script and I will try it out again. |
Hello, I also encountered a similar situation when reproducing the ASQA numbers for the 13B model, where:
I wonder if you could also share the 13B prediction results. Thanks a lot. |
Sorry for being late on this issue as I was being busy with helping to wrap up some other projects and traveling in the past weeks. I can upload the 13B results tomorrow and will take a closer look at the code base. |
Here's the 13B predictions (google drive) and results:
|
Hi, apologies for pinging, it seems like I encountered the same question... |
@AkariAsai Facing the same issue with ASQA citation precision and recall. Here is the diff between the author output and reproduced output: https://www.diffchecker.com/HLAGTddk/ . |
Hi, I would like to ask you about the retrieval and test Settings of PopQA. I used the retrieval device and Settings in the paper to conduct the search, but the subsequent evaluation accuracy was only 0.42, far lower than the 0.55 reported in the paper. I would like to ask if there are some setup problems in my experiment?
|
|
Hi, I was unable to reproduce the ASQA numbers for long-form generation. After evaluating the output with ALCE, I see the below numbers which are very different from those reported in the paper:
The command I used:
I have also uploaded the model output file here for your reference. Just wanted to know whether I am doing anything wrong for ASQA.
Btw, I did a sanity check by evaluating on short-form generation with PopQA and I see 55.0 for accuracy, which matches the number reported in the paper.
The text was updated successfully, but these errors were encountered: