Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Results for PRIMERA-arxiv #12

Closed
oaimli opened this issue May 23, 2022 · 4 comments
Closed

Results for PRIMERA-arxiv #12

oaimli opened this issue May 23, 2022 · 4 comments

Comments

@oaimli
Copy link

oaimli commented May 23, 2022

Hi,

Thanks for sharing this nice work. After running your codes, I can just get 28.5 for RougeL-fmeasure for arxiv dataset, but in your paper the Rouge-L is 42.6 in Table 3, while Rouge-1 and Rouge-2 are the same as yours. Moreover, I can only get 46.6/19.1/27.5 for Rouge-1/2/L with led-large-16384-arxiv (i.e., the SOTA for arxiv), but in your Table 3, it is 41.8 for Rouge-L. Could you please helping to explain how you get such high Rouge-L values for arxiv dataset?

@Wendy-Xiao
Copy link
Contributor

Wendy-Xiao commented May 26, 2022

Hi,

Yes, it is because of the inconsistency on how to deal with sentences in the summary when measuring ROUGE-L.

As in previous work, the sentences in the generated summaries and ground-truth summaries are aggregated with '\n' in between

s1 . \n s2 . \n s3 . \n ...

but in the setting of our experiments the sentences are aggregated with ' ' in between

s1 . s2 . s3 . ....

To make the results comparable with previous works, we split the summaries (both generated and ground-truth) with '.' and aggregated them with '\n' in between.

The ROUGE-L computed in this way is consistent with previous work, which is also the number shown in our paper. (same for led)

@oaimli
Copy link
Author

oaimli commented May 26, 2022

Hi!

Thank you so much for giving me a reply and sharing how you get the numbers. I can now reproduce the results reported in the paper for the arxiv datasets after splitting sentences with '\n'. If this is the case, in Table 3 of your paper, under the column 'RougeL', you reported RougeL for multi-news, multi-xscience, and wcep, but RougeLsum for arxiv dataset. This may be a little confusing. If so, would you mind telling me the reason why you use different measurements in the same column of the result table? Much appreciated!

Again, thanks for sharing your nice work.

@Wendy-Xiao
Copy link
Contributor

Hi there,

Sorry for the late reply. I do not have any particular reason for that, just to match the results of previous works on different datasets. The difference between datasets might come from the natural format of the original datasets, i.e. some datasets are built in the format as '\n'-splitted summaries, and some are built with the summary as a complete paragraph.

@oaimli
Copy link
Author

oaimli commented Jul 25, 2022

Thanks for your kind reply!

@oaimli oaimli closed this as completed Jul 25, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants