-
Notifications
You must be signed in to change notification settings - Fork 36
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
potential issue with SARI n-gram add-score #99
Comments
Hi! Apologies for the very late reply. I ran your example with the original implementation of SARI and I got the same score. So, to begin with, this is not an issue with our implementation but rather from the design of the metric itself. I think it would be a good idea to raise this issue in the SARI github repo to get the opinion of the original authors of the metric. It would seem like giving a 0.0 for ADD makes sense in this case, because there are not actually new bigrams, trigrams or 4-grams. You could possibly expand this logic for the other operations, and always give a zero if the operation at the unigram level is already a zero. I haven't really given this much thought, though. What's your take on this? |
Hi, thanks for the response! I didn’t expect this to be something wrong with your implementation specifically, but more a quirk of the metric itself. I just posted here because this library seems to be the best evaluation library for simplification and already contains some modifications/fixes to the original SARI implementation. To my intuition, I would expect score of 100 for cases where the outputs are identical/near-identical to the reference. Even if no n-grams are added, matching the references suggests that they are ~optimally simple and therefore should receive the highest score even if the transformation only deletes content. It doesn't make sense to me why another example could receive a higher score purely because the reference introduces a new word. Perhaps it is not necessary to make any changes because instances of this type should still receive relatively high scores overall (like in the example above), but it seemed like an interesting edge-case and I thought it was worth bringing up to see if anyone else had thoughts on this. Thanks for your time! |
Hi, I have observed a particular situation with the SARI implementation where system outputs can receive a <100 score even when they are identical to the reference (where there is only a single reference).
Basically, if a reference does not introduce new tokens, it will receive a 0.00 unigram add-score, but 100 for all n>1-grams.
Take the following example:
In this case, the add score will be 75.0 because there are no new unigrams (because of the
if sys_total > 0:
checks incompute_precision_recall_f1()
) but there are technically new bigrams, trigrams, and 4-grams around the location of the deleted word (["a japanese", "a japanese football", "is a japanese"]
, etc.).I am just curious of whether this is the expected behaviour or if a definitive 0.00 or 100.0 result for the add-score would be more desirable?
Thanks in advance for any insight.
The text was updated successfully, but these errors were encountered: