potential issue with SARI n-gram add-score #99

liamcripwell · 2022-06-24T12:16:08Z

Hi, I have observed a particular situation with the SARI implementation where system outputs can receive a <100 score even when they are identical to the reference (where there is only a single reference).

Basically, if a reference does not introduce new tokens, it will receive a 0.00 unigram add-score, but 100 for all n>1-grams.

Take the following example:

sources=["Shu Abe (born June 7 1984) is a former Japanese football player."]
predictions=["Shu Abe (born June 7 1984) is a Japanese football player."]
references=[["Shu Abe (born June 7 1984) is a Japanese football player."]]
sari_score = corpus_sari(sources, predictions, references)
print(sari_score)

>>> 91.66666666666667

In this case, the add score will be 75.0 because there are no new unigrams (because of the if sys_total > 0: checks in compute_precision_recall_f1()) but there are technically new bigrams, trigrams, and 4-grams around the location of the deleted word (["a japanese", "a japanese football", "is a japanese"], etc.).

I am just curious of whether this is the expected behaviour or if a definitive 0.00 or 100.0 result for the add-score would be more desirable?

Thanks in advance for any insight.

The text was updated successfully, but these errors were encountered:

feralvam · 2022-07-21T15:25:16Z

Hi!

Apologies for the very late reply.

I ran your example with the original implementation of SARI and I got the same score. So, to begin with, this is not an issue with our implementation but rather from the design of the metric itself. I think it would be a good idea to raise this issue in the SARI github repo to get the opinion of the original authors of the metric.

It would seem like giving a 0.0 for ADD makes sense in this case, because there are not actually new bigrams, trigrams or 4-grams. You could possibly expand this logic for the other operations, and always give a zero if the operation at the unigram level is already a zero. I haven't really given this much thought, though. What's your take on this?

liamcripwell · 2022-07-27T06:09:59Z

Hi, thanks for the response!

I didn’t expect this to be something wrong with your implementation specifically, but more a quirk of the metric itself. I just posted here because this library seems to be the best evaluation library for simplification and already contains some modifications/fixes to the original SARI implementation.

To my intuition, I would expect score of 100 for cases where the outputs are identical/near-identical to the reference. Even if no n-grams are added, matching the references suggests that they are ~optimally simple and therefore should receive the highest score even if the transformation only deletes content. It doesn't make sense to me why another example could receive a higher score purely because the reference introduces a new word.

Perhaps it is not necessary to make any changes because instances of this type should still receive relatively high scores overall (like in the example above), but it seemed like an interesting edge-case and I thought it was worth bringing up to see if anyone else had thoughts on this.

Thanks for your time!

liamcripwell changed the title ~~potential issue with n-gram add-score~~ potential issue with SARI n-gram add-score Jun 24, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

potential issue with SARI n-gram add-score #99

potential issue with SARI n-gram add-score #99

liamcripwell commented Jun 24, 2022 •

edited

Loading

feralvam commented Jul 21, 2022

liamcripwell commented Jul 27, 2022

potential issue with SARI n-gram add-score #99

potential issue with SARI n-gram add-score #99

Comments

liamcripwell commented Jun 24, 2022 • edited Loading

feralvam commented Jul 21, 2022

liamcripwell commented Jul 27, 2022

liamcripwell commented Jun 24, 2022 •

edited

Loading