-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Verbose output explaining the SubER score #4
Conversation
Hi @patrick-wilken, thank you very much for the PR. It would be helpful for my analysis. |
Sorry for the slow progress here. I now added separate statistics for word and break edit operations.
I will have to look further into the implementation details and what actually happens in practice. But in principle you can get to the same sequence of shifted words by either shifting a word or a break. |
e4dc277
to
c74638a
Compare
Also distinguishing now between I think I am still going to flip deletions and insertions in the statistics output. Usually you think of the edit operations being performed on the hypothesis to transform it into the reference, that's the direction the TER code and paper uses, which I also currently use in the output. However, I think people are used to call a missing word in the hypothesis a deletion (although the required edit operation in the sense as above would be an insertion). |
Hi, sorry for my late reply. I think that it should be counted as a break shift not as both break and word shifts. If it involves a break, it is always a break shift otherwise a word shift. But this is my interpretation, of course. |
Okay, let me get more technical. 😄 In the code a shift is defined by the tuple But that does not work because a shift can be expressed as multiple different tuples
vs.
You could say the first line break moved to one position earlier, and by the extended definition it would be a break shift. But to me this looks like just a word shift of "only". Thinking about it, what would work is to regard all shifts that either shift only a single break token or that shift across a single break token as break shifts. Because that are the cases where the text stays the same and only the segmentation changes. All other breaks would then be "word or mixed shifts"... |
Yes, I see, I got the problem now thanks for the explanation and I agree with your last comment, there are mixed cases and these cannot be counted as break shifts only, I would go for your definition of block shift. Thank you again. |
c74638a
to
2fae93a
Compare
I rebased and now also switched deletions and insertions in the statistics, meaning edit operations are considered to be applied to the reference and thus deletions are words missing in the hypothesis, insertions are additional words in the hypothesis. This is not the direction the TER paper and code uses, but as far as I know far more common (e.g. https://en.wikipedia.org/wiki/Word_error_rate). |
2fae93a
to
a1dba80
Compare
@sarapapi That's what I have so far for #3.
Setting
--suber-statistics
as a command line option would lead to an output like:Not so sure about the output format, maybe I'm overdoing this json format. But I think it's better than writing to a separate file or just printing those statistics to stderr. The idea of the extra nesting level is that maybe at some point we want additional outputs also for other metrics.
What could further be added here:
num_deletions
intonum_word_deletions
andnum_break_deletions
, same for insertions and substitutions (substitution of break is "end of block" <-> "end of line"). This gives some additional insights, for example whether there is over-/under-segmentation in general. But it requires an alignment of the words before and after the TER shifts so we know the positions of breaks in the edit operation "trace". Doable though...By the way,
num_word_shifts
/num_break_shifts
does not really make sense because it's ambiguous: swapping a word and a subsequent break could either be word shift right or break shift left.0-0 1-2 2-3
etc. This could be used to create visualizations like Figure 3 in the paper to see which words / breaks exactly are edited. Nice to have, but not so high priority for me at the moment I would say.