Verbose output explaining the SubER score #4

patrick-wilken · 2022-10-19T13:13:21Z

@sarapapi That's what I have so far for #3.

Setting --suber-statistics as a command line option would lead to an output like:

{
    "SubER": 46.435,
    "#info": {
        "SubER": {
            "num_reference_words": 5946,
            "num_shifts": 230,
            "num_deletions": 828,
            "num_insertions": 783,
            "num_substitutions": 920
        }
    }
}

Not so sure about the output format, maybe I'm overdoing this json format. But I think it's better than writing to a separate file or just printing those statistics to stderr. The idea of the extra nesting level is that maybe at some point we want additional outputs also for other metrics.

What could further be added here:

Separating num_deletions into num_word_deletions and num_break_deletions, same for insertions and substitutions (substitution of break is "end of block" <-> "end of line"). This gives some additional insights, for example whether there is over-/under-segmentation in general. But it requires an alignment of the words before and after the TER shifts so we know the positions of breaks in the edit operation "trace". Doable though...
By the way, num_word_shifts / num_break_shifts does not really make sense because it's ambiguous: swapping a word and a subsequent break could either be word shift right or break shift left.
Output of the full Levenshtein alignment, e.g. in the form of hypothesis + reference word lists and an alignment like 0-0 1-2 2-3 etc. This could be used to create visualizations like Figure 3 in the paper to see which words / breaks exactly are edited. Nice to have, but not so high priority for me at the moment I would say.

sarapapi · 2022-10-19T18:13:04Z

Hi @patrick-wilken, thank you very much for the PR. It would be helpful for my analysis.
I was wondering about the ambiguity between num_word_shifts and num_break_shifts and I think that if the shift operation involves a break, then it should be counted as a break shift otherwise as a word shift, I see no ambiguity in this definition.
For sure a natural additional improvement would be having word and break information isolated, as you said, especially for providing information about the segmentation, a critical aspect of subtitling. For example, I observed good (hence, low as you already know) SubER in some cases where Sigma (another metric theoretically developed for evaluating subtitle segmentation) is bad (hence, low) and I was thinking why this bad segmentation does not seem to have an impact on SubER. Maybe, having such a distinction would be helpful to identify if there is a disagreement between the metrics or if they agree somehow on the quality of segmentation.

patrick-wilken · 2022-11-15T19:15:16Z

Sorry for the slow progress here. I now added separate statistics for word and break edit operations.

{
    "SubER": 46.435,
    "#info": {
        "SubER": {
            "num_reference_words": 5946,
            "num_shifts": 230,
            "num_word_deletions": 620,
            "num_break_deletions": 208,
            "num_word_insertions": 566,
            "num_break_insertions": 217,
            "num_word_substitutions": 834,
            "num_break_substitutions": 86
        }
    }
}

I think that if the shift operation involves a break, then it should be counted as a break shift otherwise as a word shift, I see no ambiguity in this definition.

I will have to look further into the implementation details and what actually happens in practice. But in principle you can get to the same sequence of shifted words by either shifting a word or a break.
E.g. A B <eol> C -> A <eol> B C: Is it a shift of B or of <eol>? You could say that any shift going across a break position is also a break shift, but that is a bit complicated.

patrick-wilken · 2022-11-16T14:20:53Z

Also distinguishing now between num_reference_words and num_reference_breaks.
And added tests, seems to work and should already be safe to use.

I think I am still going to flip deletions and insertions in the statistics output. Usually you think of the edit operations being performed on the hypothesis to transform it into the reference, that's the direction the TER code and paper uses, which I also currently use in the output. However, I think people are used to call a missing word in the hypothesis a deletion (although the required edit operation in the sense as above would be an insertion).

sarapapi · 2022-12-15T14:44:52Z

Sorry for the slow progress here. I now added separate statistics for word and break edit operations.
{
    "SubER": 46.435,
    "#info": {
        "SubER": {
            "num_reference_words": 5946,
            "num_shifts": 230,
            "num_word_deletions": 620,
            "num_break_deletions": 208,
            "num_word_insertions": 566,
            "num_break_insertions": 217,
            "num_word_substitutions": 834,
            "num_break_substitutions": 86
        }
    }
}
I think that if the shift operation involves a break, then it should be counted as a break shift otherwise as a word shift, I see no ambiguity in this definition.

I will have to look further into the implementation details and what actually happens in practice. But in principle you can get to the same sequence of shifted words by either shifting a word or a break. E.g. A B <eol> C -> A <eol> B C: Is it a shift of B or of <eol>? You could say that any shift going across a break position is also a break shift, but that is a bit complicated.

Hi, sorry for my late reply. I think that it should be counted as a break shift not as both break and word shifts. If it involves a break, it is always a break shift otherwise a word shift. But this is my interpretation, of course.

patrick-wilken · 2022-12-15T17:28:19Z

Okay, let me get more technical. 😄 In the code a shift is defined by the tuple start, length, target. start is the start position of the range of words to be shifted, length is the number of words to shift, and target is the word position to shift to. I guess what you are saying is that if and only if any word in the range start to start + length is a break token then we defined it as a break shift.

But that does not work because a shift can be expressed as multiple different tuples start, length, target. In my example A B <eol> C -> A <eol> B C it could be start=2, length=1, target=1 (shift of <eol>) or start=1, length=1, target=2 (shift of B). By the definition above the first would be a break shift, the second one not. Which should be regarded a contradiction because both really describe the same shift.
You could try to alter the definition by adding "or if there is any break token between positions start and target", i.e we shift across some break position. That would be well defined, and then both are break shifts. However, in general that doesn't seem to fit what a break shift should be. For example:

1
00:00:00,000 --> 00:00:05,000
I recognized only
half of the people

vs.

1
00:00:00,000 --> 00:00:05,000
I recognized
half of the people only

You could say the first line break moved to one position earlier, and by the extended definition it would be a break shift. But to me this looks like just a word shift of "only".

Thinking about it, what would work is to regard all shifts that either shift only a single break token or that shift across a single break token as break shifts. Because that are the cases where the text stays the same and only the segmentation changes. All other breaks would then be "word or mixed shifts"...

sarapapi · 2022-12-19T17:04:04Z

Yes, I see, I got the problem now thanks for the explanation and I agree with your last comment, there are mixed cases and these cannot be counted as break shifts only, I would go for your definition of block shift. Thank you again.

patrick-wilken · 2023-01-16T16:00:11Z

I rebased and now also switched deletions and insertions in the statistics, meaning edit operations are considered to be applied to the reference and thus deletions are words missing in the hypothesis, insertions are additional words in the hypothesis. This is not the direction the TER paper and code uses, but as far as I know far more common (e.g. https://en.wikipedia.org/wiki/Word_error_rate).
So we leave "num_shifts" as is for now? Then good to merge, I think.

patrick-wilken force-pushed the feature/statistics branch from e4dc277 to c74638a Compare November 16, 2022 14:07

sarapapi approved these changes Dec 15, 2022

View reviewed changes

patrick-wilken force-pushed the feature/statistics branch from c74638a to 2fae93a Compare January 16, 2023 15:37

patrick-wilken added 3 commits January 23, 2023 08:24

Added option to print number of different edit ops

555bedd

SubER statistics: added tests

fbe73ee

SubERMetricTests: fixed comment (insertion vs. deletion)

a1dba80

patrick-wilken force-pushed the feature/statistics branch from 2fae93a to a1dba80 Compare January 23, 2023 13:25

patrick-wilken merged commit 226d207 into main Jan 23, 2023

patrick-wilken deleted the feature/statistics branch January 23, 2023 13:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Verbose output explaining the SubER score #4

Verbose output explaining the SubER score #4

patrick-wilken commented Oct 19, 2022

sarapapi commented Oct 19, 2022 •

edited

Loading

patrick-wilken commented Nov 15, 2022

patrick-wilken commented Nov 16, 2022

sarapapi commented Dec 15, 2022

patrick-wilken commented Dec 15, 2022

sarapapi commented Dec 19, 2022

patrick-wilken commented Jan 16, 2023

Verbose output explaining the SubER score #4

Verbose output explaining the SubER score #4

Conversation

patrick-wilken commented Oct 19, 2022

sarapapi commented Oct 19, 2022 • edited Loading

patrick-wilken commented Nov 15, 2022

patrick-wilken commented Nov 16, 2022

sarapapi commented Dec 15, 2022

patrick-wilken commented Dec 15, 2022

sarapapi commented Dec 19, 2022

patrick-wilken commented Jan 16, 2023

sarapapi commented Oct 19, 2022 •

edited

Loading