New metric "SubER-cased"; also use tokenizer for "WER-cased" #6

patrick-wilken · 2022-12-18T16:32:47Z

See #5
Adds the metric "SubER-cased", which is a case- and punctuation sensitive variant of SubER. Tokenization is used to treat punctuation as separate tokens. Note that the analysis in our paper shows weaker correlation with human post-editing effort. However, this variant might be useful when punctuation and casing errors are considered to be of high importance.

I also added tokenization to "WER-cased" to be consistent with "SubER-cased", because it makes sense intuitively, and also because it shows a slightly higher correlation than what we reported for "WER + case/punct" in the paper. (The numbers in Table 1 row 2 become -0.685, -0.520, -0.504, -0.657.) I think no one relies on the exact behaviour of "WER-cased" yet and it's ok to make a breaking change.

patrick-wilken added 3 commits December 18, 2022 07:46

Added SubER-cased metric

d24dc42

Added test for SubER-cased

d6cf4ba

BREAKING CHANGE: use tokenizer for WER-cased

f088902

patrick-wilken mentioned this pull request Dec 18, 2022

Punctuation and case sensitive #5

Closed

README: added section about SubER-cased

aab4d2a

sarapapi approved these changes Jan 9, 2023

View reviewed changes

patrick-wilken merged commit 6745153 into main Jan 9, 2023

patrick-wilken deleted the feauture/suber_cased branch January 9, 2023 17:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New metric "SubER-cased"; also use tokenizer for "WER-cased" #6

New metric "SubER-cased"; also use tokenizer for "WER-cased" #6

patrick-wilken commented Dec 18, 2022

New metric "SubER-cased"; also use tokenizer for "WER-cased" #6

New metric "SubER-cased"; also use tokenizer for "WER-cased" #6

Conversation

patrick-wilken commented Dec 18, 2022