-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Computing time-constrained WER #17
Comments
There is not straightforward answer to you problem, but the following might help to start a discussion. We classify the WER algorithms with these three properties:
Could you elaborate further your requirements regarding the properties defined above? Especially wether you have/want to use diarization labels.
Can you clarify what exactly you mean with asclite aWER? Is it the WER that is used in the libri-CSS publication? From our understanding, tha asclite WER from the libri-CSS publication does the following:
Currently we are working on a WER that considers diarization and time stamps (Word or segment level), you can find it as tcpWER in this package, but we haven't decided yet, which hyperparameters we want to suggest. We plan to publish it for the CHiME workshop.
This is not "beyond the scope of this toolkit", but it is beyond our know-how. We think a normalization is kind of orthogonal to the actual WER calculation and for that it might be better to use externel tools from people that have more experience in this topic (e.g. language model people). One idea would be to use kaldi, but we haven't thought about this until know. We are open for suggestions. We have some more plans, but they are in an early stage and we don't want to talk about those yet in public. Footnotes
|
Some clarifications: By asclite WER, I meant exactly what you described. One problem with this metric seems to be that references are "loose" (i.e. STM files). By "normalization" I meant providing multiple possible references (similar to what is done to compute Bleu scores in MT). |
Could you give an example, where multiple possible references are useful? In MT this is different, because translations have more degree of freedoms. There are a few issues, when we would allow a "graph" instead of a sequence of words for the reference:
|
In ASR, SCLITE and ASCLITE handle this through "GLM files". Basically, you provide rules for alternative references of words or phrases such as In any case, this is a "desirable" but not "necessary" property to have. My main purpose in creating this issue was basically to get your insights on what kind of metrics would work for the task of long-form ASR and segmentation. Edit: If you are planning to attend ICASSP, we can have more discussions then :) |
Thanks for the explanation. Yes, with the timing information, the complexity can be significantly reduced.
There are different long-form ASR and segmentation systems and they are differently evaluated. Let's say you build a "CSS pipeline" [1]. Some people stop before the Diarization and want to evaluate "Separation + ASR". When you build a system that yields a "Speaker-attributed Transcription" with temporal information, the asclite tool ignores the "Speaker-attributed" part of your estimation. For this situation, we implemented a time-Constrained levenshtein distance and replaced the classical levenshtein distance in cpWER: Time-Constrained minimum Permutation Word Error Rate (tcpWER) We provide several options to address for different accuracies between reference and hyposisis. With "ctm" estimates,
I am not there, but Thilo will attend the conference. [1] https://arxiv.org/pdf/2011.02014.pdf |
Thanks. For the models we are using now, we don't have speaker attribution. I am actually using the asclite WER at the moment, so it seems we are on the same page about that. |
I'll add a few more comments:
|
I am thinking of a metric for long-form ASR and segmentation. Consider the following scenario:
If reference is STM and hypothesis is CTM, this may correspond to computing the asclite aWER metric, but we also want to support (i) other kinds of systems that may not provide word-level timestamps, and (ii) tighter penalty on segmentation by providing reference CTM.
Additionally, we also want to be able to include multiple possible references (e.g., references may be orthographic or normalized in some way), although I understand that this may be beyond the scope of this toolkit.
I am looking for suggestions about what would be a good metric (if one exists) for this scenario.
(cc @MartinKocour since we were having related discussions.)
The text was updated successfully, but these errors were encountered: