Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Connection between output tokens and input audio #4278

Closed
Daniel-asr opened this issue Apr 14, 2022 · 6 comments
Closed

Connection between output tokens and input audio #4278

Daniel-asr opened this issue Apr 14, 2022 · 6 comments
Labels
Feature request Force alignment including CTC segmentation

Comments

@Daniel-asr
Copy link

I want to find a way to know which interval in the input audio correspond to every output token. For example, if in the audio it was said "hello", I want a mapping such as:

  • H --> [0.2, 0.3] (seconds)
  • E --> [0.3, 0.7]
  • L --> [0.7, 1.1]
  • L --> [0.7, 1.1]
  • O --> [1.1, 1.8]

I saw this attention images issue, which might help, since there is a linear correlation between the input audio timing and the encoders, but still I am not sure if this is the right way to go or if there are other tools in ESPnet for my need. Furthermore, I am expecting one correlation mapping, so how can there be more than one attention image (shown in the issue above)?

Thanks!

@sw005320
Copy link
Contributor

sw005320 commented Apr 14, 2022

espnet models usually have a CTC branch and we can get an onset token information (and offset from the previous frame of the succeeding token onset) by a Viterbi algorithm.

@lumaku, is it easy to obtain the token-level segmentation information from your CTC segmentation?
This is one of the frequent requests.

@sw005320 sw005320 added Feature request Force alignment including CTC segmentation labels Apr 14, 2022
@sw005320
Copy link
Contributor

I would not recommend using the attention map to obtain the segmentation results.
The attention map is noisy and also in the transformer case, there are multiple attention maps and it is not trivial to aggregate such multiple information.

@Daniel-asr
Copy link
Author

I agree. I believe that what makes the attention map noisy is the self-attention in the encoding layers.
I am curious about CTC segmentation option. How can we evaluate that option? @lumaku

@sw005320
Copy link
Contributor

espnet models usually have an attention branch and we can get an onset token information (and offset from the previous frame of the succeeding token onset) by a Viterbi algorithm.

Sorry "an attention branch" --> "a CTC branch"

@lumaku
Copy link
Contributor

lumaku commented Apr 16, 2022

@lumaku, is it easy to obtain the token-level segmentation information from your CTC segmentation? This is one of the frequent requests.

It is possible, you can set the intervals (utterances) to single tokens.

Use a character-based model. Also, the number of output tokens of the ASR model must be at least 2 times larger compared to the number of ground truth tokens; if not, the aligned tokens will be overlap and become inaccurate.

The CTC output is always in the form of spikes at the output index when a certain token "occurs". The CTC segmentation algorithm sets the interval as a next-neighbor around the token activation. This is usually a bit tricky for the last token (but that also depends on the audio and ASR model).

@Daniel-asr You can test this for example with the character-based WSJ model kamo-naoyuki/wsj on the example audio file in test_utils/ctc_align_test.wav. The algorithm expects a list of utterances (the parameter text), you can input a list of characters instead, then it will align those and you get an interval for each of those characters.

@sw005320
Copy link
Contributor

@Daniel-asr You can test this for example with the character-based WSJ model kamo-naoyuki/wsj on the example audio file in test_utils/ctc_align_test.wav. The algorithm expects a list of utterances (the parameter text), you can input a list of characters instead, then it will align those and you get an interval for each of those characters.

Oh, I see. If we prepare a list of characters or tokens, the model expects to work on providing the token level alignments.
Thanks for your TIPS!

sw005320 added a commit that referenced this issue Apr 18, 2022
Add a description of token-level alignment, as discussed in #4278
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature request Force alignment including CTC segmentation
Projects
None yet
Development

No branches or pull requests

3 participants