Audio-to-text alignment with trained Espnet2 asr model #3018

qdenisq · 2021-02-26T01:14:58Z

Hi, can anyone tell how one does audio-to-text alignment using Espnet2? I can see there is asr_align.py in Espnet and was curious if Espnet2 provides a similar interface. Thank you

The text was updated successfully, but these errors were encountered:

sw005320 · 2021-02-26T01:42:53Z

Very good point!
Yes, right now, we only support it in espnet1.
@lumaku, are you interested in extending your CTC alignment to espnet2?
As you know, espnet2 is more portable and has better API, and your CTC alignment tool becomes more powerful in these regards.

lumaku · 2021-02-28T11:35:23Z

Yes, I can do that.
A small pre-trained model for English speech (librispeech/wsj), similar to wsj.transformer_small.v1 in espnet1, would be very helpful for testing. Does espnet2 currently have a similar model?

kamo-naoyuki · 2021-02-28T12:28:10Z

All pretrained models can be found in https://github.com/espnet/espnet_model_zoo

qdenisq · 2021-03-02T10:48:21Z

@kamo-naoyuki @sw005320 I have one more question aside from the actual CTC alignment script and was hoping you could advise me on this. I have a Conformer trained on LIbrispeech with BPE vocabulary (size=5000) according to the recipe from https://zenodo.org/record/4276519. After taking argmax on ctc output I get the following:

Argmax result

[   0    0    0    7    0    0    0    0    0    0    0    0   83    0
    0    0    0    0    0    0    0    0  729    0    0    0    0    0
    0    0    0    0 3415    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0]

, where

{
 0: '<blank>',
 7: '_A',
 83: '▁LITTLE',
 729:  '▁GREEN',
 3415: '▁FROG'
}

Thus,

Tokenized transcript
['▁A', '▁LITTLE', '▁GREEN', '▁FROG']

I checked the indexes of the non-zero tokens and timestamps in the audio recording. It seems these indexes match with the timestamps of the beginnings of the tokens in the recording. It means that one could easily retrieve the starting timestamps, but it is absolutely unclear to me how to get the endings of tokens.

Do I understand it correctly that it is not really feasible to retrieve the token lengths from ctc output since only the starting frame is encoded with the token id and the rest is filled with <blank> until the next token is detected?

lumaku · 2021-03-02T14:51:33Z

The CTC layer does not assign time steps to tokens (as hybrid ASR does with states/phonemes), but rather fires when a token "occurs". That can be at the start, in the middle, or in the end of a word/token.
CTC segmentation takes the length of the tokens into account (which can be deactivated if needed).

sw005320 · 2021-03-02T15:03:46Z

Yes, @lumaku is right.
@qdenisq, I think you could use the fired timing as an approximated onset of the token, but as @lumaku says, it is often shifted.
Also, the shift becomes very large in the adverse environments (sometimes more than 10 frames).
This is just a note.

lumaku · 2021-03-06T16:02:34Z

Current progress: master...lumaku:espnet2_ctc_segmentation

To test the new CTCSegmentation module:

import soundfile
from espnet_model_zoo.downloader import ModelDownloader
from espnet2.bin.asr_align import CTCSegmentation
d = ModelDownloader(cachedir="./modelcache")
wsjmodel = d.download_and_unpack("kamo-naoyuki/wsj")
speech, rate = soundfile.read("./test_utils/ctc_align_test.wav")
aligner = CTCSegmentation( **wsjmodel , kaldi_style_text=False )
text = ["THE SALE OF THE HOTELS",
        "IS PART OF HOLIDAY'S STRATEGY",
        "TO SELL OFF ASSETS",
        "AND CONCENTRATE ON PROPERTY MANAGEMENT"]
segments = aligner(speech, text)
print(segments)
# utt_0000 utt 0.26 1.73 -0.0154 THE SALE OF THE HOTELS
# utt_0001 utt 1.73 3.19 -0.7674 IS PART OF HOLIDAY'S STRATEGY
# utt_0002 utt 3.19 4.20 -0.7433 TO SELL OFF ASSETS
# utt_0003 utt 4.20 6.02 -0.9003 AND CONCENTRATE ON PROPERTY MANAGEMENT

Some remarks and discussion points:

In my installation, an import error occurred in the text cleaner upon importing Speech2Text on some libraries that are optional for ASR. I changed the corresponding imports to conditional and included this in a separate commit.
The output format is a regular kaldi-style segments file.
Input data includes (data/name of) audio file, ground truth and optionally an utterance name. In espnet1, the audio data was stored in a json. In espnet2, this method has changed. For example, asr_inference uses ASRTask.build_streaming_iterator. What module / dataloader is best to use here instead?

kamo-naoyuki · 2021-03-08T09:38:06Z

Thanks.

In my installation, an import error occurred in the text cleaner upon importing Speech2Text on some libraries that are optional for ASR. I changed the corresponding imports to conditional and included this in a separate commit.

Text cleaner is a mandatory module for espnet. Please keep espnet2/text/cleaner.py as it is.

The output format is a regular kaldi-style segments file.

Segments style is good for the output format of the command line tool.

Sorry, I'm not sure how you are giving the sampling rate to the ctc-segmentation function.

Input data includes (data/name of) audio file, ground truth and optionally an utterance name. In espnet1, the audio data was stored in a json. In espnet2, this method has changed. For example, asr_inference uses ASRTask.build_streaming_iterator. What module / dataloader is best to use here instead?

Please use ASRTask.build_streaming_iterator as it is. I'm not sure why you asked this.

lumaku · 2021-03-09T13:09:02Z

sampling rate to the ctc-segmentation function

The algorithm measures time basically in frames, so that is needs two parameters: Time per input frame, i.e. frame_duration, and the factor by how much the encoder reduces its input frames, subsampling_factor. The frame_duration can also be automatically derived from the sample rate fs and hop_length of DefaultFrontend - I'll add this to the module....

kamo-naoyuki · 2021-03-10T04:30:19Z

@lumaku I thought about how giving it. My concerning is that we can't use the other frontend for alignment if we derive the information from DefaultFrontend. We need to implement additional interface for all frontend for extensiblity.

Abouthop_length and subsampling_factor , how about calculating from the audio length and the output length from the encoder? In this way, it's not necessary to assume the frontend.

audio, fs = soundfiler.read("in.wav")
audio_len = len(audio)

encoder_out = encoder(audio)
encoder_out_len = len(encoder_out)

About fs,

For Python interface, there are no problems because we can give it directly.

        speech, fs = soundfiler.read("in.wav")
        aligned = aligner(speech, text, fs)

For command line tool, we have a problem because build_streaming_iterator doesn't return fs value in the mini-batch. This is because the input feature is not always raw wave, but it can be used for fbank. We need --fs option to give it.

        parser.add_argument("--fs")

Note that fbank can be also used for training in espnet2, but I think it's okay to ignore it for alignment.

lumaku · 2021-03-11T08:24:28Z

I like your idea to derive the calculate the ratio of audio length to encoded frames, as it is more user-friendly. This value can be determined automatically from the model - I checked, it only takes a few milliseconds. Then, only fs is needed, at the initialization of the module.

stale · 2021-07-21T04:14:05Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

SaadBazaz · 2021-12-08T16:22:45Z

Is this available in the latest espnet?
I'm getting the following error when running the provided code in the above comments:

Traceback (most recent call last):
  File "/home/saad/Projects/deepdub/deepdub_cli/deepdub/pipeline/extract/transcription/alignment/espnet.py", line 2, in <module>
    from espnet_model_zoo.downloader import ModelDownloader
  File "/home/saad/anaconda3/lib/python3.9/site-packages/espnet_model_zoo/downloader.py", line 23, in <module>
    from espnet2.main_funcs.pack_funcs import find_path_and_change_it_recursive
  File "/home/saad/anaconda3/lib/python3.9/site-packages/espnet2/__init__.py", line 3, in <module>
    from espnet import __version__  # NOQA
  File "/home/saad/Projects/deepdub/deepdub_cli/deepdub/pipeline/extract/transcription/alignment/espnet.py", line 2, in <module>
    from espnet_model_zoo.downloader import ModelDownloader
ImportError: cannot import name 'ModelDownloader' from partially initialized module 'espnet_model_zoo.downloader' (most likely due to a circular import) (/home/saad/anaconda3/lib/python3.9/site-packages/espnet_model_zoo/downloader.py)

lumaku · 2021-12-08T18:38:13Z

@SaadBazaz The code still works, nevertheless, I updated the code, because a few default settings have changed since then, e.g., the parameter kaldi_style_text.
In your case, it seems that you also have other code around this that interferes with the import of espnet_model_zoo.
Use a fresh python notebook, take the updated code form above, and try again. Good luck.

SaadBazaz · 2021-12-09T07:17:39Z

@lumaku Thanks for the quick response. So in this fresh environment, should I install espnet via pip install espnet, and then espnet-model-zoo through pip install espnet-model-zoo? Sorry for the noob question, I'm new to this. If you redirect me to another source that's ok too.

SaadBazaz · 2021-12-09T07:53:33Z

@lumaku I tried with a fresh install with Python 3.9 and PyTorch 1.1.0, and am on macOS now (I was on Linux before).
I put your code in a .py file and executed from the Terminal.

I installed in the following order in a new conda environment:

conda install pytorch torchvision torchaudio -c pytorch
sudo pip3 install espnet
sudo pip3 install espnet-model-zoo

Then I ran using sudo python3 <the_file>.py.

It gives me the same error as above! Any ideas?

lumaku · 2021-12-09T13:30:14Z

As already written in the error message above, your issue is a circular import: You import espnet_model_zoo and then at the same time import another file that also imports espnet_model_zoo.

SaadBazaz · 2021-12-09T13:46:37Z

Hmm. I am using only the code you've written above (legit, nothing else). Is it possible that espnet-model-zoo is now integrated in espnet?

kamo-naoyuki · 2021-12-09T14:36:57Z

There is a circular import as lumaku said.

/home/saad/Projects/deepdub/deepdub_cli/deepdub/pipeline/extract/transcription/alignment/espnet.py
-> site-packages/espnet_model_zoo/downloader.py
-> espnet2/__init__.py
-> /home/saad/Projects/deepdub/deepdub_cli/deepdub/pipeline/extract/transcription/alignment/espnet.py

This is not an issue for espnet side obviously, but this problem comes from extract/transcription/alignment/espnet.py. I guess you are executing at extract/transcription/alignment/, or setting python path to there.

SaadBazaz · 2021-12-09T15:05:27Z

Oh.
My.
God.

SaadBazaz · 2021-12-09T15:05:43Z

I am ashamed.

stale · 2022-04-16T14:24:42Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

sw005320 added ESPnet2 Feature request labels Feb 26, 2021

lumaku mentioned this issue Mar 15, 2021

CTC Segmentation for ESPnet 2 #3087

Merged

stale bot added the Stale For probot label Jul 21, 2021

stale bot removed the Stale For probot label Dec 8, 2021

stale bot added the Stale For probot label Apr 16, 2022

sw005320 closed this as completed Jul 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Audio-to-text alignment with trained Espnet2 asr model #3018

Audio-to-text alignment with trained Espnet2 asr model #3018

qdenisq commented Feb 26, 2021

sw005320 commented Feb 26, 2021

lumaku commented Feb 28, 2021

kamo-naoyuki commented Feb 28, 2021

qdenisq commented Mar 2, 2021

lumaku commented Mar 2, 2021

sw005320 commented Mar 2, 2021

lumaku commented Mar 6, 2021 •

edited

kamo-naoyuki commented Mar 8, 2021

lumaku commented Mar 9, 2021

kamo-naoyuki commented Mar 10, 2021

lumaku commented Mar 11, 2021

stale bot commented Jul 21, 2021

SaadBazaz commented Dec 8, 2021 •

edited

lumaku commented Dec 8, 2021

SaadBazaz commented Dec 9, 2021

SaadBazaz commented Dec 9, 2021

lumaku commented Dec 9, 2021

SaadBazaz commented Dec 9, 2021

kamo-naoyuki commented Dec 9, 2021 •

edited

SaadBazaz commented Dec 9, 2021

SaadBazaz commented Dec 9, 2021

stale bot commented Apr 16, 2022

Audio-to-text alignment with trained Espnet2 asr model #3018

Audio-to-text alignment with trained Espnet2 asr model #3018

Comments

qdenisq commented Feb 26, 2021

sw005320 commented Feb 26, 2021

lumaku commented Feb 28, 2021

kamo-naoyuki commented Feb 28, 2021

qdenisq commented Mar 2, 2021

lumaku commented Mar 2, 2021

sw005320 commented Mar 2, 2021

lumaku commented Mar 6, 2021 • edited

kamo-naoyuki commented Mar 8, 2021

lumaku commented Mar 9, 2021

kamo-naoyuki commented Mar 10, 2021

lumaku commented Mar 11, 2021

stale bot commented Jul 21, 2021

SaadBazaz commented Dec 8, 2021 • edited

lumaku commented Dec 8, 2021

SaadBazaz commented Dec 9, 2021

SaadBazaz commented Dec 9, 2021

lumaku commented Dec 9, 2021

SaadBazaz commented Dec 9, 2021

kamo-naoyuki commented Dec 9, 2021 • edited

SaadBazaz commented Dec 9, 2021

SaadBazaz commented Dec 9, 2021

stale bot commented Apr 16, 2022

lumaku commented Mar 6, 2021 •

edited

SaadBazaz commented Dec 8, 2021 •

edited

kamo-naoyuki commented Dec 9, 2021 •

edited