Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Audio-to-text alignment with trained Espnet2 asr model #3018

Closed
qdenisq opened this issue Feb 26, 2021 · 22 comments
Closed

Audio-to-text alignment with trained Espnet2 asr model #3018

qdenisq opened this issue Feb 26, 2021 · 22 comments

Comments

@qdenisq
Copy link

qdenisq commented Feb 26, 2021

Hi, can anyone tell how one does audio-to-text alignment using Espnet2? I can see there is asr_align.py in Espnet and was curious if Espnet2 provides a similar interface. Thank you

@sw005320
Copy link
Contributor

Very good point!
Yes, right now, we only support it in espnet1.
@lumaku, are you interested in extending your CTC alignment to espnet2?
As you know, espnet2 is more portable and has better API, and your CTC alignment tool becomes more powerful in these regards.

@lumaku
Copy link
Contributor

lumaku commented Feb 28, 2021

Yes, I can do that.
A small pre-trained model for English speech (librispeech/wsj), similar to wsj.transformer_small.v1 in espnet1, would be very helpful for testing. Does espnet2 currently have a similar model?

@kamo-naoyuki
Copy link
Collaborator

All pretrained models can be found in https://github.com/espnet/espnet_model_zoo

@qdenisq
Copy link
Author

qdenisq commented Mar 2, 2021

@kamo-naoyuki @sw005320 I have one more question aside from the actual CTC alignment script and was hoping you could advise me on this. I have a Conformer trained on LIbrispeech with BPE vocabulary (size=5000) according to the recipe from https://zenodo.org/record/4276519. After taking argmax on ctc output I get the following:

Argmax result

[   0    0    0    7    0    0    0    0    0    0    0    0   83    0
    0    0    0    0    0    0    0    0  729    0    0    0    0    0
    0    0    0    0 3415    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0]

, where

{
 0: '<blank>',
 7: '_A',
 83: '▁LITTLE',
 729:  '▁GREEN',
 3415: '▁FROG'
}

Thus,

Tokenized transcript
['▁A', '▁LITTLE', '▁GREEN', '▁FROG']

I checked the indexes of the non-zero tokens and timestamps in the audio recording. It seems these indexes match with the timestamps of the beginnings of the tokens in the recording. It means that one could easily retrieve the starting timestamps, but it is absolutely unclear to me how to get the endings of tokens.

Do I understand it correctly that it is not really feasible to retrieve the token lengths from ctc output since only the starting frame is encoded with the token id and the rest is filled with <blank> until the next token is detected?

@lumaku
Copy link
Contributor

lumaku commented Mar 2, 2021

The CTC layer does not assign time steps to tokens (as hybrid ASR does with states/phonemes), but rather fires when a token "occurs". That can be at the start, in the middle, or in the end of a word/token.
CTC segmentation takes the length of the tokens into account (which can be deactivated if needed).

@sw005320
Copy link
Contributor

sw005320 commented Mar 2, 2021

Yes, @lumaku is right.
@qdenisq, I think you could use the fired timing as an approximated onset of the token, but as @lumaku says, it is often shifted.
Also, the shift becomes very large in the adverse environments (sometimes more than 10 frames).
This is just a note.

@lumaku
Copy link
Contributor

lumaku commented Mar 6, 2021

Current progress: master...lumaku:espnet2_ctc_segmentation

To test the new CTCSegmentation module:

import soundfile
from espnet_model_zoo.downloader import ModelDownloader
from espnet2.bin.asr_align import CTCSegmentation
d = ModelDownloader(cachedir="./modelcache")
wsjmodel = d.download_and_unpack("kamo-naoyuki/wsj")
speech, rate = soundfile.read("./test_utils/ctc_align_test.wav")
aligner = CTCSegmentation( **wsjmodel , kaldi_style_text=False )
text = ["THE SALE OF THE HOTELS",
        "IS PART OF HOLIDAY'S STRATEGY",
        "TO SELL OFF ASSETS",
        "AND CONCENTRATE ON PROPERTY MANAGEMENT"]
segments = aligner(speech, text)
print(segments)
# utt_0000 utt 0.26 1.73 -0.0154 THE SALE OF THE HOTELS
# utt_0001 utt 1.73 3.19 -0.7674 IS PART OF HOLIDAY'S STRATEGY
# utt_0002 utt 3.19 4.20 -0.7433 TO SELL OFF ASSETS
# utt_0003 utt 4.20 6.02 -0.9003 AND CONCENTRATE ON PROPERTY MANAGEMENT

Some remarks and discussion points:

  • In my installation, an import error occurred in the text cleaner upon importing Speech2Text on some libraries that are optional for ASR. I changed the corresponding imports to conditional and included this in a separate commit.

  • The output format is a regular kaldi-style segments file.

  • Input data includes (data/name of) audio file, ground truth and optionally an utterance name. In espnet1, the audio data was stored in a json. In espnet2, this method has changed. For example, asr_inference uses ASRTask.build_streaming_iterator. What module / dataloader is best to use here instead?

@kamo-naoyuki
Copy link
Collaborator

Thanks.

  • In my installation, an import error occurred in the text cleaner upon importing Speech2Text on some libraries that are optional for ASR. I changed the corresponding imports to conditional and included this in a separate commit.

Text cleaner is a mandatory module for espnet. Please keep espnet2/text/cleaner.py as it is.

  • The output format is a regular kaldi-style segments file.

Segments style is good for the output format of the command line tool.

Sorry, I'm not sure how you are giving the sampling rate to the ctc-segmentation function.

  • Input data includes (data/name of) audio file, ground truth and optionally an utterance name. In espnet1, the audio data was stored in a json. In espnet2, this method has changed. For example, asr_inference uses ASRTask.build_streaming_iterator. What module / dataloader is best to use here instead?

Please use ASRTask.build_streaming_iterator as it is. I'm not sure why you asked this.

@lumaku
Copy link
Contributor

lumaku commented Mar 9, 2021

sampling rate to the ctc-segmentation function

The algorithm measures time basically in frames, so that is needs two parameters: Time per input frame, i.e. frame_duration, and the factor by how much the encoder reduces its input frames, subsampling_factor. The frame_duration can also be automatically derived from the sample rate fs and hop_length of DefaultFrontend - I'll add this to the module....

@kamo-naoyuki
Copy link
Collaborator

@lumaku I thought about how giving it. My concerning is that we can't use the other frontend for alignment if we derive the information from DefaultFrontend. We need to implement additional interface for all frontend for extensiblity.

Abouthop_length and subsampling_factor , how about calculating from the audio length and the output length from the encoder? In this way, it's not necessary to assume the frontend.

audio, fs = soundfiler.read("in.wav")
audio_len = len(audio)

encoder_out = encoder(audio)
encoder_out_len = len(encoder_out)

About fs,

  1. For Python interface, there are no problems because we can give it directly.
        speech, fs = soundfiler.read("in.wav")
        aligned = aligner(speech, text, fs)
  1. For command line tool, we have a problem because build_streaming_iterator doesn't return fs value in the mini-batch. This is because the input feature is not always raw wave, but it can be used for fbank. We need --fs option to give it.
        parser.add_argument("--fs")

Note that fbank can be also used for training in espnet2, but I think it's okay to ignore it for alignment.

@lumaku
Copy link
Contributor

lumaku commented Mar 11, 2021

I like your idea to derive the calculate the ratio of audio length to encoded frames, as it is more user-friendly. This value can be determined automatically from the model - I checked, it only takes a few milliseconds. Then, only fs is needed, at the initialization of the module.

@stale
Copy link

stale bot commented Jul 21, 2021

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the Stale For probot label Jul 21, 2021
@SaadBazaz
Copy link

SaadBazaz commented Dec 8, 2021

Is this available in the latest espnet?
I'm getting the following error when running the provided code in the above comments:

Traceback (most recent call last):
  File "/home/saad/Projects/deepdub/deepdub_cli/deepdub/pipeline/extract/transcription/alignment/espnet.py", line 2, in <module>
    from espnet_model_zoo.downloader import ModelDownloader
  File "/home/saad/anaconda3/lib/python3.9/site-packages/espnet_model_zoo/downloader.py", line 23, in <module>
    from espnet2.main_funcs.pack_funcs import find_path_and_change_it_recursive
  File "/home/saad/anaconda3/lib/python3.9/site-packages/espnet2/__init__.py", line 3, in <module>
    from espnet import __version__  # NOQA
  File "/home/saad/Projects/deepdub/deepdub_cli/deepdub/pipeline/extract/transcription/alignment/espnet.py", line 2, in <module>
    from espnet_model_zoo.downloader import ModelDownloader
ImportError: cannot import name 'ModelDownloader' from partially initialized module 'espnet_model_zoo.downloader' (most likely due to a circular import) (/home/saad/anaconda3/lib/python3.9/site-packages/espnet_model_zoo/downloader.py)

@stale stale bot removed the Stale For probot label Dec 8, 2021
@lumaku
Copy link
Contributor

lumaku commented Dec 8, 2021

@SaadBazaz The code still works, nevertheless, I updated the code, because a few default settings have changed since then, e.g., the parameter kaldi_style_text.
In your case, it seems that you also have other code around this that interferes with the import of espnet_model_zoo.
Use a fresh python notebook, take the updated code form above, and try again. Good luck.

@SaadBazaz
Copy link

@lumaku Thanks for the quick response. So in this fresh environment, should I install espnet via pip install espnet, and then espnet-model-zoo through pip install espnet-model-zoo? Sorry for the noob question, I'm new to this. If you redirect me to another source that's ok too.

@SaadBazaz
Copy link

@lumaku I tried with a fresh install with Python 3.9 and PyTorch 1.1.0, and am on macOS now (I was on Linux before).
I put your code in a .py file and executed from the Terminal.

I installed in the following order in a new conda environment:

conda install pytorch torchvision torchaudio -c pytorch
sudo pip3 install espnet
sudo pip3 install espnet-model-zoo

Then I ran using sudo python3 <the_file>.py.

It gives me the same error as above! Any ideas?

@lumaku
Copy link
Contributor

lumaku commented Dec 9, 2021

As already written in the error message above, your issue is a circular import: You import espnet_model_zoo and then at the same time import another file that also imports espnet_model_zoo.

@SaadBazaz
Copy link

Hmm. I am using only the code you've written above (legit, nothing else). Is it possible that espnet-model-zoo is now integrated in espnet?

@kamo-naoyuki
Copy link
Collaborator

kamo-naoyuki commented Dec 9, 2021

There is a circular import as lumaku said.

/home/saad/Projects/deepdub/deepdub_cli/deepdub/pipeline/extract/transcription/alignment/espnet.py
-> site-packages/espnet_model_zoo/downloader.py
-> espnet2/__init__.py
-> /home/saad/Projects/deepdub/deepdub_cli/deepdub/pipeline/extract/transcription/alignment/espnet.py

This is not an issue for espnet side obviously, but this problem comes from extract/transcription/alignment/espnet.py. I guess you are executing at extract/transcription/alignment/, or setting python path to there.

@SaadBazaz
Copy link

Oh.
My.
God.

@SaadBazaz
Copy link

I am ashamed.

@stale
Copy link

stale bot commented Apr 16, 2022

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the Stale For probot label Apr 16, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants