New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Audio-to-text alignment with trained Espnet2 asr model #3018
Comments
Very good point! |
Yes, I can do that. |
All pretrained models can be found in https://github.com/espnet/espnet_model_zoo |
@kamo-naoyuki @sw005320 I have one more question aside from the actual CTC alignment script and was hoping you could advise me on this. I have a Conformer trained on LIbrispeech with BPE vocabulary (size=5000) according to the recipe from https://zenodo.org/record/4276519. After taking argmax on ctc output I get the following: Argmax result
, where
Thus, Tokenized transcript I checked the indexes of the non-zero tokens and timestamps in the audio recording. It seems these indexes match with the timestamps of the beginnings of the tokens in the recording. It means that one could easily retrieve the starting timestamps, but it is absolutely unclear to me how to get the endings of tokens. Do I understand it correctly that it is not really feasible to retrieve the token lengths from ctc output since only the starting frame is encoded with the token id and the rest is filled with |
The CTC layer does not assign time steps to tokens (as hybrid ASR does with states/phonemes), but rather fires when a token "occurs". That can be at the start, in the middle, or in the end of a word/token. |
Current progress: master...lumaku:espnet2_ctc_segmentation To test the new import soundfile
from espnet_model_zoo.downloader import ModelDownloader
from espnet2.bin.asr_align import CTCSegmentation
d = ModelDownloader(cachedir="./modelcache")
wsjmodel = d.download_and_unpack("kamo-naoyuki/wsj")
speech, rate = soundfile.read("./test_utils/ctc_align_test.wav")
aligner = CTCSegmentation( **wsjmodel , kaldi_style_text=False )
text = ["THE SALE OF THE HOTELS",
"IS PART OF HOLIDAY'S STRATEGY",
"TO SELL OFF ASSETS",
"AND CONCENTRATE ON PROPERTY MANAGEMENT"]
segments = aligner(speech, text)
print(segments)
# utt_0000 utt 0.26 1.73 -0.0154 THE SALE OF THE HOTELS
# utt_0001 utt 1.73 3.19 -0.7674 IS PART OF HOLIDAY'S STRATEGY
# utt_0002 utt 3.19 4.20 -0.7433 TO SELL OFF ASSETS
# utt_0003 utt 4.20 6.02 -0.9003 AND CONCENTRATE ON PROPERTY MANAGEMENT Some remarks and discussion points:
|
Thanks.
Text cleaner is a mandatory module for espnet. Please keep
Segments style is good for the output format of the command line tool. Sorry, I'm not sure how you are giving the sampling rate to the ctc-segmentation function.
Please use ASRTask.build_streaming_iterator as it is. I'm not sure why you asked this. |
The algorithm measures time basically in frames, so that is needs two parameters: Time per input frame, i.e. |
@lumaku I thought about how giving it. My concerning is that we can't use the other frontend for alignment if we derive the information from DefaultFrontend. We need to implement additional interface for all frontend for extensiblity. About audio, fs = soundfiler.read("in.wav")
audio_len = len(audio)
encoder_out = encoder(audio)
encoder_out_len = len(encoder_out) About fs,
speech, fs = soundfiler.read("in.wav")
aligned = aligner(speech, text, fs)
parser.add_argument("--fs") Note that fbank can be also used for training in espnet2, but I think it's okay to ignore it for alignment. |
I like your idea to derive the calculate the ratio of audio length to encoded frames, as it is more user-friendly. This value can be determined automatically from the model - I checked, it only takes a few milliseconds. Then, only |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
Is this available in the latest espnet?
|
@SaadBazaz The code still works, nevertheless, I updated the code, because a few default settings have changed since then, e.g., the parameter |
@lumaku Thanks for the quick response. So in this fresh environment, should I install espnet via |
@lumaku I tried with a fresh install with Python 3.9 and PyTorch 1.1.0, and am on macOS now (I was on Linux before). I installed in the following order in a new conda environment:
Then I ran using It gives me the same error as above! Any ideas? |
As already written in the error message above, your issue is a circular import: You import |
Hmm. I am using only the code you've written above (legit, nothing else). Is it possible that espnet-model-zoo is now integrated in espnet? |
There is a circular import as lumaku said.
This is not an issue for espnet side obviously, but this problem comes from |
Oh. |
I am ashamed. |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
Hi, can anyone tell how one does audio-to-text alignment using Espnet2? I can see there is
asr_align.py
in Espnet and was curious if Espnet2 provides a similar interface. Thank youThe text was updated successfully, but these errors were encountered: