Cannot build model from audio files with a length of 3 seconds #98

m-haecker · 2019-12-14T16:02:24Z

I'm trying to create my own model. Google's Command Speech Set serves as the basis. Additionally I have six keywords (alexa / jarvis / computer are three of them), which are longer than 1 second. Therefore I brought all WAVs to a length of 3 seconds (many have silence at the end). Then I call:

python -m utils.train --wanted_words alexa jarvis computer down left right learn dog sheila marvin --dev_every 1 --n_labels 12 --n_epochs 26 --weight_decay 0.00001 --lr 0.1 0.01 0.001 --schedule 3000 6000 --input_length 48000 --model res8 --no_cuda true --pos_key_size 1000 --data_folder ./speech_commands_v0.02/ --output_file ./speech_commands_v0.02/model.pt
(input_length is set to 48000 because of the audio lengths)

However, this leads to the following error:

File "workspace/voice/honk/utils/model.py", line 258, in collate_fn
audio_tensor = torch.from_numpy(self.audio_processor.compute_mfccs(audio_data).reshape(1, 101, 40))
ValueError: cannot reshape array of size 12040 into shape (1,101,40)

I don't know what to do with the message or how to fix it.
When adding param "--audio_preprocess_type PCEN" I am able to create the model. From this I can also create the file with the weights and use it in Honkling. But the recognition doesn't work at all. It constantly recognizes "computer" and nothing else, even if this keyword is not spoken at all or something is spoken at all.

What can I do to make it work?

The text was updated successfully, but these errors were encountered:

daemon · 2019-12-14T23:26:17Z

Ah, that's a bug. Can you try replacing that line with
audio_tensor = torch.from_numpy(self.audio_processor.compute_mfccs(audio_data).reshape(1, -1, 40))?

m-haecker · 2019-12-15T12:49:14Z

Yes, thank you very much, the change allowed me to create the model 👍🏻
The final test accuracy was 0.921, which sounds not too bad. (I have used only 10 epochs).

So I have created the JS-file with the weigths based on this model and fed it into Honkling. But with Honkling the recognition is poorly bad and almost all keywords are recognized wrong.

Maybe this is because I padded all WAVs to a length of 3 seconds and all WAVs from Google's Speech Command Set now have 2 seconds of silence at the end? But the additional keywords I added just have a length of 1.5 to 3 seconds and I didn't know what a better way would have been.

Or do I have to adjust the code in Honk or Honkling to get along with WAVs > 1 second?

daemon · 2019-12-15T15:49:54Z

@ljj7975 can best answer any Honkling questions. My guess is that Honkling also needs to be modified to support three-second audio.

Alternatively, if you'd like to support variable-length audio, you can use either a CTC-based decoder or one of those streaming seq2seq models.

ljj7975 · 2019-12-15T16:33:22Z

Though I tried my best to make Honkling configurable,
I have never tried supporting audio longer than 1 second.

Can you try updating these?
https://github.com/castorini/honkling/blob/master/common/config.js#L9
https://github.com/castorini/honkling/blob/master/common/config.js#L186
https://github.com/castorini/honkling/blob/master/common/offlineAudioProcessor.js#L12

m-haecker · 2019-12-17T09:12:54Z

I haven't managed it yet. But now I know where to start, hopefully to get it to work.
Thank you very much!

ljj7975 · 2019-12-17T14:34:51Z

That is great to hear.
Please let us know if you get it working so we can add the instruction on our page.
Thanks!

m-haecker closed this as completed Dec 17, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cannot build model from audio files with a length of 3 seconds #98

Cannot build model from audio files with a length of 3 seconds #98

m-haecker commented Dec 14, 2019

daemon commented Dec 14, 2019

m-haecker commented Dec 15, 2019

daemon commented Dec 15, 2019

ljj7975 commented Dec 15, 2019 •

edited

m-haecker commented Dec 17, 2019

ljj7975 commented Dec 17, 2019

Cannot build model from audio files with a length of 3 seconds #98

Cannot build model from audio files with a length of 3 seconds #98

Comments

m-haecker commented Dec 14, 2019

daemon commented Dec 14, 2019

m-haecker commented Dec 15, 2019

daemon commented Dec 15, 2019

ljj7975 commented Dec 15, 2019 • edited

m-haecker commented Dec 17, 2019

ljj7975 commented Dec 17, 2019

ljj7975 commented Dec 15, 2019 •

edited