Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue with PyAudio / SpeechRecognition Integration #67

Closed
jhoelzl opened this issue Sep 14, 2016 · 11 comments
Closed

Issue with PyAudio / SpeechRecognition Integration #67

jhoelzl opened this issue Sep 14, 2016 · 11 comments

Comments

@jhoelzl
Copy link

jhoelzl commented Sep 14, 2016

Hello,

i am using Python 2.7.6 on Ubuntu 14.04 LTS.

I tried to integrate the aubio pitch detection into the SpeechRecognition module: jhoelzl/speech_recognition@355a952

However, the function pitch = pitch_o(signal)[0] (jhoelzl/speech_recognition@355a952#diff-873076ce119583cd8f8e749e2465a287R484) always returns

('Unexpected error:', <type 'exceptions.UnboundLocalError'>).

Does anybody has an idea or suggestion?

Thanks for support,

Regards,
Josef

@piem
Copy link
Member

piem commented Sep 14, 2016

Hi @jhoelzl ,

It seems you are overwriting aubio.pitch in pitch = pitch_o(signal)[0]. The following patch seems to help: t.diff.gz.

SpeechRecognition looks great. How are you planning to use aubio in it? I would be interested to know more!

Best, Paul

@jhoelzl
Copy link
Author

jhoelzl commented Sep 14, 2016

Thank you very much @piem , now it works!

I am using the SpeechRecognition module for several months and it is indeed a handy tool.

However, voice activity detector in the SpeechRecognition module has to be improved. Currently, only the overall energy level in the frame is used as a measure (which is not very smart), so i have integrated the WebRTC VAD which measures the energy level in noise and speech band.

In addition, i also want to measure MFCC (added from python_speech_features) and pitch (added from your aubio module)

@jhoelzl jhoelzl closed this as completed Sep 14, 2016
@piem
Copy link
Member

piem commented Sep 14, 2016

ok, sounds great! Please make sure you check out the mfcc in aubio, and let me know how it is going.

For best results, a simple machine learning algorithm could be trained to discriminate speech / non-speech segments using (some of) these features (energy of each bands, mfcc, pitch, ...).

@jhoelzl
Copy link
Author

jhoelzl commented Sep 14, 2016

Okay, i have also added the calculation for the MFCCs from your module (jhoelzl/speech_recognition@59a87bf), but i get this error:

('Unexpected error:', <type 'exceptions.ValueError'>)

when performing spec = p(signal).

@jhoelzl jhoelzl reopened this Sep 14, 2016
@piem
Copy link
Member

piem commented Sep 14, 2016

strange. when trying your latest git, I get this instead:

[...]
Say something!
Traceback (most recent call last):
  File "examples/microphone_recognition.py", line 11, in <module>
    audio = r.listen(source)
  File "/home/piem/projects/aubio/contrib/speech_recognition/speech_recognition/__init__.py", line 490, in listen
    spec = p(signal)
ValueError: input fvec has length 1024, but pvoc expects length 128

@jhoelzl
Copy link
Author

jhoelzl commented Sep 14, 2016

Hi, yes it is working now, when i disable the WebRTC Vad module. The problem is, that the WebRTC VAD requires a frame size of 10, 20 or 30ms. I have 16kHz, so i have to set the variable source.CHUNK = 480 in my application to have a 30ms frame.

Then i have problems with the fft_size of the MFCC, because it has to be a power of two.

@jhoelzl
Copy link
Author

jhoelzl commented Sep 14, 2016

Okay, when i set m_hop_s = 480 (not m_win_s // 4) then it works.

@piem
Copy link
Member

piem commented Sep 14, 2016

Good! Yes, as long as the window size is a power of 2, it should work.

Note you could recompile aubio to use fftw3, but i'd recommend using power of 2 lengths for speed.

@jhoelzl jhoelzl closed this as completed Sep 14, 2016
@jhoelzl
Copy link
Author

jhoelzl commented Sep 19, 2016

I also added the zero-crossing-rate (ZCR):
jhoelzl/speech_recognition@03185a0
jhoelzl/speech_recognition@908ebb9

However, for the ZCR, i always get values smaller than 0.2.
I though this should be a positive integer value since it is defined as the number of times in a sound sample where the amplitude of the sound wave changes sign.

@jhoelzl jhoelzl reopened this Sep 19, 2016
@piem
Copy link
Member

piem commented Sep 19, 2016

hi @jhoelzl

ZRC is a rate, so you need to divide the number of sign changes by the total number of samples. Here is the doc in aubio and the actual code.

Note: the definition on wikipedia has a -1 normalisation offset and points to a different implementation.

hope this helps,
best, Paul

@jhoelzl
Copy link
Author

jhoelzl commented Sep 19, 2016

@piem thanks now i understand

@jhoelzl jhoelzl closed this as completed Sep 19, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants