-
-
Notifications
You must be signed in to change notification settings - Fork 275
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Input dataset #26
Comments
@ku2482 Thanks for your question. Regarding your questions:
|
@astorfi Thanks for answering. I think every 0.81 second audio file result in (80, 40) feature, and you concatenate 20 features to make (20, 80, 40) feature for development phase, is it right? Anyway, I appreciate for your work and kindness. |
@ku2482 Yes, that's quite correct. For the second part, (20, 80, 40) features are fed to the network. "20" is the number of spoken utterances for the speaker. However, there is no restriction on the number of (20, 80, 40) features for any speaker. The rule of thumb is "the more is the better for background model generation". You can use "20" spoken utterances at random for data augmentation (although all needs to belong to the same speaker). |
@astorfi Thank you so much!! I actually solve all my questions and now I can understand your script. I close this issue, and again, thank you!! |
Hi @astorfi
I have some questions about input dataset.
According to the paper, the number of speakers is 511 in the development phase.
But how long is the input audio file per speaker ??
Although there is the function of CMVN preprocessing in input_feature.py, I'm not sure whether CMVN preprocessing is appropriate for the output of speechpy.feature.lmfe function.
Did you use CMVN preprocessing in the experiment of the paper??
Thank you for your work!!
The text was updated successfully, but these errors were encountered: