Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Input dataset #26

Closed
toshikwa opened this issue Jun 16, 2018 · 4 comments
Closed

Input dataset #26

toshikwa opened this issue Jun 16, 2018 · 4 comments

Comments

@toshikwa
Copy link

toshikwa commented Jun 16, 2018

Hi @astorfi
I have some questions about input dataset.

  1. According to the paper, the number of speakers is 511 in the development phase.
    But how long is the input audio file per speaker ??

  2. Although there is the function of CMVN preprocessing in input_feature.py, I'm not sure whether CMVN preprocessing is appropriate for the output of speechpy.feature.lmfe function.
    Did you use CMVN preprocessing in the experiment of the paper??

Thank you for your work!!

@astorfi
Copy link
Owner

astorfi commented Jun 18, 2018

@ku2482 Thanks for your question.

Regarding your questions:

  1. 0.8-second as it is also mentioned in the paper.
  2. No for the paper, no CMVN has been used. CMVN is just an available feature in SpeechPy library.

@toshikwa
Copy link
Author

@astorfi Thanks for answering.

I think every 0.81 second audio file result in (80, 40) feature, and you concatenate 20 features to make (20, 80, 40) feature for development phase, is it right?
I don't know how many (20, 80, 40) features per speaker do you use in the paper .
You use just one (20, 80, 40) feature for one speaker and make the dataset shaped (511, 20, 80, 40) ??

Anyway, I appreciate for your work and kindness.

@astorfi
Copy link
Owner

astorfi commented Jun 19, 2018

@ku2482 Yes, that's quite correct.

For the second part, (20, 80, 40) features are fed to the network. "20" is the number of spoken utterances for the speaker. However, there is no restriction on the number of (20, 80, 40) features for any speaker. The rule of thumb is "the more is the better for background model generation". You can use "20" spoken utterances at random for data augmentation (although all needs to belong to the same speaker).

@toshikwa
Copy link
Author

@astorfi Thank you so much!!

I actually solve all my questions and now I can understand your script.
Your work is really great!!

I close this issue, and again, thank you!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants