Reproducing the results #41
Comments
Hi isn0gud, Hope it helps! |
Hmm, wired. I did my own feature extraction and I removed all the silences longer than 100ms. But thanks for your input! Gonna investigate :) |
Yeah the script is something crazy... and beware, some of the package are sometimes unavailable.... Just to be clear, you need to remove the silence at the beginning of the sentences. There are many! also there are a number of files filled with silence. I would cut tighter than 100 ms, but it is just a feeling... A good way to check if what your are doing make sense is to compare the size of your feature with those from FAIR. |
Did you rewrite the feature extraction? |
no I cleaned it up a little bit and made it independent from the web (I downloaded all package on my computer). |
Would you mind to share yours? 😍The silence removal using |
I'm sorry, I can't because I did this code for my company. But silence removal is pretty straight forward:
Remember that you do not want the remove silence inside the sentence because they are important for the model... |
That was my intuition too, but the provided extracted features have some silences removed in the middle of the sentences. But this is done by phone alignment, so the important silences are probably kept. So just removing everything below Anyway, I am extracting the features with the script now. It just takes forever. |
Hi friesch, You can try out librosa trim method -https://librosa.github.io/librosa/generated/librosa.effects.trim.html#librosa-effects-trim. It does exactly what you want, I found top_db=15 to work well enough. |
Hi, thanks for open sourcing the code!
I am trying to reproduce your results. However, I am running into problems. I have been training:
So the problem is that only some speakers actually produce a speech signal based on the input. The majority of speakers only produce noise. However, the speech producing speakers are depended on the actual phoneme input. The problem seems to be that the attention does not work correctly for these samples. The attention basically stays at the beginning of the sequence and does not advance.
Did you have a similar issue when training the model? Or do you might have an idea what the problem could be?
good attention with speech output:
p226_009_11.pdf
p225_005_4.pdf
somewhat working:
p226_009_2.pdf
Most examples:
p226_009_9.pdf
p226_009_13.pdf
p226_009_1.pdf
Thanks!
The text was updated successfully, but these errors were encountered: