-
Notifications
You must be signed in to change notification settings - Fork 111
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Experiments / Discussion #7
Comments
Hi, thx for the hint - I have actually skipped through the paper after I finished the implementation - there is a lot of overlap with ForwardTacotron - it could definitely be worth a try and could help for sure. As for the autoregressive ForwardTacotron, it worked but I found that it exhibits lower mel quality (I didnt do an exhaustive test though) - main problem was (probably) that I trained with teacher enforcing and thus got a very low loss very quickly. With additional pre-nets and dropout as in the Tacotron the quality improved slightly, but was still lower than the non-autoreressive model. |
1 similar comment
Hi, thx for the hint - I have actually skipped through the paper after I finished the implementation - there is a lot of overlap with ForwardTacotron - it could definitely be worth a try and could help for sure. As for the autoregressive ForwardTacotron, it worked but I found that it exhibits lower mel quality (I didnt do an exhaustive test though) - main problem was (probably) that I trained with teacher enforcing and thus got a very low loss very quickly. With additional pre-nets and dropout as in the Tacotron the quality improved slightly, but was still lower than the non-autoreressive model. |
Thanks, having read this paper recently https://arxiv.org/abs/1909.01145 I come to think that autoregression hurts more than it helps ;). I think I'll try the length regulator from your repo in a Taco2 setting (as I see it you use the Taco1 CBHG etc. layers) and see how it goes. Also might make sense to use a classical forced aligner instead of training vanilla Taco first just for the alignments. |
Hey, that paper is actually sth I want to try soon for the ForwardTacotron as well - although I am not sure if it would be beneficial for a non-autoregressive model. Trying taco2 definitely makes sense, I had some success also with conv-only models similar to what they use in MelGAN, there is probably room for improvement! I also thought about using a STT model for extracting the durations. Keep us posted if you find anything interesting! |
Oh right, I actually found your repo there I'm currently just using DFR 0.2 without MMI because of some reports there and would also first have to adapt the code to the phone set instead of characters. It's interesting that this duration model is trained together with the rest instead of separately. I'm quite eager to get rid of attention as it's really the #1 source of issues I encounter. |
Yes I found that the model without attention is really robust. It seems to be the general trend to get rid of it. Also worth a look: https://arxiv.org/abs/2006.04558 |
Oh thanks, didn't see that one yet. I got the impression that the two lines of research are either to use explicit durations (IBM model, the new Facebook model, Tencent etc) or try to improve on the attention mechanism, like But you really wonder how much of an attention model this actually is if you just use it to attend to a single input at a time. |
Yeah thats true. From my experience the duration models perform well enough and they are much faster. Next thing I will try is to use a different approach for extracting durations from the data, probably with a simple STT model with CTC loss. |
Just integrated your model into my training framework, preprocessing and with MelGAN and already works quite well after just 20k steps. Audible but noisy, very smooth spectra. Let's see how it evolves. Also prepared Integration taco 2 layers but first want a baseline. I wonder if training the duration model separately would be beneficial but I guess it won't make a big difference. Any reason why you picked L1 loss? I mostly plan to try forced alignment next, perhaps try subphone units like in HMM systems and a couple smaller things similar to DurIAN, like the skip encoder and the positional index. |
Cool, keep me updated - the spectra get much better until 200k steps in my experience. No special reason for L1 over L2, i would not think it makes a big difference. I am trying now to extract the durations with a simple conv-lstm STT model. I use the output log-probablilities to align the mels to the input text with a graph search algorithm. Works pretty well, but so far I don't see it performing better than the alignments extracted from the taco. |
It started to converge a bit at around 100k steps (bs32). Stopped for now and trying the suggestions from alexdemartos (prenet output into duration model, duration model from fastspeech). Implemented positional index here https://github.com/vocalid/tacotron2/blob/b958c7d889b7b6161f56f36b2d525650ff55df3c/model.py#L41 But that's next. I planned to use something like https://github.com/CSTR-Edinburgh/merlin/blob/master/misc/scripts/alignment/state_align/forced_alignment.py |
Yeah if the model works well I sure gonna open source it. If you're interested check out the (researchy) branch 'aligner and run train_aligner.py and then force_alignment.py. The HM models seem to be standard for extracting alignments, I though want something independent from third parties. I'd be really interested in how the hmm works though! |
Hi, just to share with you - I did a couple of tests extracting the durations with a STT model (simple conv-lstm such as a standard OCR model). I overfitted the STT model on the train set, extracted the predition probs and use a graph-search method to align phonemes and mel steps (based on maximising the prediction prob for the current phoneme at each mel step). It works pretty well, the results are intelligible, but the prosody is slightly worse and more robotic than with the taco extracted durations. Any luck yet for you with the forced alignment? |
Hey. Haven't tried it yet. I ran lots of variations with taco1 vs taco2 postnet, prenet vs no prenet. Generally I often see differences in the training loss but none in validation loss. Biggest difference was exchanging the duration predictor with the fastspeech style one and putting it after CBHG instead of before. Unfortunately I didn't test yet which of the two is the more important modification. Generally I see a lot more artefacts than with our taco2 model. I train melgan by generating Mel spectra using ground truth durations and it reconstructs them very well. Once I feed Mel spectra generated using predicted durations things get ugly. Also training with positional indices at the moment but no significant difference either. I'm also interested in trying a generative model for durations as in https://www.ibm.com/blogs/research/2019/09/tts-using-lpcnet/ Another interesting aspect: I see training loss decreasing after 500k+ steps but validation loss is pretty much stable after about 50k or so. Seems a bit early for overfitting to me. |
Thx for the update. As for the duration predictor - I've had the problem of overfitting when I put it after the prenet plus the mels looked worse. As for the increasing validation loss - I think this is kind of normal as the model is not teacher forced and the predicted patterns differ from the ground truth, the audio quality still improves up until 200k steps or so in my experience. I normally do not even look at the validation loss to be honest and judge more by the audio. Also, I have seen quite some artifacts with a pretrained ljspeech melgan + forward taco, but less with wavernn. On our male custom dataset there are quite few artifacts with melgan - I increased melgans receptive field, maybe that helps... |
One more thing, when I train the MelGAN, I usually mix ground truth mels with predicted ones as I think it makes training more stable - this could also be worth a try. If you only train on predicted spectra the problem could be that they differ too much from the GT as they are not really teacher forced (e.g. the pitch could be different etc.). |
Yeah ive also increased the receptive field as proposed in a paper I forgot. Didnt see the huge improvement they saw but well... Regarding mixing GT with GTA - I could have sworn I did that but strangely only find it in my wavernn codebase. I'll also try multispeaker. With vanilla taco I never could get it to learn attention well for all speakers but spectra were pretty good, so I guess zu should work well with this model. Considering the models I trained using merlin (mostly just 3 LSTMs on HTS Labels) were very well able to produce and mix more than 1000 voices easily in combination with WORLD. |
I would also assume that the model goes well with multispeaker, thats quite some work though. For this it makes sense to first find a quicker way of extracting durations probably. I am running another ljspeech training now with melgan. I see improvements of audio quality up to 400k steps on the forward if I test it with the standard pretrained melgan (fewer squeezy artefacts) |
Seems it was an issue of patience again. MelGAN loss is hard to interpret and just letting it run often helps. So let it run over the weekend, acoustic model to 500k steps, MelGAN just another day more or so and it's definitely better now. Prosody not yet at the level of the Taco2 model I trained but we're getting closer. And yes, definitely have to work on the aligner first before tackling multispeaker. |
Thats not too bad for melgan and the inference speed you get with both models. I would assume that the durations could be overfitted (did you check the dur val loss?). I am also testing some model variations and I found that it helps to concat the lstm output with the prenet_output:
This is closer to the Taco architecture, where the attention is also concatenated with the lstm output. |
Seems like instant overfitting... |
https://drive.google.com/file/d/1S__-0_3N2swYCsWu4TciZ6O9owkxX7eK/view?usp=sharing Here is some result with the updated model trained 320k steps together with the pretrained melgan from the repo. |
Very cool. That's what I expected too, the structure gets more pronounced but may vary from the ground truth (e.g. different pitch or voice going a bit up instead of down) - as the model is not teacher forced. BTW I found that with the melgan preprocessing it is necessary to do some cherry picking with the tts model, but training to 400K steps definitely is worth it. |
Oh, I got the alignment with HTK to work and while it generally works fine, I currently getting more "raspy" voices and I'm not completely sure if it's because of the alignment. I think it might be the best solution to not have them in alignment and then use some skip encoder like in DurIAN. Where they keep the word boundaries as separate symbols until the state expansion. Well, having diacritics as separate symbols is not really helpful either... |
Thats also my experience with durations from a STT model. I tried 1. generating phoneme boundaries (and word boundaries) from the output probabilities and 2. extracting the exact phoneme positions in time and splitting right between them. Both resulted in lower mel quality. |
Should change the name of this issue ;) Still seeing generalization issues. Implemented multispeaker, injecting speaker codes after CBHG but in generally it always defaults to one voice (or I am not fully sure if it's actually am average voice) except if I pick a sentence from training set with the respektive speaker code. Strange as it's fed directly into the LSTM. Hmm gotta try synthesizing from pre postnet Mel spectra if it makes a difference. - update: nope, sounds a bit different but already lost speaker information. |
Renamed - good research. Did you use durations extracted by a respective tacotron trained separately on each dataset? As for the overfitting - I see the same issues, you mentioned that having a pre-net with heavy dropout did not help, did it? I conducted a lot of unsuccessful experiments, mainly trying various forms of duration prediction, e.g. a separate autoregressive duration predictor (heavily overfitted). Currently I am experimenting with GANS again and got them to work quite well, although the voice quality is not yet better than with standard L1 loss. |
Ah cool. I'm also planning on messing with a pitch predictor, the fastpitch samples seem quite convincing. Let me know how it goes! |
Oh it worked quite well without any real issues. The more I wondered that it ignores my speaker IDs while those work nicely. Update 2: Such a simple stupid bug and took me longer than that X11 forwarding syscall issue here mozilla/TTS#417 (comment) ;) 😠 Retraining now.... Update 3: |
Hi there @cschaefer26, nice project! Regarding #7 (comment), while integrating fatchord's tacotron model with https://github.com/CorentinJ/Real-Time-Voice-Cloning (my work is in #472), I've also encountered the same problems you had with LibriTTS, which are mainly caused by the highly inconsistent prosody between speakers. You can get much better results by preprocessing or curating the dataset (either trimming mid-sentence pauses or discarding utterances when that occurs). VCTK works a lot better if you trim the silence at the beginning and end of each file. I can go into more detail if it is helpful. The baseline tacotron requires very clean data for multispeaker, and even then I'm having trouble producing a decent model. Which is what leads me to your repo. :) I will be trying it out. Keep up the great work! |
Hi @m-toman I totally missed your update. Sounds really good, I assume its melgan? I got some ok results usind VCTK, as @blue-fish states the datasets require some good trimming etc. I found this to be really helpful: https://github.com/resemble-ai/Resemblyzer/blob/cf57923d50c9faa7b5f7ea1740f288aa279edbd6/resemblyzer/audio.py#L57 Any updates? We are also looking into adding GST. The main problem I have right now is that I would nead really clean german datasets to benefit from transfer learning for our use case. I also looked into other open source repos and tested Taco2 etc. but found it not really to perform much better. Currently, I am looking into some different preprocessings, e.g. mean-var scaling to improve the voice quality. |
Hi. Yeah it's interesting. As I can't really believe it I regularly compare to taco2 and other more complex methods out there (https://github.com/tugstugi/dl-colab-notebooks) but neither attention nor autoregression really seem to make a significant difference. Regarding styles I would have considered something like the simple method presented in DurIAN which is mostly just style embedding. Also read the flowtron paper again and thought about wrapping the whole model in such a Flow formulation but after listening to the samples again I felt it's probably not worth it vs just playing with the pitch predictor I got (might also be possible to predict and sample F0 from a gaussian where you could then play with the variance). I would have to read the GST paper again but I felt the control is a bit top random when I remember correctly? So in the sense that the tokens are hard to interpret and probably different with each run? |
Yeah exactly, although they show some impressive results absorbing background noise into the tokens. I would probably think that pitch prediction is the lowest hanging fruit of them all... |
I feel we're getting to a similar state of saturation like we had it before deep learning entered the speech synthesis field. The HMM-based methods became so loaded with more and more tricks and features, the complexity was insane. The training script I used during my PhD consisted of I think 120 separate steps in the end, each calling HTS tools with dozens of command line parameters and additional script files ;). i admit I wasn't brave enough to just try what you did and throw all that stuff out. The thousand people at Google would have certainly done that, right? :) |
Good rant though! The more I test the autoregressive stuff the less impressed I am. Its basically not usable for us in production (we are trying to synth long german politics articles). The forward models are pretty robust though. I wish I could get rid of the AR model to extract durations, we experimented with the aligner module from google EATS, didnt work. Extracting with an STT model worked but quality was worse. Today I spent the whole day debugging why my forward model all of the sudden sucked badly and found that the tacotron alignments were shifted - somehow I got unlucky, increasing the batch size solved this. When I started with TTS I was wondering why people got so interested in thes attention plots, now I know - watching a tacotron giving birth to attention is one of my good moments :D .. Honestly the forward models seem to be SOTA now and are probably used in production for Microsoft, AWS, google... |
My samples above used alignment via HTS and didnt notice a difference to the taco attention ones. Using those scripts https://github.com/CSTR-Edinburgh/merlin/tree/master/misc/scripts/alignment/state_align Think it's mostly Google that still clings to it. https://arxiv.org/abs/1910.10288 Yeah was astonished to see that Springer does TTS ;). My ex-colleagues recorded this corpus https://speech.kfs.oeaw.ac.at/mmascs/ Edit: and obviously it's Austrian German (Vorarlberg in this case ;)). |
Cool, I actually just found a glitch in the duration extraction of my STT models and it seems fine now. Probably going to release that as its cumbersome to train a tacotron for extractions. Good stuff! Ill keep you updated on how it goes with pitch prediction, multispeaker etc. |
I struggled a week now with that suddenly I got a burp sound at the end of many sentences. Honestly still now idea why, seems to happen sometimes. I now force it to silence... I assumed it was that it aligned the final punctuation symbol to silence usually, but if the silence trimming trims to aggressively it has to align that symbol with voice. So I added a little bit of silence myself (quite common actually in older systems to have a silence symbol at the beginning and end and prepend and append artificial silence). But didn't help. Now I force it to silence after synthesis but no idea where it's coming from... anyways, did you ever get the validation loss to make sense? For me it still gradually increases, although at probably 1/10th of what the training loss decreases. Tried really small model sizes, more dropout but still. Even the multispeaker model I currently train on 100k sentences does it, but admittedly less pronounced. |
Did you check the padding? I had a similar problem once and found that padding values were at zero (and not at -11.5 for silence in the log space). Validation loss in this case does not mean anything imo since the model has too much freedom in predicting intonation, pitch etc without teacher forcing. I don't even look at it (for durations it still makes sense though imo). |
Hmm, good idea, thanks. ForwardTacotron/utils/dataset.py Line 207 in d5c5d88
But have to check how my mel representation differs from yours. The loss masking should work now, so it should mostly be about the context from the LSTMs and convolutions "leaking" into the actual speech. |
OK, checked it, for me silence is -4 and modified the padding. |
Any improvements with the padding? I agree that the glitch cant be from the loss directly due to masking, but I found something else - in my case the length regulator is slightly overexpanding, i.e. attaching some extra repeats of the last input vecs due to the padding within batches during training, I added this to put the repeated inputs to zero: ForwardTacotron/models/forward_tacotron.py Line 171 in 611dd81
Imo this could be one of the causes of a 'leak' in the RNNS. |
Oh, btw, I am also now adding a pitch module and the first results seem very promising. Might be adding an 'energy' vec as in FastSpeech2 as well, although in their ablation study the gain was pretty small. I thought if one could instead of calculating F0 just use the mean of the frequency distribution along the mel axis? Imo this should be pretty similar and wouldnt require an external lib to do the calculation. |
I'll check the LR thing above. Not sure now if I used your solution of wrote something myself because I added positional index (didn't see a huge difference though, if at all). I'll post a sample when back home. For pitch i userd the approach from fastpitch (repo is out there) which works fine with the proposed mean pitch per symbol but I am thinking about a more complex parameterization that also allows to control some delta features (perhaps just categories of falling/rising/flat pitch or so). |
Until now it seems to be better - https://drive.google.com/file/d/1LkusT0VO8cKw3nI5jJ1GBmDLCu4jP_vv/view?usp=sharing |
Thats only 7k steps? Impressive. I found no real improvements with a pitch module, although its fun to play around with the pitch. May be a limitation of our dataset though (8hrs only). I also tested the Nvidia FastPitch implementation, not better. I thought about looking more into the data to be honest, e.g. cleaning inconsistent pronunciations with an STT model. |
Yeah I didn't see improvement either but we need it for implementing ssml tags and it definitely works better than synthesis-resynthsis methods which introduce too much noise. And yeah probably one can think about some generative model to produce/sample interesting pitch contours. My results were a bit strange. With both paddings modifications above I started to get weird pauses/prosody. Trained 3 times to verify it's not some random effect and at different stages of training. Integrating DiffWave might be interesting as well. |
No improvements as in with the pitch? Generally I have the same problems as you have, it seems that trainings can vary by large degrees, probably there is some randomness in what the model really fits... |
Yeah sounds pretty much the same with and without pitch model. Will try multispeaker again soon but it seemed to me as if it would average a bit too much. |
Sounds good. I am regularly comparing the model to other architectures and I find that the LSTM produces a bit more fluent output but tends to more mumbling compared to a transformer basted model a la FastSpeech. Multispeaker didn't really add some benefit so far, but that could be due to lack of data yet. |
Just a quick update, I merged all the pitch stuff to master, I really found a benefit using the pitch condition. I see the same as you, after 10-20k steps the model is almost done. Quick question: Did you see any improvement with positional indexing? I found some generalization problems on smaller datasets, where the voice mumbles quite a bit especially for shorter sentences, weirdly. Also, I tried to add an STT model to the training and added a CTC loss to hope that the model is forced to be clearer, first results seem quite promising actually. |
No new experiments from my side atm.
I've checked out the samples of Glow-TTS in Mozilla TTS but did not really seem convincing. |
Intuitively I wouldnt expect worlds difference with pos indexing with lstms though. As for prosody, did you try to use a smaller separate duration predictor as I do? I found that the model is hugely overfitting otherwise (e.g. when duration prediction is done after the encoder). Also for prosody I have an idea I want to try out soon - similar to the pitch frequency i want to leak some duration stats from the target, e.g. some running mean of durations to condition the duration predictor on. My hope is that the model would pick up some rhythm / prosody swings in longer sentences (similar to the pitch swings). |
I tried running the duration predictor before the CBHG once but the results were a bit strange. Will try again. Also wondered if it's really a good idea to train it together with the rest or should rather be separate (or at least stop training it at some earlier point). So you already added some "global" pitch stats to the model? Have to check your code. |
My best resuls were with a mere 64 dim gru on duration prediction with lots of dropout before. Yes, it probably makes sense to completely separate it (to be able to compare results at different stages to the least). Yeah I reimplemented the FastPitch version (with minor differences) with pitch averaged over chars. |
I'll try the different duration model as soon as I got the capacity. |
Hi,
great work!
I saw your autoregression branch and wanted to ask if it worked out?
I always wondered how much the effect of the autoregression (apart from the formal aspect that it then is a autoregressive, generative model P(x_i|x_<i)) really is, considering there are RNNs in the network anyway.
Also, wanted to point you to this paper in case you don't know it yet: https://tencent-ailab.github.io/durian/
They use, similarly to older models like in https://github.com/CSTR-Edinburgh/merlin, an additional value for the expanded vectors to indicate the position in the current input symbol. Wonder if that would help a bit with prosody.
The text was updated successfully, but these errors were encountered: