Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Experiments / Discussion #7

Closed
m-toman opened this issue Jun 8, 2020 · 69 comments
Closed

Experiments / Discussion #7

m-toman opened this issue Jun 8, 2020 · 69 comments

Comments

@m-toman
Copy link

m-toman commented Jun 8, 2020

Hi,

great work!
I saw your autoregression branch and wanted to ask if it worked out?
I always wondered how much the effect of the autoregression (apart from the formal aspect that it then is a autoregressive, generative model P(x_i|x_<i)) really is, considering there are RNNs in the network anyway.

Also, wanted to point you to this paper in case you don't know it yet: https://tencent-ailab.github.io/durian/

They use, similarly to older models like in https://github.com/CSTR-Edinburgh/merlin, an additional value for the expanded vectors to indicate the position in the current input symbol. Wonder if that would help a bit with prosody.

@cschaefer26
Copy link

Hi, thx for the hint - I have actually skipped through the paper after I finished the implementation - there is a lot of overlap with ForwardTacotron - it could definitely be worth a try and could help for sure. As for the autoregressive ForwardTacotron, it worked but I found that it exhibits lower mel quality (I didnt do an exhaustive test though) - main problem was (probably) that I trained with teacher enforcing and thus got a very low loss very quickly. With additional pre-nets and dropout as in the Tacotron the quality improved slightly, but was still lower than the non-autoreressive model.

1 similar comment
@cschaefer26
Copy link

Hi, thx for the hint - I have actually skipped through the paper after I finished the implementation - there is a lot of overlap with ForwardTacotron - it could definitely be worth a try and could help for sure. As for the autoregressive ForwardTacotron, it worked but I found that it exhibits lower mel quality (I didnt do an exhaustive test though) - main problem was (probably) that I trained with teacher enforcing and thus got a very low loss very quickly. With additional pre-nets and dropout as in the Tacotron the quality improved slightly, but was still lower than the non-autoreressive model.

@m-toman
Copy link
Author

m-toman commented Jun 8, 2020

Thanks, having read this paper recently https://arxiv.org/abs/1909.01145 I come to think that autoregression hurts more than it helps ;).
Furthermore considering that in the Tencent paper above they found that the power of Taco does not seem to come from attention but from the pre/postnets it's not surprising you ended up with this model.

I think I'll try the length regulator from your repo in a Taco2 setting (as I see it you use the Taco1 CBHG etc. layers) and see how it goes. Also might make sense to use a classical forced aligner instead of training vanilla Taco first just for the alignments.
I'll keep you posted when I find something interesting ;)

@cschaefer26
Copy link

Hey, that paper is actually sth I want to try soon for the ForwardTacotron as well - although I am not sure if it would be beneficial for a non-autoregressive model. Trying taco2 definitely makes sense, I had some success also with conv-only models similar to what they use in MelGAN, there is probably room for improvement! I also thought about using a STT model for extracting the durations. Keep us posted if you find anything interesting!

@m-toman
Copy link
Author

m-toman commented Jun 9, 2020

Oh right, I actually found your repo there
NVIDIA/tacotron2#280

I'm currently just using DFR 0.2 without MMI because of some reports there and would also first have to adapt the code to the phone set instead of characters.
But this should be obsolete with an explicit duration model.

It's interesting that this duration model is trained together with the rest instead of separately.

I'm quite eager to get rid of attention as it's really the #1 source of issues I encounter.

@cschaefer26
Copy link

Yes I found that the model without attention is really robust. It seems to be the general trend to get rid of it. Also worth a look: https://arxiv.org/abs/2006.04558

@m-toman
Copy link
Author

m-toman commented Jun 9, 2020

Oh thanks, didn't see that one yet.

I got the impression that the two lines of research are either to use explicit durations (IBM model, the new Facebook model, Tencent etc) or try to improve on the attention mechanism, like
Monotonic attention or https://google.github.io/tacotron/publications/location_relative_attention/index.html

But you really wonder how much of an attention model this actually is if you just use it to attend to a single input at a time.

@cschaefer26
Copy link

Yeah thats true. From my experience the duration models perform well enough and they are much faster. Next thing I will try is to use a different approach for extracting durations from the data, probably with a simple STT model with CTC loss.

@m-toman
Copy link
Author

m-toman commented Jun 18, 2020

Just integrated your model into my training framework, preprocessing and with MelGAN and already works quite well after just 20k steps. Audible but noisy, very smooth spectra. Let's see how it evolves.

Also prepared Integration taco 2 layers but first want a baseline.

I wonder if training the duration model separately would be beneficial but I guess it won't make a big difference.

Any reason why you picked L1 loss?

I mostly plan to try forced alignment next, perhaps try subphone units like in HMM systems and a couple smaller things similar to DurIAN, like the skip encoder and the positional index.

@cschaefer26
Copy link

cschaefer26 commented Jun 19, 2020

Cool, keep me updated - the spectra get much better until 200k steps in my experience. No special reason for L1 over L2, i would not think it makes a big difference. I am trying now to extract the durations with a simple conv-lstm STT model. I use the output log-probablilities to align the mels to the input text with a graph search algorithm. Works pretty well, but so far I don't see it performing better than the alignments extracted from the taco.

@m-toman
Copy link
Author

m-toman commented Jun 19, 2020

It started to converge a bit at around 100k steps (bs32). Stopped for now and trying the suggestions from alexdemartos (prenet output into duration model, duration model from fastspeech).

Implemented positional index here https://github.com/vocalid/tacotron2/blob/b958c7d889b7b6161f56f36b2d525650ff55df3c/model.py#L41
But have to see how to improve that. The torch.gather solution trains with 0.4s/iteration while this one is about 4s (actually 40 sec if you don't move expanded to the gpu first as in the link)

But that's next.
Do you plan to release the alignment model?

I planned to use something like https://github.com/CSTR-Edinburgh/merlin/blob/master/misc/scripts/alignment/state_align/forced_alignment.py
Which worked quite well on small datasets in my experience

@cschaefer26
Copy link

cschaefer26 commented Jun 19, 2020

Yeah if the model works well I sure gonna open source it. If you're interested check out the (researchy) branch 'aligner and run train_aligner.py and then force_alignment.py. The HM models seem to be standard for extracting alignments, I though want something independent from third parties. I'd be really interested in how the hmm works though!

@cschaefer26
Copy link

It started to converge a bit at around 100k steps (bs32). Stopped for now and trying the suggestions from alexdemartos (prenet output into duration model, duration model from fastspeech).

Implemented positional index here https://github.com/vocalid/tacotron2/blob/b958c7d889b7b6161f56f36b2d525650ff55df3c/model.py#L41
But have to see how to improve that. The torch.gather solution trains with 0.4s/iteration while this one is about 4s (actually 40 sec if you don't move expanded to the gpu first as in the link)

But that's next.
Do you plan to release the alignment model?

I planned to use something like https://github.com/CSTR-Edinburgh/merlin/blob/master/misc/scripts/alignment/state_align/forced_alignment.py
Which worked quite well on small datasets in my experience

Hi, just to share with you - I did a couple of tests extracting the durations with a STT model (simple conv-lstm such as a standard OCR model). I overfitted the STT model on the train set, extracted the predition probs and use a graph-search method to align phonemes and mel steps (based on maximising the prediction prob for the current phoneme at each mel step). It works pretty well, the results are intelligible, but the prosody is slightly worse and more robotic than with the taco extracted durations. Any luck yet for you with the forced alignment?

@m-toman
Copy link
Author

m-toman commented Jun 28, 2020

Hey. Haven't tried it yet. I ran lots of variations with taco1 vs taco2 postnet, prenet vs no prenet.
I found the prenet before the CBHG didn't really make a difference, neither the postnet choice.

Generally I often see differences in the training loss but none in validation loss.

Biggest difference was exchanging the duration predictor with the fastspeech style one and putting it after CBHG instead of before. Unfortunately I didn't test yet which of the two is the more important modification.

Generally I see a lot more artefacts than with our taco2 model. I train melgan by generating Mel spectra using ground truth durations and it reconstructs them very well. Once I feed Mel spectra generated using predicted durations things get ugly.

Also training with positional indices at the moment but no significant difference either.

I'm also interested in trying a generative model for durations as in https://www.ibm.com/blogs/research/2019/09/tts-using-lpcnet/

Another interesting aspect: I see training loss decreasing after 500k+ steps but validation loss is pretty much stable after about 50k or so. Seems a bit early for overfitting to me.

@cschaefer26
Copy link

Thx for the update. As for the duration predictor - I've had the problem of overfitting when I put it after the prenet plus the mels looked worse. As for the increasing validation loss - I think this is kind of normal as the model is not teacher forced and the predicted patterns differ from the ground truth, the audio quality still improves up until 200k steps or so in my experience. I normally do not even look at the validation loss to be honest and judge more by the audio. Also, I have seen quite some artifacts with a pretrained ljspeech melgan + forward taco, but less with wavernn. On our male custom dataset there are quite few artifacts with melgan - I increased melgans receptive field, maybe that helps...

@cschaefer26
Copy link

One more thing, when I train the MelGAN, I usually mix ground truth mels with predicted ones as I think it makes training more stable - this could also be worth a try. If you only train on predicted spectra the problem could be that they differ too much from the GT as they are not really teacher forced (e.g. the pitch could be different etc.).

@m-toman
Copy link
Author

m-toman commented Jun 28, 2020

Yeah ive also increased the receptive field as proposed in a paper I forgot. Didnt see the huge improvement they saw but well... Regarding mixing GT with GTA - I could have sworn I did that but strangely only find it in my wavernn codebase.
And yes, seemed to make it more robust.

I'll also try multispeaker. With vanilla taco I never could get it to learn attention well for all speakers but spectra were pretty good, so I guess zu should work well with this model. Considering the models I trained using merlin (mostly just 3 LSTMs on HTS Labels) were very well able to produce and mix more than 1000 voices easily in combination with WORLD.

@cschaefer26
Copy link

cschaefer26 commented Jun 29, 2020

I would also assume that the model goes well with multispeaker, thats quite some work though. For this it makes sense to first find a quicker way of extracting durations probably. I am running another ljspeech training now with melgan. I see improvements of audio quality up to 400k steps on the forward if I test it with the standard pretrained melgan (fewer squeezy artefacts)

@m-toman
Copy link
Author

m-toman commented Jun 29, 2020

Seems it was an issue of patience again. MelGAN loss is hard to interpret and just letting it run often helps. So let it run over the weekend, acoustic model to 500k steps, MelGAN just another day more or so and it's definitely better now.
https://drive.google.com/file/d/1YBsS7sxus_tw9PQdr0HVtScGo8Ccuolw/view?usp=sharing

Prosody not yet at the level of the Taco2 model I trained but we're getting closer.

And yes, definitely have to work on the aligner first before tackling multispeaker.

@cschaefer26
Copy link

cschaefer26 commented Jun 30, 2020

Thats not too bad for melgan and the inference speed you get with both models. I would assume that the durations could be overfitted (did you check the dur val loss?). I am also testing some model variations and I found that it helps to concat the lstm output with the prenet_output:

    x = self.prenet(x)
    x = self.lr(x, dur)

    x_p, _ = self.lstm(x)
    x_p = F.dropout(x_p,
                    p=self.dropout,
                    training=self.training)
    x_p = torch.cat([x, x_p], dim=-1)
    x_p = self.lin(x_p)

This is closer to the Taco architecture, where the attention is also concatenated with the lstm output.

@m-toman
Copy link
Author

m-toman commented Jun 30, 2020

Screenshot_20200630-114049_Chrome

Well, validation loss is strange for me ;)

EDIT: zooming in doesn't really help
image

@cschaefer26
Copy link

Seems like instant overfitting...

@cschaefer26
Copy link

cschaefer26 commented Jul 1, 2020

https://drive.google.com/file/d/1S__-0_3N2swYCsWu4TciZ6O9owkxX7eK/view?usp=sharing
https://drive.google.com/file/d/1fIU8SfijwsUg_vEOSykgs7hXb3lO1Fh0/view?usp=sharing

Here is some result with the updated model trained 320k steps together with the pretrained melgan from the repo.

@m-toman
Copy link
Author

m-toman commented Jul 8, 2020

Currently investigating this "overfitting" issue. Been plotting pre und post mel validation error now and it is at the lowest point at around 10k steps and then gradually increases.

Looking at mel spectra from the validation set, this is after 12k steps
image

After 81k steps
image

definitely more detail.

Then I've also plotted the error here:
12k steps
image

81k steps
image

Seems the error in the formant structure is really higher.
I would assume that there might just be some ... shift in the frequency axis that messes up the loss but obviously still sounds fine.

Ground truth
image

@cschaefer26
Copy link

cschaefer26 commented Jul 8, 2020

Very cool. That's what I expected too, the structure gets more pronounced but may vary from the ground truth (e.g. different pitch or voice going a bit up instead of down) - as the model is not teacher forced.

BTW I found that with the melgan preprocessing it is necessary to do some cherry picking with the tts model, but training to 400K steps definitely is worth it.

@m-toman
Copy link
Author

m-toman commented Jul 8, 2020

Oh, I got the alignment with HTK to work and while it generally works fine, I currently getting more "raspy" voices and I'm not completely sure if it's because of the alignment.
My main issue is that I'm not completely sure how to handle the word boundaries best. Tacotron usually works fine with spaces as wound boundary symbols, but it messes up the aligner in most cases. Except there's really a pause between words.

I think it might be the best solution to not have them in alignment and then use some skip encoder like in DurIAN. Where they keep the word boundaries as separate symbols until the state expansion.
If I just drop them completely it strings the words together without any pause ever and it sounds awful.

Well, having diacritics as separate symbols is not really helpful either...

@cschaefer26
Copy link

Thats also my experience with durations from a STT model. I tried 1. generating phoneme boundaries (and word boundaries) from the output probabilities and 2. extracting the exact phoneme positions in time and splitting right between them. Both resulted in lower mel quality.

@m-toman
Copy link
Author

m-toman commented Jul 11, 2020

Should change the name of this issue ;)

Still seeing generalization issues. Implemented multispeaker, injecting speaker codes after CBHG but in generally it always defaults to one voice (or I am not fully sure if it's actually am average voice) except if I pick a sentence from training set with the respektive speaker code. Strange as it's fed directly into the LSTM.
Pretty strange considering I previously used similar 3 layer LSTM networks where it worked without issues.
Currently adding residual connections, like the concat you suggested above around the LSTM and also additive after the postnet like in taco2 but it still seems to do the same thing. Even more interesting - if I pick a sentence from training set with a specific speaker and just change a word it sort of interpolates the whole sentence.

Hmm gotta try synthesizing from pre postnet Mel spectra if it makes a difference. - update: nope, sounds a bit different but already lost speaker information.

@cschaefer26 cschaefer26 changed the title Autoregressive training Experiments / Discussion Jul 12, 2020
@cschaefer26
Copy link

Renamed - good research. Did you use durations extracted by a respective tacotron trained separately on each dataset? As for the overfitting - I see the same issues, you mentioned that having a pre-net with heavy dropout did not help, did it? I conducted a lot of unsuccessful experiments, mainly trying various forms of duration prediction, e.g. a separate autoregressive duration predictor (heavily overfitted). Currently I am experimenting with GANS again and got them to work quite well, although the voice quality is not yet better than with standard L1 loss.

@cschaefer26
Copy link

Ah cool. I'm also planning on messing with a pitch predictor, the fastpitch samples seem quite convincing. Let me know how it goes!

@m-toman
Copy link
Author

m-toman commented Sep 8, 2020

Oh it worked quite well without any real issues. The more I wondered that it ignores my speaker IDs while those work nicely.
I'm just at 20k steps with the modifications above but also seems to output me some random voice, checking the embeddings again, weird.

Update 2:
OMG I think I got it. Such a stupid bug. I've checked the embeddings on inference if they match the speaker, I've checked them in the data loader if they match the filename etc. I've checked in forward if they vary in each batch and the repeat does its work correctly.
But I did NOT check the collate function 😱

Such a simple stupid bug and took me longer than that X11 forwarding syscall issue here mozilla/TTS#417 (comment) ;)

😠

Retraining now....

Update 3:
Working, 3 samples after just 2k steps (like 10 minutes of training)
ms.zip

@ghost
Copy link

ghost commented Sep 20, 2020

Hi there @cschaefer26, nice project! Regarding #7 (comment), while integrating fatchord's tacotron model with https://github.com/CorentinJ/Real-Time-Voice-Cloning (my work is in #472), I've also encountered the same problems you had with LibriTTS, which are mainly caused by the highly inconsistent prosody between speakers. You can get much better results by preprocessing or curating the dataset (either trimming mid-sentence pauses or discarding utterances when that occurs). VCTK works a lot better if you trim the silence at the beginning and end of each file. I can go into more detail if it is helpful.

The baseline tacotron requires very clean data for multispeaker, and even then I'm having trouble producing a decent model. Which is what leads me to your repo. :) I will be trying it out. Keep up the great work!

@cschaefer26
Copy link

cschaefer26 commented Sep 22, 2020

Hi @m-toman I totally missed your update. Sounds really good, I assume its melgan? I got some ok results usind VCTK, as @blue-fish states the datasets require some good trimming etc. I found this to be really helpful: https://github.com/resemble-ai/Resemblyzer/blob/cf57923d50c9faa7b5f7ea1740f288aa279edbd6/resemblyzer/audio.py#L57

Any updates? We are also looking into adding GST. The main problem I have right now is that I would nead really clean german datasets to benefit from transfer learning for our use case. I also looked into other open source repos and tested Taco2 etc. but found it not really to perform much better.

Currently, I am looking into some different preprocessings, e.g. mean-var scaling to improve the voice quality.

@m-toman
Copy link
Author

m-toman commented Sep 22, 2020

Hi.
I have mixed in vctk (also with this trimming ;)) but I felt that the larger speakers I added lost a bit of prosody/felt flatter than when trained individually. Wanted to investigate further but did not get to it yet. Yeah it's melgan.

Yeah it's interesting. As I can't really believe it I regularly compare to taco2 and other more complex methods out there (https://github.com/tugstugi/dl-colab-notebooks) but neither attention nor autoregression really seem to make a significant difference.

Regarding styles I would have considered something like the simple method presented in DurIAN which is mostly just style embedding. Also read the flowtron paper again and thought about wrapping the whole model in such a Flow formulation but after listening to the samples again I felt it's probably not worth it vs just playing with the pitch predictor I got (might also be possible to predict and sample F0 from a gaussian where you could then play with the variance).

I would have to read the GST paper again but I felt the control is a bit top random when I remember correctly? So in the sense that the tokens are hard to interpret and probably different with each run?

@cschaefer26
Copy link

Yeah exactly, although they show some impressive results absorbing background noise into the tokens. I would probably think that pitch prediction is the lowest hanging fruit of them all...

@m-toman
Copy link
Author

m-toman commented Sep 22, 2020

I feel we're getting to a similar state of saturation like we had it before deep learning entered the speech synthesis field. The HMM-based methods became so loaded with more and more tricks and features, the complexity was insane. The training script I used during my PhD consisted of I think 120 separate steps in the end, each calling HTS tools with dozens of command line parameters and additional script files ;).
Recently there have been so many approaches to make attention work better for this use case, like the monotonic methods that force it to either take a step or stay in the current state and only attend to a single input. With lots of weird tricks to make it differentiable etc.
At that point it's so far from the origins that it seems awkward to even use attention.
The seq2seq AR approach also means dependence on the stop token prediction (who did not end up with 30 seconds of garbled speech please raise their hands ;)).
https://arxiv.org/abs/1909.01145 was an interesting paper but it's yet again another rather complicated workaround for the issues introduced by AR, besides scheduled sampling/curricum learning (which introduced now robustness issues) and gradually decreasing r and stopping at r=2 (although that works quite well) to keep it from predicting from the previous samples ignoring conditioning information.

i admit I wasn't brave enough to just try what you did and throw all that stuff out. The thousand people at Google would have certainly done that, right? :)
Enough ranting, curious what you will achieve with GST, I'll further play with multispeaker soon.

@cschaefer26
Copy link

Good rant though! The more I test the autoregressive stuff the less impressed I am. Its basically not usable for us in production (we are trying to synth long german politics articles). The forward models are pretty robust though. I wish I could get rid of the AR model to extract durations, we experimented with the aligner module from google EATS, didnt work. Extracting with an STT model worked but quality was worse. Today I spent the whole day debugging why my forward model all of the sudden sucked badly and found that the tacotron alignments were shifted - somehow I got unlucky, increasing the batch size solved this. When I started with TTS I was wondering why people got so interested in thes attention plots, now I know - watching a tacotron giving birth to attention is one of my good moments :D .. Honestly the forward models seem to be SOTA now and are probably used in production for Microsoft, AWS, google...

@m-toman
Copy link
Author

m-toman commented Sep 23, 2020

My samples above used alignment via HTS and didnt notice a difference to the taco attention ones. Using those scripts https://github.com/CSTR-Edinburgh/merlin/tree/master/misc/scripts/alignment/state_align
Just a bit annoying to set up,

Think it's mostly Google that still clings to it. https://arxiv.org/abs/1910.10288

Yeah was astonished to see that Springer does TTS ;).

My ex-colleagues recorded this corpus https://speech.kfs.oeaw.ac.at/mmascs/
Unfortunately too small for deep learning stuff (was fine for HMM based synthesis) but it's good quality and in different speaking rates might be useful for the duration model.
We did lots of mocap recordings back then, was fun ;)

Edit: and obviously it's Austrian German (Vorarlberg in this case ;)).
Here some fun dialect interpolation samples http://mtoman.neuratec.com/thesis/interpolation/

@cschaefer26
Copy link

Cool, I actually just found a glitch in the duration extraction of my STT models and it seems fine now. Probably going to release that as its cumbersome to train a tacotron for extractions. Good stuff! Ill keep you updated on how it goes with pitch prediction, multispeaker etc.

@m-toman
Copy link
Author

m-toman commented Sep 28, 2020

I struggled a week now with that suddenly I got a burp sound at the end of many sentences. Honestly still now idea why, seems to happen sometimes. I now force it to silence... I assumed it was that it aligned the final punctuation symbol to silence usually, but if the silence trimming trims to aggressively it has to align that symbol with voice. So I added a little bit of silence myself (quite common actually in older systems to have a silence symbol at the beginning and end and prepend and append artificial silence). But didn't help. Now I force it to silence after synthesis but no idea where it's coming from...

anyways, did you ever get the validation loss to make sense? For me it still gradually increases, although at probably 1/10th of what the training loss decreases. Tried really small model sizes, more dropout but still. Even the multispeaker model I currently train on 100k sentences does it, but admittedly less pronounced.

@cschaefer26
Copy link

Did you check the padding? I had a similar problem once and found that padding values were at zero (and not at -11.5 for silence in the log space). Validation loss in this case does not mean anything imo since the model has too much freedom in predicting intonation, pitch etc without teacher forcing. I don't even look at it (for durations it still makes sense though imo).

@m-toman
Copy link
Author

m-toman commented Sep 29, 2020

Hmm, good idea, thanks.
You mean the mel padding here, right?

mel = [pad2d(x[1], max_spec_len) for x in batch]

But have to check how my mel representation differs from yours.

The loss masking should work now, so it should mostly be about the context from the LSTMs and convolutions "leaking" into the actual speech.
Strangely I never had any issues until recently but checked all commits I think a dozen times, no model/dataset changes. I've retrained at different commit times but the answer was never clear. Sometimes there were slight issues at the end of the sentence, 3 of 4 trainings on the original commit were fine, but one had slight issues. Sometimes it occurs at some point during the training, then is gone again, sometimes gradually worsens.

@m-toman
Copy link
Author

m-toman commented Sep 29, 2020

OK, checked it, for me silence is -4 and modified the padding.
But then I've noticed I made a mistake in my posting above - convs/RNNs apply to the input text, not to the mel spectra. So with the loss masking fixed this should not have any effect, or do I miss something?

@cschaefer26
Copy link

cschaefer26 commented Oct 1, 2020

Any improvements with the padding? I agree that the glitch cant be from the loss directly due to masking, but I found something else - in my case the length regulator is slightly overexpanding, i.e. attaching some extra repeats of the last input vecs due to the padding within batches during training, I added this to put the repeated inputs to zero:

for i in range(x.size(0)):

Imo this could be one of the causes of a 'leak' in the RNNS.

@cschaefer26
Copy link

cschaefer26 commented Oct 1, 2020

Oh, btw, I am also now adding a pitch module and the first results seem very promising. Might be adding an 'energy' vec as in FastSpeech2 as well, although in their ablation study the gain was pretty small. I thought if one could instead of calculating F0 just use the mean of the frequency distribution along the mel axis? Imo this should be pretty similar and wouldnt require an external lib to do the calculation.

@m-toman
Copy link
Author

m-toman commented Oct 1, 2020

I'll check the LR thing above. Not sure now if I used your solution of wrote something myself because I added positional index (didn't see a huge difference though, if at all).
Strange that I never noticed it for the first months and then suddenly it appeared. Resetting to the commit from my last OK training really reproduced a non-burpy version... So I checked and checked again but no modifications to the model or dataloader or training procedure. But run 4 then also produced burps. It feels so semi random. For now I just overwrite the last symbol (in combination with forcing the last symbol to be punctuation and making sure there is silence in the audio at the end) with silence and that fixes the symptoms but still bugs me ;).

I'll post a sample when back home.

For pitch i userd the approach from fastpitch (repo is out there) which works fine with the proposed mean pitch per symbol but I am thinking about a more complex parameterization that also allows to control some delta features (perhaps just categories of falling/rising/flat pitch or so).
EDIT: here it is https://drive.google.com/file/d/1p9dJjLzJ0p0R3v0XLz-Z6hr1Xwhj-5Gd/view?usp=sharing
after changing the mel padding

@m-toman
Copy link
Author

m-toman commented Oct 1, 2020

Until now it seems to be better - https://drive.google.com/file/d/1LkusT0VO8cKw3nI5jJ1GBmDLCu4jP_vv/view?usp=sharing
Just 7k steps, it often started to happen after 60k steps.
Generally I often feel there is not really much improvement after 10k steps Training, which is quite cool considering how long Taco takes to get the attention right (if at all).

@cschaefer26
Copy link

cschaefer26 commented Oct 8, 2020

Thats only 7k steps? Impressive. I found no real improvements with a pitch module, although its fun to play around with the pitch. May be a limitation of our dataset though (8hrs only). I also tested the Nvidia FastPitch implementation, not better. I thought about looking more into the data to be honest, e.g. cleaning inconsistent pronunciations with an STT model.

@m-toman
Copy link
Author

m-toman commented Oct 8, 2020

Yeah I didn't see improvement either but we need it for implementing ssml tags and it definitely works better than synthesis-resynthsis methods which introduce too much noise. And yeah probably one can think about some generative model to produce/sample interesting pitch contours.

My results were a bit strange. With both paddings modifications above I started to get weird pauses/prosody. Trained 3 times to verify it's not some random effect and at different stages of training.

Integrating DiffWave might be interesting as well.

@cschaefer26
Copy link

No improvements as in with the pitch? Generally I have the same problems as you have, it seems that trainings can vary by large degrees, probably there is some randomness in what the model really fits...

@m-toman
Copy link
Author

m-toman commented Oct 9, 2020

Yeah sounds pretty much the same with and without pitch model.
Well, still much more robust than most taco 2 implementations. Quite a few smaller datasets that did not work at all (in the sense that the output was cut off, broken words etc.) now work well. Sometimes prosody is not as natural (rarely) but better than garbage generated.

Will try multispeaker again soon but it seemed to me as if it would average a bit too much.

@cschaefer26
Copy link

Sounds good. I am regularly comparing the model to other architectures and I find that the LSTM produces a bit more fluent output but tends to more mumbling compared to a transformer basted model a la FastSpeech. Multispeaker didn't really add some benefit so far, but that could be due to lack of data yet.

@cschaefer26
Copy link

Just a quick update, I merged all the pitch stuff to master, I really found a benefit using the pitch condition. I see the same as you, after 10-20k steps the model is almost done. Quick question: Did you see any improvement with positional indexing? I found some generalization problems on smaller datasets, where the voice mumbles quite a bit especially for shorter sentences, weirdly. Also, I tried to add an STT model to the training and added a CTC loss to hope that the model is forced to be clearer, first results seem quite promising actually.

@m-toman
Copy link
Author

m-toman commented Nov 2, 2020

No new experiments from my side atm.
I am not fully sure about the positional index, I added it before all the other stuff and kept it in although I didn't hear a major difference.

    def expand(self, x, durations):
        idx = self.build_index(durations, x)
        y = torch.gather(x, 1, idx)
        if self.posidx:
            duration_sums = durations.sum(dim=1)
            max_target_len = duration_sums.max().int().item()
            batch_size = durations.shape[0]
            position_encodings = torch.zeros(
                batch_size, max_target_len, dtype=torch.float32).to(durations.device)
            for obj_idx in range(batch_size):
                positions = torch.cat([torch.linspace(0, 1, steps=int(dur), dtype=torch.float32)
                                       for dur in durations[obj_idx]]).to(durations.device)
                position_encodings[obj_idx, :positions.shape[0]] = positions
            y = torch.cat([y, position_encodings.unsqueeze(dim=2)], dim=2)
        return y

I've checked out the samples of Glow-TTS in Mozilla TTS but did not really seem convincing.
My main issue atm is that the prosody could be better.

@cschaefer26
Copy link

Intuitively I wouldnt expect worlds difference with pos indexing with lstms though. As for prosody, did you try to use a smaller separate duration predictor as I do? I found that the model is hugely overfitting otherwise (e.g. when duration prediction is done after the encoder). Also for prosody I have an idea I want to try out soon - similar to the pitch frequency i want to leak some duration stats from the target, e.g. some running mean of durations to condition the duration predictor on. My hope is that the model would pick up some rhythm / prosody swings in longer sentences (similar to the pitch swings).

@m-toman
Copy link
Author

m-toman commented Nov 2, 2020

I tried running the duration predictor before the CBHG once but the results were a bit strange. Will try again. Also wondered if it's really a good idea to train it together with the rest or should rather be separate (or at least stop training it at some earlier point).

So you already added some "global" pitch stats to the model? Have to check your code.
Could probably instead of just expanding states according to the duration value also feed the duration value itself to the Mel predictor, don't know if that would help it.

@cschaefer26
Copy link

My best resuls were with a mere 64 dim gru on duration prediction with lots of dropout before. Yes, it probably makes sense to completely separate it (to be able to compare results at different stages to the least). Yeah I reimplemented the FastPitch version (with minor differences) with pitch averaged over chars.

@m-toman
Copy link
Author

m-toman commented Nov 9, 2020

I'll try the different duration model as soon as I got the capacity.
Also wanted to try out https://pytorch.org/blog/stochastic-weight-averaging-in-pytorch/

@m-toman m-toman closed this as completed May 5, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants