-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GravesAttention with Tacotron 1 yields empty alignment plots during training and throwns no attribute error during inference #383
Comments
So, if I comment out the line |
I'll take a look at Graves attention but in the meantime, you can try DDC or DCA models for solving attention. |
Thanks, yeah just checked and the attention plots are still empty, moving onto DDC/DCA now. |
Just a side note. DCA is faster and uses less memory but DDC conspires better quality. |
@erogol thanks, if I wanted to use DDC, I understand I set |
it should be like // TACOTRON ATTENTION
"attention_type": "original", // 'original' , 'graves', 'dynamic_convolution'
"attention_heads": 4, // number of attention heads (only for 'graves')
"attention_norm": "sigmoid", // softmax or sigmoid.
"windowing": false, // Enables attention windowing. Used only in eval mode.
"use_forward_attn": false, // if it uses forward attention. In general, it aligns faster.
"forward_attn_mask": false, // Additional masking forcing monotonicity only in eval mode.
"transition_agent": false, // enable/disable transition agent of forward attention.
"location_attn": true, // enable_disable location sensitive attention. It is enabled for TACOTRON by default.
"bidirectional_decoder": false, // use https://arxiv.org/abs/1907.09006. Use it, if attention does not work well with your dataset.
"double_decoder_consistency": true, // use DDC explained here https://erogol.com/solving-attention-problems-of-tts-models-with-double-decoder-consistency-draft/
"ddc_r": 6, // reduction rate for coarse decoder. |
Thank you! |
@erogol Graves is working, something was off in my dataset and or config that's been solved and the training is yielding alignments after 5-10K steps. The inference issue is still there, I'll open a PR just deleting that one line mentioned above. |
Good to hear that but I am personally not sure if the implementation is right comparing to this paper https://arxiv.org/abs/1910.10288 AFAIK this is the most robust Graves attention so far proposed for TTS. It may be wrong. Itd be nice if you could double check. |
Closing this because the no attribute bug was fixed in #479 and GMM (Graves) Attention will be looked at in a separate discussion. |
I've trained a model using T1 with GST and GravesAttention. During training, all training and eval alignment plots have been empty (trained 80k+ steps). The model produced audio in the tensorboards, however using the logic from one of the notebooks to evaluate a model and synthesize speech, it threw me the following error:
AttributeError: 'GravesAttention' object has no attribute 'init_win_idx'.
referring tolayers/tacotorn.py
--> 478self.attention.init_win_idx()
. I suspect maybe that the Tacotron 1 model is not configured to use GravesAttention because some of those methods defined inlayers/tacotron.py
do not exist in theGravesAttention
class.Config:
{
"model": "Tacotron",
"run_name": "blizzard-gts",
"run_description": "tacotron GST.",
"audio": {
},
"distributed": {
"backend": "nccl",
"url": "tcp://localhost:54321"
},
"reinit_layers": [],
"batch_size": 128,
"eval_batch_size": 16,
"r": 7,
"gradual_training": [
[0, 7, 64],
[1, 5, 64],
[50000, 3, 32],
[130000, 2, 32],
[290000, 1, 32]
],
"mixed_precision": true,
"loss_masking": false,
"decoder_loss_alpha": 0.5,
"postnet_loss_alpha": 0.25,
"postnet_diff_spec_alpha": 0.25,
"decoder_diff_spec_alpha": 0.25,
"decoder_ssim_alpha": 0.5,
"postnet_ssim_alpha": 0.25,
"ga_alpha": 5.0,
"stopnet_pos_weight": 15.0,
"run_eval": true,
"test_delay_epochs": 10,
"test_sentences_file": null,
"noam_schedule": false,
"grad_clip": 1.0,
"epochs": 300000,
"lr": 0.0001,
"wd": 0.000001,
"warmup_steps": 4000,
"seq_len_norm": false,
"memory_size": -1,
"prenet_type": "original",
"prenet_dropout": true,
"attention_type": "graves",
"attention_heads": 4,
"attention_norm": "sigmoid",
"windowing": false,
"use_forward_attn": false,
"forward_attn_mask": false,
"transition_agent": false,
"location_attn": true,
"bidirectional_decoder": false,
"double_decoder_consistency": false,
"ddc_r": 7,
"stopnet": true,
"separate_stopnet": true,
"print_step": 25,
"tb_plot_step": 100,
"print_eval": false,
"save_step": 5000,
"checkpoint": true,
"tb_model_param_stats": false,
"text_cleaner": "phoneme_cleaners",
"enable_eos_bos_chars": false,
"num_loader_workers": 8,
"num_val_loader_workers": 8,
"batch_group_size": 4,
"min_seq_len": 6,
"max_seq_len": 153,
"compute_input_seq_cache": false,
"use_noise_augment": true,
"output_path": "/home/big-boy/Models/Blizzard/",
"phoneme_cache_path": "/home/big-boy/Models/phoneme_cache/",
"use_phonemes": true,
"phoneme_language": "en-us",
"use_speaker_embedding": false,
"use_gst": true,
"use_external_speaker_embedding_file": false,
"external_speaker_embedding_file": "../../speakers-vctk-en.json",
"gst": {
"gst_style_input": null,
"gst_embedding_dim": 512,
"gst_num_heads": 4,
"gst_style_tokens": 10,
"gst_use_speaker_embedding": false
},
"datasets":
[{
"name": "ljspeech",
"path": "/Data/blizzard2013/segmented/",
"meta_file_train": "metadata.csv",
"meta_file_val": null
}]
}
Alignment plots:
The text was updated successfully, but these errors were encountered: