Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GravesAttention with Tacotron 1 yields empty alignment plots during training and throwns no attribute error during inference #383

Closed
a-froghyar opened this issue Mar 16, 2021 · 11 comments

Comments

@a-froghyar
Copy link
Contributor

a-froghyar commented Mar 16, 2021

I've trained a model using T1 with GST and GravesAttention. During training, all training and eval alignment plots have been empty (trained 80k+ steps). The model produced audio in the tensorboards, however using the logic from one of the notebooks to evaluate a model and synthesize speech, it threw me the following error: AttributeError: 'GravesAttention' object has no attribute 'init_win_idx'. referring to layers/tacotorn.py --> 478 self.attention.init_win_idx(). I suspect maybe that the Tacotron 1 model is not configured to use GravesAttention because some of those methods defined in layers/tacotron.py do not exist in the GravesAttention class.

Config:

{
"model": "Tacotron",
"run_name": "blizzard-gts",
"run_description": "tacotron GST.",
"audio": {

"fft_size": 1024, 
"win_length": 1024, 
"hop_length": 256, 
"frame_length_ms": null, 
"frame_shift_ms": null, 


"sample_rate": 24000, 
"preemphasis": 0.0, 
"ref_level_db": 20, 


"do_trim_silence": true, 
"trim_db": 60, 


"power": 1.5, 
"griffin_lim_iters": 60, 


"num_mels": 80, 
"mel_fmin": 95.0, 
"mel_fmax": 12000.0, 
"spec_gain": 20,


"signal_norm": true, 
"min_level_db": -100, 
"symmetric_norm": true, 
"max_norm": 4.0, 
"clip_norm": true, 
"stats_path": null 

},

"distributed": {
"backend": "nccl",
"url": "tcp://localhost:54321"
},

"reinit_layers": [],

"batch_size": 128,
"eval_batch_size": 16,
"r": 7,
"gradual_training": [
[0, 7, 64],
[1, 5, 64],
[50000, 3, 32],
[130000, 2, 32],
[290000, 1, 32]
],
"mixed_precision": true,

"loss_masking": false,
"decoder_loss_alpha": 0.5,
"postnet_loss_alpha": 0.25,
"postnet_diff_spec_alpha": 0.25,
"decoder_diff_spec_alpha": 0.25,
"decoder_ssim_alpha": 0.5,
"postnet_ssim_alpha": 0.25,
"ga_alpha": 5.0,
"stopnet_pos_weight": 15.0,

"run_eval": true,
"test_delay_epochs": 10,
"test_sentences_file": null,

"noam_schedule": false,
"grad_clip": 1.0,
"epochs": 300000,
"lr": 0.0001,
"wd": 0.000001,
"warmup_steps": 4000,
"seq_len_norm": false,

"memory_size": -1,
"prenet_type": "original",
"prenet_dropout": true,

"attention_type": "graves",
"attention_heads": 4,
"attention_norm": "sigmoid",
"windowing": false,
"use_forward_attn": false,
"forward_attn_mask": false,
"transition_agent": false,
"location_attn": true,
"bidirectional_decoder": false,
"double_decoder_consistency": false,
"ddc_r": 7,

"stopnet": true,
"separate_stopnet": true,

"print_step": 25,
"tb_plot_step": 100,
"print_eval": false,
"save_step": 5000,
"checkpoint": true,
"tb_model_param_stats": false,

"text_cleaner": "phoneme_cleaners",
"enable_eos_bos_chars": false,
"num_loader_workers": 8,
"num_val_loader_workers": 8,
"batch_group_size": 4,
"min_seq_len": 6,
"max_seq_len": 153,
"compute_input_seq_cache": false,
"use_noise_augment": true,

"output_path": "/home/big-boy/Models/Blizzard/",

"phoneme_cache_path": "/home/big-boy/Models/phoneme_cache/",
"use_phonemes": true,
"phoneme_language": "en-us",

"use_speaker_embedding": false,
"use_gst": true,
"use_external_speaker_embedding_file": false,
"external_speaker_embedding_file": "../../speakers-vctk-en.json",
"gst": {
"gst_style_input": null,
"gst_embedding_dim": 512,
"gst_num_heads": 4,
"gst_style_tokens": 10,
"gst_use_speaker_embedding": false
},

"datasets":
[{
"name": "ljspeech",
"path": "/Data/blizzard2013/segmented/",
"meta_file_train": "metadata.csv",
"meta_file_val": null
}]
}

Alignment plots:

image

@erogol erogol added feature request feature requests for making TTS better. and removed feature request feature requests for making TTS better. labels Apr 6, 2021
@a-froghyar
Copy link
Contributor Author

So, if I comment out the line self.attention.init_win_idx() in tacotron.py, I guess inference won't be complaining and in case OriginalAttention() is used, the self.attention.init_win_idx() method is called inside the OriginalAttention() module (line 226). I'm gonna launch a training to see if the empty alignment plots are still there, reporting back later today.

@erogol
Copy link
Member

erogol commented Apr 13, 2021

I'll take a look at Graves attention but in the meantime, you can try DDC or DCA models for solving attention.

@a-froghyar
Copy link
Contributor Author

Thanks, yeah just checked and the attention plots are still empty, moving onto DDC/DCA now.

@erogol
Copy link
Member

erogol commented Apr 13, 2021

Just a side note. DCA is faster and uses less memory but DDC conspires better quality.

@a-froghyar
Copy link
Contributor Author

@erogol thanks, if I wanted to use DDC, I understand I set "double_decoder_consistency": true, but then which attention_type should I choose?

@erogol
Copy link
Member

erogol commented Apr 13, 2021

it should be like

    // TACOTRON ATTENTION
    "attention_type": "original",  // 'original' , 'graves', 'dynamic_convolution'
    "attention_heads": 4,          // number of attention heads (only for 'graves')
    "attention_norm": "sigmoid",   // softmax or sigmoid.
    "windowing": false,            // Enables attention windowing. Used only in eval mode.
    "use_forward_attn": false,     // if it uses forward attention. In general, it aligns faster.
    "forward_attn_mask": false,    // Additional masking forcing monotonicity only in eval mode.
    "transition_agent": false,     // enable/disable transition agent of forward attention.
    "location_attn": true,         // enable_disable location sensitive attention. It is enabled for TACOTRON by default.
    "bidirectional_decoder": false,  // use https://arxiv.org/abs/1907.09006. Use it, if attention does not work well with your dataset.
    "double_decoder_consistency": true,  // use DDC explained here https://erogol.com/solving-attention-problems-of-tts-models-with-double-decoder-consistency-draft/
    "ddc_r": 6,                           // reduction rate for coarse decoder.

@a-froghyar
Copy link
Contributor Author

Thank you!

@a-froghyar
Copy link
Contributor Author

@erogol Graves is working, something was off in my dataset and or config that's been solved and the training is yielding alignments after 5-10K steps. The inference issue is still there, I'll open a PR just deleting that one line mentioned above.

@a-froghyar
Copy link
Contributor Author

After 43K steps the alignments are also still a bit wonky but not empty.

Screenshot from 2021-04-29 09-24-50

@erogol
Copy link
Member

erogol commented Apr 29, 2021

Good to hear that but I am personally not sure if the implementation is right comparing to this paper https://arxiv.org/abs/1910.10288

AFAIK this is the most robust Graves attention so far proposed for TTS. It may be wrong.

Itd be nice if you could double check.

@a-froghyar
Copy link
Contributor Author

Closing this because the no attribute bug was fixed in #479 and GMM (Graves) Attention will be looked at in a separate discussion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants