Svd not converging during validation phase #279

JRMeyer · 2021-03-07T08:58:29Z

JRMeyer
Mar 7, 2021
Maintainer

>>> LegendBegins
[October 3, 2020, 4:59pm]

Hi,

I'm trying to fine tune this Tacotron 2
model using a voice from the
libri_tts dataset. However, whenever the training gets to the validation
phase, it raises the following error: slash
numpy.linalg.linalg.LinAlgError: SVD did not converge slash
I looked into it, and it appears that the basis for the Mel Spectrogram
being generated in audio.py's slash _build_mel_basis function is not properly
being inverted in slash _mel_to_linear. I tried isolating the variables and
creating the matrix in a standalone Python file, which pinv was able to
invert without any issues. I've successfully fine tuned one of the
Tacotron 1 models with this dataset, but I can't seem to manage to find
a fix for this one. I've reproduced the functions below, along with my
config file. The only major change I made was to the sample rate to
match up with the new data. I appreciate any and all advice.

def _mel_to_linear(self, mel_spec):
inv_mel_basis = np.linalg.pinv(self._build_mel_basis())
return np.maximum(1e-10, np.dot(inv_mel_basis, mel_spec))

def _build_mel_basis(self, ):
n_fft = (self.num_freq - 1) curl-run-all.sh discourse.mozilla.org html-to-markdown.sh ordered-posts ordered-posts~ TTS.cdx tts.commands tts-emails.txt TTS.pages tts-telegram.txt TTS.warc.gz 2
if self.mel_fmax is not None:
assert self.mel_fmax <= self.sample_rate // 2
return librosa.filters.mel(
self.sample_rate,
n_fft,
n_mels=self.num_mels,
fmin=self.mel_fmin,
fmax=self.mel_fmax)

{
'github_branch':'* dev',
'restore_path':'A: slash Other slash Installations slash chatbot slash Speech_Synthesis slash TTS_Final_Load_Test slash TTS slash models slash best_model.pth.tar',
'github_branch':'* dev',
'model': 'Tacotron2', // one of the model in models/
'run_name': 'ljspeech-bn',
'run_description': 'tacotron2 basline finetuned with BN prenet',

// AUDIO PARAMETERS
'audio':{
// Audio processing parameters
'num_mels': 80, // size of the mel spec frame.
'num_freq': 1025, // number of stft frequency levels. Size of the linear spectogram frame.
'sample_rate': 24000, // DATASET-RELATED: wav sample-rate. If different than the original data, it is resampled.
'frame_length_ms': 50.0, // stft window length in ms.
'frame_shift_ms': 12.5, // stft window hop-lengh in ms.
'preemphasis': 0.98, // pre-emphasis to reduce spec noise and make it more structured. If 0.0, no -pre-emphasis.
'min_level_db': -100, // normalization range
'ref_level_db': 20, // reference level db, theoretically 20db is the sound of air.
'power': 1.5, // value to sharpen wav signals after GL algorithm.
'griffin_lim_iters': 60,// #griffin-lim iterations. 30-60 is a good range. Larger the value, slower the generation.
// Normalization parameters
'signal_norm': true, // normalize the spec values in range [0, 1]
'symmetric_norm': true, // move normalization to range [-1, 1]
'max_norm': 4.0, // scale normalization to range [-max_norm, max_norm] or [0, max_norm]
'clip_norm': true, // clip normalized values into the range.
'mel_fmin': 0.0, // minimum freq level for mel-spec. ~50 for male and ~95 for female voices. Tune for dataset!!
'mel_fmax': 8000.0, // maximum freq level for mel-spec. Tune for dataset!!
'do_trim_silence': true // enable trimming of slience of audio as you load it. LJspeech (false), TWEB (false), Nancy (true)
},

// DISTRIBUTED TRAINING
'distributed':{
'backend': 'nccl',
'url': 'tcp: slash / slash /localhost:54321'
},

'reinit_layers': [], // give a list of layer names to restore from the given checkpoint. If not defined, it reloads all heuristically matching layers.

// TRAINING
'batch_size': 32, // Batch size for training. Lower values than 32 might cause hard to learn attention. It is overwritten by 'gradual_training'.
'eval_batch_size':16,
'r': 7, // Number of decoder frames to predict per iteration. Set the initial values if gradual training is enabled.
'gradual_training': null, //[[0, 7, 64], [1, 5, 64], [50000, 3, 32], [130000, 2, 16], [290000, 1, 32]], // ONLY TACOTRON - set gradual training steps [first_step, r, batch_size]. If it is null, gradual training is disabled.
'loss_masking': true, // enable / disable loss masking against the sequence padding.

// VALIDATION
'run_eval': true,
'test_delay_epochs': 10, //Until attention is aligned, testing only wastes computation time.
'test_sentences_file': null, // set a file to load sentences to be used for testing. If it is null then we use default english sentences.

// OPTIMIZER
'noam_schedule': false, // use noam warmup and lr schedule.
'grad_clip': 1, // upper limit for gradients for clipping.
'epochs': 1000, // total number of epochs to train.
'lr': 0.0001, // Initial learning rate. If Noam decay is active, maximum learning rate.
'lr_decay': false, // if true, Noam learning rate decaying is applied through training.
'wd': 0.000001, // Weight decay weight.
'warmup_steps': 4000, // Noam decay steps to increase the learning rate from 0 to 'lr'
'seq_len_norm': false, // Normalize eash sample loss with its length to alleviate imbalanced datasets. Use it if your dataset is small or has skewed distribution of sequence lengths.

// TACOTRON PRENET
'memory_size': -1, // ONLY TACOTRON - size of the memory queue used fro storing last decoder predictions for auto-regression. If < 0, memory queue is disabled and decoder only uses the last prediction frame.
'prenet_type': 'bn', // 'original' or 'bn'.
'prenet_dropout': false, // enable/disable dropout at prenet.

// ATTENTION
'attention_type': 'original', // 'original' or 'graves'
'attention_heads': 5, // number of attention heads (only for 'graves')
'attention_norm': 'sigmoid', // softmax or sigmoid. Suggested to use softmax for Tacotron2 and sigmoid for Tacotron.
'windowing': false, // Enables attention windowing. Used only in eval mode.
'use_forward_attn': false, // if it uses forward attention. In general, it aligns faster.
'forward_attn_mask': false, // Additional masking forcing monotonicity only in eval mode.
'transition_agent': false, // enable/disable transition agent of forward attention.
'location_attn': true, // enable_disable location sensitive attention. It is enabled for TACOTRON by default.
'bidirectional_decoder': false, // use https://arxiv.org/abs/1907.09006. Use it, if attention does not work well with your dataset.

// STOPNET
'stopnet': true, // Train stopnet predicting the end of synthesis.
'separate_stopnet': true, // Train stopnet seperately if 'stopnet==true'. It prevents stopnet loss to influence the rest of the model. It causes a better model, but it trains SLOWER.

// TENSORBOARD and LOGGING
'print_step': 25, // Number of steps to log traning on console.
'save_step': 10000, // Number of training steps expected to save traninpg stats and checkpoints.
'checkpoint': true, // If true, it saves checkpoints per 'save_step'
'tb_model_param_stats': false, // true, plots param stats per layer on tensorboard. Might be memory consuming, but good for debugging.

// DATA LOADING
'text_cleaner': 'phoneme_cleaners',
'enable_eos_bos_chars': false, // enable/disable beginning of sentence and end of sentence chars.
'num_loader_workers': 4, // number of training data loader processes. Don't set it too big. 4-8 are good values.
'num_val_loader_workers': 4, // number of evaluation data loader processes.
'batch_group_size': 0, //Number of batches to shuffle after bucketing.
'min_seq_len': 6, // DATASET-RELATED: minimum text length to use in training
'max_seq_len': 150, // DATASET-RELATED: maximum text length

// PATHS
'output_path': 'A: slash Other slash Installations slash chatbot slash Speech_Synthesis slash TTS_Final_Load_Test slash TTS slash Outputs', // DATASET-RELATED: output path for all training outputs.
//'output_path': '/media/erogol/data_ssd/Models/runs/',

// PHONEMES
'phoneme_cache_path': 'ljspeech_ph_cache', // phoneme computation is slow, therefore, it caches results in the given folder.
'use_phonemes': true, // use phonemes instead of raw characters. It is suggested for better pronounciation.
'phoneme_language': 'en-us', // depending on your target language, pick one from https://github.com/bootphon/phonemizer#languages
// MULTI-SPEAKER and GST
'use_speaker_embedding': false, // use speaker embedding to enable multi-speaker learning.
'style_wav_for_test': null, // path to style wav file to be used in TacotronGST inference.
'use_gst': false, // TACOTRON ONLY: use global style tokens

// DATASETS
'datasets': // List of datasets. They all merged and they get different speaker_ids.
[
{
'name': 'libri_tts',
'path': 'A: slash Other slash Installations slash chatbot slash Speech_Synthesis slash TTS_Old_2 slash TTS slash RC_Voice_Source',
//'path': '/home/erogol/Data/LJSpeech-1.1',
'meta_file_train': null,
'meta_file_val': null
}
]

}

[This is an archived TTS discussion thread from discourse.mozilla.org/t/svd-not-converging-during-validation-phase]

JRMeyer · 2021-03-07T08:58:32Z

JRMeyer
Mar 7, 2021
Maintainer Author

>>> othiele
[October 4, 2020, 1:23pm]

What branch are you using? Most recent - and working - code is in the
dev branch.

[Archived Post]

0 replies

JRMeyer · 2021-03-07T08:58:34Z

JRMeyer
Mar 7, 2021
Maintainer Author

>>> LegendBegins
[October 4, 2020, 3:54pm]

I was using the branch linked with the model (specifically this
one). I ran a quick test
with the latest version, and this model's architecture appears to be
incompatible with the most recent TTS. Is there a commit I missed that
fixes this issue without breaking compatibility?

[Archived Post]

0 replies

JRMeyer · 2021-03-07T08:58:37Z

JRMeyer
Mar 7, 2021
Maintainer Author

>>> othiele
[October 4, 2020, 6:17pm]

As I said, don't use anything but dev currently. The code you are
referencing is 8 months old. As long as there is no release, stick to
dev. I tried patching in between, it is a nightmare. I know it's hard,
throw everything away and start over.

[Archived Post]

0 replies

JRMeyer · 2021-03-07T08:58:40Z

JRMeyer
Mar 7, 2021
Maintainer Author

>>> LegendBegins
[October 4, 2020, 7:12pm]

It's not a matter of throwing away work---switching between versions of
TTS is only a few minutes of effort. I'm using a pretrained model and
fine tuning it, and those models are only compatible with the
architecture that was present when they were released. In other words,
if I switch to the most recent version of the codebase, I would have to
scrap this model that I didn't generate or have the means to generate,
because each model is locked under the version of Mozila TTS it was
created in. The reason the models in the wiki provide a link to the
branch they were created under is because they usually stop working
after a significant update to the architecture.

[Archived Post]

0 replies

JRMeyer · 2021-03-07T08:58:42Z

JRMeyer
Mar 7, 2021
Maintainer Author

>>> othiele
[October 4, 2020, 7:24pm]

I understand completely and I meant exactly that: throwing away your
previous model.

is already in another role
at Mozilla he doesn't have much time to support here and I don't see why
code that was training before doesn't train now. If continuing a
training worked back then? Maybe try the old data to see whether this
happens or state some more info and maybe erogol has a good answer. But
it is old code and a strange error.

[Archived Post]

0 replies

JRMeyer · 2021-03-07T08:58:45Z

JRMeyer
Mar 7, 2021
Maintainer Author

>>> sanjaesc
[October 4, 2020, 10:15pm]

Did you try fine-tuning this model?

github.com/mozilla/TTS

#### Model Release: Tacotron2 with Forward Attention - LJSpeech

date='2020-02-06' time='16:17:07' timezone='UTC'}

date='2020-03-11' time='09:03:33' timezone='UTC'}

Model Link:
https://drive.google.com/open?id=10ymOlWHutqTtfDYhIbHULn2IKDKP0O9m Colab
example:
https://colab.research.google.com/drive/1cpofjnfKSpFhiREgExENIsum4MrqxyPR
This model is trained with Forward Attention enabled until slash ~400K iters
and then finetuned with Batch Norm slash ...

[model-releasestyle='display:inline-block;margin-top:2px;background-color: #B8B8B8;padding: 2px;border-radius: 4px;color: #fff;margin-left: 3px;'}

[Archived Post]

0 replies

JRMeyer · 2021-03-07T08:58:48Z

JRMeyer
Mar 7, 2021
Maintainer Author

>>> LegendBegins
[October 5, 2020, 6:54am]

I just loaded that model and its corresponding version, and it raises
the same exception.

[Archived Post]

0 replies

JRMeyer · 2021-03-07T08:58:50Z

JRMeyer
Mar 7, 2021
Maintainer Author

>>> othiele
[October 5, 2020, 8:51am]

Please give some more info, I guess you haven't used your own model?
Then this would suggest that the error stems from one of your
dependencies? Try to load everything in Google Colab, this would be
sharable for us to help or use
's
GatherUp.

[Archived Post]

0 replies

JRMeyer · 2021-03-07T08:58:53Z

JRMeyer
Mar 7, 2021
Maintainer Author

>>> erogol
[October 5, 2020, 12:21pm]

I think SVD is running to estimate reverse mel-filters to convert
mel-specs to linear-specs before runnine GL algorithm. I've never
encountered that personally. It is more likely to be about the numpy or
scipy version. Please try to reinstall the environment using the
requirements.txt file and try different versions if it does not work.

[Archived Post]

0 replies

JRMeyer · 2021-03-07T08:58:56Z

JRMeyer
Mar 7, 2021
Maintainer Author

>>> LegendBegins
[October 5, 2020, 5:35pm]

slash
Correct, I'm not training a model from scratch, and I don't have the
data necessary to create one (which is why I'm trying to fine tune an
already successful model). This dataset worked on one of the Tacotron 1
models as well. I'd be more than glad to run the tool, but keep in mind
that I've tried well over a dozen dependency combinations, so it might
not provide a comprehensive view of what I've loaded in previous trials.

slash
I reinstalled the environment from scratch and iterated through versions
of scipy, librosa, and numpy (down to 1.15.0), all of which generate the
same error. I tried modifying torch's version for good measure as well. slash
Looking at the slash _mel_to_linear function, the error arises when it tries
to pseudo-invert the mel_basis, which is the following: slash
librosa.filters.mel(self.sample_rate, n_fft, n_mels=self.num_mels, fmin=self.mel_fmin, fmax=self.mel_fmax) slash
Substituting out my own parameters results in this: slash
librosa.filters.mel(24000, 2048, n_mels=80, fmin=0, fmax=8000.0) slash
I'm not familiar with the internals of this function, but given that
it's provided constants (and that these constants generate a valid
p-invertible matrix in a standalone Python file), I assume that the
output of librosa.filters.mel is dependent on something else in the
program that isn't passed as an argument. If not, I'm at a loss.

[Archived Post]

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Svd not converging during validation phase #279

{{title}}

Replies: 10 comments

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Svd not converging during validation phase #279

JRMeyer Mar 7, 2021 Maintainer

Replies: 10 comments

JRMeyer Mar 7, 2021 Maintainer Author

JRMeyer Mar 7, 2021 Maintainer Author

JRMeyer Mar 7, 2021 Maintainer Author

JRMeyer Mar 7, 2021 Maintainer Author

JRMeyer Mar 7, 2021 Maintainer Author

JRMeyer Mar 7, 2021 Maintainer Author

JRMeyer Mar 7, 2021 Maintainer Author

JRMeyer Mar 7, 2021 Maintainer Author

JRMeyer Mar 7, 2021 Maintainer Author

JRMeyer Mar 7, 2021 Maintainer Author

JRMeyer
Mar 7, 2021
Maintainer

JRMeyer
Mar 7, 2021
Maintainer Author

JRMeyer
Mar 7, 2021
Maintainer Author

JRMeyer
Mar 7, 2021
Maintainer Author

JRMeyer
Mar 7, 2021
Maintainer Author

JRMeyer
Mar 7, 2021
Maintainer Author

JRMeyer
Mar 7, 2021
Maintainer Author

JRMeyer
Mar 7, 2021
Maintainer Author

JRMeyer
Mar 7, 2021
Maintainer Author

JRMeyer
Mar 7, 2021
Maintainer Author

JRMeyer
Mar 7, 2021
Maintainer Author