Distributed Training failing. #649

luthes · 2021-07-12T04:45:12Z

luthes
Jul 12, 2021

Seem to be having an issue getting GPUs to working, these are V100s on AWS.

Commands:

(.venv) [ec2-user@aws python-mytts]$ export CUDA_VISIBLE_DEVICES="1, 2, 3, 4"
(.venv) [ec2-user@aws python-mytts]$ python TTS/bin/distribute.py --script training_tacotron2.py --config taco_config.json

Error:

['training_tacotron2.py', '--continue_path=', '--restore_path=', '--config_path=taco_config.json', '--group_id=group_2021_07_12-043001', '--rank=0']
['training_tacotron2.py', '--continue_path=', '--restore_path=', '--config_path=taco_config.json', '--group_id=group_2021_07_12-043001', '--rank=1']
['training_tacotron2.py', '--continue_path=', '--restore_path=', '--config_path=taco_config.json', '--group_id=group_2021_07_12-043001', '--rank=2']
['training_tacotron2.py', '--continue_path=', '--restore_path=', '--config_path=taco_config.json', '--group_id=group_2021_07_12-043001', '--rank=3']
Tacotron2Config(model='Tacotron2', run_name='', run_description='', epochs=10000, batch_size=None, eval_batch_size=None, mixed_precision=False, run_eval=True, test_delay_epochs=0, print_eval=False, print_step=25, tb_plot_step=100, tb_model_param_stats=False, save_step=10000, checkpoint=True, keep_all_best=False, keep_after=10000, num_loader_workers=4, num_eval_loader_workers=0, use_noise_augment=True, output_path='/home/ec2-user/python-mytts', distributed_backend='nccl', distributed_url='tcp://localhost:54321', audio=BaseAudioConfig(fft_size=1024, win_length=1024, hop_length=256, frame_shift_ms=None, frame_length_ms=None, stft_pad_mode='reflect', sample_rate=22050, resample=False, preemphasis=0.0, ref_level_db=20, do_sound_norm=False, do_trim_silence=True, trim_db=45, power=1.5, griffin_lim_iters=60, num_mels=80, mel_fmin=0.0, mel_fmax=None, spec_gain=20, signal_norm=True, min_level_db=-100, symmetric_norm=True, max_norm=4.0, clip_norm=True, stats_path=None), use_phonemes=True, use_espeak_phonemes=True, phoneme_language='en-us', compute_input_seq_cache=False, text_cleaner='phoneme_cleaners', enable_eos_bos_chars=False, test_sentences_file='', phoneme_cache_path='/home/ec2-user/python-mytts/my-models/mytts/.cache/phoneme', characters=None, batch_group_size=4, loss_masking=True, min_seq_len=6, max_seq_len=153, compute_f0=False, add_blank=False, datasets=[BaseDatasetConfig(name='ljspeech', path='/home/ec2-user/python-mytts/my-models/mytts/TRAINING_WAVS/', meta_file_train='metadata_uniq.csv', ununsed_speakers=None, meta_file_val='', meta_file_attn_mask='')], optimizer='RAdam', optimizer_params={'betas': [0.9, 0.998], 'weight_decay': 1e-06}, lr_scheduler='NoamLR', lr_scheduler_params={'warmup_steps': 4000}, test_sentences=["It took me quite a long time to develop a voice, and now that I have it I'm not going to be silent.", 'Be a voice, not an echo.', "I'm sorry Dave. I'm afraid I can't do that.", "This cake is great. It's so delicious and moist.", 'Prior to November 22, 1963.'], use_speaker_embedding=False, use_d_vector_file=False, d_vector_dim=None, use_gst=False, gst=None, gst_style_input=None, num_speakers=1, num_chars=0, r=7, gradual_training=[[0, 7, 64], [1, 5, 32], [50000, 3, 32], [130000, 2, 16], [290000, 1, 16]], memory_size=-1, prenet_type='original', prenet_dropout=False, prenet_dropout_at_inference=False, stopnet=True, separate_stopnet=True, stopnet_pos_weight='15', max_decoder_steps=500, encoder_in_features=512, decoder_in_features=512, decoder_output_dim=80, out_channels=80, attention_type='original', attention_heads=4, attention_norm='sigmoid', attention_win=False, windowing=False, use_forward_attn=False, forward_attn_mask=False, transition_agent=False, location_attn=True, bidirectional_decoder=False, double_decoder_consistency=True, ddc_r=7, speaker_embedding_dim=512, d_vector_file=False, lr=0.0004, grad_clip=1, seq_len_norm=False, decoder_loss_alpha=0.5, postnet_loss_alpha=0.25, postnet_diff_spec_alpha=0.25, decoder_diff_spec_alpha=0.25, decoder_ssim_alpha=0.5, postnet_ssim_alpha=0.25, ga_alpha=5)
/home/ec2-user/python-mytts
 > Git Hash: 8fbadad6
Root Path: /home/ec2-user/python-mytts, Model Name: 
Root Path: /home/ec2-user/python-mytts, Model Name: 
Root Path: /home/ec2-user/python-mytts, Model Name: 
 > Experiment folder: /home/ec2-user/python-mytts/-July-12-2021_04+30AM-8fbadad6
Traceback (most recent call last):
  File "training_tacotron2.py", line 69, in <module>
    trainer = Trainer(args, config, output_path, c_logger, tb_logger)
  File "/home/ec2-user/python-mytts/TTS/trainer.py", line 150, in __init__
    self.use_cuda, self.num_gpus = setup_torch_training_env(True, cudnn_benchmark)
  File "/home/ec2-user/python-mytts/TTS/utils/trainer_utils.py", line 17, in setup_torch_training_env
    f" [!] {num_gpus} active GPUs. Define the target GPU by `CUDA_VISIBLE_DEVICES`. For multi-gpu training use `TTS/bin/distribute.py`."
RuntimeError:  [!] 4 active GPUs. Define the target GPU by `CUDA_VISIBLE_DEVICES`. For multi-gpu training use `TTS/bin/distribute.py`.
Traceback (most recent call last):
  File "training_tacotron2.py", line 69, in <module>
    trainer = Trainer(args, config, output_path, c_logger, tb_logger)
  File "/home/ec2-user/python-mytts/TTS/trainer.py", line 150, in __init__
    self.use_cuda, self.num_gpus = setup_torch_training_env(True, cudnn_benchmark)
  File "/home/ec2-user/python-mytts/TTS/utils/trainer_utils.py", line 17, in setup_torch_training_env
    f" [!] {num_gpus} active GPUs. Define the target GPU by `CUDA_VISIBLE_DEVICES`. For multi-gpu training use `TTS/bin/distribute.py`."
RuntimeError:  [!] 4 active GPUs. Define the target GPU by `CUDA_VISIBLE_DEVICES`. For multi-gpu training use `TTS/bin/distribute.py`.
Traceback (most recent call last):
  File "training_tacotron2.py", line 69, in <module>
    trainer = Trainer(args, config, output_path, c_logger, tb_logger)
  File "/home/ec2-user/python-mytts/TTS/trainer.py", line 150, in __init__
    self.use_cuda, self.num_gpus = setup_torch_training_env(True, cudnn_benchmark)
  File "/home/ec2-user/python-mytts/TTS/utils/trainer_utils.py", line 17, in setup_torch_training_env
    f" [!] {num_gpus} active GPUs. Define the target GPU by `CUDA_VISIBLE_DEVICES`. For multi-gpu training use `TTS/bin/distribute.py`."
RuntimeError:  [!] 4 active GPUs. Define the target GPU by `CUDA_VISIBLE_DEVICES`. For multi-gpu training use `TTS/bin/distribute.py`.
Traceback (most recent call last):
  File "training_tacotron2.py", line 69, in <module>
    trainer = Trainer(args, config, output_path, c_logger, tb_logger)
  File "/home/ec2-user/python-mytts/TTS/trainer.py", line 150, in __init__
    self.use_cuda, self.num_gpus = setup_torch_training_env(True, cudnn_benchmark)
  File "/home/ec2-user/python-mytts/TTS/utils/trainer_utils.py", line 17, in setup_torch_training_env
    f" [!] {num_gpus} active GPUs. Define the target GPU by `CUDA_VISIBLE_DEVICES`. For multi-gpu training use `TTS/bin/distribute.py`."
RuntimeError:  [!] 4 active GPUs. Define the target GPU by `CUDA_VISIBLE_DEVICES`. For multi-gpu training use `TTS/bin/distribute.py`.

Config:

import os
from TTS.tts.configs import Tacotron2Config, BaseDatasetConfig
from TTS.trainer import init_training, Trainer, TrainingArgs

output_path = os.path.dirname(os.path.abspath(__file__))
dataset_config = BaseDatasetConfig(name="ljspeech",
                                   meta_file_train="metadata.csv",
                                   path=os.path.join(
                                       output_path,
                                       "my-models/mytts/TRAINING_WAVS/"))

config = Tacotron2Config(
    # Base Config
    output_path=output_path,
    use_phonemes=True,  # (bool):
    phoneme_language="en-us",
    use_espeak_phonemes=True,  # (bool):
    compute_input_seq_cache=False,  # (bool):
    text_cleaner="phoneme_cleaners",  # (str):
    phoneme_cache_path=os.path.join(
        output_path, 'my-models/mytts/.cache/phoneme'),  # (str):
    enable_eos_bos_chars=False,  # (bool):
    batch_group_size=4,  # (int):
    min_seq_len=6,  # (int):
    max_seq_len=153,  # (int):
    use_noise_augment=True,  # (bool):
    add_blank=False,  # (bool):
    datasets=[dataset_config],  # (List[BaseDatasetConfig])
    model='Tacotron2',
    use_gst=False,
    r=7,
    gradual_training=[[0, 7, 64], [1, 5, 32], [50000, 3, 32], [130000, 2, 16],
                      [290000, 1, 16]],
    prenet_type="original",
    prenet_dropout=False,
    stopnet=True,
    stopnet_pos_weight="15",
    separate_stopnet=True,
    attention_type="original",
    attention_heads=4,
    windowing=False,
    use_forward_attn=False,
    forward_attn_mask=False,
    transition_agent=False,
    location_attn=True,
    bidirectional_decoder=False,
    double_decoder_consistency=True,
    ddc_r=7,
    use_speaker_embedding=False,
    lr=.0004,
    grad_clip=1,
    loss_masking=True,
    decoder_loss_alpha=.5,
    postnet_loss_alpha=.25,
    postnet_diff_spec_alpha=.25,
    decoder_diff_spec_alpha=.25,
    decoder_ssim_alpha=.5,
    postnet_ssim_alpha=.25,
    ga_alpha=5,
    num_loader_workers=4,
)

print(config)
print(output_path)

args, config, output_path, _, c_logger, tb_logger = init_training(
    TrainingArgs(), config)
trainer = Trainer(args, config, output_path, c_logger, tb_logger)
trainer.fit()

With a single GPU on the same machine with 8, I don't seem to have any issues. Setting the environment variable to 0, even with 8 GPUs, it starts training, or at least got passed here. On my machine with a single GPU, I don't have any issues with a python training_tacotron2.py config (generic tacotron2 json config I translated). I'm thinking I'm missing something, or maybe something's not documented in the Wiki.

erogol · 2021-07-12T15:39:56Z

erogol
Jul 12, 2021
Maintainer

try CUDA_VISIBLE_DEVICES=0,1,2,3 python TTS/bin/distribute.py ...without export

4 replies

luthes Jul 12, 2021
Author

Yeah, regardless of if I export it, or if set it in the same line, I get the same error. I thought this might be a CUDA version error, but I've installed a few different versions of Torch with cuda 10.1 through 11.1 and couldn't get it to kick off training.

edit: I'm curious if I actually need the taco_config.json, as the Python file (training_tacotron2.py) already has all of the variables for config. I've tried both ways, and I get the same error regardless.

CUDA_VISIBLE_DEVICES="0,1,2,3,4,5,6,7" python TTS/bin/distribute.py --script training_tacotron2.py

RuntimeError:  [!] 8 active GPUs. Define the target GPU by `CUDA_VISIBLE_DEVICES`. For multi-gpu training use `TTS/bin/distribute.py`.
Traceback (most recent call last):
  File "training_tacotron2.py", line 69, in <module>
    trainer = Trainer(args, config, output_path, c_logger, tb_logger)
  File "/home/ec2-user/python-scarjo/TTS/trainer.py", line 150, in __init__
    self.use_cuda, self.num_gpus = setup_torch_training_env(True, cudnn_benchmark)
  File "/home/ec2-user/python-scarjo/TTS/utils/trainer_utils.py", line 17, in setup_torch_training_env
    f" [!] {num_gpus} active GPUs. Define the target GPU by `CUDA_VISIBLE_DEVICES`. For multi-gpu training use `TTS/bin/distribute.py`."
RuntimeError:  [!] 8 active GPUs. Define the target GPU by `CUDA_VISIBLE_DEVICES`. For multi-gpu training use `TTS/bin/distribute.py`.

luthes Jul 13, 2021
Author

So I think I know what's happening, and I think this might be a bug when using the .py config file. When python distribute.py is called, it copies the entire environment, which includes the CUDA_AVAILABLE_DEVICES=0,n and then that gets passed the whole way down to setup_torch_training_env, causing it to fail.

In distribute.py setting my_env["CUDA_VISIBLE_DEVICES"]="{}".format(i) in the subprocess seemed to resolve this issue:

 38     # run processes
 39     processes = []
 40     for i in range(num_gpus):
 41         my_env = os.environ.copy()
 42         my_env["PYTHON_EGG_CACHE"] = "/tmp/tmp{}".format(i)
 43         my_env["CUDA_VISIBLE_DEVICES"]="{}".format(i)
 44         command[-1] = "--rank={}".format(i)
 45         print(f"Command: {command}")

luthes Jul 13, 2021
Author

Apologies for the spam, but this did indeed seem to fix it, and I'm able to start training. I'm not entirely convinced that this is just because it's my environment. I have this running on vanilla AWS instance, using their default CUDA AMI, with no changes. 4 Nvidia V100 GPUs, CUDA 10.0. TTS installed in a virtual environment with pip install -e .. Nothing really out of the ordinary. If you want, I can open a PR for this if it's even necessary.

erogol Jul 13, 2021
Maintainer

I see the problem. Please send a PR with the fix above. Thx for dealing with that.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Distributed Training failing. #649

{{title}}

Replies: 1 comment 4 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

Distributed Training failing. #649

luthes Jul 12, 2021

Replies: 1 comment · 4 replies

erogol Jul 12, 2021 Maintainer

luthes Jul 12, 2021 Author

luthes Jul 13, 2021 Author

luthes Jul 13, 2021 Author

erogol Jul 13, 2021 Maintainer

luthes
Jul 12, 2021

Replies: 1 comment 4 replies

erogol
Jul 12, 2021
Maintainer

luthes Jul 12, 2021
Author

luthes Jul 13, 2021
Author

luthes Jul 13, 2021
Author

erogol Jul 13, 2021
Maintainer