training does not work #18

listener17 · 2023-07-03T14:45:31Z

Hi all:

Did anyone manage to start the training?
If yes, could you please share your environment?

I created a separate virtual environment (Python 3.10.11). I'm using CUDA Version: 11.4; Ubuntu 20.04.2 LTS.
I followed all the instruction.
pip install git+https://github.com/descriptinc/descript-audio-codec

Encoding + decoding works!

Then did the training pre-requisites step:
pip install -e ".[dev]"

When I start training:

export CUDA_VISIBLE_DEVICES=0
python scripts/train.py --args.load conf/ablations/baseline.yml --save_path runs/baseline/

It's stuck for a long time displaying (below) and then exits!

[11:50:04] Saving audio samples to TensorBoard                                     decorators.py:220

─────────────────────────────────────────── train_loop() ───────────────────────────────────────────

╭─────────────────────────────────────────── Progress ────────────────────────────────────────────╮
│                                              train                                              │
│                                             ╷                      ╷                            │
│       key                                   │ value                │ mean                       │
│     ╶───────────────────────────────────────┼──────────────────────┼──────────────────────╴     │
│       adv/disc_loss                         │   7.859243           │   7.859243                 │
│       adv/feat_loss                         │  22.607071           │  22.607071                 │
│       adv/gen_loss                          │   7.853064           │   7.853064                 │
│       loss                                  │ 157.455719           │ 157.455719                 │
│       mel/loss                              │   6.766288           │   6.766288                 │
│       other/batch_size                      │   12.000000           │   12.000000                 │
│       other/grad_norm                       │        nan           │   0.000000                 │
│       other/grad_norm_d                     │        inf           │   0.000000                 │
│       other/learning_rate                   │   0.000100           │   0.000100                 │
│       stft/loss                             │   6.870012           │   6.870012                 │
│       vq/codebook_loss                      │   2.315346           │   2.315346                 │
│       vq/commitment_loss                    │   2.315346           │   2.315346                 │
│       waveform/loss                         │   0.116164           │   0.116164                 │
│       time/train_loop                       │  19.523806           │  19.523806                 │
│                                             ╵                      ╵                            │
│                                                                                                 │
│     ⠏ Iteration (train) 1/250000 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0:03:17 / -:--:--     │
│     ⠏ Iteration (val)   0/63     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0:03:17 / -:--:--     │
╰─────────────────────────────────────────────────────────────────────────────────────────────────╯Tcl_AsyncDelete: async handler deleted by the wrong thread
Aborted (core dumped)

Any idea why training is not working?

The text was updated successfully, but these errors were encountered:

pseeth · 2023-07-03T16:47:20Z

Hey @listener17 , sorry about that! I'll look into it this week.

For now, can you try launching via torchrun, even if on a single GPU? The relevant command is

torchrun --nproc_per_node 1 scripts/train.py --args.load conf/ablations/baseline.yml --save_path runs/baseline/

Just curious if that works.

listener17 · 2023-07-04T13:06:08Z

@pseeth: thanks.
It's stuck for ages. Even though I use smaller batch size (4), one discriminator each for the 2 types of discriminator.

user@v100:~/user/descript-audio-codec$ torchrun --nproc_per_node 1 scripts/train.py --args.load conf/ablations/baseline.yml --save_path runs/baseline/
Accelerator(
  amp : bool = False
)

listener17 · 2023-07-04T13:12:18Z

@pseeth:
If I use this:

export CUDA_VISIBLE_DEVICES=2
torchrun --nproc_per_node 1 scripts/train.py --args.load conf/ablations/baseline.yml --save_path runs/baseline/

I get this error:

Accelerator(
  amp : bool = False
)
Traceback (most recent call last):
  File "/home/user/descript-audio-codec/scripts/train.py", line 433, in <module>
    with Accelerator() as accel:
  File "/home/user/anaconda3/envs/dac/lib/python3.10/site-packages/argbind/argbind.py", line 159, in cmd_func
    return func(*cmd_args, **kwargs)
  File "/home/user/anaconda3/envs/dac/lib/python3.10/site-packages/audiotools/ml/accelerator.py", line 71, in __init__
    torch.cuda.device(self.local_rank) if torch.cuda.is_available() else None
  File "/home/user/anaconda3/envs/dac/lib/python3.10/site-packages/torch/cuda/__init__.py", line 312, in __init__
    self.idx = _get_device_index(device, optional=True)
  File "/home/user/anaconda3/envs/dac/lib/python3.10/site-packages/torch/cuda/_utils.py", line 26, in _get_device_index
    device = torch.device(device)
RuntimeError: Invalid device string: '0'
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 5276) of binary: /home/user/anaconda3/envs/dac/bin/python
Traceback (most recent call last):
  File "/home/user/anaconda3/envs/dac/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/home/user/anaconda3/envs/dac/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/home/user/anaconda3/envs/dac/lib/python3.10/site-packages/torch/distributed/run.py", line 794, in main
    run(args)
  File "/home/user/anaconda3/envs/dac/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run
    elastic_launch(
  File "/home/user/anaconda3/envs/dac/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/user/anaconda3/envs/dac/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
scripts/train.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-07-04_06:07:46
  host      : v100.com.net
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 5276)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

listener17 · 2023-07-04T14:36:08Z

@pseeth:
The error seems to be in line 235 https://github.com/descriptinc/descript-audio-codec/blob/main/scripts/train.py
out = state.generator(signal.audio_data, signal.sample_rate)

Exception has occurred: TypeError
'int' object is not iterable
  File "/home/user/descript-audio-codec/scripts/train.py", line 235, in train_loop
    out = state.generator(signal.audio_data, signal.sample_rate)
  File "/home/user/descript-audio-codec/scripts/train.py", line 412, in train
    train_loop(state, batch, accel, lambdas)
  File "/home/user/descript-audio-codec/scripts/train.py", line 437, in <module>
    train(args, accel)
TypeError: 'int' object is not iterable

listener17 · 2023-07-05T11:10:56Z

On a different GPU server, I'm getting similar but different error message at the same place

Exception has occurred: TypeError
zip argument #1 must support iteration
  File "/home/user/descript-audio-codec/scripts/train.py", line 235, in train_loop
    out = state.generator(signal.audio_data, signal.sample_rate)
  File "/home/user/descript-audio-codec/scripts/train.py", line 412, in train
    train_loop(state, batch, accel, lambdas)
  File "/home/user/descript-audio-codec/scripts/train.py", line 437, in <module>
    train(args, accel)
TypeError: zip argument #1 must support iteration

listener17 · 2023-07-07T15:31:38Z

@pseeth and @eeishaan:

FYI:
python -m pytest tests is also not working.

BUT, if I add:

import sys 
sys.stdout.reconfigure(encoding="utf-8")

at the top of https://github.com/descriptinc/descript-audio-codec/blob/main/tests/test_train.py
Then, python -m pytest tests passes!

I tried the same trick with train.py, but still the training does not work!
But, maybe it all gives you guys some hints.

listener17 · 2023-07-10T12:09:47Z

I created a clean conda environment, followed your installation steps, and ... it was not working.

However, by luck, the training was working on my colleague's (unclean) environment.
So, I simply used exactly that environment .... and the training works now :-)

zaptrem · 2023-08-23T03:10:42Z

I created a clean conda environment, followed your installation steps, and ... it was not working.

However, by luck, the training was working on my colleague's (unclean) environment. So, I simply used exactly that environment .... and the training works now :-)

Can you share the environment and reopen the issue? We're hitting the same thing.

Edit: Colleague says adding the following fixed it:

import matplotlib
matplotlib.use('Agg')

listener17 closed this as completed Jul 10, 2023

stet-stet mentioned this issue Feb 19, 2024

broken training: please specify versions of libraries used #57

Open

ZhikangNiu mentioned this issue Feb 22, 2024

Same error in #18 #59

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

training does not work #18

training does not work #18

listener17 commented Jul 3, 2023 •

edited

pseeth commented Jul 3, 2023

listener17 commented Jul 4, 2023

listener17 commented Jul 4, 2023 •

edited

listener17 commented Jul 4, 2023

listener17 commented Jul 5, 2023

listener17 commented Jul 7, 2023 •

edited

listener17 commented Jul 10, 2023

zaptrem commented Aug 23, 2023 •

edited

training does not work #18

training does not work #18

Comments

listener17 commented Jul 3, 2023 • edited

pseeth commented Jul 3, 2023

listener17 commented Jul 4, 2023

listener17 commented Jul 4, 2023 • edited

listener17 commented Jul 4, 2023

listener17 commented Jul 5, 2023

listener17 commented Jul 7, 2023 • edited

listener17 commented Jul 10, 2023

zaptrem commented Aug 23, 2023 • edited

listener17 commented Jul 3, 2023 •

edited

listener17 commented Jul 4, 2023 •

edited

listener17 commented Jul 7, 2023 •

edited

zaptrem commented Aug 23, 2023 •

edited