Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

training does not work #18

Closed
listener17 opened this issue Jul 3, 2023 · 8 comments
Closed

training does not work #18

listener17 opened this issue Jul 3, 2023 · 8 comments

Comments

@listener17
Copy link

listener17 commented Jul 3, 2023

Hi all:

Did anyone manage to start the training?
If yes, could you please share your environment?

I created a separate virtual environment (Python 3.10.11). I'm using CUDA Version: 11.4; Ubuntu 20.04.2 LTS.
I followed all the instruction.
pip install git+https://github.com/descriptinc/descript-audio-codec

Encoding + decoding works!

Then did the training pre-requisites step:
pip install -e ".[dev]"

When I start training:

export CUDA_VISIBLE_DEVICES=0
python scripts/train.py --args.load conf/ablations/baseline.yml --save_path runs/baseline/

It's stuck for a long time displaying (below) and then exits!

[11:50:04] Saving audio samples to TensorBoard                                     decorators.py:220

─────────────────────────────────────────── train_loop() ───────────────────────────────────────────

╭─────────────────────────────────────────── Progress ────────────────────────────────────────────╮
│                                              train                                              │
│                                             ╷                      ╷                            │
│       key                                   │ value                │ mean                       │
│     ╶───────────────────────────────────────┼──────────────────────┼──────────────────────╴     │
│       adv/disc_loss                         │   7.859243           │   7.859243                 │
│       adv/feat_loss                         │  22.607071           │  22.607071                 │
│       adv/gen_loss                          │   7.853064           │   7.853064                 │
│       loss                                  │ 157.455719           │ 157.455719                 │
│       mel/loss                              │   6.766288           │   6.766288                 │
│       other/batch_size                      │   12.000000           │   12.000000                 │
│       other/grad_norm                       │        nan           │   0.000000                 │
│       other/grad_norm_d                     │        inf           │   0.000000                 │
│       other/learning_rate                   │   0.000100           │   0.000100                 │
│       stft/loss                             │   6.870012           │   6.870012                 │
│       vq/codebook_loss                      │   2.315346           │   2.315346                 │
│       vq/commitment_loss                    │   2.315346           │   2.315346                 │
│       waveform/loss                         │   0.116164           │   0.116164                 │
│       time/train_loop                       │  19.523806           │  19.523806                 │
│                                             ╵                      ╵                            │
│                                                                                                 │
│     ⠏ Iteration (train) 1/250000 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0:03:17 / -:--:--     │
│     ⠏ Iteration (val)   0/63     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0:03:17 / -:--:--     │
╰─────────────────────────────────────────────────────────────────────────────────────────────────╯Tcl_AsyncDelete: async handler deleted by the wrong thread
Aborted (core dumped)

Any idea why training is not working?

@pseeth
Copy link
Contributor

pseeth commented Jul 3, 2023

Hey @listener17 , sorry about that! I'll look into it this week.

For now, can you try launching via torchrun, even if on a single GPU? The relevant command is

torchrun --nproc_per_node 1 scripts/train.py --args.load conf/ablations/baseline.yml --save_path runs/baseline/

Just curious if that works.

@listener17
Copy link
Author

@pseeth: thanks.
It's stuck for ages. Even though I use smaller batch size (4), one discriminator each for the 2 types of discriminator.

user@v100:~/user/descript-audio-codec$ torchrun --nproc_per_node 1 scripts/train.py --args.load conf/ablations/baseline.yml --save_path runs/baseline/
Accelerator(
  amp : bool = False
)

@listener17
Copy link
Author

listener17 commented Jul 4, 2023

@pseeth:
If I use this:

export CUDA_VISIBLE_DEVICES=2
torchrun --nproc_per_node 1 scripts/train.py --args.load conf/ablations/baseline.yml --save_path runs/baseline/

I get this error:

Accelerator(
  amp : bool = False
)
Traceback (most recent call last):
  File "/home/user/descript-audio-codec/scripts/train.py", line 433, in <module>
    with Accelerator() as accel:
  File "/home/user/anaconda3/envs/dac/lib/python3.10/site-packages/argbind/argbind.py", line 159, in cmd_func
    return func(*cmd_args, **kwargs)
  File "/home/user/anaconda3/envs/dac/lib/python3.10/site-packages/audiotools/ml/accelerator.py", line 71, in __init__
    torch.cuda.device(self.local_rank) if torch.cuda.is_available() else None
  File "/home/user/anaconda3/envs/dac/lib/python3.10/site-packages/torch/cuda/__init__.py", line 312, in __init__
    self.idx = _get_device_index(device, optional=True)
  File "/home/user/anaconda3/envs/dac/lib/python3.10/site-packages/torch/cuda/_utils.py", line 26, in _get_device_index
    device = torch.device(device)
RuntimeError: Invalid device string: '0'
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 5276) of binary: /home/user/anaconda3/envs/dac/bin/python
Traceback (most recent call last):
  File "/home/user/anaconda3/envs/dac/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/home/user/anaconda3/envs/dac/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/home/user/anaconda3/envs/dac/lib/python3.10/site-packages/torch/distributed/run.py", line 794, in main
    run(args)
  File "/home/user/anaconda3/envs/dac/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run
    elastic_launch(
  File "/home/user/anaconda3/envs/dac/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/user/anaconda3/envs/dac/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
scripts/train.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-07-04_06:07:46
  host      : v100.com.net
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 5276)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

@listener17
Copy link
Author

@pseeth:
The error seems to be in line 235 https://github.com/descriptinc/descript-audio-codec/blob/main/scripts/train.py
out = state.generator(signal.audio_data, signal.sample_rate)

Exception has occurred: TypeError
'int' object is not iterable
  File "/home/user/descript-audio-codec/scripts/train.py", line 235, in train_loop
    out = state.generator(signal.audio_data, signal.sample_rate)
  File "/home/user/descript-audio-codec/scripts/train.py", line 412, in train
    train_loop(state, batch, accel, lambdas)
  File "/home/user/descript-audio-codec/scripts/train.py", line 437, in <module>
    train(args, accel)
TypeError: 'int' object is not iterable

@listener17
Copy link
Author

On a different GPU server, I'm getting similar but different error message at the same place

Exception has occurred: TypeError
zip argument #1 must support iteration
  File "/home/user/descript-audio-codec/scripts/train.py", line 235, in train_loop
    out = state.generator(signal.audio_data, signal.sample_rate)
  File "/home/user/descript-audio-codec/scripts/train.py", line 412, in train
    train_loop(state, batch, accel, lambdas)
  File "/home/user/descript-audio-codec/scripts/train.py", line 437, in <module>
    train(args, accel)
TypeError: zip argument #1 must support iteration

@listener17
Copy link
Author

listener17 commented Jul 7, 2023

@pseeth and @eeishaan:

FYI:
python -m pytest tests is also not working.

BUT, if I add:

import sys 
sys.stdout.reconfigure(encoding="utf-8")

at the top of https://github.com/descriptinc/descript-audio-codec/blob/main/tests/test_train.py
Then, python -m pytest tests passes!

I tried the same trick with train.py, but still the training does not work!
But, maybe it all gives you guys some hints.

@listener17
Copy link
Author

I created a clean conda environment, followed your installation steps, and ... it was not working.

However, by luck, the training was working on my colleague's (unclean) environment.
So, I simply used exactly that environment .... and the training works now :-)

@zaptrem
Copy link

zaptrem commented Aug 23, 2023

I created a clean conda environment, followed your installation steps, and ... it was not working.

However, by luck, the training was working on my colleague's (unclean) environment. So, I simply used exactly that environment .... and the training works now :-)

Can you share the environment and reopen the issue? We're hitting the same thing.

Edit: Colleague says adding the following fixed it:

import matplotlib
matplotlib.use('Agg')

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

3 participants