Dimension problem with Multi-GPU training in fastspeech2 #4228

IrisLhy · 2022-04-01T04:11:47Z

Describe the bug
The model can be launched on single GPU, but not multiples.

Basic environments:

OS information: [e.g., Linux amax 4.15.0-45-generic SWBD recipe #48~16.04.1-Ubuntu SMP Tue Jan 29 18:03:48 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux]
python version: [e.g. 3.8.5 ]
espnet version: [e.g. espnet 1.10.6]
pytorch version [e.g. pytorch 1.7.1]

Python version: 3.8.5
Is CUDA available: Yes
CUDA runtime version: 11.1

Nvidia driver version:455.23.04

To solve the
x_masks = make_non_pad_mask(ilens).to(next(self.parameters()).device) in Fastpseech2
I change the code to
x_masks = make_non_pad_mask(ilens).to(xs.device)

However, I still have the problem that is similar with https://discuss.pytorch.org/t/dimension-problem-by-multiple-gpus/76075,

Error logs are

Traceback (most recent call last): File "<input>", line 1, in <module> File "/home/bme2/.pycharm_helpers/pydev/_pydev_bundle/pydev_umd.py", line 197, in runfile pydev_imports.execfile(filename, global_vars, local_vars) # execute the script File "/home/bme2/.pycharm_helpers/pydev/_pydev_imps/_pydev_execfile.py", line 18, in execfile exec(compile(contents+"\n", file, 'exec'), glob, loc) File "/data/data-lhy/emg2speech/05_Mandarin_dataset/fastspeech2/train.py", line 249, in <module> output = model(xs=xs, ilens=ilens, ys=ys, olens=olens, ds=ds, ps=ps, es=es,spembs=spembs) File "/home/bme2/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl result = self.forward(*input, **kwargs) File "/home/bme2/miniconda3/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 162, in forward return self.gather(outputs, self.output_device) File "/home/bme2/miniconda3/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 174, in gather return gather(outputs, output_device, dim=self.dim) File "/home/bme2/miniconda3/lib/python3.8/site-packages/torch/nn/parallel/scatter_gather.py", line 68, in gather res = gather_map(outputs) File "/home/bme2/miniconda3/lib/python3.8/site-packages/torch/nn/parallel/scatter_gather.py", line 63, in gather_map return type(out)(map(gather_map, zip(*outputs))) File "/home/bme2/miniconda3/lib/python3.8/site-packages/torch/nn/parallel/scatter_gather.py", line 55, in gather_map return Gather.apply(target_device, dim, *outputs) File "/home/bme2/miniconda3/lib/python3.8/site-packages/torch/nn/parallel/_functions.py", line 71, in forward return comm.gather(inputs, ctx.dim, ctx.target_device) File "/home/bme2/miniconda3/lib/python3.8/site-packages/torch/nn/parallel/comm.py", line 230, in gather return torch._C._gather(tensors, dim, destination) RuntimeError: Input tensor at index 1 has invalid shape [4, 170, 80], but expected [4, 303, 80]

The text was updated successfully, but these errors were encountered:

IrisLhy · 2022-04-06T13:58:05Z

Sorry, I find
from espnet2.torch_utils.device_funcs import force_gatherable
will help solve my problem.

IrisLhy added the Bug bug should be fixed label Apr 1, 2022

IrisLhy closed this as completed Apr 6, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dimension problem with Multi-GPU training in fastspeech2 #4228

Dimension problem with Multi-GPU training in fastspeech2 #4228

IrisLhy commented Apr 1, 2022 •

edited

Loading

IrisLhy commented Apr 6, 2022 •

edited

Loading

Dimension problem with Multi-GPU training in fastspeech2 #4228

Dimension problem with Multi-GPU training in fastspeech2 #4228

Comments

IrisLhy commented Apr 1, 2022 • edited Loading

IrisLhy commented Apr 6, 2022 • edited Loading

IrisLhy commented Apr 1, 2022 •

edited

Loading

IrisLhy commented Apr 6, 2022 •

edited

Loading