Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dimension problem with Multi-GPU training in fastspeech2 #4228

Closed
IrisLhy opened this issue Apr 1, 2022 · 1 comment
Closed

Dimension problem with Multi-GPU training in fastspeech2 #4228

IrisLhy opened this issue Apr 1, 2022 · 1 comment
Labels
Bug bug should be fixed

Comments

@IrisLhy
Copy link

IrisLhy commented Apr 1, 2022

Describe the bug
The model can be launched on single GPU, but not multiples.

Basic environments:

  • OS information: [e.g., Linux amax 4.15.0-45-generic SWBD recipe #48~16.04.1-Ubuntu SMP Tue Jan 29 18:03:48 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux]
  • python version: [e.g. 3.8.5 ]
  • espnet version: [e.g. espnet 1.10.6]
  • pytorch version [e.g. pytorch 1.7.1]

Python version: 3.8.5
Is CUDA available: Yes
CUDA runtime version: 11.1

Nvidia driver version:455.23.04

To solve the
x_masks = make_non_pad_mask(ilens).to(next(self.parameters()).device) in Fastpseech2
I change the code to
x_masks = make_non_pad_mask(ilens).to(xs.device)

However, I still have the problem that is similar with https://discuss.pytorch.org/t/dimension-problem-by-multiple-gpus/76075,
image
Error logs are

Traceback (most recent call last): File "<input>", line 1, in <module> File "/home/bme2/.pycharm_helpers/pydev/_pydev_bundle/pydev_umd.py", line 197, in runfile pydev_imports.execfile(filename, global_vars, local_vars) # execute the script File "/home/bme2/.pycharm_helpers/pydev/_pydev_imps/_pydev_execfile.py", line 18, in execfile exec(compile(contents+"\n", file, 'exec'), glob, loc) File "/data/data-lhy/emg2speech/05_Mandarin_dataset/fastspeech2/train.py", line 249, in <module> output = model(xs=xs, ilens=ilens, ys=ys, olens=olens, ds=ds, ps=ps, es=es,spembs=spembs) File "/home/bme2/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl result = self.forward(*input, **kwargs) File "/home/bme2/miniconda3/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 162, in forward return self.gather(outputs, self.output_device) File "/home/bme2/miniconda3/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 174, in gather return gather(outputs, output_device, dim=self.dim) File "/home/bme2/miniconda3/lib/python3.8/site-packages/torch/nn/parallel/scatter_gather.py", line 68, in gather res = gather_map(outputs) File "/home/bme2/miniconda3/lib/python3.8/site-packages/torch/nn/parallel/scatter_gather.py", line 63, in gather_map return type(out)(map(gather_map, zip(*outputs))) File "/home/bme2/miniconda3/lib/python3.8/site-packages/torch/nn/parallel/scatter_gather.py", line 55, in gather_map return Gather.apply(target_device, dim, *outputs) File "/home/bme2/miniconda3/lib/python3.8/site-packages/torch/nn/parallel/_functions.py", line 71, in forward return comm.gather(inputs, ctx.dim, ctx.target_device) File "/home/bme2/miniconda3/lib/python3.8/site-packages/torch/nn/parallel/comm.py", line 230, in gather return torch._C._gather(tensors, dim, destination) RuntimeError: Input tensor at index 1 has invalid shape [4, 170, 80], but expected [4, 303, 80]

@IrisLhy IrisLhy added the Bug bug should be fixed label Apr 1, 2022
@IrisLhy
Copy link
Author

IrisLhy commented Apr 6, 2022

Sorry, I find
from espnet2.torch_utils.device_funcs import force_gatherable
will help solve my problem.

@IrisLhy IrisLhy closed this as completed Apr 6, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug bug should be fixed
Projects
None yet
Development

No branches or pull requests

1 participant