Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

a multi-gpu bug #136

Open
DingJuPeng1 opened this issue Apr 21, 2022 · 4 comments
Open

a multi-gpu bug #136

DingJuPeng1 opened this issue Apr 21, 2022 · 4 comments

Comments

@DingJuPeng1
Copy link

If I use the batchbins in espnet, It will trigger a multi-GPU bug.
for example, if I use two GPUs, and the final batch_size is 61, and I use data parallel, it will divide into 30, 31,
when I thy to use torch-audiomentations, it will trigger a bug as follow.
image

whether The batch size of each card must be the same or there can be other solutions to avoid this bug.

looking forward to a reply

@iver56
Copy link
Collaborator

iver56 commented Apr 21, 2022

Could you provide a snippet of code that reproduces the problem? If possible, make one that doesn't need a multi-gpu setup to reproduce it, as I don't have such a setup available at the moment

@DingJuPeng1
Copy link
Author

I apologize for not being particularly easy to get a small piece of reproducible code, because this code is based on espnet which is wrapped.
when I use single Gpu to run same code, It is normal, but when I use two GPU with dataParallel it will trigger this problem, it seems like it mixes the two sample size in different batches than lead to shape dismatch.
This problem will occur when I run this code:
image
"self.transforms" include "ApplyImpulseResponse" and "AddBackgroundNoise"
image

@iver56
Copy link
Collaborator

iver56 commented Apr 21, 2022

Ok, but without a code example and a multi-gpu setup I won't be able to reproduce the bug at the moment.

Does this bug apply to all transforms, or just ApplyImpulseResponse and/or AddBackgroundNoise?

Is there a way you can work around it? Like always give it a batch size that is divisible by your number of GPUs? A fixed batch size should do the trick.

Or would you like to make a PR that fixes the bug?

Or maybe I should slap a known limitation on readme that says multi-GPU with "uneven" batch sizes isn't officially supported?

@DingJuPeng1
Copy link
Author

When I Use batch_sizes which can be divible by the number of GPUs, it can work normally. It seems like occur when there are uneven batch sizes in different GPUs. If you can't reproduce the bug and fix that, I think I can have a try to fix this bug by myself at first.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants