-
Notifications
You must be signed in to change notification settings - Fork 6.8k
low speedup ratio when using multi-GPUs and multi-machines #4968
Comments
use device instead of local |
please try local -> device and dist_sync -> dist_device_sync. adding |
@piiswrong @mli I also tried the #####My question
#####Results a). single machine with batch size per GPU = 32 and kv_store = 'device'
b). single machine with batch size per GPU = 64 and kv_store = 'device'
c). 2 machines with batch size per GPU = 32 and kv_store = ' dist_device_sync '
d). 2 machines with batch size per GPU = 64 and kv_store = ' dist_device_sync '
|
@Agoniii How about train and test accuracy? |
This issue is closed due to lack of activity in the last 90 days. Feel free to ping me to reopen if this is still an active issue. Thanks! |
Environment info
Operating System: centos 7.1
Cuda 7.5
MXNet version: v0.9
Python versio: v2.7
Error Message:
I am trying to use multiple GPUs and multiple machines for training an image classification task using resnet network, but the performance is not as expected. I have summed up some data as a few tables below:
multiple GPUs in a single machine
a). batch size per GPU = 32
b). batch size per GPU = 64
2 machines
c). batch size per GPU = 64
My problem
I think the performance and the number of GPUs is a linear relationship. But when I increase the number of GPUs from 1 to 2 with
kv_store=local
, as in tables (a) and (b), the performance is not improved but worse when batch size is 32. The number of GPUs from 1 to 4, the performance is slightly better.My result in tables (b) and (c) is inconsistent with " if there are n machines and we use batch size b, then
dist_sync
behaviors equally tolocal
with batch size n*b. "(multi_devices.md) .Minimum reproducible example
python train_imagenet.py --batch-size 256 \ --num-classes 1000 --num-examples 50000 \ --gpus 0,1,2,3 --num-epoch 80 --network resnet \ --kv-store local --data-train data/testimage-val.rec
hosts
contains the private IPs of the 2 computersSteps to reproduce
The output.
single machine
&4 gpus
&batch-size 256
2 machines
&4 gpus
&batch-size 256
The text was updated successfully, but these errors were encountered: