low speedup ratio when using multi-GPUs and multi-machines #4968

Agoniii · 2017-02-10T09:13:56Z

Environment info

Operating System: centos 7.1
Cuda 7.5
MXNet version: v0.9
Python versio: v2.7

Error Message:

I am trying to use multiple GPUs and multiple machines for training an image classification task using resnet network, but the performance is not as expected. I have summed up some data as a few tables below:

multiple GPUs in a single machine

a). batch size per GPU = 32

kv_store	num_of_gpu	batch_size	speed	speedup_ratio
local	1	32	95.381	1.000
local	2	64	76.490	0.802
local	4	128	142.664	1.496

b). batch size per GPU = 64

kv_store	num_of_gpu	batch_size	speed	speedup_ratio
local	1	64	98.701	1.000
local	2	128	112.307	1.138
local	4	256	217.973	2.208

2 machines

c). batch size per GPU = 64

kv_store	num_of_gpu	batch_size	speed	speedup_ratio
dist_sync	1	64	73.082	1.000
dist_sync	2	128	138.605	1.897
dist_sync	4	256	291.866	3.994

My problem

I think the performance and the number of GPUs is a linear relationship. But when I increase the number of GPUs from 1 to 2 with kv_store=local, as in tables (a) and (b), the performance is not improved but worse when batch size is 32. The number of GPUs from 1 to 4, the performance is slightly better.
My result in tables (b) and (c) is inconsistent with " if there are n machines and we use batch size b, then dist_sync behaviors equally to local with batch size n*b. "(multi_devices.md) .

Minimum reproducible example

single machine

  python train_imagenet.py --batch-size 256  \
      --num-classes 1000 --num-examples 50000 \
      --gpus 0,1,2,3  --num-epoch 80 --network resnet \
      --kv-store local --data-train data/testimage-val.rec

2 machines : hosts contains the private IPs of the 2 computers

  ../../tools/launch.py -H hosts -n 2  \
      python train_imagenet.py --batch-size 256 \
        --num-classes 1000 --num-examples 50000 \
        --gpus 0,1,2,3 --num-epoch 80 --network resnet \
        --kv-store dist_sync --data-train data/testimage-val.rec

Steps to reproduce

The output.

single machine & 4 gpus & batch-size 256

INFO:root:Start training with [gpu(0), gpu(1), gpu(2), gpu(3)]
INFO:root:Epoch[0] Batch [20]	Speed: 240.12 samples/sec	Train-accuracy=0.002148
INFO:root:Epoch[0] Batch [40]	Speed: 223.57 samples/sec	Train-accuracy=0.001367
INFO:root:Epoch[0] Batch [60]	Speed: 222.98 samples/sec	Train-accuracy=0.001563
INFO:root:Epoch[0] Batch [80]	Speed: 218.49 samples/sec	Train-accuracy=0.001953
INFO:root:Epoch[0] Batch [100]	Speed: 219.13 samples/sec	Train-accuracy=0.001367
INFO:root:Epoch[0] Batch [120]	Speed: 216.96 samples/sec	Train-accuracy=0.001172
INFO:root:Epoch[0] Batch [140]	Speed: 216.20 samples/sec	Train-accuracy=0.001758
INFO:root:Epoch[0] Batch [160]	Speed: 217.95 samples/sec	Train-accuracy=0.003320
INFO:root:Epoch[0] Batch [180]	Speed: 218.57 samples/sec	Train-accuracy=0.003320
INFO:root:Epoch[0] Resetting Data Iterator
INFO:root:Epoch[0] Time cost=229.735
INFO:root:Epoch[1] Batch [20]	Speed: 225.11 samples/sec	Train-accuracy=0.003516
INFO:root:Epoch[1] Batch [40]	Speed: 217.89 samples/sec	Train-accuracy=0.003906
INFO:root:Epoch[1] Batch [60]	Speed: 218.94 samples/sec	Train-accuracy=0.003711
INFO:root:Epoch[1] Batch [80]	Speed: 217.06 samples/sec	Train-accuracy=0.005078
INFO:root:Epoch[1] Batch [100]	Speed: 218.70 samples/sec	Train-accuracy=0.004687
INFO:root:Epoch[1] Batch [120]	Speed: 218.88 samples/sec	Train-accuracy=0.006250
INFO:root:Epoch[1] Batch [140]	Speed: 221.11 samples/sec	Train-accuracy=0.007031
INFO:root:Epoch[1] Batch [160]	Speed: 217.54 samples/sec	Train-accuracy=0.006836
INFO:root:Epoch[1] Batch [180]	Speed: 217.40 samples/sec	Train-accuracy=0.006250
INFO:root:Epoch[1] Resetting Data Iterator
INFO:root:Epoch[1] Time cost=229.037

2 machines & 4 gpus & batch-size 256

INFO:root:Start training with [gpu(0), gpu(1), gpu(2), gpu(3)]
INFO:root:Start training with [gpu(0), gpu(1), gpu(2), gpu(3)]
INFO:root:Epoch[0] Batch [20]	Speed: 248.57 samples/sec	Train-accuracy=0.001172
INFO:root:Epoch[0] Batch [20]	Speed: 239.79 samples/sec	Train-accuracy=0.000586
INFO:root:Epoch[0] Batch [40]	Speed: 175.89 samples/sec	Train-accuracy=0.001367
INFO:root:Epoch[0] Batch [40]	Speed: 175.36 samples/sec	Train-accuracy=0.000391
INFO:root:Epoch[0] Batch [60]	Speed: 146.04 samples/sec	Train-accuracy=0.001758
INFO:root:Epoch[0] Batch [60]	Speed: 145.87 samples/sec	Train-accuracy=0.001172
INFO:root:Epoch[0] Batch [80]	Speed: 138.28 samples/sec	Train-accuracy=0.003320
INFO:root:Epoch[0] Batch [80]	Speed: 137.40 samples/sec	Train-accuracy=0.002930
INFO:root:Epoch[0] Resetting Data Iterator
INFO:root:Epoch[0] Time cost=157.144
INFO:root:Epoch[0] Resetting Data Iterator
INFO:root:Epoch[0] Time cost=157.551
INFO:root:Epoch[1] Batch [20]	Speed: 144.78 samples/sec	Train-accuracy=0.003320
INFO:root:Epoch[1] Batch [20]	Speed: 145.04 samples/sec	Train-accuracy=0.003906
INFO:root:Epoch[1] Batch [40]	Speed: 134.42 samples/sec	Train-accuracy=0.004102
INFO:root:Epoch[1] Batch [40]	Speed: 135.15 samples/sec	Train-accuracy=0.004687
INFO:root:Epoch[1] Batch [60]	Speed: 135.63 samples/sec	Train-accuracy=0.004687
INFO:root:Epoch[1] Batch [60]	Speed: 134.72 samples/sec	Train-accuracy=0.005469
INFO:root:Epoch[1] Batch [80]	Speed: 135.11 samples/sec	Train-accuracy=0.006641
INFO:root:Epoch[1] Batch [80]	Speed: 135.41 samples/sec	Train-accuracy=0.006055
INFO:root:Epoch[1] Resetting Data Iterator
INFO:root:Epoch[1] Time cost=185.209
INFO:root:Epoch[1] Resetting Data Iterator
INFO:root:Epoch[1] Time cost=185.342

The text was updated successfully, but these errors were encountered:

piiswrong · 2017-02-10T16:59:48Z

use device instead of local

mli · 2017-02-10T18:29:26Z

please try local -> device and dist_sync -> dist_device_sync. adding device will try to use gpu p2p communication

Agoniii · 2017-02-11T03:16:09Z

@piiswrong @mli I also tried the device and dist_device_sync and the performance and the number of GPUs is a linear relationship. But I also have several questions, please help me.

#####My question

Why the performance local is not good ? Normal?
How to choose batch size when use multiple machine? Does dist_sync with batch size 32 for 2 machines behave equally to local with batch size 64 for single machine ( multi_devices.md)? But my result as in table (b) and table (c) is not.

#####Results

a). single machine with batch size per GPU = 32 and kv_store = 'device'

kv_store	num_of_gpu	batch_size	Speed	speedup_ratio
device	1	32	96.999	1.000
device	2	64	188.268	1.941
device	4	128	379.780	3.915

b). single machine with batch size per GPU = 64 and kv_store = 'device'

kv_store	num_of_gpu	batch_size	Speed	speedup_ratio
device	1	64	98.709	1.000
device	2	128	197.371	2.000
device	4	256	393.581	3.987

c). 2 machines with batch size per GPU = 32 and kv_store = ' dist_device_sync '

kv_store	num_of_gpu	batch_size	Speed	speedup_ratio
dist_device_sync	2	128	120.133	1.000
dist_device_sync	4	256	243.574	2.028

d). 2 machines with batch size per GPU = 64 and kv_store = ' dist_device_sync '

kv_store	num_of_gpu	batch_size	Speed	speedup_ratio
dist_device_sync	1	64	74.280	1.000
dist_device_sync	2	128	139.467	1.878
dist_device_sync	4	256	392.148	5.279

Vogen · 2017-04-01T03:09:02Z

@Agoniii How about train and test accuracy?

szha · 2017-09-29T16:50:03Z

This issue is closed due to lack of activity in the last 90 days. Feel free to ping me to reopen if this is still an active issue. Thanks!

darrylbobo mentioned this issue Apr 9, 2017

Use multiple GPUs during predict #5742

Closed

szha closed this as completed Sep 29, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

low speedup ratio when using multi-GPUs and multi-machines #4968

low speedup ratio when using multi-GPUs and multi-machines #4968

Agoniii commented Feb 10, 2017 •

edited

Loading

piiswrong commented Feb 10, 2017

mli commented Feb 10, 2017

Agoniii commented Feb 11, 2017 •

edited

Loading

Vogen commented Apr 1, 2017

szha commented Sep 29, 2017

low speedup ratio when using multi-GPUs and multi-machines #4968

low speedup ratio when using multi-GPUs and multi-machines #4968

Comments

Agoniii commented Feb 10, 2017 • edited Loading

Environment info

Error Message:

multiple GPUs in a single machine

2 machines

My problem

Minimum reproducible example

Steps to reproduce

piiswrong commented Feb 10, 2017

mli commented Feb 10, 2017

Agoniii commented Feb 11, 2017 • edited Loading

Vogen commented Apr 1, 2017

szha commented Sep 29, 2017

Agoniii commented Feb 10, 2017 •

edited

Loading

Agoniii commented Feb 11, 2017 •

edited

Loading