Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

low speedup ratio when using multi-GPUs and multi-machines #4968

Closed
Agoniii opened this issue Feb 10, 2017 · 5 comments
Closed

low speedup ratio when using multi-GPUs and multi-machines #4968

Agoniii opened this issue Feb 10, 2017 · 5 comments

Comments

@Agoniii
Copy link

Agoniii commented Feb 10, 2017

Environment info

Operating System: centos 7.1
Cuda 7.5
MXNet version: v0.9
Python versio: v2.7

Error Message:

I am trying to use multiple GPUs and multiple machines for training an image classification task using resnet network, but the performance is not as expected. I have summed up some data as a few tables below:

multiple GPUs in a single machine

a). batch size per GPU = 32

kv_store num_of_gpu batch_size speed speedup_ratio
local 1 32 95.381 1.000
local 2 64 76.490 0.802
local 4 128 142.664 1.496

b). batch size per GPU = 64

kv_store num_of_gpu batch_size speed speedup_ratio
local 1 64 98.701 1.000
local 2 128 112.307 1.138
local 4 256 217.973 2.208

2 machines

c). batch size per GPU = 64

kv_store num_of_gpu batch_size speed speedup_ratio
dist_sync 1 64 73.082 1.000
dist_sync 2 128 138.605 1.897
dist_sync 4 256 291.866 3.994

My problem

  1. I think the performance and the number of GPUs is a linear relationship. But when I increase the number of GPUs from 1 to 2 with kv_store=local, as in tables (a) and (b), the performance is not improved but worse when batch size is 32. The number of GPUs from 1 to 4, the performance is slightly better.

  2. My result in tables (b) and (c) is inconsistent with " if there are n machines and we use batch size b, then dist_sync behaviors equally to local with batch size n*b. "(multi_devices.md) .

Minimum reproducible example

  • single machine
  python train_imagenet.py --batch-size 256  \
      --num-classes 1000 --num-examples 50000 \
      --gpus 0,1,2,3  --num-epoch 80 --network resnet \
      --kv-store local --data-train data/testimage-val.rec
  • 2 machines : hosts contains the private IPs of the 2 computers
  ../../tools/launch.py -H hosts -n 2  \
      python train_imagenet.py --batch-size 256 \
        --num-classes 1000 --num-examples 50000 \
        --gpus 0,1,2,3 --num-epoch 80 --network resnet \
        --kv-store dist_sync --data-train data/testimage-val.rec

Steps to reproduce

The output.

  1. single machine & 4 gpus & batch-size 256
INFO:root:Start training with [gpu(0), gpu(1), gpu(2), gpu(3)]
INFO:root:Epoch[0] Batch [20]	Speed: 240.12 samples/sec	Train-accuracy=0.002148
INFO:root:Epoch[0] Batch [40]	Speed: 223.57 samples/sec	Train-accuracy=0.001367
INFO:root:Epoch[0] Batch [60]	Speed: 222.98 samples/sec	Train-accuracy=0.001563
INFO:root:Epoch[0] Batch [80]	Speed: 218.49 samples/sec	Train-accuracy=0.001953
INFO:root:Epoch[0] Batch [100]	Speed: 219.13 samples/sec	Train-accuracy=0.001367
INFO:root:Epoch[0] Batch [120]	Speed: 216.96 samples/sec	Train-accuracy=0.001172
INFO:root:Epoch[0] Batch [140]	Speed: 216.20 samples/sec	Train-accuracy=0.001758
INFO:root:Epoch[0] Batch [160]	Speed: 217.95 samples/sec	Train-accuracy=0.003320
INFO:root:Epoch[0] Batch [180]	Speed: 218.57 samples/sec	Train-accuracy=0.003320
INFO:root:Epoch[0] Resetting Data Iterator
INFO:root:Epoch[0] Time cost=229.735
INFO:root:Epoch[1] Batch [20]	Speed: 225.11 samples/sec	Train-accuracy=0.003516
INFO:root:Epoch[1] Batch [40]	Speed: 217.89 samples/sec	Train-accuracy=0.003906
INFO:root:Epoch[1] Batch [60]	Speed: 218.94 samples/sec	Train-accuracy=0.003711
INFO:root:Epoch[1] Batch [80]	Speed: 217.06 samples/sec	Train-accuracy=0.005078
INFO:root:Epoch[1] Batch [100]	Speed: 218.70 samples/sec	Train-accuracy=0.004687
INFO:root:Epoch[1] Batch [120]	Speed: 218.88 samples/sec	Train-accuracy=0.006250
INFO:root:Epoch[1] Batch [140]	Speed: 221.11 samples/sec	Train-accuracy=0.007031
INFO:root:Epoch[1] Batch [160]	Speed: 217.54 samples/sec	Train-accuracy=0.006836
INFO:root:Epoch[1] Batch [180]	Speed: 217.40 samples/sec	Train-accuracy=0.006250
INFO:root:Epoch[1] Resetting Data Iterator
INFO:root:Epoch[1] Time cost=229.037
  1. 2 machines & 4 gpus & batch-size 256
INFO:root:Start training with [gpu(0), gpu(1), gpu(2), gpu(3)]
INFO:root:Start training with [gpu(0), gpu(1), gpu(2), gpu(3)]
INFO:root:Epoch[0] Batch [20]	Speed: 248.57 samples/sec	Train-accuracy=0.001172
INFO:root:Epoch[0] Batch [20]	Speed: 239.79 samples/sec	Train-accuracy=0.000586
INFO:root:Epoch[0] Batch [40]	Speed: 175.89 samples/sec	Train-accuracy=0.001367
INFO:root:Epoch[0] Batch [40]	Speed: 175.36 samples/sec	Train-accuracy=0.000391
INFO:root:Epoch[0] Batch [60]	Speed: 146.04 samples/sec	Train-accuracy=0.001758
INFO:root:Epoch[0] Batch [60]	Speed: 145.87 samples/sec	Train-accuracy=0.001172
INFO:root:Epoch[0] Batch [80]	Speed: 138.28 samples/sec	Train-accuracy=0.003320
INFO:root:Epoch[0] Batch [80]	Speed: 137.40 samples/sec	Train-accuracy=0.002930
INFO:root:Epoch[0] Resetting Data Iterator
INFO:root:Epoch[0] Time cost=157.144
INFO:root:Epoch[0] Resetting Data Iterator
INFO:root:Epoch[0] Time cost=157.551
INFO:root:Epoch[1] Batch [20]	Speed: 144.78 samples/sec	Train-accuracy=0.003320
INFO:root:Epoch[1] Batch [20]	Speed: 145.04 samples/sec	Train-accuracy=0.003906
INFO:root:Epoch[1] Batch [40]	Speed: 134.42 samples/sec	Train-accuracy=0.004102
INFO:root:Epoch[1] Batch [40]	Speed: 135.15 samples/sec	Train-accuracy=0.004687
INFO:root:Epoch[1] Batch [60]	Speed: 135.63 samples/sec	Train-accuracy=0.004687
INFO:root:Epoch[1] Batch [60]	Speed: 134.72 samples/sec	Train-accuracy=0.005469
INFO:root:Epoch[1] Batch [80]	Speed: 135.11 samples/sec	Train-accuracy=0.006641
INFO:root:Epoch[1] Batch [80]	Speed: 135.41 samples/sec	Train-accuracy=0.006055
INFO:root:Epoch[1] Resetting Data Iterator
INFO:root:Epoch[1] Time cost=185.209
INFO:root:Epoch[1] Resetting Data Iterator
INFO:root:Epoch[1] Time cost=185.342
@piiswrong
Copy link
Contributor

use device instead of local

@mli
Copy link
Member

mli commented Feb 10, 2017

please try local -> device and dist_sync -> dist_device_sync. adding device will try to use gpu p2p communication

@Agoniii
Copy link
Author

Agoniii commented Feb 11, 2017

@piiswrong @mli I also tried the device and dist_device_sync and the performance and the number of GPUs is a linear relationship. But I also have several questions, please help me.

#####My question

  1. Why the performance local is not good ? Normal?
  2. How to choose batch size when use multiple machine? Does dist_sync with batch size 32 for 2 machines behave equally to local with batch size 64 for single machine ( multi_devices.md)? But my result as in table (b) and table (c) is not.

#####Results

a). single machine with batch size per GPU = 32 and kv_store = 'device'

kv_store num_of_gpu batch_size Speed speedup_ratio
device 1 32 96.999 1.000
device 2 64 188.268 1.941
device 4 128 379.780 3.915

b). single machine with batch size per GPU = 64 and kv_store = 'device'

kv_store num_of_gpu batch_size Speed speedup_ratio
device 1 64 98.709 1.000
device 2 128 197.371 2.000
device 4 256 393.581 3.987

c). 2 machines with batch size per GPU = 32 and kv_store = ' dist_device_sync '

kv_store num_of_gpu batch_size Speed speedup_ratio
dist_device_sync 2 128 120.133 1.000
dist_device_sync 4 256 243.574 2.028

d). 2 machines with batch size per GPU = 64 and kv_store = ' dist_device_sync '

kv_store num_of_gpu batch_size Speed speedup_ratio
dist_device_sync 1 64 74.280 1.000
dist_device_sync 2 128 139.467 1.878
dist_device_sync 4 256 392.148 5.279

@Vogen
Copy link

Vogen commented Apr 1, 2017

@Agoniii How about train and test accuracy?

@szha
Copy link
Member

szha commented Sep 29, 2017

This issue is closed due to lack of activity in the last 90 days. Feel free to ping me to reopen if this is still an active issue. Thanks!

@szha szha closed this as completed Sep 29, 2017
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants