Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

Balance Allocation for Multiple Cards On Device KVStore #252

Closed
junranhe opened this issue Oct 10, 2015 · 7 comments
Closed

Balance Allocation for Multiple Cards On Device KVStore #252

junranhe opened this issue Oct 10, 2015 · 7 comments

Comments

@junranhe
Copy link
Contributor

我用vggnet 11layer 训练,在总的batchsize是36,3个780gpu,conv workspace 设置为 256, cuda-7.0, 但不用cudnn(觉得可以规避一些内存分配的不确定性), 发现占用gpu:2795m,2661m,2383m,他们100多m的递减差别是怎么产生的?希望能差别小一点,这样显存的利用率高些,毕竟显卡相同,显存利用取决于占用最大的gpu了,batchsize太小我的数据有时不收敛:(

@tqchen tqchen changed the title 可不可以让显存分配均匀一些? Balance Allocation for Multiple Cards Oct 10, 2015
@tqchen tqchen changed the title Balance Allocation for Multiple Cards Balance Allocation for Multiple Cards On Device KVStore Oct 10, 2015
@tqchen
Copy link
Member

tqchen commented Oct 10, 2015

This could due to the memory allocation policy used for the distributed KVStore under mode kvstore_type = 'device'. I guess this should not be the case for kvstore_type=local

When we do device type kvstore, what is needed is we allocate the reduction temporal memory on each of the device. We do it in a random assignment way.here https://github.com/dmlc/mxnet/blob/master/src/kvstore/kvstore_device.h#L36 to balance the temp weight memory on each device.

If the weight is not uniformly distributed (e.g there is a weight that is particularly big chunk of weight), then this could cause the imbalance.

@junranhe
Copy link
Contributor Author

是的,我是使用kvstore=device,可不可以添加一个比较确定的分配策略?例如从weight从大到小贪心,总是放到当前weight最少的设备,我觉得不少人会喜欢把显存用得满满的,如果随机的话,就有偶尔爆显存的顾虑,即使是随机在显存分配上最好也有个默认的seed,这样不用担心显存每次跑起来不一样:)

@tqchen
Copy link
Member

tqchen commented Oct 10, 2015

This seems to be a good idea. The allocation strategy code is here https://github.com/dmlc/mxnet/blob/master/src/kvstore/kvstore_device.h#L36

Maybe you can consider hack it a bit and contribute back :) ?

@junranhe
Copy link
Contributor Author

yes,I will

@mli
Copy link
Contributor

mli commented Oct 10, 2015

another possible way is, do random assignment if the size < bigarray_bound_, otherwise, evenly split the array into num_dev parts, and assign one part to each device

@tqchen
Copy link
Member

tqchen commented Oct 10, 2015

This dep checking of slice might block other slices, as we do not
distinguish between non overlapping slices. So the current way is easier
On Sat, Oct 10, 2015 at 1:30 PM Mu Li notifications@github.com wrote:

another possible way is, do random assignment if the size <
bigarray_bound_, otherwise, evenly split the array into num_dev parts, and
assign one part to each device


Reply to this email directly or view it on GitHub
#252 (comment).

@tqchen
Copy link
Member

tqchen commented Oct 16, 2015

This issue is fixed by contribution #256 Thanks to @junranhe

@tqchen tqchen closed this as completed Oct 16, 2015
eric-haibin-lin pushed a commit to eric-haibin-lin/mxnet that referenced this issue Dec 2, 2017
joseph-wakeling-sociomantic added a commit to joseph-wakeling-sociomantic/mxnet that referenced this issue Jan 15, 2018
* cub 89de7ab2(89de7ab2)...05eb57fa(05eb57fa) (798 commits)
  > Merge pull request sociomantic-tsunami#1 from ptrendx/update
  > Create README.md
  > update readme.md
  > 1.7.0
  < 1.6.4 doc update (part 2)
  (...)

* dlpack ()...a6e09b5(a6e09b5) (1 commits)
  > Change order of device_type/id in Context (sociomantic-tsunami#11)

* dmlc-core a6c5701(a6c5701)...87b7ffa(87b7ffa) (54 commits)
  > add SetEnv (apache#322)
  > Fix a bug in seek/tell on Windows (apache#318)
  > Fixes apache#303: added recurse_directories to InputSplit::Create (apache#310)
  > Type name error (apache#316)
  > Small param bug (apache#315)
  (...)

* mshadow c037b06(c037b06)...2d7780c(2d7780c) (42 commits)
  > [CMAKE][ARM] Change USE_SSE to SUPPORT_MSSE2 to it uses the autodetected presence of sse compiler flag from the parent project (see PR apache#8395) (apache#303)
  > Makes repeated setting of gpu rng seed produce repeatable sequences. (apache#304)
  > Add USE_SSE which propagates into MSHADOW_USE_SSE in cmake (apache#302)
  > fix range (apache#301)
  > fix for random seed generation (apache#300)
  (...)

* nnvm b279286(b279286)...e4a138a(e4a138a) (139 commits)
  > [TVM] upgrade to latest version (apache#263)
  > Added support for CoreML Permute layers (apache#262)
  > [CMPL] Add Support for Other Data Types (apache#252)
  > fix onnx conv2d_transpose loading (apache#245)
  > [FIX] Fix from_mxnet for multiple outputs symbol (apache#247)
  (...)

* ps-lite v1+118(acdb698)...v1+123(2ce8b9a) (2 commits)
  > Merge pull request apache#117 from madjam/listen-interface
  > Merge pull request apache#109 from b0noI/master
joseph-wakeling-sociomantic added a commit to joseph-wakeling-sociomantic/mxnet that referenced this issue Jan 15, 2018
Fixes sociomantic-tsunami#11

* cub 89de7ab2(89de7ab2)...05eb57fa(05eb57fa) (798 commits)
  > Merge pull request sociomantic-tsunami#1 from ptrendx/update
  > Create README.md
  > update readme.md
  > 1.7.0
  < 1.6.4 doc update (part 2)
  (...)

* dlpack ()...a6e09b5(a6e09b5) (1 commits)
  > Change order of device_type/id in Context (sociomantic-tsunami#11)

* cub 89de7ab2(89de7ab2)...05eb57fa(05eb57fa) (798 commits)
  > Merge pull request sociomantic-tsunami#1 from ptrendx/update
  > Create README.md
  > update readme.md
  > 1.7.0
  < 1.6.4 doc update (part 2)
  (...)

* dlpack ()...a6e09b5(a6e09b5) (1 commits)
  > Change order of device_type/id in Context (sociomantic-tsunami#11)

* dmlc-core a6c5701(a6c5701)...87b7ffa(87b7ffa) (54 commits)
  > add SetEnv (apache#322)
  > Fix a bug in seek/tell on Windows (apache#318)
  > Fixes apache#303: added recurse_directories to InputSplit::Create (apache#310)
  > Type name error (apache#316)
  > Small param bug (apache#315)
  (...)

* mshadow c037b06(c037b06)...2d7780c(2d7780c) (42 commits)
  > [CMAKE][ARM] Change USE_SSE to SUPPORT_MSSE2 to it uses the autodetected presence of sse compiler flag from the parent project (see PR apache#8395) (apache#303)
  > Makes repeated setting of gpu rng seed produce repeatable sequences. (apache#304)
  > Add USE_SSE which propagates into MSHADOW_USE_SSE in cmake (apache#302)
  > fix range (apache#301)
  > fix for random seed generation (apache#300)
  (...)

* nnvm b279286(b279286)...e4a138a(e4a138a) (139 commits)
  > [TVM] upgrade to latest version (apache#263)
  > Added support for CoreML Permute layers (apache#262)
  > [CMPL] Add Support for Other Data Types (apache#252)
  > fix onnx conv2d_transpose loading (apache#245)
  > [FIX] Fix from_mxnet for multiple outputs symbol (apache#247)
  (...)

* ps-lite v1+118(acdb698)...v1+123(2ce8b9a) (2 commits)
  > Merge pull request apache#117 from madjam/listen-interface
  > Merge pull request apache#109 from b0noI/master
eric-haibin-lin pushed a commit to eric-haibin-lin/mxnet that referenced this issue Apr 4, 2018
* [CMPL] Add Support for Other Data Types

* [CMPL] Add test

* [CMPL] Fix
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants