gpu mem pool strategy #11041

szha · 2018-05-24T04:12:52Z

Description

adjust GPU memory pool strategy

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

Changes are complete (i.e. I finished coding on this PR)
All changes have test coverage:
Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Changes

add knob for minimum memory pool chunk size
add option (MXNET_GPU_MEM_POOL_TYPE="Round") for using nearest power of 2 size for better memory reuse

Comments

fixes Bug of CuDNN RNN with variable sequence length #10453 when using MXNET_GPU_MEM_POOL_TYPE="Round". Before the change, memory size must be exact match to reuse the chunk in memory pool. For the workload in Bug of CuDNN RNN with variable sequence length #10453, it required cudaMalloc for 55.45GB, whereas with the rounding the cudaMalloc call reduced to 1.32GB, during which the memory usage largely stayed the same. It also helped speed up workloads and improve stability with variations in size that cannot be hybridized yet.

eric-haibin-lin · 2018-05-25T20:17:25Z

src/storage/storage.cc

+              LOG(INFO) << "Using GPUPooledRoundedStorageManager.";
+            } else {
+              if (strategy != "Naive") {
+                LOG(INFO) << "Unknown memory pool strategy specified: " << strategy << ".";


log(fatal)?

zhreshold · 2018-05-29T18:38:47Z

Still no clue what's going wrong with this PR. Nothing specific to windows, weirdly python2-GPU-win is good.
I will try it on a local windows pc.

piiswrong · 2018-05-30T00:47:23Z

src/storage/pooled_storage_manager.h

@@ -71,7 +78,7 @@ class GPUPooledStorageManager final : public StorageManager {
 private:
  void DirectFreeNoLock(Storage::Handle handle) {
    cudaError_t err = cudaFree(handle.dptr);
-    size_t size = handle.size + NDEV;


are you sure + NDEV is not needed any more? what if NDEV=32 and min_chunk=33 and handle.size=30? Original code would allocate 62. New code would allocate 33

cc'd @ptrendx. My understanding on this was that there needs to be enough bytes to make sure that for 32 devices at least each device has 1 byte, for nccl scattering. Could you confirm, @ptrendx?

Yes, that is correct.

piiswrong · 2018-05-30T00:47:37Z

src/storage/pooled_storage_manager.h

@@ -52,6 +54,11 @@ class GPUPooledStorageManager final : public StorageManager {
   */
  GPUPooledStorageManager() {
    reserve_ = dmlc::GetEnv("MXNET_GPU_MEM_POOL_RESERVE", 5);
+    min_chunk_ = dmlc::GetEnv("MXNET_GPU_MEM_POOL_MIN_CHUNK", 4096);


page size instead of min chunk?

piiswrong · 2018-05-30T00:47:49Z

src/storage/pooled_storage_manager.h

@@ -82,19 +89,19 @@ class GPUPooledStorageManager final : public StorageManager {
 private:
  void ReleaseAll();
  // used memory
-  size_t used_memory_ = 0;
+  size_t used_memory_ = 0, min_chunk_;


piiswrong · 2018-05-30T00:54:25Z

src/storage/pooled_storage_manager.h

+ private:
+#if __SIZEOF_SIZE_T__ == __SIZEOF_LONG__
+
+#if defined(__clang__) || defined(__GNUC__)


does this need to be so complicated? You just need to take the highest bit and shift left by 1 if it's smaller than size.

This is called the finding the MSB. See https://www.google.com/search?ei=__UNW-DMG6iF0wLqyr4g&q=how+to+find+most+significant+bit+in+c&oq=take+highest+bit&gs_l=psy-ab.1.0.0i71k1l8.0.0.0.4417.0.0.0.0.0.0.0.0..0.0....0...1c..64.psy-ab..0.0.0....0.LUbIFjlZyeU

these builtins would utilize hardware instructions when available.

Is it really faster? It looks too complicated.

also the default implementation with pow and log is really slow

I will change the default implementation to use bit shifting and then do a comparison

I compared my current solution, the bit shifting, and static_cast<int>(std::ceil(std::log2(s))), with -O3 is turned on on my mac (clang), the speed looks like the following:

Running 10000000 iters. Addr width 64 It took me 0.00981569 seconds. result: 223222785 It took me 0.128623 seconds. result: 223222785 It took me 0.0801588 seconds. result: 223222785

szha · 2018-06-06T20:28:24Z

I've simplified the implementation to exclude optimization using intrinsics and bit scans. They are backed up in https://github.com/szha/mxnet/tree/mem_strategy_backup

piiswrong · 2018-06-07T00:43:04Z

amalgamation/amalgamation.py

@@ -23,7 +23,7 @@
 import platform

 blacklist = [
-    'Windows.h', 'cublas_v2.h', 'cuda/tensor_gpu-inl.cuh',
+    'Windows.h', 'intrin.h', 'cublas_v2.h', 'cuda/tensor_gpu-inl.cuh',


marcoabreu · 2018-06-09T07:48:55Z

tests/cpp/storage/storage_test.cc

+
+TEST(GPUStorage, Round_GPU) {
+  if (mxnet::test::unitTestsWithCuda) {
+    putenv("MXNET_GPU_MEM_POOL_ROUND_LINEAR_CUTOFF=20");


How long does this variable persist? It could have side effects on other tests

marcoabreu · 2018-06-09T07:51:20Z

tests/cpp/storage/storage_test.cc

 #include <gtest/gtest.h>
 #include <dmlc/logging.h>
 #include <mxnet/storage.h>
 #include <cstdio>
 #include "test_util.h"
+#include "storage/pooled_storage_manager.h"


Duplicate import? I think it's already part of the storage namespace at mxnet/storage.h

Didn't want to block

marcoabreu · 2018-06-11T16:24:15Z

tests/python/unittest/test_sparse_operator.py

@@ -16,7 +16,7 @@
 # under the License.

 from mxnet.test_utils import *
-from common import setup_module, with_seed
+from common import setup_module, with_seed, teardown


Is it really necessary to import this in every single test? Looks a bit ugly tbh

applying this change would allow all tests within a module to finish before moving onto the next test, thus eliminating the case where side effect of tests in another module spills over to the next. In terms of testing practice, including a setup/teardown is common.

Yeah, but we're not actually using it in most files, right?

Ah in common.py :) But isn't it sufficient to import it there?

unfortunately no. it is the same case as setup_module

lebeg · 2018-06-12T08:52:26Z

src/storage/pooled_storage_manager.h

+    size_t free, total;
+    cudaMemGetInfo(&free, &total);
+    if (free <= total * reserve_ / 100 || size > free - total * reserve_ / 100)
+      ReleaseAll();


What will happen to the storage handles currently pointing to some of the memory?

lebeg · 2018-06-12T08:53:54Z

src/storage/pooled_storage_manager.h

+  std::lock_guard<std::mutex> lock(Storage::Get()->GetMutex(Context::kGPU));
+  int bucket = get_bucket(handle->size);
+  size_t size = get_size(bucket);
+  auto&& reuse_pool = memory_pool_[bucket];


Even if it's no error (the rvalue reference will de deduced to normal lvalue reference) it's better to use it explicitly as auto&

ThomasDelteil · 2018-06-24T04:41:09Z

@szha should we document this new env variable or is it still experimental?

szha · 2018-06-25T00:21:38Z

@ThomasDelteil I intended to have people experiment with this first.

* use nearest power of 2 for gpu memory pool sizes * add linear * add test

szha force-pushed the mem_strategy branch 10 times, most recently from fd64b96 to b8b942e Compare May 25, 2018 03:10

szha changed the title ~~[WIP] gpu mem pool strategy~~ gpu mem pool strategy May 25, 2018

eric-haibin-lin reviewed May 25, 2018

View reviewed changes

szha force-pushed the mem_strategy branch 2 times, most recently from bcba6e2 to de2a823 Compare May 25, 2018 21:26

piiswrong suggested changes May 30, 2018

View reviewed changes

szha force-pushed the mem_strategy branch 10 times, most recently from 0319b42 to 63aac3f Compare June 4, 2018 20:39

piiswrong suggested changes Jun 7, 2018

View reviewed changes

szha force-pushed the mem_strategy branch from 8d72d62 to f3e053b Compare June 7, 2018 03:09

piiswrong approved these changes Jun 7, 2018

View reviewed changes

szha force-pushed the mem_strategy branch 2 times, most recently from e57bae9 to 9b39b72 Compare June 8, 2018 22:11

marcoabreu previously requested changes Jun 9, 2018

View reviewed changes

szha force-pushed the mem_strategy branch 2 times, most recently from d0d8bf7 to 00086f1 Compare June 11, 2018 02:17

marcoabreu reviewed Jun 11, 2018

View reviewed changes

szha force-pushed the mem_strategy branch 6 times, most recently from 37ecc98 to 72b386f Compare June 12, 2018 03:08

lebeg reviewed Jun 12, 2018

View reviewed changes

szha force-pushed the mem_strategy branch from 72b386f to 590ffbc Compare June 12, 2018 16:43

leezu mentioned this pull request Jun 12, 2018

Add embedding training model dmlc/gluon-nlp#136

Merged

4 tasks

szha force-pushed the mem_strategy branch from 590ffbc to 7e0f2c1 Compare June 12, 2018 22:04

szha added 2 commits June 12, 2018 18:04

use nearest power of 2 for gpu memory pool sizes

542f382

add linear

e6f3f56

szha force-pushed the mem_strategy branch from 7e0f2c1 to e7943aa Compare June 13, 2018 04:34

add test

e7943aa

szha merged commit bf26886 into apache:master Jun 14, 2018

leezu mentioned this pull request Jun 22, 2018

Word embeddings update dmlc/gluon-nlp#159

Merged

9 tasks

szha deleted the mem_strategy branch June 25, 2018 00:21

zheng-da pushed a commit to zheng-da/incubator-mxnet that referenced this pull request Jun 28, 2018

gpu mem pool strategy (apache#11041)

d281019

* use nearest power of 2 for gpu memory pool sizes * add linear * add test

This was referenced Aug 17, 2018

Memory optimization in GLUON #12226

Open

Does memonger work for gluon to save memory? #10382

Closed

XinYao1994 pushed a commit to XinYao1994/incubator-mxnet that referenced this pull request Aug 29, 2018

gpu mem pool strategy (apache#11041)

1d6f107

* use nearest power of 2 for gpu memory pool sizes * add linear * add test

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gpu mem pool strategy #11041

gpu mem pool strategy #11041

szha commented May 24, 2018 •

edited

eric-haibin-lin May 25, 2018

zhreshold commented May 29, 2018

piiswrong May 30, 2018

szha May 30, 2018

ptrendx May 30, 2018

piiswrong May 30, 2018

piiswrong May 30, 2018

piiswrong May 30, 2018

szha May 31, 2018

piiswrong Jun 1, 2018

piiswrong Jun 1, 2018

szha Jun 1, 2018

szha Jun 4, 2018

szha commented Jun 6, 2018

piiswrong Jun 7, 2018

marcoabreu Jun 9, 2018

marcoabreu Jun 9, 2018 •

edited

marcoabreu Jun 11, 2018

szha Jun 11, 2018

marcoabreu Jun 11, 2018

szha Jun 11, 2018

marcoabreu Jun 11, 2018

szha Jun 11, 2018

marcoabreu Jun 11, 2018

lebeg Jun 12, 2018

lebeg Jun 12, 2018

ThomasDelteil commented Jun 24, 2018

szha commented Jun 25, 2018

gpu mem pool strategy #11041

gpu mem pool strategy #11041

Conversation

szha commented May 24, 2018 • edited

Description

Checklist

Essentials

Changes

Comments

Choose a reason for hiding this comment

zhreshold commented May 29, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

szha commented Jun 6, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

marcoabreu Jun 9, 2018 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ThomasDelteil commented Jun 24, 2018

szha commented Jun 25, 2018

szha commented May 24, 2018 •

edited

marcoabreu Jun 9, 2018 •

edited