New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OOM when allocating tensor #7

Open
drobertduke opened this Issue Sep 17, 2016 · 8 comments

Comments

Projects
None yet
6 participants
@drobertduke
Copy link

drobertduke commented Sep 17, 2016

I have a 12GB GPU but attempting to train anything with the default settings produces an OOM on the first epoch. I had to dial the batch_size and the dilation_depth way down before it would even start. What settings are you using when you train?

I tensorflow/core/common_runtime/bfc_allocator.cc:689]      Summary of in-use Chunks by size:
I tensorflow/core/common_runtime/bfc_allocator.cc:692] 83 Chunks of size 256 totalling 20.8KiB
I tensorflow/core/common_runtime/bfc_allocator.cc:692] 1 Chunks of size 512 totalling 512B
I tensorflow/core/common_runtime/bfc_allocator.cc:692] 15 Chunks of size 1024 totalling 15.0KiB
I tensorflow/core/common_runtime/bfc_allocator.cc:692] 1 Chunks of size 1280 totalling 1.2KiB
I tensorflow/core/common_runtime/bfc_allocator.cc:692] 1 Chunks of size 65536 totalling 64.0KiB
I tensorflow/core/common_runtime/bfc_allocator.cc:692] 59 Chunks of size 262144 totalling 14.75MiB
I tensorflow/core/common_runtime/bfc_allocator.cc:692] 1 Chunks of size 520704 totalling 508.5KiB
I tensorflow/core/common_runtime/bfc_allocator.cc:692] 105 Chunks of size 524288 totalling 52.50MiB
I tensorflow/core/common_runtime/bfc_allocator.cc:692] 13 Chunks of size 67108864 totalling 832.00MiB
I tensorflow/core/common_runtime/bfc_allocator.cc:692] 2 Chunks of size 67174400 totalling 128.12MiB
I tensorflow/core/common_runtime/bfc_allocator.cc:692] 1 Chunks of size 67239936 totalling 64.12MiB
I tensorflow/core/common_runtime/bfc_allocator.cc:692] 1 Chunks of size 67371008 totalling 64.25MiB
I tensorflow/core/common_runtime/bfc_allocator.cc:692] 1 Chunks of size 67633152 totalling 64.50MiB
I tensorflow/core/common_runtime/bfc_allocator.cc:692] 1 Chunks of size 68157440 totalling 65.00MiB
I tensorflow/core/common_runtime/bfc_allocator.cc:692] 1 Chunks of size 134479872 totalling 128.25MiB
I tensorflow/core/common_runtime/bfc_allocator.cc:692] 1 Chunks of size 269484032 totalling 257.00MiB
I tensorflow/core/common_runtime/bfc_allocator.cc:692] 1 Chunks of size 541065216 totalling 516.00MiB
I tensorflow/core/common_runtime/bfc_allocator.cc:692] 1 Chunks of size 1090519040 totalling 1.02GiB
I tensorflow/core/common_runtime/bfc_allocator.cc:692] 1 Chunks of size 2147483648 totalling 2.00GiB
I tensorflow/core/common_runtime/bfc_allocator.cc:692] 1 Chunks of size 2214592512 totalling 2.06GiB
I tensorflow/core/common_runtime/bfc_allocator.cc:692] 1 Chunks of size 3726535936 totalling 3.47GiB
I tensorflow/core/common_runtime/bfc_allocator.cc:696] Sum Total of in-use chunks: 10.68GiB
I tensorflow/core/common_runtime/bfc_allocator.cc:698] Stats:
Limit:                 11715375924
InUse:                 11472467200
MaxInUse:              11473515776
NumAllocs:                     563
MaxAllocSize:           3980291328

W tensorflow/core/common_runtime/bfc_allocator.cc:270] ****************************************************************************************xxxxxxxxxxxx
W tensorflow/core/common_runtime/bfc_allocator.cc:271] Ran out of memory trying to allocate 2.00GiB.  See logs for memory state.
W tensorflow/core/framework/op_kernel.cc:968] Resource exhausted: OOM when allocating tensor with shape[65536,256,32,1]
@basveeling

This comment has been minimized.

Copy link
Owner

basveeling commented Sep 17, 2016

I haven't trained with tensorflow yet but I'll look into it. In the meantime, try using theano with cnmem enabled (THEANO_FLAGS='lib.cnmem=1' KERAS_BACKEND=theano python wavenet.py)

@basveeling basveeling closed this Sep 17, 2016

@basveeling basveeling reopened this Sep 17, 2016

@ibab

This comment has been minimized.

Copy link

ibab commented Sep 17, 2016

Maybe the reason for this might be the same as for the tensorflow implementation here: ibab/tensorflow-wavenet#4 (comment)
(I haven't looked at how keras implements AtrousConvolution1D, though).

@basveeling

This comment has been minimized.

Copy link
Owner

basveeling commented Sep 17, 2016

I was wondering why keras was requiring the dilation values to be equal in both dimensions when using tensorflow; it uses tf.nn.atrous_conv2d. Thanks for the heads up, and nice work on the fix :)!

@basveeling

This comment has been minimized.

Copy link
Owner

basveeling commented Feb 6, 2017

I'm closing this with the assumption that this is probably fixed in tensorflow by now. If not, please let me know.

@basveeling basveeling closed this Feb 6, 2017

@Shoshin23

This comment has been minimized.

Copy link

Shoshin23 commented Mar 10, 2017

Nope. This is not fixed in Tensorflow as of yet. Im getting the EXACT same error as the OP. Trying to run using theano backend and seeing if it works.

@raavianvesh

This comment has been minimized.

Copy link

raavianvesh commented Sep 16, 2017

Did it work using theano backend?

@meridion

This comment has been minimized.

Copy link

meridion commented Aug 29, 2018

I would like to let you all know it is fixed in TensorFlow 1.10. Works like a charm. I'm using the unmodified current master. (well, technically I modified a single line in dataset to make the code work in Python 3.x)

@basveeling

This comment has been minimized.

Copy link
Owner

basveeling commented Sep 3, 2018

@meridion Thanks! Would you mind sending a pull request so other users can easily benefit from your fix?

@basveeling basveeling reopened this Sep 3, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment