-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Change filter type back to NCHW, since it supports more algos. #53
Conversation
Convert assertion to warning, although I am not sure this works, since if an algorithm fails to be found because of an out-of-memory error, it is likely to fail at training time for the same reason.
I did on a K10. A quick fix is probably to try a newer GPU with more memory like the K80. |
Same problem on a 1080ti, so it's not likely to be running out of memory:
|
There's a bug in CUDNN. Calling:
Sets strides inappropriately as if the data format were NCHW, from what I can tell. We can work around this by using If I change the two occurrences of cc @danpovey |
OK thanks. For nnet3 reasons, CUDNN_TENSOR_NCHW won't work. Also, it seems to me there must be more than one bug in CUDNN, because apart from setting those strides inappropriately, it also segfaults later on, which surely it shouldn't do. Regarding checking those things instead of crashing: since we are not specifying the memory limit (memoryLimitInBytes) to the cudnnGet*Algorithm() calls: I don't think it would fail for a memory reason, and I think it's valid to just crash. Initially it was failing to get any backward algorithm at all when I used the other layout for the filters. So that would be considered a code error, presumably. |
And let me know how you found that bug, i.e. from that debugging output,
what specific line you knew that there was a bug from, and what it would
have been if it were correct.
…On Tue, Oct 23, 2018 at 12:44 PM Daniel Galvez ***@***.***> wrote:
There's a bug in CUDNN.
Calling:
CUDNN_SAFE_CALL(
cudnnSetTensor4dDescriptor(input_desc_, CUDNN_TENSOR_NHWC,
CUDNN_DATA_BASEFLOAT, c.num_images,
c.num_channels_in, c.input_image_width,
c.input_image_height));
Sets strides inappropriately as if the data format were NCHW, from what I
can tell. We can work around this by using cudnnSetTensor4dDescriptorEx
directly, so we can set the strides appropriately. Although the fact that
this bug has been sitting around makes me a little bit concerned about the
accuracy of the documentation saying which formats are compatible with
which algos.
If I change the two occurrences of CUDNN_TENSOR_NHWC to CUDNN_TENSOR_NCHW
in this PR, the test will pass.
cc @danpovey <https://github.com/danpovey>
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#53 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ADJVu27Uv4DnsUDLzsjoJQ5JiDyIaMMkks5un0dWgaJpZM4X034A>
.
|
Yes, my initial expectation about memory was wrong. I know we can't use CUDNN_TENSOR_NCHW for nnet3 reasons for the input and output images. (but it's okay for filters, right? Please confirm.) For debugging, my biggest recommendation is this:
It logs every CUDNN call made (even those made internally by the library itself), including the formal parameters to those calls. I then reasoned about what is calling each CUDNN function, to see where invalid inputs were coming from. In this case, I saw this excerpt:
where only the first call is a call that we make from Kaldi (everything else is just functions deferring to one another). The strides looked incorrect to me. The bug is that cudnnSetTensor4dDescriptor() seems to think that we specified CUDNN_TENSOR_NCHW, rather than CUDNN_TENSOR_NHWC. As I mentioned, we can work around this by calling cudnnSetTensor4dDescriptorEx() directly, with the appropriate strides for CUDNN_TENSOR_NHWC. I just haven't done it yet since I need to get to work. |
OK, thanks a lot for the info.
For the filters, we can use any setup. I already had to change it to get
around the previous problem where none of the backward algorithms were
available.
…On Tue, Oct 23, 2018 at 1:16 PM Daniel Galvez ***@***.***> wrote:
Yes, my initial expectation about memory was wrong.
I know we can't use CUDNN_TENSOR_NCHW for nnet3 reasons for the input and
output images. (but it's okay for filters, right? Please confirm.)
For debugging, my biggest recommendation is this:
CUDNN_LOGINFO_DBG=1 CUDNN_LOGDEST_DBG=stderr ./convolution-cudnn-test
It logs every CUDNN call made (even those made internally by the library
itself), including the formal parameters to those calls. I then reasoned
about what is calling each CUDNN function, to see where invalid inputs were
coming from.
In this case, I saw this excerpt:
I! CuDNN (v7102) function cudnnSetTensor4dDescriptor() called:
i! format: type=cudnnTensorFormat_t; val=CUDNN_TENSOR_NHWC (1);
i! dataType: type=cudnnDataType_t; val=CUDNN_DATA_FLOAT (0);
i! n: type=int; val=8;
i! c: type=int; val=20;
i! h: type=int; val=13;
i! w: type=int; val=12;
i! Time: 2018-10-23T12:31:54.659314 (0d+0h+0m+2s since start)
i! Process=48634; Thread=48634; Handle=NULL; StreamId=NULL.
I! CuDNN (v7102) function cudnnSetTensor4dDescriptorEx() called:
i! dataType: type=cudnnDataType_t; val=CUDNN_DATA_FLOAT (0);
i! n: type=int; val=8;
i! c: type=int; val=20;
i! h: type=int; val=13;
i! w: type=int; val=12;
i! nStride: type=int; val=3120;
i! cStride: type=int; val=1;
i! hStride: type=int; val=240;
i! wStride: type=int; val=20;
i! Time: 2018-10-23T12:31:54.659350 (0d+0h+0m+2s since start)
i! Process=48634; Thread=48634; Handle=NULL; StreamId=NULL.
I! CuDNN (v7102) function cudnnSetTensorNdDescriptor() called:
i! dataType: type=cudnnDataType_t; val=CUDNN_DATA_FLOAT (0);
i! nbDims: type=int; val=4;
i! dimA: type=int; val=[8,20,13,12];
i! strideA: type=int; val=[3120,1,240,20];
i! Time: 2018-10-23T12:31:54.659393 (0d+0h+0m+2s since start)
i! Process=48634; Thread=48634; Handle=NULL; StreamId=NULL.
where only the first call is a call that we make from Kaldi (everything
else is just functions deferring to one another). The strides looked
incorrect to me. The bug is that cudnnSetTensor4dDescriptor() seems to
think that we specified CUDNN_TENSOR_NCHW, rather than CUDNN_TENSOR_NHWC.
As I mentioned, we can work around this by calling
cudnnSetTensor4dDescriptorEx() directly, with the appropriate strides for
CUDNN_TENSOR_NHWC. I just haven't done it yet since I need to get to work.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#53 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ADJVuywfW6mTqkyU0nWBDkvrTRwbIvB0ks5un08GgaJpZM4X034A>
.
|
Use cudnnSetTensor4dDescriptor with strides we calculate ourselves instead.
@danpovey Okay, the current code will run without any errors. |
Fantastic!!!!
…On Tue, Oct 23, 2018 at 11:51 PM Daniel Galvez ***@***.***> wrote:
@danpovey <https://github.com/danpovey> Okay, the current code will run
without any errors.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#53 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ADJVu0cTU54EbD2UFR0fzg4w_1HxPuu5ks5un-PPgaJpZM4X034A>
.
|
I'll deal with merging changes to-morrow when I have time to look at this.
…On Wed, Oct 24, 2018 at 12:22 AM Daniel Povey ***@***.***> wrote:
Fantastic!!!!
On Tue, Oct 23, 2018 at 11:51 PM Daniel Galvez ***@***.***>
wrote:
> @danpovey <https://github.com/danpovey> Okay, the current code will run
> without any errors.
>
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub
> <#53 (comment)>, or mute
> the thread
> <https://github.com/notifications/unsubscribe-auth/ADJVu0cTU54EbD2UFR0fzg4w_1HxPuu5ks5un-PPgaJpZM4X034A>
> .
>
|
Convert assertion to warning, although I am not sure this works, since
if an algorithm fails to be found because of an out-of-memory error, it
is likely to fail at training time for the same reason.
This is the current error I am getting, after these changes, when I run
make -j convolution-cudnn-test && CUDNN_LOGINFO_DBG=1 CUDNN_LOGDEST_DBG=stderr ./convolution-cudnn-test 2 >&1 | tee
: