CUDNN_STATUS_BAD_PARAM exceptions on some gradient checks #4863

AlexDBlack · 2018-03-28T08:09:28Z

A number of the CNN gradient checks here, when run on CuDNN give CUDA error = 3: CUDNN_STATUS_BAD_PARAM on forward pass

https://github.com/deeplearning4j/deeplearning4j/blob/master/deeplearning4j-core/src/test/java/org/deeplearning4j/gradientcheck/CNNGradientCheckTest.java

The text was updated successfully, but these errors were encountered:

AlexDBlack · 2018-03-28T11:02:38Z

See also: #4864

I've been digging into this for a number of hours - so far, no luck.
This is with the latest CuDNN 7.1 with CUDA 9.1 (inc. 3 updates) on Windows 10 (Titan X)

We're basically getting this on a number of the unit tests:

java.lang.RuntimeException: java.lang.RuntimeException: CuDNN error = 3: CUDNN_STATUS_BAD_PARAM during forward pass - step cudnnGetConvolutionForwardWorkspaceSize: inputShape=[1, 1, 4, 5], weightsShape=[2, 1, 2, 2], biasShape=[1, 2], kernel=[2, 2], stride=[1, 1], padding=[1, 1], dilation=[1, 1], AlgoMode=USER_SPECIFIED, fwdAlgo=IMPLICIT_GEMM, convolutionMode=Same

	at org.deeplearning4j.nn.layers.convolution.ConvolutionLayer.preOutput(ConvolutionLayer.java:369)

(note the user specified + implicit gemm is the fallback, originally it was set to PREFER_FASTEST)

Why? This far I'm not sure.
I've ruled out bad args and have checked the CuDNN 7.1 developer guide extensively - none of the conditions that can supposedly lead to CUDNN_STATUS_BAD_PARAM are actually present (I've checked all of them manually).

It's somehow related to the input size - not all combinations fail.
[N,C,H,W]=[5,1,4,5] + same mode + kernel 2: Fails
[N,C,H,W]=[5,1,4,5] + same mode + kernel 3: Passes
[N,C,H,W]=[5,1,5,5] + same mode + kernel 2: Passes
[N,C,H,W]=[5,1,5,5] + same mode + kernel 3: Passes
[N,C,H,W]=[5,1,64,64] + same mode + kernel 2: Fails
[N,C,H,W]=[5,1,64,64] + same mode + kernel 3: Passes
[N,C,H,W]=[1,1,64,64] + same mode + kernel 3: Passes
[N,C,H,W]=[1,1,64,64] + same mode + kernel 2: Fails
[N,C,H,W]=[1,1,16,16] + same mode + kernel 2: Fails

[N,C,H,W]=[1,1,16,16] + truncate mode + kernel 2: Passes
[N,C,H,W]=[1,1,64,64] + truncate mode + kernel 2: Passes

Note that manualy setting the forward algorithm - or using the fall-back in the afore-mentioned PR helps a bit - failures occur on cudnnGetConvolutionForwardWorkspaceSize instead of cudnnGetConvolutionForwardAlgorithm

I've also looked through the CuDNN changes for CuDNN 7 - there's nothing there (known issues, bugs, fixes etc) that looks like this.

Only conclusion I've got so far: It's related to SAME mode padding.
Specifically: I haven't checked all cases, but so far the failing cases I've checked have different top and bottom padding... the passing cases have identical top/bottom padding.
https://github.com/deeplearning4j/deeplearning4j/blob/master/deeplearning4j-nn/src/main/java/org/deeplearning4j/nn/conf/ConvolutionMode.java

AlexDBlack · 2018-03-28T11:29:31Z

Yeah, SAME mode padding - different top/left and bottom/right values definitely seems like the cause here...

Note that CuDNN only supports 2 padding values - not 4. Which is why we pass only the top/left values.
Note also that when top/left and bottom/right values differ, they differ by exactly 1: i.e., if top and bottom padding differ, then bottom padding is equal to top padding + 1.

(I'm guessing CuDNN 9 or 9.1 adds some extra checks, yet previous versions handled this implicitly - hence why we're only seeing this now...)

Of course, the next question is: how do we fix this?

Some options:

Bad option 1: don't use CuDNN in this condition (or resort to fallback)
Bad option 2: Manually add the padding values (physically create a new array with zero padding, and copy over the original values - passing the new array + "padding" arg of 0 to CuDNN)
Work out some way to do this properly with CuDNN

Option 3 is obviously best - but as of yet I have found no way to actualy do that... for example, cudnnSetConvolution2dDescriptor (method that takes the padding values) only supports an identical amount of padding on top/bottom, and left/right:

pad_h: Input. zero-padding height: number of rows of zeros implicitly concatenated onto the
top and onto the bottom of input images.
pad_w: Input. zero-padding width: number of columns of zeros implicitly concatenated onto
the left and onto the right of input images.
Even cudnnSetConvolutionNdDescriptor doesn't seem to support this...

Yet I assume this must be possible - unless all other CuDNN libraries use options 1 or 2??

lock · 2018-09-23T01:27:58Z

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

AlexDBlack added the Bug Bugs and problems label Mar 28, 2018

AlexDBlack self-assigned this Mar 28, 2018

AlexDBlack mentioned this issue Mar 28, 2018

CuDNN improvements #4864

Merged

AlexDBlack closed this as completed in #4864 Mar 29, 2018

lock bot locked and limited conversation to collaborators Sep 23, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUDNN_STATUS_BAD_PARAM exceptions on some gradient checks #4863

CUDNN_STATUS_BAD_PARAM exceptions on some gradient checks #4863

AlexDBlack commented Mar 28, 2018

AlexDBlack commented Mar 28, 2018

AlexDBlack commented Mar 28, 2018 •

edited

Loading

lock bot commented Sep 23, 2018

CUDNN_STATUS_BAD_PARAM exceptions on some gradient checks #4863

CUDNN_STATUS_BAD_PARAM exceptions on some gradient checks #4863

Comments

AlexDBlack commented Mar 28, 2018

AlexDBlack commented Mar 28, 2018

AlexDBlack commented Mar 28, 2018 • edited Loading

lock bot commented Sep 23, 2018

AlexDBlack commented Mar 28, 2018 •

edited

Loading