Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUDNN_STATUS_BAD_PARAM exceptions on some gradient checks #4863

Closed
AlexDBlack opened this issue Mar 28, 2018 · 3 comments
Closed

CUDNN_STATUS_BAD_PARAM exceptions on some gradient checks #4863

AlexDBlack opened this issue Mar 28, 2018 · 3 comments
Assignees
Labels
Bug Bugs and problems

Comments

@AlexDBlack
Copy link
Contributor

A number of the CNN gradient checks here, when run on CuDNN give CUDA error = 3: CUDNN_STATUS_BAD_PARAM on forward pass

https://github.com/deeplearning4j/deeplearning4j/blob/master/deeplearning4j-core/src/test/java/org/deeplearning4j/gradientcheck/CNNGradientCheckTest.java

@AlexDBlack AlexDBlack added the Bug Bugs and problems label Mar 28, 2018
@AlexDBlack AlexDBlack self-assigned this Mar 28, 2018
@AlexDBlack
Copy link
Contributor Author

See also: #4864

I've been digging into this for a number of hours - so far, no luck.
This is with the latest CuDNN 7.1 with CUDA 9.1 (inc. 3 updates) on Windows 10 (Titan X)

We're basically getting this on a number of the unit tests:

java.lang.RuntimeException: java.lang.RuntimeException: CuDNN error = 3: CUDNN_STATUS_BAD_PARAM during forward pass - step cudnnGetConvolutionForwardWorkspaceSize: inputShape=[1, 1, 4, 5], weightsShape=[2, 1, 2, 2], biasShape=[1, 2], kernel=[2, 2], stride=[1, 1], padding=[1, 1], dilation=[1, 1], AlgoMode=USER_SPECIFIED, fwdAlgo=IMPLICIT_GEMM, convolutionMode=Same

	at org.deeplearning4j.nn.layers.convolution.ConvolutionLayer.preOutput(ConvolutionLayer.java:369)

(note the user specified + implicit gemm is the fallback, originally it was set to PREFER_FASTEST)

Why? This far I'm not sure.
I've ruled out bad args and have checked the CuDNN 7.1 developer guide extensively - none of the conditions that can supposedly lead to CUDNN_STATUS_BAD_PARAM are actually present (I've checked all of them manually).

It's somehow related to the input size - not all combinations fail.
[N,C,H,W]=[5,1,4,5] + same mode + kernel 2: Fails
[N,C,H,W]=[5,1,4,5] + same mode + kernel 3: Passes
[N,C,H,W]=[5,1,5,5] + same mode + kernel 2: Passes
[N,C,H,W]=[5,1,5,5] + same mode + kernel 3: Passes
[N,C,H,W]=[5,1,64,64] + same mode + kernel 2: Fails
[N,C,H,W]=[5,1,64,64] + same mode + kernel 3: Passes
[N,C,H,W]=[1,1,64,64] + same mode + kernel 3: Passes
[N,C,H,W]=[1,1,64,64] + same mode + kernel 2: Fails
[N,C,H,W]=[1,1,16,16] + same mode + kernel 2: Fails

[N,C,H,W]=[1,1,16,16] + truncate mode + kernel 2: Passes
[N,C,H,W]=[1,1,64,64] + truncate mode + kernel 2: Passes

Note that manualy setting the forward algorithm - or using the fall-back in the afore-mentioned PR helps a bit - failures occur on cudnnGetConvolutionForwardWorkspaceSize instead of cudnnGetConvolutionForwardAlgorithm

I've also looked through the CuDNN changes for CuDNN 7 - there's nothing there (known issues, bugs, fixes etc) that looks like this.

Only conclusion I've got so far: It's related to SAME mode padding.
Specifically: I haven't checked all cases, but so far the failing cases I've checked have different top and bottom padding... the passing cases have identical top/bottom padding.
https://github.com/deeplearning4j/deeplearning4j/blob/master/deeplearning4j-nn/src/main/java/org/deeplearning4j/nn/conf/ConvolutionMode.java

@AlexDBlack
Copy link
Contributor Author

AlexDBlack commented Mar 28, 2018

Yeah, SAME mode padding - different top/left and bottom/right values definitely seems like the cause here...

Note that CuDNN only supports 2 padding values - not 4. Which is why we pass only the top/left values.
Note also that when top/left and bottom/right values differ, they differ by exactly 1: i.e., if top and bottom padding differ, then bottom padding is equal to top padding + 1.

(I'm guessing CuDNN 9 or 9.1 adds some extra checks, yet previous versions handled this implicitly - hence why we're only seeing this now...)

Of course, the next question is: how do we fix this?

Some options:

  1. Bad option 1: don't use CuDNN in this condition (or resort to fallback)
  2. Bad option 2: Manually add the padding values (physically create a new array with zero padding, and copy over the original values - passing the new array + "padding" arg of 0 to CuDNN)
  3. Work out some way to do this properly with CuDNN

Option 3 is obviously best - but as of yet I have found no way to actualy do that... for example, cudnnSetConvolution2dDescriptor (method that takes the padding values) only supports an identical amount of padding on top/bottom, and left/right:

pad_h: Input. zero-padding height: number of rows of zeros implicitly concatenated onto the
top and onto the bottom of input images.
pad_w: Input. zero-padding width: number of columns of zeros implicitly concatenated onto
the left and onto the right of input images.
Even cudnnSetConvolutionNdDescriptor doesn't seem to support this...

Yet I assume this must be possible - unless all other CuDNN libraries use options 1 or 2??

@lock
Copy link

lock bot commented Sep 23, 2018

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked and limited conversation to collaborators Sep 23, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Bug Bugs and problems
Projects
None yet
Development

No branches or pull requests

1 participant