Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Float16 CUDA conv broken on 5D tensors #505

Open
nikopj opened this issue Feb 16, 2023 · 8 comments
Open

Float16 CUDA conv broken on 5D tensors #505

nikopj opened this issue Feb 16, 2023 · 8 comments

Comments

@nikopj
Copy link
Contributor

nikopj commented Feb 16, 2023

Float16 CUDA conv seems to be broken for 5D tensors, but not 3D or 4D tensors. FluxML/Flux.jl#2184

(using Julia 1.8.3 on a A100 GPU.)

julia> conv(rand(Float16, 16, 16, 1, 1) |> gpu, rand(Float16, 3,3,1,1) |> gpu)
14×14×1×1 CuArray{Float16, 4, CUDA.Mem.DeviceBuffer}:
[...]
julia> conv(rand(Float16, 16, 16, 16, 1, 1) |> gpu, rand(Float16, 3,3,3, 1,1) |> gpu)
ERROR: CUDNNError: CUDNN_STATUS_NOT_SUPPORTED (code 9)
Stacktrace:
  [1] throw_api_error(res::cuDNN.cudnnStatus_t)
    @ cuDNN /scratch/npj226/.julia/packages/cuDNN/7X4E7/src/libcudnn.jl:11
  [2] macro expansion
    @ /scratch/npj226/.julia/packages/cuDNN/7X4E7/src/libcudnn.jl:24 [inlined]
  [3] cudnnConvolutionForward(handle::Ptr{cuDNN.cudnnContext}, alpha::Base.RefValue{Float32}, xDesc::cuDNN.cudnnTensorDescriptor, x::CuArray{Float16, 5, CUDA.Mem.DeviceBuffer}, wDesc::cuDNN.cudnnFilterDescriptor, w::CuArray{Float16, 5, CUDA.Mem.DeviceBuffer}, convDesc::cuDNN.cudnnConvolutionDescriptor, algo::cuDNN.cudnnConvolutionFwdAlgo_t, workSpace::CuArray{UInt8, 1, CUDA.Mem.DeviceBuffer}, workSpaceSizeInBytes::Int64, beta::Base.RefValue{Float32}, yDesc::cuDNN.cudnnTensorDescriptor, y::CuArray{Float16, 5, CUDA.Mem.DeviceBuffer})
    @ cuDNN /scratch/npj226/.julia/packages/CUDA/ZdCxS/lib/utils/call.jl:26
  [4] (::cuDNN.var"#1153#1155"{CuArray{Float16, 5, CUDA.Mem.DeviceBuffer}, cuDNN.cudnnActivationMode_t, cuDNN.cudnnConvolutionDescriptor, cuDNN.cudnnFilterDescriptor, cuDNN.cudnnTensorDescriptor, cuDNN.cudnnTensorDescriptor, Base.RefValue{Float32}, Base.RefValue{Float32}, CuArray{Float16, 5, CUDA.Mem.DeviceBuffer}, CuArray{Float16, 5, CUDA.Mem.DeviceBuffer}, cuDNN.cudnnConvolutionFwdAlgoPerfStruct})(workspace::CuArray{UInt8, 1, CUDA.Mem.DeviceBuffer})
    @ cuDNN /scratch/npj226/.julia/packages/cuDNN/7X4E7/src/convolution.jl:105
@ToucheSir
Copy link
Member

Can you set JULIA_DEBUG=CUDA and post the debug output after running the second conv call?

@ToucheSir ToucheSir transferred this issue from FluxML/Flux.jl Feb 16, 2023
@nikopj
Copy link
Contributor Author

nikopj commented Feb 17, 2023

julia> conv(rand(Float16, 16, 16, 16, 1, 1) |> gpu, rand(Float16, 3, 3, 3, 1, 1) |> gpu)
┌ Warning: No valid algorithm found, probably bad params for convolution.
└ @ cuDNN /scratch/npj226/.julia/packages/cuDNN/7X4E7/src/convolution.jl:276
┌ Debug:  cuBLAS (v11.8) function cublasStatus_t cublasGetVersion_v2(cublasHandle_t, int*) called:
│   handle: type=cublasHandle_t; val=POINTER (IN HEX:0x0x23907a00)
│   version: type=int; val=POINTER (IN HEX:0x0x7ffcee3e1fbc)
│  Time: 2023-02-16T23:53:39 elapsed from start 0.883333 minutes or 53.000000 seconds
│ Process=2461157; Thread=22746666410368; GPU=0; Handle=POINTER (IN HEX:0x0x23907a00); StreamId=POINTER (IN HEX:0x0x4db8580); MathMode=CUBLAS_DEFAULT_MATH
│  COMPILED WITH: GNU GCC/G++ / 6.3.1 20170216 (Red Hat 6.3.1-3)
└ @ CUDA.CUBLAS /scratch/npj226/.julia/packages/CUDA/ZdCxS/lib/cublas/CUBLAS.jl:224
ERROR: ┌ Debug:  cuBLAS (v11.8) function cublasStatus_t cublasGetVersion_v2(cublasHandle_t, int*) called:
│   handle: type=cublasHandle_t; val=POINTER (IN HEX:0x0x23907a00)
│   version: type=int; val=POINTER (IN HEX:0x0x7ffcee3e1fbc)
│  Time: 2023-02-16T23:53:39 elapsed from start 0.883333 minutes or 53.000000 seconds
│ Process=2461157; Thread=22746666410368; GPU=0; Handle=POINTER (IN HEX:0x0x23907a00); StreamId=POINTER (IN HEX:0x0x4db8580); MathMode=CUBLAS_DEFAULT_MATH
│  COMPILED WITH: GNU GCC/G++ / 6.3.1 20170216 (Red Hat 6.3.1-3)
└ @ CUDA.CUBLAS /scratch/npj226/.julia/packages/CUDA/ZdCxS/lib/cublas/CUBLAS.jl:224
CUDNNError: ┌ Debug:  cuBLAS (v11.8) function cublasStatus_t cublasGetVersion_v2(cublasHandle_t, int*) called:
│   handle: type=cublasHandle_t; val=POINTER (IN HEX:0x0x23907a00)
│   version: type=int; val=POINTER (IN HEX:0x0x7ffcee3e1fbc)
│  Time: 2023-02-16T23:53:39 elapsed from start 0.883333 minutes or 53.000000 seconds
│ Process=2461157; Thread=22746666410368; GPU=0; Handle=POINTER (IN HEX:0x0x23907a00); StreamId=POINTER (IN HEX:0x0x4db8580); MathMode=CUBLAS_DEFAULT_MATH
│  COMPILED WITH: GNU GCC/G++ / 6.3.1 20170216 (Red Hat 6.3.1-3)
└ @ CUDA.CUBLAS /scratch/npj226/.julia/packages/CUDA/ZdCxS/lib/cublas/CUBLAS.jl:224
CUDNN_STATUS_NOT_SUPPORTED┌ Debug:  cuBLAS (v11.8) function cublasStatus_t cublasGetVersion_v2(cublasHandle_t, int*) called:
│   handle: type=cublasHandle_t; val=POINTER (IN HEX:0x0x23907a00)
│   version: type=int; val=POINTER (IN HEX:0x0x7ffcee3e1fbc)
│  Time: 2023-02-16T23:53:39 elapsed from start 0.883333 minutes or 53.000000 seconds
│ Process=2461157; Thread=22746666410368; GPU=0; Handle=POINTER (IN HEX:0x0x23907a00); StreamId=POINTER (IN HEX:0x0x4db8580); MathMode=CUBLAS_DEFAULT_MATH
│  COMPILED WITH: GNU GCC/G++ / 6.3.1 20170216 (Red Hat 6.3.1-3)
│
└ @ CUDA.CUBLAS /scratch/npj226/.julia/packages/CUDA/ZdCxS/lib/cublas/CUBLAS.jl:224
 (code 9)
Stacktrace:
  [1] throw_api_error(res::cuDNN.cudnnStatus_t)
    @ cuDNN /scratch/npj226/.julia/packages/cuDNN/7X4E7/src/libcudnn.jl:11
  [2] macro expansion
    @ /scratch/npj226/.julia/packages/cuDNN/7X4E7/src/libcudnn.jl:24 [inlined]
  [3] cudnnConvolutionForward(handle::Ptr{cuDNN.cudnnContext}, alpha::Base.RefValue{Float32}, xDesc::cuDNN.cudnnTensorDescriptor, x::CUDA.CuArray{Float16, 5, CUDA.Mem.DeviceBuffer}, wDesc::cuDNN.cudnnFilterDescriptor, w::CUDA.CuArray{Float16, 5, CUDA.Mem.DeviceBuffer}, convDesc::cuDNN.cudnnConvolutionDescriptor, algo::cuDNN.cudnnConvolutionFwdAlgo_t, workSpace::CUDA.CuArray{UInt8, 1, CUDA.Mem.DeviceBuffer}, workSpaceSizeInBytes::Int64, beta::Base.RefValue{Float32}, yDesc::cuDNN.cudnnTensorDescriptor, y::CUDA.CuArray{Float16, 5, CUDA.Mem.DeviceBuffer})
    @ cuDNN /scratch/npj226/.julia/packages/CUDA/ZdCxS/lib/utils/call.jl:26
  [4] (::cuDNN.var"#1153#1155"{CUDA.CuArray{Float16, 5, CUDA.Mem.DeviceBuffer}, cuDNN.cudnnActivationMode_t, cuDNN.cudnnConvolutionDescriptor, cuDNN.cudnnFilterDescriptor, cuDNN.cudnnTensorDescriptor, cuDNN.cudnnTensorDescriptor, Base.RefValue{Float32}, Base.RefValue{Float32}, CUDA.CuArray{Float16, 5, CUDA.Mem.DeviceBuffer}, CUDA.CuArray{Float16, 5, CUDA.Mem.DeviceBuffer}, cuDNN.cudnnConvolutionFwdAlgoPerfStruct})(workspace::CUDA.CuArray{UInt8, 1, CUDA.Mem.DeviceBuffer})
    @ cuDNN /scratch/npj226/.julia/packages/cuDNN/7X4E7/src/convolution.jl:105
  [5] with_workspace(f::cuDNN.var"#1153#1155"{CUDA.CuArray{Float16, 5, CUDA.Mem.DeviceBuffer}, cuDNN.cudnnActivationMode_t, cuDNN.cudnnConvolutionDescriptor, cuDNN.cudnnFilterDescriptor, cuDNN.cudnnTensorDescriptor, cuDNN.cudnnTensorDescriptor, Base.RefValue{Float32}, Base.RefValue{Float32}, CUDA.CuArray{Float16, 5, CUDA.Mem.DeviceBuffer}, CUDA.CuArray{Float16, 5, CUDA.Mem.DeviceBuffer}, cuDNN.cudnnConvolutionFwdAlgoPerfStruct}, eltyp::Type{UInt8}, size::CUDA.APIUtils.var"#2#3"{UInt64}, fallback::Nothing; keep::Bool)
    @ CUDA.APIUtils /scratch/npj226/.julia/packages/CUDA/ZdCxS/lib/utils/call.jl:77
  [6] with_workspace(f::Function, eltyp::Type{UInt8}, size::CUDA.APIUtils.var"#2#3"{UInt64}, fallback::Nothing)
    @ CUDA.APIUtils /scratch/npj226/.julia/packages/CUDA/ZdCxS/lib/utils/call.jl:56
  [7] #with_workspace#1
    @ /scratch/npj226/.julia/packages/CUDA/ZdCxS/lib/utils/call.jl:53 [inlined]
  [8] with_workspace(f::Function, size::UInt64, fallback::Nothing) (repeats 2 times)
    @ CUDA.APIUtils /scratch/npj226/.julia/packages/CUDA/ZdCxS/lib/utils/call.jl:53
  [9] cudnnConvolutionForwardAD(w::CUDA.CuArray{Float16, 5, CUDA.Mem.DeviceBuffer}, x::CUDA.CuArray{Float16, 5, CUDA.Mem.DeviceBuffer}, bias::Nothing, z::CUDA.CuArray{Float16, 5, CUDA.Mem.DeviceBuffer}; y::CUDA.CuArray{Float16, 5, CUDA.Mem.DeviceBuffer}, activation::cuDNN.cudnnActivationMode_t, convDesc::cuDNN.cudnnConvolutionDescriptor, wDesc::cuDNN.cudnnFilterDescriptor, xDesc::cuDNN.cudnnTensorDescriptor, yDesc::cuDNN.cudnnTensorDescriptor, zDesc::cuDNN.cudnnTensorDescriptor, biasDesc::Nothing, alpha::Base.RefValue{Float32}, beta::Base.RefValue{Float32}, dw::Base.RefValue{Any}, dx::Base.RefValue{Any}, dz::Base.RefValue{Any}, dbias::Base.RefValue{Any}, dready::Base.RefValue{Bool})
    @ cuDNN /scratch/npj226/.julia/packages/cuDNN/7X4E7/src/convolution.jl:103
 [10] cudnnConvolutionForwardWithDefaults(w::CUDA.CuArray{Float16, 5, CUDA.Mem.DeviceBuffer}, x::CUDA.CuArray{Float16, 5, CUDA.Mem.DeviceBuffer}; padding::Int64, stride::Int64, dilation::Int64, mode::cuDNN.cudnnConvolutionMode_t, mathType::cuDNN.cudnnMathType_t, reorderType::cuDNN.cudnnReorderType_t, group::Int64, format::cuDNN.cudnnTensorFormat_t, convDesc::cuDNN.cudnnConvolutionDescriptor, xDesc::cuDNN.cudnnTensorDescriptor, wDesc::cuDNN.cudnnFilterDescriptor, y::CUDA.CuArray{Float16, 5, CUDA.Mem.DeviceBuffer}, yDesc::cuDNN.cudnnTensorDescriptor, alpha::Int64, beta::Int64, bias::Nothing, z::CUDA.CuArray{Float16, 5, CUDA.Mem.DeviceBuffer}, biasDesc::Nothing, zDesc::cuDNN.cudnnTensorDescriptor, activation::cuDNN.cudnnActivationMode_t, dw::Base.RefValue{Any}, dx::Base.RefValue{Any}, dz::Base.RefValue{Any}, dbias::Base.RefValue{Any})
    @ cuDNN /scratch/npj226/.julia/packages/cuDNN/7X4E7/src/convolution.jl:96
 [11] #cudnnConvolutionForward!#1150
    @ /scratch/npj226/.julia/packages/cuDNN/7X4E7/src/convolution.jl:53 [inlined]
 [12] conv!(y::CUDA.CuArray{Float16, 5, CUDA.Mem.DeviceBuffer}, x::CUDA.CuArray{Float16, 5, CUDA.Mem.DeviceBuffer}, w::CUDA.CuArray{Float16, 5, CUDA.Mem.DeviceBuffer}, cdims::DenseConvDims{3, 3, 3, 6, 3}; alpha::Int64, beta::Int64, algo::Int64)
    @ NNlibCUDA /scratch/npj226/.julia/packages/NNlibCUDA/C6t0p/src/cudnn/conv.jl:67
 [13] conv!
    @ /scratch/npj226/.julia/packages/NNlibCUDA/C6t0p/src/cudnn/conv.jl:58 [inlined]
 [14] #conv#233
    @ /scratch/npj226/.julia/packages/NNlib/TZPiH/src/conv.jl:88 [inlined]
 [15] conv
    @ /scratch/npj226/.julia/packages/NNlib/TZPiH/src/conv.jl:83 [inlined]
 [16] #conv#231
    @ /scratch/npj226/.julia/packages/NNlib/TZPiH/src/conv.jl:56 [inlined]
 [17] conv(x::CUDA.CuArray{Float16, 5, CUDA.Mem.DeviceBuffer}, w::CUDA.CuArray{Float16, 5, CUDA.Mem.DeviceBuffer})
    @ NNlib /scratch/npj226/.julia/packages/NNlib/TZPiH/src/conv.jl:50
 [18] top-level scope
    @ REPL[3]:1
 [19] top-level scope
    @ /scratch/npj226/.julia/packages/CUDA/ZdCxS/src/initialization.jl:155

@nikopj
Copy link
Contributor Author

nikopj commented Feb 17, 2023

There is a similar error for gradients with conv and Float16 for 3D/4D/5D tensors as well.

julia> w = rand(Float16, 3, 1, 1) |> gpu;

julia> gradient(x->sum(conv(x, w)), rand(Float16, 16, 1, 1) |> gpu)
┌ Warning: CuDNN (v8600) function cudnnGetConvolutionForwardAlgorithmMaxCount() called:
│     Info: Traceback contains 44 message(s)
│         Warning: CUDNN_STATUS_NOT_SUPPORTED; Reason: false == cudnn::cnn::isForwardSupported(handle, xDesc, wDesc, cDesc, yDesc, algo)
│         Warning: CUDNN_STATUS_NOT_SUPPORTED; Reason: T_ENGINEMAP::isLegacyAlgoSupported(handle, xDesc, wDesc, cDesc, yDesc, algo)
   [...]
│ Time: 2023-02-16T23:53:39.684290 (0d+0h+0m+48s since start)
│ Process=2461157; Thread=2461157; GPU=NULL; Handle=NULL; StreamId=NULL.
└ @ cuDNN /scratch/npj226/.julia/packages/cuDNN/7X4E7/src/cuDNN.jl:151
ERROR: CUDNNError: CUDNN_STATUS_BAD_PARAM (code 3)
Stacktrace:
  [1] throw_api_error(res::cuDNN.cudnnStatus_t)
    @ cuDNN /scratch/npj226/.julia/packages/cuDNN/7X4E7/src/libcudnn.jl:11
  [2] macro expansion
    @ /scratch/npj226/.julia/packages/cuDNN/7X4E7/src/libcudnn.jl:24 [inlined]
  [3] cudnnConvolutionBackwardFilter(handle::Ptr{cuDNN.cudnnContext}, alpha::Base.RefValue{Float32}, xDesc::cuDNN.cudnnTensorDescriptor, x::CUDA.CuArray{Float16, 3, CUDA.Mem.DeviceBuffer}, dyDesc::cuDNN.cudnnTensorDescriptor, dy::CUDA.CuArray{Float16, 3, CUDA.Mem.DeviceBuffer}, convDesc::cuDNN.cudnnConvolutionDescriptor, algo::cuDNN.cudnnConvolutionBwdFilterAlgo_t, workSpace::CUDA.CuArray{UInt8, 1, CUDA.Mem.DeviceBuffer}, workSpaceSizeInBytes::Int64, beta::Base.RefValue{Float32}, dwDesc::cuDNN.cudnnFilterDescriptor, dw::CUDA.CuArray{Float16, 3, CUDA.Mem.DeviceBuffer})
    @ cuDNN /scratch/npj226/.julia/packages/CUDA/ZdCxS/lib/utils/call.jl:26
  [4] FluxML/NNlibCUDA.jl#36
    @ /scratch/npj226/.julia/packages/NNlibCUDA/C6t0p/src/cudnn/conv.jl:120 [inlined]
  [5] with_workspace(f::NNlibCUDA.var"#36#38"{Base.RefValue{Float32}, Base.RefValue{Float32}, CUDA.CuArray{Float16, 3, CUDA.Mem.DeviceBuffer}, CUDA.CuArray{Float16, 3, CUDA.Mem.DeviceBuffer}, CUDA.CuArray{Float16, 3, CUDA.Mem.DeviceBuffer}, cuDNN.cudnnConvolutionBwdFilterAlgoPerfStruct, cuDNN.cudnnFilterDescriptor, cuDNN.cudnnTensorDescriptor, cuDNN.cudnnTensorDescriptor, cuDNN.cudnnConvolutionDescriptor}, eltyp::Type{UInt8}, size::CUDA.APIUtils.var"#2#3"{UInt64}, fallback::Nothing; keep::Bool)
    @ CUDA.APIUtils /scratch/npj226/.julia/packages/CUDA/ZdCxS/lib/utils/call.jl:77
  [6] with_workspace(f::Function, eltyp::Type{UInt8}, size::CUDA.APIUtils.var"#2#3"{UInt64}, fallback::Nothing)
    @ CUDA.APIUtils /scratch/npj226/.julia/packages/CUDA/ZdCxS/lib/utils/call.jl:56
  [7] #with_workspace#1
    @ /scratch/npj226/.julia/packages/CUDA/ZdCxS/lib/utils/call.jl:53 [inlined]
  [8] with_workspace(f::Function, size::UInt64, fallback::Nothing) (repeats 2 times)
    @ CUDA.APIUtils /scratch/npj226/.julia/packages/CUDA/ZdCxS/lib/utils/call.jl:53
  [9] ∇conv_filter!(dw::CUDA.CuArray{Float16, 3, CUDA.Mem.DeviceBuffer}, x::CUDA.CuArray{Float16, 3, CUDA.Mem.DeviceBuffer}, dy::CUDA.CuArray{Float16, 3, CUDA.Mem.DeviceBuffer}, cdims::DenseConvDims{1, 1, 1, 2, 1}; alpha::Int64, beta::Int64, algo::Int64)
    @ NNlibCUDA /scratch/npj226/.julia/packages/NNlibCUDA/C6t0p/src/cudnn/conv.jl:119
 [10] ∇conv_filter!
    @ /scratch/npj226/.julia/packages/NNlibCUDA/C6t0p/src/cudnn/conv.jl:107 [inlined]
 [11] #∇conv_filter#237
    @ /scratch/npj226/.julia/packages/NNlib/TZPiH/src/conv.jl:112 [inlined]
 [12] ∇conv_filter
    @ /scratch/npj226/.julia/packages/NNlib/TZPiH/src/conv.jl:107 [inlined]
 [13] #375
    @ /scratch/npj226/.julia/packages/NNlib/TZPiH/src/conv.jl:351 [inlined]
 [14] unthunk
    @ /scratch/npj226/.julia/packages/ChainRulesCore/a4mIA/src/tangent_types/thunks.jl:204 [inlined]
 [15] wrap_chainrules_output
    @ /scratch/npj226/.julia/packages/Zygote/g2w9o/src/compiler/chainrules.jl:110 [inlined]
 [16] map
    @ ./tuple.jl:223 [inlined]
 [17] map
    @ ./tuple.jl:224 [inlined]
 [18] wrap_chainrules_output
    @ /scratch/npj226/.julia/packages/Zygote/g2w9o/src/compiler/chainrules.jl:111 [inlined]
 [19] ZBack
    @ /scratch/npj226/.julia/packages/Zygote/g2w9o/src/compiler/chainrules.jl:211 [inlined]
 [20] Pullback
    @ /scratch/npj226/.julia/packages/NNlib/TZPiH/src/conv.jl:56 [inlined]
 [21] (::typeof((#conv#231)))(Δ::CUDA.CuArray{Float16, 3, CUDA.Mem.DeviceBuffer})
    @ Zygote /scratch/npj226/.julia/packages/Zygote/g2w9o/src/compiler/interface2.jl:0
 [22] Pullback
    @ /scratch/npj226/.julia/packages/NNlib/TZPiH/src/conv.jl:50 [inlined]
 [23] Pullback
    @ ./REPL[27]:1 [inlined]
 [24] (::Zygote.var"#60#61"{typeof((#30))})(Δ::Float16)
    @ Zygote /scratch/npj226/.julia/packages/Zygote/g2w9o/src/compiler/interface.jl:45
 [25] gradient(::Function, ::CUDA.CuArray{Float16, 3, CUDA.Mem.DeviceBuffer}, ::Vararg{CUDA.CuArray{Float16, 3, CUDA.Mem.DeviceBuffer}})
    @ Zygote /scratch/npj226/.julia/packages/Zygote/g2w9o/src/compiler/interface.jl:97
 [26] top-level scope
    @ REPL[27]:1
 [27] top-level scope
    @ /scratch/npj226/.julia/packages/CUDA/ZdCxS/src/initialization.jl:155

@ToucheSir
Copy link
Member

Ok, that means this might be easier to solve then if it's not dimension specific. I also forgot that cuDNN functionality had been spun off into its own package, sorry. Do you mind rerunning the test with JULIA_DEBUG=cuDNN instead?

@nikopj
Copy link
Contributor Author

nikopj commented Feb 17, 2023

Ok, JULIA_DEBUG=cuDNN for the 5D conv and 3D gradient cases:

julia> conv(rand(Float16, 16, 16, 16, 1, 1) |> gpu, rand(Float16, 3, 3, 3, 1, 1) |> gpu)
┌ Warning: No valid algorithm found, probably bad params for convolution.
└ @ cuDNN /scratch/npj226/.julia/packages/cuDNN/7X4E7/src/convolution.jl:276
┌ Debug: CuDNN (v8600) function cudnnCreateConvolutionDescriptor() called:
│     convDesc: location=host; addr=0x1526f7168c80;
│ Time: 2023-02-17T00:25:16.053149 (0d+0h+0m+40s since start)
│ Process=2464303; Thread=2464303; GPU=NULL; Handle=NULL; StreamId=NULL.
└ @ cuDNN /scratch/npj226/.julia/packages/cuDNN/7X4E7/src/cuDNN.jl:149
ERROR: CUDNNError: CUDNN_STATUS_NOT_SUPPORTED┌ Debug: CuDNN (v8600) function cudnnSetConvolutionNdDescriptor() called:
│     convDesc: location=host; addr=0x75faa90;
│     arrayLength: type=int; val=2;
│     padA: type=int; val=[0,0];
│     strideA: type=int; val=[1,1];
│     dilationA: type=int; val=[1,1];
│     mode: type=cudnnConvolutionMode_t; val=CUDNN_CONVOLUTION (0);
│     dataType: type=cudnnDataType_t; val=CUDNN_DATA_HALF (2);
│ Time: 2023-02-17T00:25:16.118505 (0d+0h+0m+40s since start)
│ Process=2464303; Thread=2464303; GPU=NULL; Handle=NULL; StreamId=NULL.
└ @ cuDNN /scratch/npj226/.julia/packages/cuDNN/7X4E7/src/cuDNN.jl:149
 (code 9)
Stacktrace:
  [1] throw_api_error(res::cuDNN.cudnnStatus_t)
    @ cuDNN /scratch/npj226/.julia/packages/cuDNN/7X4E7/src/libcudnn.jl:11
  [2] macro expansion
    @ /scratch/npj226/.julia/packages/cuDNN/7X4E7/src/libcudnn.jl:24 [inlined]
  [3] cudnnConvolutionForward(handle::Ptr{cuDNN.cudnnContext}, alpha::Base.RefValue{Float32}, xDesc::cuDNN.cudnnTensorDescriptor, x::CUDA.CuArray{Float16, 5, CUDA.Mem.DeviceBuffer}, wDesc::cuDNN.cudnnFilterDescriptor, w::CUDA.CuArray{Float16, 5, CUDA.Mem.DeviceBuffer}, convDesc::cuDNN.cudnnConvolutionDescriptor, algo::cuDNN.cudnnConvolutionFwdAlgo_t, workSpace::CUDA.CuArray{UInt8, 1, CUDA.Mem.DeviceBuffer}, workSpaceSizeInBytes::Int64, beta::Base.RefValue{Float32}, yDesc::cuDNN.cudnnTensorDescriptor, y::CUDA.CuArray{Float16, 5, CUDA.Mem.DeviceBuffer})
    @ cuDNN /scratch/npj226/.julia/packages/CUDA/ZdCxS/lib/utils/call.jl:26
  [4] (::cuDNN.var"#1153#1155"{CUDA.CuArray{Float16, 5, CUDA.Mem.DeviceBuffer}, cuDNN.cudnnActivationMode_t, cuDNN.cudnnConvolutionDescriptor, cuDNN.cudnnFilterDescriptor, cuDNN.cudnnTensorDescriptor, cuDNN.cudnnTensorDescriptor, Base.RefValue{Float32}, Base.RefValue{Float32}, CUDA.CuArray{Float16, 5, CUDA.Mem.DeviceBuffer}, CUDA.CuArray{Float16, 5, CUDA.Mem.DeviceBuffer}, cuDNN.cudnnConvolutionFwdAlgoPerfStruct})(workspace::CUDA.CuArray{UInt8, 1, CUDA.Mem.DeviceBuffer})
    @ cuDNN /scratch/npj226/.julia/packages/cuDNN/7X4E7/src/convolution.jl:105
  [5] with_workspace(f::cuDNN.var"#1153#1155"{CUDA.CuArray{Float16, 5, CUDA.Mem.DeviceBuffer}, cuDNN.cudnnActivationMode_t, cuDNN.cudnnConvolutionDescriptor, cuDNN.cudnnFilterDescriptor, cuDNN.cudnnTensorDescriptor, cuDNN.cudnnTensorDescriptor, Base.RefValue{Float32}, Base.RefValue{Float32}, CUDA.CuArray{Float16, 5, CUDA.Mem.DeviceBuffer}, CUDA.CuArray{Float16, 5, CUDA.Mem.DeviceBuffer}, cuDNN.cudnnConvolutionFwdAlgoPerfStruct}, eltyp::Type{UInt8}, size::CUDA.APIUtils.var"#2#3"{UInt64}, fallback::Nothing; keep::Bool)
    @ CUDA.APIUtils /scratch/npj226/.julia/packages/CUDA/ZdCxS/lib/utils/call.jl:77
  [6] with_workspace(f::Function, eltyp::Type{UInt8}, size::CUDA.APIUtils.var"#2#3"{UInt64}, fallback::Nothing)
    @ CUDA.APIUtils /scratch/npj226/.julia/packages/CUDA/ZdCxS/lib/utils/call.jl:56
  [7] #with_workspace#1
    @ /scratch/npj226/.julia/packages/CUDA/ZdCxS/lib/utils/call.jl:53 [inlined]
  [8] with_workspace(f::Function, size::UInt64, fallback::Nothing) (repeats 2 times)
    @ CUDA.APIUtils /scratch/npj226/.julia/packages/CUDA/ZdCxS/lib/utils/call.jl:53
  [9] cudnnConvolutionForwardAD(w::CUDA.CuArray{Float16, 5, CUDA.Mem.DeviceBuffer}, x::CUDA.CuArray{Float16, 5, CUDA.Mem.DeviceBuffer}, bias::Nothing, z::CUDA.CuArray{Float16, 5, CUDA.Mem.DeviceBuffer}; y::CUDA.CuArray{Float16, 5, CUDA.Mem.DeviceBuffer}, activation::cuDNN.cudnnActivationMode_t, convDesc::cuDNN.cudnnConvolutionDescriptor, wDesc::cuDNN.cudnnFilterDescriptor, xDesc::cuDNN.cudnnTensorDescriptor, yDesc::cuDNN.cudnnTensorDescriptor, zDesc::cuDNN.cudnnTensorDescriptor, biasDesc::Nothing, alpha::Base.RefValue{Float32}, beta::Base.RefValue{Float32}, dw::Base.RefValue{Any}, dx::Base.RefValue{Any}, dz::Base.RefValue{Any}, dbias::Base.RefValue{Any}, dready::Base.RefValue{Bool})
    @ cuDNN /scratch/npj226/.julia/packages/cuDNN/7X4E7/src/convolution.jl:103
 [10] cudnnConvolutionForwardWithDefaults(w::CUDA.CuArray{Float16, 5, CUDA.Mem.DeviceBuffer}, x::CUDA.CuArray{Float16, 5, CUDA.Mem.DeviceBuffer}; padding::Int64, stride::Int64, dilation::Int64, mode::cuDNN.cudnnConvolutionMode_t, mathType::cuDNN.cudnnMathType_t, reorderType::cuDNN.cudnnReorderType_t, group::Int64, format::cuDNN.cudnnTensorFormat_t, convDesc::cuDNN.cudnnConvolutionDescriptor, xDesc::cuDNN.cudnnTensorDescriptor, wDesc::cuDNN.cudnnFilterDescriptor, y::CUDA.CuArray{Float16, 5, CUDA.Mem.DeviceBuffer}, yDesc::cuDNN.cudnnTensorDescriptor, alpha::Int64, beta::Int64, bias::Nothing, z::CUDA.CuArray{Float16, 5, CUDA.Mem.DeviceBuffer}, biasDesc::Nothing, zDesc::cuDNN.cudnnTensorDescriptor, activation::cuDNN.cudnnActivationMode_t, dw::Base.RefValue{Any}, dx::Base.RefValue{Any}, dz::Base.RefValue{Any}, dbias::Base.RefValue{Any})
    @ cuDNN /scratch/npj226/.julia/packages/cuDNN/7X4E7/src/convolution.jl:96
 [11] #cudnnConvolutionForward!#1150
    @ /scratch/npj226/.julia/packages/cuDNN/7X4E7/src/convolution.jl:53 [inlined]
 [12] conv!(y::CUDA.CuArray{Float16, 5, CUDA.Mem.DeviceBuffer}, x::CUDA.CuArray{Float16, 5, CUDA.Mem.DeviceBuffer}, w::CUDA.CuArray{Float16, 5, CUDA.Mem.DeviceBuffer}, cdims::DenseConvDims{3, 3, 3, 6, 3}; alpha::Int64, beta::Int64, algo::Int64)
    @ NNlibCUDA /scratch/npj226/.julia/packages/NNlibCUDA/C6t0p/src/cudnn/conv.jl:67
 [13] conv!
    @ /scratch/npj226/.julia/packages/NNlibCUDA/C6t0p/src/cudnn/conv.jl:58 [inlined]
 [14] #conv#233
    @ /scratch/npj226/.julia/packages/NNlib/TZPiH/src/conv.jl:88 [inlined]
 [15] conv
    @ /scratch/npj226/.julia/packages/NNlib/TZPiH/src/conv.jl:83 [inlined]
 [16] #conv#231
    @ /scratch/npj226/.julia/packages/NNlib/TZPiH/src/conv.jl:56 [inlined]
 [17] conv(x::CUDA.CuArray{Float16, 5, CUDA.Mem.DeviceBuffer}, w::CUDA.CuArray{Float16, 5, CUDA.Mem.DeviceBuffer})
    @ NNlib /scratch/npj226/.julia/packages/NNlib/TZPiH/src/conv.jl:50
 [18] top-level scope
    @ REPL[4]:1
 [19] top-level scope
    @ /scratch/npj226/.julia/packages/CUDA/ZdCxS/src/initialization.jl:155

julia> w = rand(Float16, 3, 1, 1) |> gpu;

julia> gradient(x->sum(conv(x, w)), rand(Float16, 16, 1, 1) |> gpu)
┌ Debug: CuDNN (v8600) function cudnnSetConvolutionMathType() called:
│     convDesc: location=host; addr=0x75faa90;
│     mathType: type=cudnnMathType_t; val=CUDNN_TENSOR_OP_MATH (1);
│ Time: 2023-02-17T00:25:16.118532 (0d+0h+0m+40s since start)
│ Process=2464303; Thread=2464303; GPU=NULL; Handle=NULL; StreamId=NULL.
└ @ cuDNN /scratch/npj226/.julia/packages/cuDNN/7X4E7/src/cuDNN.jl:149
ERROR: ┌ Debug: CuDNN (v8600) function cudnnCreateTensorDescriptor() called:
│ Time: 2023-02-17T00:25:16.237211 (0d+0h+0m+40s since start)
│ Process=2464303; Thread=2464303; GPU=NULL; Handle=NULL; StreamId=NULL.
└ @ cuDNN /scratch/npj226/.julia/packages/cuDNN/7X4E7/src/cuDNN.jl:149
CUDNNError: ┌ Debug: CuDNN (v8600) function cudnnSetTensorNdDescriptorEx() called:
│     format: type=cudnnTensorFormat_t; val=CUDNN_TENSOR_NCHW (0);
│     dataType: type=cudnnDataType_t; val=CUDNN_DATA_HALF (2);
│     nbDims: type=int; val=4;
│     dimA: type=int; val=[1,1,16,16];
│ Time: 2023-02-17T00:25:16.252704 (0d+0h+0m+40s since start)
│ Process=2464303; Thread=2464303; GPU=NULL; Handle=NULL; StreamId=NULL.
└ @ cuDNN /scratch/npj226/.julia/packages/cuDNN/7X4E7/src/cuDNN.jl:149
CUDNN_STATUS_BAD_PARAM (code 3)
Stacktrace:
  [1] throw_api_error(res::cuDNN.cudnnStatus_t)
    @ cuDNN /scratch/npj226/.julia/packages/cuDNN/7X4E7/src/libcudnn.jl:11
  [2] macro expansion
    @ /scratch/npj226/.julia/packages/cuDNN/7X4E7/src/libcudnn.jl:24 [inlined]
  [3] cudnnConvolutionBackwardFilter(handle::Ptr{cuDNN.cudnnContext}, alpha::Base.RefValue{Float32}, xDesc::cuDNN.cudnnTensorDescriptor, x::CUDA.CuArray{Float16, 3, CUDA.Mem.DeviceBuffer}, dyDesc::cuDNN.cudnnTensorDescriptor, dy::CUDA.CuArray{Float16, 3, CUDA.Mem.DeviceBuffer}, convDesc::cuDNN.cudnnConvolutionDescriptor, algo::cuDNN.cudnnConvolutionBwdFilterAlgo_t, workSpace::CUDA.CuArray{UInt8, 1, CUDA.Mem.DeviceBuffer}, workSpaceSizeInBytes::Int64, beta::Base.RefValue{Float32}, dwDesc::cuDNN.cudnnFilterDescriptor, dw::CUDA.CuArray{Float16, 3, CUDA.Mem.DeviceBuffer})
    @ cuDNN /scratch/npj226/.julia/packages/CUDA/ZdCxS/lib/utils/call.jl:26
  [4] FluxML/NNlibCUDA.jl#36
    @ /scratch/npj226/.julia/packages/NNlibCUDA/C6t0p/src/cudnn/conv.jl:120 [inlined]
  [5] with_workspace(f::NNlibCUDA.var"#36#38"{Base.RefValue{Float32}, Base.RefValue{Float32}, CUDA.CuArray{Float16, 3, CUDA.Mem.DeviceBuffer}, CUDA.CuArray{Float16, 3, CUDA.Mem.DeviceBuffer}, CUDA.CuArray{Float16, 3, CUDA.Mem.DeviceBuffer}, cuDNN.cudnnConvolutionBwdFilterAlgoPerfStruct, cuDNN.cudnnFilterDescriptor, cuDNN.cudnnTensorDescriptor, cuDNN.cudnnTensorDescriptor, cuDNN.cudnnConvolutionDescriptor}, eltyp::Type{UInt8}, size::CUDA.APIUtils.var"#2#3"{UInt64}, fallback::Nothing; keep::Bool)
    @ CUDA.APIUtils /scratch/npj226/.julia/packages/CUDA/ZdCxS/lib/utils/call.jl:77
  [6] with_workspace(f::Function, eltyp::Type{UInt8}, size::CUDA.APIUtils.var"#2#3"{UInt64}, fallback::Nothing)
    @ CUDA.APIUtils /scratch/npj226/.julia/packages/CUDA/ZdCxS/lib/utils/call.jl:56
  [7] #with_workspace#1
    @ /scratch/npj226/.julia/packages/CUDA/ZdCxS/lib/utils/call.jl:53 [inlined]
  [8] with_workspace(f::Function, size::UInt64, fallback::Nothing) (repeats 2 times)
    @ CUDA.APIUtils /scratch/npj226/.julia/packages/CUDA/ZdCxS/lib/utils/call.jl:53
  [9] ∇conv_filter!(dw::CUDA.CuArray{Float16, 3, CUDA.Mem.DeviceBuffer}, x::CUDA.CuArray{Float16, 3, CUDA.Mem.DeviceBuffer}, dy::CUDA.CuArray{Float16, 3, CUDA.Mem.DeviceBuffer}, cdims::DenseConvDims{1, 1, 1, 2, 1}; alpha::Int64, beta::Int64, algo::Int64)
    @ NNlibCUDA /scratch/npj226/.julia/packages/NNlibCUDA/C6t0p/src/cudnn/conv.jl:119
 [10] ∇conv_filter!
    @ /scratch/npj226/.julia/packages/NNlibCUDA/C6t0p/src/cudnn/conv.jl:107 [inlined]
 [11] #∇conv_filter#237
    @ /scratch/npj226/.julia/packages/NNlib/TZPiH/src/conv.jl:112 [inlined]
 [12] ∇conv_filter
    @ /scratch/npj226/.julia/packages/NNlib/TZPiH/src/conv.jl:107 [inlined]
 [13] #375
    @ /scratch/npj226/.julia/packages/NNlib/TZPiH/src/conv.jl:351 [inlined]
 [14] unthunk
    @ /scratch/npj226/.julia/packages/ChainRulesCore/a4mIA/src/tangent_types/thunks.jl:204 [inlined]
 [15] wrap_chainrules_output
    @ /scratch/npj226/.julia/packages/Zygote/g2w9o/src/compiler/chainrules.jl:110 [inlined]
 [16] map
    @ ./tuple.jl:223 [inlined]
 [17] map
    @ ./tuple.jl:224 [inlined]
 [18] wrap_chainrules_output
    @ /scratch/npj226/.julia/packages/Zygote/g2w9o/src/compiler/chainrules.jl:111 [inlined]
 [19] ZBack
    @ /scratch/npj226/.julia/packages/Zygote/g2w9o/src/compiler/chainrules.jl:211 [inlined]
 [20] Pullback
    @ /scratch/npj226/.julia/packages/NNlib/TZPiH/src/conv.jl:56 [inlined]
 [21] (::typeof((#conv#231)))(Δ::CUDA.CuArray{Float16, 3, CUDA.Mem.DeviceBuffer})
    @ Zygote /scratch/npj226/.julia/packages/Zygote/g2w9o/src/compiler/interface2.jl:0
 [22] Pullback
    @ /scratch/npj226/.julia/packages/NNlib/TZPiH/src/conv.jl:50 [inlined]
 [23] Pullback
    @ ./REPL[6]:1 [inlined]
 [24] (::Zygote.var"#60#61"{typeof((#3))})(Δ::Float16)
    @ Zygote /scratch/npj226/.julia/packages/Zygote/g2w9o/src/compiler/interface.jl:45
 [25] gradient(f::Function, args::CUDA.CuArray{Float16, 3, CUDA.Mem.DeviceBuffer})
    @ Zygote /scratch/npj226/.julia/packages/Zygote/g2w9o/src/compiler/interface.jl:97
 [26] top-level scope
    @ REPL[6]:1
 [27] top-level scope
    @ /scratch/npj226/.julia/packages/CUDA/ZdCxS/src/initialization.jl:155

@ToucheSir ToucheSir added the bug label Feb 23, 2023
@ToucheSir
Copy link
Member

ToucheSir commented Feb 24, 2023

I've been looking into this but haven't found anything conclusive yet. Can you test with NNlibCUDA v0.2.6 and see if it has the same issue? Verifying whether it's a CUDA lib version issue should help us narrow down the possibilities significantly.

Edit: just tested myself and same issue. This is strange, because when I log all the descriptors everything looks fine, but for whatever reason the algo search at https://github.com/JuliaGPU/CUDA.jl/blob/a70c83e2cbe978873a7aa74f2493838b509aa42c/lib/cudnn/src/convolution.jl#L193 is returning CUDNN_STATUS_NOT_SUPPORTED. It feels like I'm missing something blindingly obvious but not sure what, nothing stands out in the cudnn docs.

Edit2: right after I posted the last edit, I realized that Table 30 under https://docs.nvidia.com/deeplearning/cudnn/api/index.html#cudnnConvolutionForward notes that 3D convs only support PSEUDO_HALF_CONFIG and not TRUE_HALF_CONFIG, whereas 2D convs (Table 29) support both. The main difference is that we'd have to set the conv descriptor's dataType to CUDNN_DATA_FLOAT instead of CUDNN_DATA_HALF. This is currently matched to the eltype of x in https://github.com/JuliaGPU/CUDA.jl/blob/a70c83e2cbe978873a7aa74f2493838b509aa42c/lib/cudnn/src/convolution.jl#L69, and my question is whether it makes more sense to have cuDNN.jl or NNlibCUDA check for this (cc @maleadt for thoughts).

P.S. @mcabbott you may be interested in Tables like no. 25 and 26 in https://docs.nvidia.com/deeplearning/cudnn/api/index.html. We were wondering what mixtures of datatypes people might use in the wild and I think such tables provide a pretty exhaustive list.

@nikopj
Copy link
Contributor Author

nikopj commented May 4, 2023

@ToucheSir I'm back to being able to help (busy semester). Do you still want a test with NNlibCUDA v0.2.6?

@ToucheSir
Copy link
Member

No, per the edits in the above post I think I've reproduced it. Re-reading the CUDA.jl -> NNlibCUDA integration code, I think https://github.com/FluxML/NNlibCUDA.jl/blob/82ba6cb4ef6c6ed11d93c6bd7e72a8eb3cb2234a/src/cudnn/conv.jl#L46-L56would nave to be special-cased for 3D convs + Float16 inputs. Two main driving questions there: is it fine to do this silently without warning users or letting them opt for an error, and what is the least tedious way to do this (I don't want to hard-code all the valid configurations in Tables 26-30 unless absolutely necessary)?

@CarloLucibello CarloLucibello transferred this issue from FluxML/NNlibCUDA.jl Jun 24, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants