NNlibCUDA Heisenbug in conv! with nonzero beta #37

jw3126 · 2021-04-08T11:04:59Z

Describe the bug

When using the beta keyword of NNlib.conv! on CuArray there are rare non-deterministic? absurd results.

To reproduce
Run the following on a fresh julia session

using CUDA
using NNlib
x_cpu = fill(1f0, 1,1,1)
w_cpu = fill(1f0, 1,1,1)
x_gpu = CuArray(x_cpu)
w_gpu = CuArray(w_cpu)
cdims = NNlib.DenseConvDims(x_cpu, w_cpu)
y_cpu = fill(0f0, 1,1,1)
y_gpu = CuArray(y_cpu)
NNlib.conv!(y_cpu, x_cpu, w_cpu, cdims, alpha=1f0, beta=1f0)
NNlib.conv!(y_gpu, x_gpu, w_gpu, cdims, alpha=1f0, beta=1f0)
@show y_cpu
@show y_gpu

y_cpu = Float32[1.0]
y_gpu = Float32[2.0]

If I run it again, y_gpu will give the correct result. If I do some fuzz testing, it seems that at least the first conv! operation of given array sizes goes wrong. I think also that it is not only the first operation that goes wrong, but the first operation reliably goes wrong.

using NNlib
using CUDA
using Test
using LinearAlgebra

function fuzz(;max_fails, max_iter)
    fails = 0
    for i in 1:max_iter
        nspacedims = rand(1:1)
        spacesize  = tuple(rand(1:3, nspacedims)...)
        nb         = rand(1:3)
        ncin       = rand(1:3)
        ncout      = rand(1:3)
        
        x_cpu = randn(Float32, spacesize...,ncin, nb)
        kernel_size = ntuple(_->1, nspacedims)
        w_cpu = randn(Float32, kernel_size...,ncin, ncout)
        x_gpu = CuArray(x_cpu)
        w_gpu = CuArray(w_cpu)
        
        cdims = NNlib.DenseConvDims(x_cpu, w_cpu)
        y_cpu = randn(Float32, spacesize...,ncout, nb)
        y_gpu = CuArray(y_cpu)
        NNlib.conv!(y_cpu, x_cpu, w_cpu, cdims, alpha=1f0, beta=1f0)
        NNlib.conv!(y_gpu, x_gpu, w_gpu, cdims, alpha=1f0, beta=1f0)
        if !(collect(y_gpu) ≈ y_cpu)
            @show i
            #@show x_cpu
            #@show x_gpu
            #@show w_cpu
            #@show w_gpu
            #@show y_cpu
            #@show y_gpu
            @show size(x_cpu)
            @show size(w_cpu)
            @show norm(collect(y_gpu) -y_cpu)
            fails += 1
        end
        if fails >= max_fails
            break
        end
    end
    @show fails
end

fuzz(max_fails=1000, max_iter=10000)

I am using current master of CUDA.jl

Manifest.toml

[[AbstractFFTs]]
deps = ["LinearAlgebra"]
git-tree-sha1 = "8ed9de2f1b1a9b1dee48582ad477c6e67b83eb2c"
uuid = "621f4979-c628-5d54-868e-fcf4e3e8185c"
version = "1.0.0"

[[Adapt]]
deps = ["LinearAlgebra"]
git-tree-sha1 = "ffcfa2d345aaee0ef3d8346a073d5dd03c983ebe"
uuid = "79e6a3ab-5dfb-504d-930d-738a2a938a0e"
version = "3.2.0"

[[ArgTools]]
uuid = "0dad84c5-d112-42e6-8d28-ef12dabb789f"

[[Artifacts]]
uuid = "56f22d72-fd6d-98f1-02f0-08ddc0907c33"

[[BFloat16s]]
deps = ["LinearAlgebra", "Test"]
git-tree-sha1 = "4af69e205efc343068dc8722b8dfec1ade89254a"
uuid = "ab4f0b2a-ad5b-11e8-123f-65d77653426b"
version = "0.1.0"

[[Base64]]
uuid = "2a0f44e3-6c83-55bd-87e4-b1978d98bd5f"

[[CEnum]]
git-tree-sha1 = "215a9aa4a1f23fbd05b92769fdd62559488d70e9"
uuid = "fa961155-64e5-5f13-b03f-caf6b980ea82"
version = "0.4.1"

[[CUDA]]
deps = ["AbstractFFTs", "Adapt", "BFloat16s", "CEnum", "CompilerSupportLibraries_jll", "DataStructures", "ExprTools", "GPUArrays", "GPUCompiler", "LLVM", "LazyArtifacts", "Libdl", "LinearAlgebra", "Logging", "MacroTools", "Memoize", "NNlib", "Printf", "Random", "Reexport", "Requires", "SparseArrays", "Statistics", "TimerOutputs"]
git-tree-sha1 = "d891e403471f04266c80a03ecf247d9aff6e7879"
repo-rev = "master"
repo-url = "https://github.com/JuliaGPU/CUDA.jl.git"
uuid = "052768ef-5323-5732-b1bb-66c8b64840ba"
version = "2.6.0"

[[ChainRulesCore]]
deps = ["Compat", "LinearAlgebra", "SparseArrays"]
git-tree-sha1 = "de4f08843c332d355852721adb1592bce7924da3"
uuid = "d360d2e6-b24c-11e9-a2a3-2a2ae2dbcce4"
version = "0.9.29"

[[Compat]]
deps = ["Base64", "Dates", "DelimitedFiles", "Distributed", "InteractiveUtils", "LibGit2", "Libdl", "LinearAlgebra", "Markdown", "Mmap", "Pkg", "Printf", "REPL", "Random", "SHA", "Serialization", "SharedArrays", "Sockets", "SparseArrays", "Statistics", "Test", "UUIDs", "Unicode"]
git-tree-sha1 = "919c7f3151e79ff196add81d7f4e45d91bbf420b"
uuid = "34da2185-b29b-5c13-b0c7-acf172513d20"
version = "3.25.0"

[[CompilerSupportLibraries_jll]]
deps = ["Artifacts", "Libdl"]
uuid = "e66e0078-7015-5450-92f7-15fbd957f2ae"

[[DataStructures]]
deps = ["Compat", "InteractiveUtils", "OrderedCollections"]
git-tree-sha1 = "4437b64df1e0adccc3e5d1adbc3ac741095e4677"
uuid = "864edb3b-99cc-5e75-8d2d-829cb0a9cfe8"
version = "0.18.9"

[[Dates]]
deps = ["Printf"]
uuid = "ade2ca70-3891-5945-98fb-dc099432e06a"

[[DelimitedFiles]]
deps = ["Mmap"]
uuid = "8bb1440f-4735-579b-a4ab-409b98df4dab"

[[Distributed]]
deps = ["Random", "Serialization", "Sockets"]
uuid = "8ba89e20-285c-5b6f-9357-94700520ee1b"

[[Downloads]]
deps = ["ArgTools", "LibCURL", "NetworkOptions"]
uuid = "f43a241f-c20a-4ad4-852c-f6b1247861c6"

[[ExprTools]]
git-tree-sha1 = "10407a39b87f29d47ebaca8edbc75d7c302ff93e"
uuid = "e2ba6199-217a-4e67-a87a-7c52f15ade04"
version = "0.1.3"

[[GPUArrays]]
deps = ["AbstractFFTs", "Adapt", "LinearAlgebra", "Printf", "Random", "Serialization"]
git-tree-sha1 = "f99a25fe0313121f2f9627002734c7d63b4dd3bd"
uuid = "0c68f7d7-f131-5f86-a1c3-88cf8149b2d7"
version = "6.2.0"

[[GPUCompiler]]
deps = ["DataStructures", "ExprTools", "InteractiveUtils", "LLVM", "Libdl", "Logging", "Scratch", "Serialization", "TimerOutputs", "UUIDs"]
git-tree-sha1 = "ef2839b063e158672583b9c09d2cf4876a8d3d55"
uuid = "61eb1bfa-7361-4325-ad38-22787b887f55"
version = "0.10.0"

[[InteractiveUtils]]
deps = ["Markdown"]
uuid = "b77e0a4c-d291-57a0-90e8-8db25a27a240"

[[LLVM]]
deps = ["CEnum", "Libdl", "Printf", "Unicode"]
git-tree-sha1 = "b616937c31337576360cb9fb872ec7633af7b194"
uuid = "929cbde3-209d-540e-8aea-75f648917ca0"
version = "3.6.0"

[[LazyArtifacts]]
deps = ["Artifacts", "Pkg"]
uuid = "4af54fe1-eca0-43a8-85a7-787d91b784e3"

[[LibCURL]]
deps = ["LibCURL_jll", "MozillaCACerts_jll"]
uuid = "b27032c2-a3e7-50c8-80cd-2d36dbcbfd21"

[[LibCURL_jll]]
deps = ["Artifacts", "LibSSH2_jll", "Libdl", "MbedTLS_jll", "Zlib_jll", "nghttp2_jll"]
uuid = "deac9b47-8bc7-5906-a0fe-35ac56dc84c0"

[[LibGit2]]
deps = ["Base64", "NetworkOptions", "Printf", "SHA"]
uuid = "76f85450-5226-5b5a-8eaa-529ad045b433"

[[LibSSH2_jll]]
deps = ["Artifacts", "Libdl", "MbedTLS_jll"]
uuid = "29816b5a-b9ab-546f-933c-edad1886dfa8"

[[Libdl]]
uuid = "8f399da3-3557-5675-b5ff-fb832c97cbdb"

[[LinearAlgebra]]
deps = ["Libdl"]
uuid = "37e2e46d-f89d-539d-b4ee-838fcccc9c8e"

[[Logging]]
uuid = "56ddb016-857b-54e1-b83d-db4d58db5568"

[[MacroTools]]
deps = ["Markdown", "Random"]
git-tree-sha1 = "6a8a2a625ab0dea913aba95c11370589e0239ff0"
uuid = "1914dd2f-81c6-5fcd-8719-6d5c9610ff09"
version = "0.5.6"

[[Markdown]]
deps = ["Base64"]
uuid = "d6f4376e-aef5-505a-96c1-9c027394607a"

[[MbedTLS_jll]]
deps = ["Artifacts", "Libdl"]
uuid = "c8ffd9c3-330d-5841-b78e-0817d7145fa1"

[[Memoize]]
deps = ["MacroTools"]
git-tree-sha1 = "2b1dfcba103de714d31c033b5dacc2e4a12c7caa"
uuid = "c03570c3-d221-55d1-a50c-7939bbd78826"
version = "0.4.4"

[[Mmap]]
uuid = "a63ad114-7e13-5084-954f-fe012c677804"

[[MozillaCACerts_jll]]
uuid = "14a3606d-f60d-562e-9121-12d972cd8159"

[[NNlib]]
deps = ["ChainRulesCore", "Compat", "LinearAlgebra", "Pkg", "Requires", "Statistics"]
git-tree-sha1 = "df42d0816edfc24f5b82a728f46381613c4dff79"
uuid = "872c559c-99b0-510c-b3b7-b6c96a88d5cd"
version = "0.7.14"

[[NetworkOptions]]
uuid = "ca575930-c2e3-43a9-ace4-1e988b2c1908"

[[OrderedCollections]]
git-tree-sha1 = "4fa2ba51070ec13fcc7517db714445b4ab986bdf"
uuid = "bac558e1-5e72-5ebc-8fee-abe8a469f55d"
version = "1.4.0"

[[Pkg]]
deps = ["Artifacts", "Dates", "Downloads", "LibGit2", "Libdl", "Logging", "Markdown", "Printf", "REPL", "Random", "SHA", "Serialization", "TOML", "Tar", "UUIDs"]
uuid = "44cfe95a-1eb2-52ea-b672-e2afdf69b78f"

[[Printf]]
deps = ["Unicode"]
uuid = "de0858da-6303-5e67-8744-51eddeeeb8d7"

[[REPL]]
deps = ["InteractiveUtils", "Markdown", "Sockets", "Unicode"]
uuid = "3fa0cd96-eef1-5676-8a61-b3b8758bbffb"

[[Random]]
deps = ["Serialization"]
uuid = "9a3f8284-a2c9-5f02-9a11-845980a1fd5c"

[[Reexport]]
git-tree-sha1 = "57d8440b0c7d98fc4f889e478e80f268d534c9d5"
uuid = "189a3867-3050-52da-a836-e630ba90ab69"
version = "1.0.0"

[[Requires]]
deps = ["UUIDs"]
git-tree-sha1 = "cfbac6c1ed70c002ec6361e7fd334f02820d6419"
uuid = "ae029012-a4dd-5104-9daa-d747884805df"
version = "1.1.2"

[[SHA]]
uuid = "ea8e919c-243c-51af-8825-aaa63cd721ce"

[[Scratch]]
deps = ["Dates"]
git-tree-sha1 = "ad4b278adb62d185bbcb6864dc24959ab0627bf6"
uuid = "6c6a2e73-6563-6170-7368-637461726353"
version = "1.0.3"

[[Serialization]]
uuid = "9e88b42a-f829-5b0c-bbe9-9e923198166b"

[[SharedArrays]]
deps = ["Distributed", "Mmap", "Random", "Serialization"]
uuid = "1a1011a3-84de-559e-8e89-a11a2f7dc383"

[[Sockets]]
uuid = "6462fe0b-24de-5631-8697-dd941f90decc"

[[SparseArrays]]
deps = ["LinearAlgebra", "Random"]
uuid = "2f01184e-e22b-5df5-ae63-d93ebab69eaf"

[[Statistics]]
deps = ["LinearAlgebra", "SparseArrays"]
uuid = "10745b16-79ce-11e8-11f9-7d13ad32a3b2"

[[TOML]]
deps = ["Dates"]
uuid = "fa267f1f-6049-4f14-aa54-33bafae1ed76"

[[Tar]]
deps = ["ArgTools", "SHA"]
uuid = "a4e569a6-e804-4fa4-b0f3-eef7a1d5b13e"

[[Test]]
deps = ["InteractiveUtils", "Logging", "Random", "Serialization"]
uuid = "8dfed614-e22c-5e08-85e1-65c5234f0b40"

[[TimerOutputs]]
deps = ["Printf"]
git-tree-sha1 = "3318281dd4121ecf9713ce1383b9ace7d7476fdd"
uuid = "a759f4b9-e2f1-59dc-863e-4aeb61b1ea8f"
version = "0.5.7"

[[UUIDs]]
deps = ["Random", "SHA"]
uuid = "cf7118a7-6976-5b1a-9a39-7adc72f591a4"

[[Unicode]]
uuid = "4ec0a83e-493e-50e2-b9ac-8f72acf5a8f5"

[[Zlib_jll]]
deps = ["Libdl"]
uuid = "83775a58-1f1d-513f-b197-d71354ab007a"

[[nghttp2_jll]]
deps = ["Artifacts", "Libdl"]
uuid = "8e850ede-7688-5339-a07c-302acd2aaf8d"

Version info

Details on Julia:

Julia Version 1.6.0-rc1
Commit a58bdd9010 (2021-02-06 15:49 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: AMD Ryzen 9 3900 12-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-11.0.1 (ORCJIT, znver2)
Environment:
  JULIA_NUM_THREADS = 24

Details on CUDA:

CUDA toolkit 11.2.1, artifact installation
CUDA driver 11.2.0
NVIDIA driver 460.32.3

Libraries: 
- CUBLAS: 11.4.1
- CURAND: 10.2.3
- CUFFT: 10.4.0
- CUSOLVER: 11.1.0
- CUSPARSE: 11.4.0
- CUPTI: 14.0.0
- NVML: 11.0.0+460.32.3
- CUDNN: 8.10.0 (for CUDA 11.2.0)
- CUTENSOR: 1.2.2 (for CUDA 11.1.0)

Toolchain:
- Julia: 1.6.0-rc1
- LLVM: 11.0.1
- PTX ISA support: 3.2, 4.0, 4.1, 4.2, 4.3, 5.0, 6.0, 6.1, 6.3, 6.4, 6.5, 7.0
- Device support: sm_35, sm_37, sm_50, sm_52, sm_53, sm_60, sm_61, sm_62, sm_70, sm_72, sm_75, sm_80

Preferences:
- Memory pool: None
- Async allocation: true

1 device:
  0: GeForce RTX 2060 (sm_75, 4.931 GiB / 5.787 GiB available)

The text was updated successfully, but these errors were encountered:

maxfreu · 2022-05-31T09:46:55Z

I can replicate it on

[052768ef] CUDA v3.10.1
[872c559c] NNlib v0.8.3
[a00861dc] NNlibCUDA v0.2.3
CUDA 11.4.0
CUDNN 8.30.2
julia-1.6.3

ToucheSir · 2022-05-31T13:33:31Z

The first call to conv! is special because it will reliably trigger an algorithm search. What happens if you go a level lower and call the CUDA.jl functions?

maxfreu · 2022-06-02T14:09:30Z

This happens:

using CUDA, NNlibCUDA, NNlib
x = CUDA.ones(1,1,1)
w = CUDA.ones(1,1,1)
y = CUDA.zeros(1,1,1)
cdims = NNlib.DenseConvDims(x,w)
d, x, _ = NNlibCUDA.cudnnConvolutionDescriptorAndPaddedInput(cdims, x)
CUDA.CUDNN.cudnnConvolutionForward!(y, w, x, d; alpha=1f0, beta=1f0, z=y) # 2
y = CUDA.zeros(1,1,1)
CUDA.CUDNN.cudnnConvolutionForward!(y, w, x, d; alpha=1f0, beta=1f0, z=y) # 1

I suspect that the crux is here. Instead of y, a similar array should be allocated and used in cudnnFindConvolutionForwardAlgorithmEx. @maleadt , what do you think? The CUDA docs here state that the contents of y will be overwritten with arbitrary values during the algorithm search.

ToucheSir · 2022-06-02T15:22:58Z

I would expect that y would be re-overwritten during the actual forward pass though? Otherwise using similar to allocate that array would lead to strange results for all conv calls, not just the first one.

maxfreu · 2022-06-03T09:13:09Z

We have to differentiate between the actual convolution and the algorithm search. The convolution needs a zero-initalized output buffer, that it alters. The algorithm search also needs an output buffer for the benchmark, but at the end of this, the values in the buffer are arbitrary. If you use the same for both, you run the algorithm search, the buffer has arbitrary values in it and then the conv adds to that, leading to garbage. As you said correctly, in subsequent calls the algorithm search is omitted, which makes this a semi-heisenbug. Using this branch https://github.com/maxfreu/CUDA.jl/tree/conv-algosearch I get zero errors in the fuzzer.

ToucheSir · 2022-06-03T16:34:27Z

That's what I was missing, y and dy are accumulated into instead of completely overwritten when beta is non-zero. Can we tweak the call chain such that an output buffer is only allocated for the algorithm search when this is the case? My worry is that the search is already causing OOMs for users, so allocating more for it when not required is not ideal.

maxfreu · 2022-06-03T17:11:27Z

Hmm I wouldn't have expected that the search causes OOMs, as the buffer should be freed right after the search. Apart from the input, weight and output tensors, the search also needs a "workspace", the size of which is calculated here. It already seems to be quite small, but I didn't think it through. Maybe not small enough? Anyway, it should be possible to allocate only if beta != 0. In case the assumption holds that this is the only case y is accumulated into. Should I mark the PR as draft?

ToucheSir transferred this issue from FluxML/NNlib.jl Jan 9, 2022

maxfreu mentioned this issue Jun 3, 2022

Fix convolution algorithm search JuliaGPU/CUDA.jl#1536

Merged

maleadt closed this as completed in JuliaGPU/CUDA.jl#1536 Jun 15, 2022

ToucheSir mentioned this issue Jul 4, 2023

cudnn complex convolution via gauss trick FluxML/NNlib.jl#517

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NNlibCUDA Heisenbug in conv! with nonzero beta #37

NNlibCUDA Heisenbug in conv! with nonzero beta #37

jw3126 commented Apr 8, 2021

maxfreu commented May 31, 2022

ToucheSir commented May 31, 2022

maxfreu commented Jun 2, 2022

ToucheSir commented Jun 2, 2022

maxfreu commented Jun 3, 2022

ToucheSir commented Jun 3, 2022

maxfreu commented Jun 3, 2022 •

edited

Loading

NNlibCUDA Heisenbug in conv! with nonzero beta #37

NNlibCUDA Heisenbug in conv! with nonzero beta #37

Comments

jw3126 commented Apr 8, 2021

maxfreu commented May 31, 2022

ToucheSir commented May 31, 2022

maxfreu commented Jun 2, 2022

ToucheSir commented Jun 2, 2022

maxfreu commented Jun 3, 2022

ToucheSir commented Jun 3, 2022

maxfreu commented Jun 3, 2022 • edited Loading

maxfreu commented Jun 3, 2022 •

edited

Loading