Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NNlibCUDA Heisenbug in conv! with nonzero beta #37

Closed
jw3126 opened this issue Apr 8, 2021 · 7 comments · Fixed by JuliaGPU/CUDA.jl#1536
Closed

NNlibCUDA Heisenbug in conv! with nonzero beta #37

jw3126 opened this issue Apr 8, 2021 · 7 comments · Fixed by JuliaGPU/CUDA.jl#1536

Comments

@jw3126
Copy link

jw3126 commented Apr 8, 2021

See JuliaGPU/CUDA.jl#736

Describe the bug

When using the beta keyword of NNlib.conv! on CuArray there are rare non-deterministic? absurd results.

To reproduce
Run the following on a fresh julia session

using CUDA
using NNlib
x_cpu = fill(1f0, 1,1,1)
w_cpu = fill(1f0, 1,1,1)
x_gpu = CuArray(x_cpu)
w_gpu = CuArray(w_cpu)
cdims = NNlib.DenseConvDims(x_cpu, w_cpu)
y_cpu = fill(0f0, 1,1,1)
y_gpu = CuArray(y_cpu)
NNlib.conv!(y_cpu, x_cpu, w_cpu, cdims, alpha=1f0, beta=1f0)
NNlib.conv!(y_gpu, x_gpu, w_gpu, cdims, alpha=1f0, beta=1f0)
@show y_cpu
@show y_gpu

y_cpu = Float32[1.0]
y_gpu = Float32[2.0]

If I run it again, y_gpu will give the correct result. If I do some fuzz testing, it seems that at least the first conv! operation of given array sizes goes wrong. I think also that it is not only the first operation that goes wrong, but the first operation reliably goes wrong.

using NNlib
using CUDA
using Test
using LinearAlgebra

function fuzz(;max_fails, max_iter)
    fails = 0
    for i in 1:max_iter
        nspacedims = rand(1:1)
        spacesize  = tuple(rand(1:3, nspacedims)...)
        nb         = rand(1:3)
        ncin       = rand(1:3)
        ncout      = rand(1:3)
        
        x_cpu = randn(Float32, spacesize...,ncin, nb)
        kernel_size = ntuple(_->1, nspacedims)
        w_cpu = randn(Float32, kernel_size...,ncin, ncout)
        x_gpu = CuArray(x_cpu)
        w_gpu = CuArray(w_cpu)
        
        cdims = NNlib.DenseConvDims(x_cpu, w_cpu)
        y_cpu = randn(Float32, spacesize...,ncout, nb)
        y_gpu = CuArray(y_cpu)
        NNlib.conv!(y_cpu, x_cpu, w_cpu, cdims, alpha=1f0, beta=1f0)
        NNlib.conv!(y_gpu, x_gpu, w_gpu, cdims, alpha=1f0, beta=1f0)
        if !(collect(y_gpu)  y_cpu)
            @show i
            #@show x_cpu
            #@show x_gpu
            #@show w_cpu
            #@show w_gpu
            #@show y_cpu
            #@show y_gpu
            @show size(x_cpu)
            @show size(w_cpu)
            @show norm(collect(y_gpu) -y_cpu)
            fails += 1
        end
        if fails >= max_fails
            break
        end
    end
    @show fails
end

fuzz(max_fails=1000, max_iter=10000)

I am using current master of CUDA.jl

Manifest.toml

[[AbstractFFTs]]
deps = ["LinearAlgebra"]
git-tree-sha1 = "8ed9de2f1b1a9b1dee48582ad477c6e67b83eb2c"
uuid = "621f4979-c628-5d54-868e-fcf4e3e8185c"
version = "1.0.0"

[[Adapt]]
deps = ["LinearAlgebra"]
git-tree-sha1 = "ffcfa2d345aaee0ef3d8346a073d5dd03c983ebe"
uuid = "79e6a3ab-5dfb-504d-930d-738a2a938a0e"
version = "3.2.0"

[[ArgTools]]
uuid = "0dad84c5-d112-42e6-8d28-ef12dabb789f"

[[Artifacts]]
uuid = "56f22d72-fd6d-98f1-02f0-08ddc0907c33"

[[BFloat16s]]
deps = ["LinearAlgebra", "Test"]
git-tree-sha1 = "4af69e205efc343068dc8722b8dfec1ade89254a"
uuid = "ab4f0b2a-ad5b-11e8-123f-65d77653426b"
version = "0.1.0"

[[Base64]]
uuid = "2a0f44e3-6c83-55bd-87e4-b1978d98bd5f"

[[CEnum]]
git-tree-sha1 = "215a9aa4a1f23fbd05b92769fdd62559488d70e9"
uuid = "fa961155-64e5-5f13-b03f-caf6b980ea82"
version = "0.4.1"

[[CUDA]]
deps = ["AbstractFFTs", "Adapt", "BFloat16s", "CEnum", "CompilerSupportLibraries_jll", "DataStructures", "ExprTools", "GPUArrays", "GPUCompiler", "LLVM", "LazyArtifacts", "Libdl", "LinearAlgebra", "Logging", "MacroTools", "Memoize", "NNlib", "Printf", "Random", "Reexport", "Requires", "SparseArrays", "Statistics", "TimerOutputs"]
git-tree-sha1 = "d891e403471f04266c80a03ecf247d9aff6e7879"
repo-rev = "master"
repo-url = "https://github.com/JuliaGPU/CUDA.jl.git"
uuid = "052768ef-5323-5732-b1bb-66c8b64840ba"
version = "2.6.0"

[[ChainRulesCore]]
deps = ["Compat", "LinearAlgebra", "SparseArrays"]
git-tree-sha1 = "de4f08843c332d355852721adb1592bce7924da3"
uuid = "d360d2e6-b24c-11e9-a2a3-2a2ae2dbcce4"
version = "0.9.29"

[[Compat]]
deps = ["Base64", "Dates", "DelimitedFiles", "Distributed", "InteractiveUtils", "LibGit2", "Libdl", "LinearAlgebra", "Markdown", "Mmap", "Pkg", "Printf", "REPL", "Random", "SHA", "Serialization", "SharedArrays", "Sockets", "SparseArrays", "Statistics", "Test", "UUIDs", "Unicode"]
git-tree-sha1 = "919c7f3151e79ff196add81d7f4e45d91bbf420b"
uuid = "34da2185-b29b-5c13-b0c7-acf172513d20"
version = "3.25.0"

[[CompilerSupportLibraries_jll]]
deps = ["Artifacts", "Libdl"]
uuid = "e66e0078-7015-5450-92f7-15fbd957f2ae"

[[DataStructures]]
deps = ["Compat", "InteractiveUtils", "OrderedCollections"]
git-tree-sha1 = "4437b64df1e0adccc3e5d1adbc3ac741095e4677"
uuid = "864edb3b-99cc-5e75-8d2d-829cb0a9cfe8"
version = "0.18.9"

[[Dates]]
deps = ["Printf"]
uuid = "ade2ca70-3891-5945-98fb-dc099432e06a"

[[DelimitedFiles]]
deps = ["Mmap"]
uuid = "8bb1440f-4735-579b-a4ab-409b98df4dab"

[[Distributed]]
deps = ["Random", "Serialization", "Sockets"]
uuid = "8ba89e20-285c-5b6f-9357-94700520ee1b"

[[Downloads]]
deps = ["ArgTools", "LibCURL", "NetworkOptions"]
uuid = "f43a241f-c20a-4ad4-852c-f6b1247861c6"

[[ExprTools]]
git-tree-sha1 = "10407a39b87f29d47ebaca8edbc75d7c302ff93e"
uuid = "e2ba6199-217a-4e67-a87a-7c52f15ade04"
version = "0.1.3"

[[GPUArrays]]
deps = ["AbstractFFTs", "Adapt", "LinearAlgebra", "Printf", "Random", "Serialization"]
git-tree-sha1 = "f99a25fe0313121f2f9627002734c7d63b4dd3bd"
uuid = "0c68f7d7-f131-5f86-a1c3-88cf8149b2d7"
version = "6.2.0"

[[GPUCompiler]]
deps = ["DataStructures", "ExprTools", "InteractiveUtils", "LLVM", "Libdl", "Logging", "Scratch", "Serialization", "TimerOutputs", "UUIDs"]
git-tree-sha1 = "ef2839b063e158672583b9c09d2cf4876a8d3d55"
uuid = "61eb1bfa-7361-4325-ad38-22787b887f55"
version = "0.10.0"

[[InteractiveUtils]]
deps = ["Markdown"]
uuid = "b77e0a4c-d291-57a0-90e8-8db25a27a240"

[[LLVM]]
deps = ["CEnum", "Libdl", "Printf", "Unicode"]
git-tree-sha1 = "b616937c31337576360cb9fb872ec7633af7b194"
uuid = "929cbde3-209d-540e-8aea-75f648917ca0"
version = "3.6.0"

[[LazyArtifacts]]
deps = ["Artifacts", "Pkg"]
uuid = "4af54fe1-eca0-43a8-85a7-787d91b784e3"

[[LibCURL]]
deps = ["LibCURL_jll", "MozillaCACerts_jll"]
uuid = "b27032c2-a3e7-50c8-80cd-2d36dbcbfd21"

[[LibCURL_jll]]
deps = ["Artifacts", "LibSSH2_jll", "Libdl", "MbedTLS_jll", "Zlib_jll", "nghttp2_jll"]
uuid = "deac9b47-8bc7-5906-a0fe-35ac56dc84c0"

[[LibGit2]]
deps = ["Base64", "NetworkOptions", "Printf", "SHA"]
uuid = "76f85450-5226-5b5a-8eaa-529ad045b433"

[[LibSSH2_jll]]
deps = ["Artifacts", "Libdl", "MbedTLS_jll"]
uuid = "29816b5a-b9ab-546f-933c-edad1886dfa8"

[[Libdl]]
uuid = "8f399da3-3557-5675-b5ff-fb832c97cbdb"

[[LinearAlgebra]]
deps = ["Libdl"]
uuid = "37e2e46d-f89d-539d-b4ee-838fcccc9c8e"

[[Logging]]
uuid = "56ddb016-857b-54e1-b83d-db4d58db5568"

[[MacroTools]]
deps = ["Markdown", "Random"]
git-tree-sha1 = "6a8a2a625ab0dea913aba95c11370589e0239ff0"
uuid = "1914dd2f-81c6-5fcd-8719-6d5c9610ff09"
version = "0.5.6"

[[Markdown]]
deps = ["Base64"]
uuid = "d6f4376e-aef5-505a-96c1-9c027394607a"

[[MbedTLS_jll]]
deps = ["Artifacts", "Libdl"]
uuid = "c8ffd9c3-330d-5841-b78e-0817d7145fa1"

[[Memoize]]
deps = ["MacroTools"]
git-tree-sha1 = "2b1dfcba103de714d31c033b5dacc2e4a12c7caa"
uuid = "c03570c3-d221-55d1-a50c-7939bbd78826"
version = "0.4.4"

[[Mmap]]
uuid = "a63ad114-7e13-5084-954f-fe012c677804"

[[MozillaCACerts_jll]]
uuid = "14a3606d-f60d-562e-9121-12d972cd8159"

[[NNlib]]
deps = ["ChainRulesCore", "Compat", "LinearAlgebra", "Pkg", "Requires", "Statistics"]
git-tree-sha1 = "df42d0816edfc24f5b82a728f46381613c4dff79"
uuid = "872c559c-99b0-510c-b3b7-b6c96a88d5cd"
version = "0.7.14"

[[NetworkOptions]]
uuid = "ca575930-c2e3-43a9-ace4-1e988b2c1908"

[[OrderedCollections]]
git-tree-sha1 = "4fa2ba51070ec13fcc7517db714445b4ab986bdf"
uuid = "bac558e1-5e72-5ebc-8fee-abe8a469f55d"
version = "1.4.0"

[[Pkg]]
deps = ["Artifacts", "Dates", "Downloads", "LibGit2", "Libdl", "Logging", "Markdown", "Printf", "REPL", "Random", "SHA", "Serialization", "TOML", "Tar", "UUIDs"]
uuid = "44cfe95a-1eb2-52ea-b672-e2afdf69b78f"

[[Printf]]
deps = ["Unicode"]
uuid = "de0858da-6303-5e67-8744-51eddeeeb8d7"

[[REPL]]
deps = ["InteractiveUtils", "Markdown", "Sockets", "Unicode"]
uuid = "3fa0cd96-eef1-5676-8a61-b3b8758bbffb"

[[Random]]
deps = ["Serialization"]
uuid = "9a3f8284-a2c9-5f02-9a11-845980a1fd5c"

[[Reexport]]
git-tree-sha1 = "57d8440b0c7d98fc4f889e478e80f268d534c9d5"
uuid = "189a3867-3050-52da-a836-e630ba90ab69"
version = "1.0.0"

[[Requires]]
deps = ["UUIDs"]
git-tree-sha1 = "cfbac6c1ed70c002ec6361e7fd334f02820d6419"
uuid = "ae029012-a4dd-5104-9daa-d747884805df"
version = "1.1.2"

[[SHA]]
uuid = "ea8e919c-243c-51af-8825-aaa63cd721ce"

[[Scratch]]
deps = ["Dates"]
git-tree-sha1 = "ad4b278adb62d185bbcb6864dc24959ab0627bf6"
uuid = "6c6a2e73-6563-6170-7368-637461726353"
version = "1.0.3"

[[Serialization]]
uuid = "9e88b42a-f829-5b0c-bbe9-9e923198166b"

[[SharedArrays]]
deps = ["Distributed", "Mmap", "Random", "Serialization"]
uuid = "1a1011a3-84de-559e-8e89-a11a2f7dc383"

[[Sockets]]
uuid = "6462fe0b-24de-5631-8697-dd941f90decc"

[[SparseArrays]]
deps = ["LinearAlgebra", "Random"]
uuid = "2f01184e-e22b-5df5-ae63-d93ebab69eaf"

[[Statistics]]
deps = ["LinearAlgebra", "SparseArrays"]
uuid = "10745b16-79ce-11e8-11f9-7d13ad32a3b2"

[[TOML]]
deps = ["Dates"]
uuid = "fa267f1f-6049-4f14-aa54-33bafae1ed76"

[[Tar]]
deps = ["ArgTools", "SHA"]
uuid = "a4e569a6-e804-4fa4-b0f3-eef7a1d5b13e"

[[Test]]
deps = ["InteractiveUtils", "Logging", "Random", "Serialization"]
uuid = "8dfed614-e22c-5e08-85e1-65c5234f0b40"

[[TimerOutputs]]
deps = ["Printf"]
git-tree-sha1 = "3318281dd4121ecf9713ce1383b9ace7d7476fdd"
uuid = "a759f4b9-e2f1-59dc-863e-4aeb61b1ea8f"
version = "0.5.7"

[[UUIDs]]
deps = ["Random", "SHA"]
uuid = "cf7118a7-6976-5b1a-9a39-7adc72f591a4"

[[Unicode]]
uuid = "4ec0a83e-493e-50e2-b9ac-8f72acf5a8f5"

[[Zlib_jll]]
deps = ["Libdl"]
uuid = "83775a58-1f1d-513f-b197-d71354ab007a"

[[nghttp2_jll]]
deps = ["Artifacts", "Libdl"]
uuid = "8e850ede-7688-5339-a07c-302acd2aaf8d"

Version info

Details on Julia:

Julia Version 1.6.0-rc1
Commit a58bdd9010 (2021-02-06 15:49 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: AMD Ryzen 9 3900 12-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-11.0.1 (ORCJIT, znver2)
Environment:
  JULIA_NUM_THREADS = 24

Details on CUDA:

CUDA toolkit 11.2.1, artifact installation
CUDA driver 11.2.0
NVIDIA driver 460.32.3

Libraries: 
- CUBLAS: 11.4.1
- CURAND: 10.2.3
- CUFFT: 10.4.0
- CUSOLVER: 11.1.0
- CUSPARSE: 11.4.0
- CUPTI: 14.0.0
- NVML: 11.0.0+460.32.3
- CUDNN: 8.10.0 (for CUDA 11.2.0)
- CUTENSOR: 1.2.2 (for CUDA 11.1.0)

Toolchain:
- Julia: 1.6.0-rc1
- LLVM: 11.0.1
- PTX ISA support: 3.2, 4.0, 4.1, 4.2, 4.3, 5.0, 6.0, 6.1, 6.3, 6.4, 6.5, 7.0
- Device support: sm_35, sm_37, sm_50, sm_52, sm_53, sm_60, sm_61, sm_62, sm_70, sm_72, sm_75, sm_80

Preferences:
- Memory pool: None
- Async allocation: true

1 device:
  0: GeForce RTX 2060 (sm_75, 4.931 GiB / 5.787 GiB available)

@ToucheSir ToucheSir transferred this issue from FluxML/NNlib.jl Jan 9, 2022
@maxfreu
Copy link
Contributor

maxfreu commented May 31, 2022

I can replicate it on

[052768ef] CUDA v3.10.1
[872c559c] NNlib v0.8.3
[a00861dc] NNlibCUDA v0.2.3
CUDA 11.4.0
CUDNN 8.30.2
julia-1.6.3

@ToucheSir
Copy link
Member

The first call to conv! is special because it will reliably trigger an algorithm search. What happens if you go a level lower and call the CUDA.jl functions?

@maxfreu
Copy link
Contributor

maxfreu commented Jun 2, 2022

This happens:

using CUDA, NNlibCUDA, NNlib
x = CUDA.ones(1,1,1)
w = CUDA.ones(1,1,1)
y = CUDA.zeros(1,1,1)
cdims = NNlib.DenseConvDims(x,w)
d, x, _ = NNlibCUDA.cudnnConvolutionDescriptorAndPaddedInput(cdims, x)
CUDA.CUDNN.cudnnConvolutionForward!(y, w, x, d; alpha=1f0, beta=1f0, z=y) # 2
y = CUDA.zeros(1,1,1)
CUDA.CUDNN.cudnnConvolutionForward!(y, w, x, d; alpha=1f0, beta=1f0, z=y) # 1

I suspect that the crux is here. Instead of y, a similar array should be allocated and used in cudnnFindConvolutionForwardAlgorithmEx. @maleadt , what do you think? The CUDA docs here state that the contents of y will be overwritten with arbitrary values during the algorithm search.

@ToucheSir
Copy link
Member

I would expect that y would be re-overwritten during the actual forward pass though? Otherwise using similar to allocate that array would lead to strange results for all conv calls, not just the first one.

@maxfreu
Copy link
Contributor

maxfreu commented Jun 3, 2022

We have to differentiate between the actual convolution and the algorithm search. The convolution needs a zero-initalized output buffer, that it alters. The algorithm search also needs an output buffer for the benchmark, but at the end of this, the values in the buffer are arbitrary. If you use the same for both, you run the algorithm search, the buffer has arbitrary values in it and then the conv adds to that, leading to garbage. As you said correctly, in subsequent calls the algorithm search is omitted, which makes this a semi-heisenbug. Using this branch https://github.com/maxfreu/CUDA.jl/tree/conv-algosearch I get zero errors in the fuzzer.

@ToucheSir
Copy link
Member

That's what I was missing, y and dy are accumulated into instead of completely overwritten when beta is non-zero. Can we tweak the call chain such that an output buffer is only allocated for the algorithm search when this is the case? My worry is that the search is already causing OOMs for users, so allocating more for it when not required is not ideal.

@maxfreu
Copy link
Contributor

maxfreu commented Jun 3, 2022

Hmm I wouldn't have expected that the search causes OOMs, as the buffer should be freed right after the search. Apart from the input, weight and output tensors, the search also needs a "workspace", the size of which is calculated here. It already seems to be quite small, but I didn't think it through. Maybe not small enough? Anyway, it should be possible to allocate only if beta != 0. In case the assumption holds that this is the only case y is accumulated into. Should I mark the PR as draft?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants