Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Seg fault after upgrading to Julia 1.5 #589

Closed
andevellicus opened this issue Aug 3, 2020 · 18 comments
Closed

Seg fault after upgrading to Julia 1.5 #589

andevellicus opened this issue Aug 3, 2020 · 18 comments

Comments

@andevellicus
Copy link

Code was working fine until Julia 1.5. Looks like there's something not right with autograd and unpool? Stack trace below:

Illegal inttoptr
	  %12 = ptrtoint %jl_value_t addrspace(10)* addrspace(13)* %32 to i64, !dbg !28
Illegal inttoptr
	  %13 = inttoptr i64 %12 to %jl_value_t addrspace(10)*, !dbg !28

signal (6): Aborted
in expression starting at xxx
gsignal at /usr/bin/../lib/libc.so.6 (unknown line)
abort at /usr/bin/../lib/libc.so.6 (unknown line)
unknown function (ip: 0x7f4930c412c9)
_ZN4llvm13FPPassManager13runOnFunctionERNS_8FunctionE at /usr/bin/../lib/libLLVM-10.so (unknown line)
_ZN4llvm13FPPassManager11runOnModuleERNS_6ModuleE at /usr/bin/../lib/libLLVM-10.so (unknown line)
_ZN4llvm6legacy15PassManagerImpl3runERNS_6ModuleE at /usr/bin/../lib/libLLVM-10.so (unknown line)
unknown function (ip: 0x7f4930d480fb)
unknown function (ip: 0x7f4930d4affb)
unknown function (ip: 0x7f4930d4ccf8)
unknown function (ip: 0x7f4930d4d9e2)
unknown function (ip: 0x7f4930d4eb88)
unknown function (ip: 0x7f4930ccf77f)
jl_invoke at /usr/bin/../lib/libjulia.so.1 (unknown line)
forw##kw at /home/xxx/.julia/packages/AutoGrad/VFrAv/src/core.jl:65 [inlined]
#unpool#440 at ./none:0
unknown function (ip: 0x7f48da0ea425)
unpool##kw at ./none:0
unknown function (ip: 0x7f48da0ea205)
Sampling at /home/xxx/.julia/packages/KnetLayers/zfhNR/src/cnn.jl:12
applychain at /home/xxx/.julia/packages/KnetLayers/zfhNR/src/chain.jl:25
Chain at /home/andevellicus/.julia/packages/KnetLayers/zfhNR/src/chain.jl:27
unknown function (ip: 0x7f48da0e9fff)

#2 at /home/xxx/.julia/packages/AutoGrad/VFrAv/src/core.jl:205
unknown function (ip: 0x7f48da0c032c)
#differentiate#3 at /home/xxx/.julia/packages/AutoGrad/VFrAv/src/core.jl:144
differentiate at /home/xxx/.julia/packages/AutoGrad/VFrAv/src/core.jl:135 [inlined]

unknown function (ip: 0x7f48da0bea0c)

unknown function (ip: 0x7f48da76ad0c)
unknown function (ip: 0x7f4930ce65c5)
unknown function (ip: 0x7f4930ce624e)
unknown function (ip: 0x7f4930ce6d90)
unknown function (ip: 0x7f4930ce7840)
unknown function (ip: 0x7f4930d03bd1)
unknown function (ip: 0x7f4930cd9bc2)
jl_load_rewrite at /usr/bin/../lib/libjulia.so.1 (unknown line)
unknown function (ip: 0x7f492044db98)
unknown function (ip: 0x7f492044d652)
unknown function (ip: 0x7f492002e339)
unknown function (ip: 0x7f492003b69e)
unknown function (ip: 0x7f492003b7f5)
unknown function (ip: 0x55bab43ab4fe)
unknown function (ip: 0x55bab43ab0a7)
__libc_start_main at /usr/bin/../lib/libc.so.6 (unknown line)
unknown function (ip: 0x55bab43ab15d)
Allocations: 122169744 (Pool: 122133980; Big: 35764); GC: 291
@denizyuret
Copy link
Owner

The latest tests seem to be ok: https://gitlab.com/JuliaGPU/Knet.jl/-/jobs/669021495
What version of AutoGrad and Knet are you using?

@andevellicus
Copy link
Author

andevellicus commented Aug 3, 2020

Knet v1.3.9
KnetLayers v0.2.0
AutoGrad v1.2.3

I tried Chain(x -> unpool(x)) instead of calling Chain(UnPool()), and still got the same problem.

Looks like there are some C libraries involved too, which is over my head....

@denizyuret
Copy link
Owner

denizyuret commented Aug 4, 2020 via email

@andevellicus
Copy link
Author

andevellicus commented Aug 4, 2020

Looks like it's something with unpool. If I replace Chain(UnPool, myconvblock) with Chain(x->unpool(x)) or even just x->unpool(x) I get the same fault.

From the stacktrace, it looks like Autograd and unpool might be at fault? Maybe something in the forwargs... Not sure what changed between Julia 1.4 and 1.5 to cause that, it's a little over my head.

 63 # forw() is called with primitive functions that have Tracked or Bcasted args
 64 function forw(f, args...; kwargs...)
 65     @timer "forwargs"        ((f, nobcast, novalue) = forwargs(f, args))
 66     @timer ftimer(f,novalue) (v = f(novalue...; kwargs...))
 67     if recording()
 68         if v isa Broadcasted
 69             @timer "unfuse"  (v = copy(v))
 70         end
 71         if novalue !== nobcast  # there are tracked args
 72             @timer "record"  (v = Result(v, f, nobcast, kwargs))
 73         end
 74         if nobcast !== args     # there are bcasted args
 75             @timer "bcasted" (v = Bcasted(v))
 76         end
 77     end
 78     return v
 79 end

Having said that, when I just do a plain UnPool()(KnetArray(rand(100, 100, 100, 1, 1))) in REPL it works fine, so I suppose it must be running into issues when taking the gradient.

@andevellicus
Copy link
Author

andevellicus commented Aug 4, 2020

So in looking into it more, the only thing I can think of that changed with Julia 1.5 is the way structs are allocated.

From the release notes:
"Immutable structs (including tuples) that contain references can now be allocated on the stack, and allocated inline within arrays and other structs (#33886). This significantly reduces the number of heap allocations in some workloads. Code that requires assumptions about object layout and addresses (usually for interoperability with C or other languages) might need to be updated; for example any object that needs a stable address should be a mutable struct. As a result, Array views no longer allocate (#34126)."

Not sure what's so special about unpool that it'd mess autograd up, but the line that seems to be at issue is

@timer "forwargs"        ((f, nobcast, novalue) = forwargs(f, args))

in Autograd's core.jl. I have a pretty complex network so I'll have to see if I can reproduce it with something simpler.

@andevellicus
Copy link
Author

Interestingly, if I remove unpool completely and replace it with a hand written upsampling layer, I still get a similar error, only this time it's within my loss function.

dice(x, y; smooth::Float32=1.f0) = (2*sum(y .* x) + smooth) / (sum(y.^2) + sum(x.^2) + smooth)
loss(x, y) = 1 - dice(x, y)

Stack trace is:

Illegal inttoptr
	  %12 = ptrtoint %jl_value_t addrspace(10)* addrspace(13)* %32 to i64, !dbg !28
Illegal inttoptr
	  %13 = inttoptr i64 %12 to %jl_value_t addrspace(10)*, !dbg !28

signal (6): Aborted
in expression starting at /home/xxx/Programming/ML/julia/Knet/train_unet.jl:102
gsignal at /usr/bin/../lib/libc.so.6 (unknown line)
abort at /usr/bin/../lib/libc.so.6 (unknown line)
unknown function (ip: 0x7f26a37922c9)
_ZN4llvm13FPPassManager13runOnFunctionERNS_8FunctionE at /usr/bin/../lib/libLLVM-10.so (unknown line)
_ZN4llvm13FPPassManager11runOnModuleERNS_6ModuleE at /usr/bin/../lib/libLLVM-10.so (unknown line)
_ZN4llvm6legacy15PassManagerImpl3runERNS_6ModuleE at /usr/bin/../lib/libLLVM-10.so (unknown line)
unknown function (ip: 0x7f26a38990fb)
unknown function (ip: 0x7f26a389bffb)
unknown function (ip: 0x7f26a389dcf8)
unknown function (ip: 0x7f26a389e9e2)
unknown function (ip: 0x7f26a389fb88)
unknown function (ip: 0x7f26a382077f)
jl_invoke at /usr/bin/../lib/libjulia.so.1 (unknown line)
forw at /home/xxx/.julia/packages/AutoGrad/VFrAv/src/core.jl:65 [inlined]
#sum#61 at ./none:0 [inlined]
sum at ./none:0
#dice#1 at /home/xxx/Programming/ML/julia/Knet/train_unet.jl:22
dice at /home/xxx/Programming/ML/julia/Knet/train_unet.jl:22 [inlined]
loss at /home/xxx/Programming/ML/julia/Knet/train_unet.jl:23
#2 at /home/xxx/.julia/packages/AutoGrad/VFrAv/src/core.jl:205
unknown function (ip: 0x7f2630084cdc)
#differentiate#3 at /home/xxx/.julia/packages/AutoGrad/VFrAv/src/core.jl:144
differentiate at /home/xxx/.julia/packages/AutoGrad/VFrAv/src/core.jl:135 [inlined]
minimize! at /home/xxx/Programming/ML/julia/Knet/train_unet.jl:27
unknown function (ip: 0x7f263008358c)
train at /home/xxx/Programming/ML/julia/Knet/train_unet.jl:78
unknown function (ip: 0x7f26300527cc)
unknown function (ip: 0x7f26a38375c5)
unknown function (ip: 0x7f26a383724e)
unknown function (ip: 0x7f26a3837d90)
unknown function (ip: 0x7f26a3838840)
unknown function (ip: 0x7f26a3854bd1)
unknown function (ip: 0x7f26a382abc2)
jl_load_rewrite at /usr/bin/../lib/libjulia.so.1 (unknown line)
unknown function (ip: 0x7f2692f9eb98)
unknown function (ip: 0x7f2692f9e652)
unknown function (ip: 0x7f2692b7f339)
unknown function (ip: 0x7f2692b8c69e)
unknown function (ip: 0x7f2692b8c7f5)
unknown function (ip: 0x562cd957d4fe)
unknown function (ip: 0x562cd957d0a7)
__libc_start_main at /usr/bin/../lib/libc.so.6 (unknown line)
unknown function (ip: 0x562cd957d15d)
Allocations: 153300284 (Pool: 153252213; Big: 48071); GC: 275
zsh: abort (core dumped)  julia train_unet.jl

Should I move this over to Autograd.jl?

@denizyuret
Copy link
Owner

I need a small example to replicate the error. For example none of the following gives me issues:

julia> x = randn(8,8,3,10);                                                                                                      
x = randn(8,8,3,10);                                                                                                             
                                                                                                                                 
julia> y = unpool(x);                                                                                                            
y = unpool(x);                                                                                                                   
                                                                                                                                 
julia> p = Param(x);                                                                                                             
p = Param(x);                                                                                                                    
                                                                                                                                 
julia> @diff sum(unpool(p))                                                                                                      
@diff sum(unpool(p))                                                                                                             
T(-153.84529796846022)                                                                                                           

julia> dice(x, y; smooth::Float32=1.f0) = (2*sum(y .* x) + smooth) / (sum(y.^2) + sum(x.^2) + smooth)            

julia> loss(x, y) = 1 - dice(x, y)                       

julia> y = randn(8,8,3,10);           

julia> q = Param(y);

julia> loss(p,q)                                                                                                                 
1.0116619302912517                                                                                                               
                                                                                                                                 
julia> @diff loss(p,q)                                                                                                           
T(1.0116619302912517)                                                                                                            
                                                                                                                                 
julia> pk = Param(KnetArray(x));                                                                                                 
                                                                                                                                 
julia> qk = Param(KnetArray(y));                                                                                                 
                                                                                                                                 
julia> @diff loss(pk,qk)                                                                                                         
T(1.0116619302912517)                                 

julia> versioninfo()
      Julia Version 1.5.0                                                                                                              

julia> pkg"st"
Status `/dev/shm/dyuret/.julia/environments/v1.5/Project.toml`                                                                   
  [6710c13c] AutoGrad v1.2.3                                                                                                     
  [052768ef] CUDA v1.2.1                                                                                                         
  [0c68f7d7] GPUArrays v5.0.0                                                                                                    
  [1902f260] Knet v1.3.9                                                                                                         
  [295af30f] Revise v2.7.3                                                                                                       
                                                                     
                                                                                                                                                    


@andevellicus
Copy link
Author

I think I may have an idea of where the problem is. Below is some example code that throws the error on my computer:

using Knet
using KnetLayers
using ResumableFunctions

const opt = Adam(lr=0001f0)

global arrtype = gpu()>=0 ? KnetArray{Float32} : Array{Float32}

setoptim!(M, optimizer) = for p in params(M); p.opt = Knet.clone(optimizer); end

Knet.gpu(1)
using CUDA

dice_loss(x, y; smooth::Float32=1.f0) = 1 - (2*sum(y .* x) + smooth) / (sum(y.^2) + sum(x.^2) + smooth)

@resumable function datagen()
	while true
		x = rand(Float32, 8, 8, 1, 1)
		y = rand(Float32, 8, 8, 1, 1)
		@yield (x, y)
	end
end

struct test_mod
    layers::Chain
end

function test_mod()
	w = param(2, 2, 1, 8)
	c = Chain(
	    x->conv4(w, x),
	    x->unpool(x),
	)
	test_mod(c)
end

function (m::test_mod)(x)
	m.layers(x)
end

function calc(model, x, y)
	ld = @diff dice_loss(Array(model(x)), y)
	for w in params(model)
		Knet.update!(w, grad(ld, w))
	end
	ld = value(ld)
	return ld
end

function on_batch(model, generator)
	for batch in generator
		x = KnetArray(batch[1])
		y = KnetArray(batch[2])
		@time loss = calc(model, x, y)
		println(loss)
	end
end

function train()
    model = test_mod()

    setoptim!(model, opt)
    train_gen = datagen()

    	epochs=10
	best_test_loss = 0f0
	for i in 1:epochs
		@info "Epoch $i of $epochs"
		on_batch(model, train_gen)
	end

end

train()

I think w = param(2, 2, 1, 8) might be causing issues; I used that here because that's how KnetLayers inits the weights in the Filtering function:

mutable struct Filtering{T<:Function,P,A<:ActOrNothing,V<:Bias} <: Layer
    weight::P
    bias::V
    activation::A
    options::NamedTuple
end

function Filtering{T}(;height::Integer, width::Integer, inout::Pair=1=>1,
                       activation::ActOrNothing=NonAct(),
                       winit=xavier, binit=zeros,
                       atype=arrtype,
                       opts...) where T <: Function

    wsize = T===typeof(conv4) ? inout : reverse(inout)
    w = param(height,width,wsize...; init=winit, atype=atype)
    b = binit !== nothing ? Bias(1,1,inout[2],1; init=binit, atype=atype) : Bias(nothing)
    Filtering{T}(w, b, activation; opts...)

end

@andevellicus
Copy link
Author

Tried this after the new 1.4 Knet update... still with the same issue.

@denizyuret
Copy link
Owner

In your MWE you call dice_loss with an Array and a KnetArray which results in type mismatch. Even when I fix that x,y are (8,8,1,1) whereas model(x) is (14,14,8,1). These are not broadcastable shapes so dice_loss throws an error. Can you: (1) not use KnetLayers for now to isolate the problem (you only use Chain which can be defined in a few lines, e.g. here, (2) send me an MWE that fails with Knet-1.4 and Julia-1.5.

@andevellicus
Copy link
Author

andevellicus commented Aug 22, 2020

I tried to run my MWE in #602 with Julia 1.5 and Knet 1.4. It doesn't call KnetLayers per se, but essentially uses almost identical code... still the same error. Will try to provide the model sans KnetLayers esque model.

@andevellicus
Copy link
Author

andevellicus commented Aug 22, 2020

Ok, using the following code:

using Knet
using ProgressMeter

setoptim!(m, optimizer) = for p in params(m); p.opt = Knet.clone(optimizer); end

dice(x, y; smooth::Float32=1.f0) = (2*sum(y .* x) + smooth) / (sum(y.^2) + sum(x.^2) + smooth)
loss(x, y) = 1 - dice(x, y)

# Calculates loss and updates the model's parameters
function minimize!(model, x::KnetArray, y::KnetArray)
    ld = @diff loss(model(x), y)
    for w in params(model)
	Knet.update!(w, grad(ld, w))
    end
    return value(ld)
end


# Define a chain of layers :
struct Chain; layers; end
(c::Chain)(x) = (for l in c.layers; x = l(x); end; x)

struct test_model; c; end
function (m::test_model)(x)
    x = m.c(x)
    return x
end

function test_model()
    w = param(3, 3, 3, 1, 8) 
    c = Chain((
	      x->conv4(w, x, stride=2, padding=1),
	      x->unpool(x)
	      ))
    test_model(c)
end

# Main training loop
function main()

    Knet.gpu(1)

    # Get model
    model = test_model()  
    setoptim!(model, Adam())

    # Kick off the training loop
    for i in 1:5 
	@info "Epoch $i of 5" 

	p = Progress(Int(floor(5)),
		     dt=0.5, barglyphs=BarGlyphs("[=> ]"), barlen=50, color=:yellow)
	for i in 1:5 
	    x = rand(Float32, 32, 32, 32, 1, 1)
	    y = rand(Float32, 32, 32, 32, 1, 1)
	    train_loss = minimize!(model, KnetArray(x), KnetArray(y))
	    next!(p, showvalues=[(:loss, train_loss)])
	end

	println("")
    end
end

main()

I get these results:

  • Julia 1.5, Knet 1.4, CUDA 1.3.1: Same error in first post
Illegal inttoptr
	  %12 = ptrtoint %jl_value_t addrspace(10)* addrspace(13)* %32 to i64, !dbg !28
Illegal inttoptr
	  %13 = inttoptr i64 %12 to %jl_value_t addrspace(10)*, !dbg !28
...
...
  • Julia 1.4, Knet 1.4, CUDA 1.3.1: No problems.

Here are the rest of my packages:

(@v1.5) pkg> status
Status `~/.julia/environments/v1.5/Project.toml`
  [c7e460c6] ArgParse v1.1.0
  [fbb218c0] BSON v0.2.6
  [6e4b80f9] BenchmarkTools v0.5.0
  [336ed68f] CSV v0.7.7
  [052768ef] CUDA v1.3.1
  [159f3aea] Cairo v1.0.5
  [88353bc9] ConfParser v0.1.2
  [150eb455] CoordinateTransformations v0.6.0
  [a93c6f00] DataFrames v0.21.6
  [5789e2e9] FileIO v1.4.1
  [587475ba] Flux v0.11.1
  [186bb1d3] Fontconfig v0.4.0
  [c91e804a] Gadfly v1.3.0
  [f67ccb44] HDF5 v0.13.5
  [6a3955dd] ImageFiltering v0.6.15
  [787d08f9] ImageMorphology v0.2.8
  [80713f31] ImageSegmentation v1.4.5
  [02fcd773] ImageTransformations v0.8.5
  [916415d5] Images v0.22.4
  [a98d9a8b] Interpolations v0.12.10
  [1902f260] Knet v1.4.0
  [2b0e0bc5] LanguageServer v3.2.0
  [a3a9e032] NIfTI v0.4.1
  [d96e819e] Parameters v0.12.1
  [92933f4c] ProgressMeter v1.3.2
  [189a3867] Reexport v0.2.0
  [c5292f4c] ResumableFunctions v0.5.1
  [6038ab10] Rotations v1.0.1
  [b3cc710f] StaticLint v4.5.0
  [cf896787] SymbolServer v5.1.0

@andevellicus
Copy link
Author

Ok, more digging...

When I change

train_loss = minimize!(model, KnetArray(x), KnetArray(y))

to

adam(model, (x, y))

code works fine. So that leads me to believe there is an issue in

function minimize!(model, x::KnetArray, y::KnetArray)
    ld = @diff loss(model(x), y)
    for w in params(model)
	Knet.update!(w, grad(ld, w))
    end
    return value(ld)
end

I commented out the update part, leaving just the diff macro, and I was back to getting errors... so I think there's something going on in AutoGrad....

@denizyuret
Copy link
Owner

Unfortunately I cannot get your code to fail ;(

I tried it in windows thinking maybe it is an OS difference. It finished without any errors. My setup:

  • windows10
  • julia 1.5.0
  • [6710c13c] AutoGrad v1.2.3
  • [052768ef] CUDA v1.3.1
  • [1902f260] Knet v1.4.0
  • [92933f4c] ProgressMeter v1.3.2
julia> include(Knet.dir("test/gpu.jl"))
Knet.LibKnet8.libknet8 = "C:\\Users\\deniz\\.julia\\artifacts\\5e1e317677e88277f0ee67ab9e17587a8edc4f7a\\libknet8"
readdir(artifact"libknet8") = ["libknet8.dll", "libknet8.exp", "libknet8.lib"]
CuDevice(0): GeForce GTX 1060 with Max-Q Design
length(CUDA.devices()) = 1
CUDA.capability(CUDA.device()) = v"6.1.0"
CUDA.warpsize(CUDA.device()) = 32
CUDA.find_toolkit() = ["C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v10.2"]
CUDA.version() = v"11.0.0"
Mem.info() = (5265096704, 6442450944)
CUDA.synchronize() = nothing
# then some error about not finding nvml because it doesn't exist in windows

@andevellicus
Copy link
Author

andevellicus commented Aug 23, 2020

I found the solution to my error, though why it's specific to my system I have no idea. Once I changed

function minimize!(model, x::KnetArray, y::KnetArray)
    ld = @diff loss(model(x), y)
    for w in params(model)
	Knet.update!(w, grad(ld, w))
    end
    return value(ld)
end

to

function minimize!(model, x::KnetArray, y::KnetArray)
    x = model(x) # This fixes the issue
    ld = @diff loss(x, y)
    for w in params(model)
	Knet.update!(w, grad(ld, w))
    end
    return value(ld)
end

everything works fine. I have no earthly idea why that matters, but holy @#*$ am I glad it was a simple solution. I was despairing that it was something inherent to my box and OS, and there was much anguish and wringing of hands as well as explitives. Life is good now :)

Thanks for taking the time to look into it. Will close now.

@andevellicus
Copy link
Author

I'm an idiot, just realized that taking the model function out of @ diff means that diff doesn't get applied to the model function. Nevermind....

@andevellicus
Copy link
Author

So I found the source of the issue. Looks like for some reason on my system @ diff misbehaves. Once I did

 23 function main()
 24 
 25     # Get model
 26     model = test_model()
 27 
 28     # Kick off the training loop
 29     for i in 1:5
 30         println("Epoch $i of 5")
 31 
 32         for j in 1:5
 33             x = KnetArray(rand(Float32, 32, 32, 32, 1, 1))
 34             y = KnetArray(rand(Float32, 32, 32, 32, 1, 1))
 35             #@diff loss(model(x), y)
 36             #ex = macroexpand(debug, :(@diff loss(model(x), y)))
 37             ex = @macroexpand @diff loss(model(x), y) 
 38             println(ex)                        
 39         end                                              
 40                                                          
 41         println("")                                      
 42     end                                                  
 43 end  

No more errors. Any suggestions on how to actually collect the loss from ex? Or would this require modifying the @ diff macro?

@andevellicus
Copy link
Author

Welp, turns out that it's an Archlinux issue as you suggested. The default packages are messed up, when I install the bins everything is just dandy.

Apologies for the wild goose chase, appreciate your patience.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants