Preallocation in activations #25

lruthotto · 2018-01-28T15:56:41Z

Maybe we should write two versions of activation functions (and other ones). One that does allocation and one that operates in place. See the following example that Eran and I have put together.

Y = randn(10*768,2*512);
dY = zeros(size(Y));


function myTanhActivation!{T<:Real}(A::Array{T},dA::Array{T},doDerivative::Bool=false)
for k=1:length(A)
    @inbounds A[k] = tanh(A[k])
    @inbounds dA[k] = 1-A[k]^2
end
return A,dA
end

function myTanhActivation{T<:Real}(A::Array{T},dA::Array{T},doDerivative::Bool=false)
return myTanhActivation!(copy(A),copy(dA),doDerivative)
end


function myReluActivation!(Y::Array{T},dA::Array{T},doDerivative::Bool=false) where {T}
for k=1:length(Y)
    @inbounds Y[k] = max(Y[k],0.0);
    @inbounds dA[k] = sign(Y[k]);
end
return Y,dA
end

function myReluActivation(Y::Array{T},dA::Array{T},doDerivative::Bool=false) where {T}
return myReluActivation!(copy(Y),copy(Y),doDerivative)
end




X = copy(Y);
t1 = myTanhActivation!(X,X,true);
t2 = myTanhActivation(X,X,true);
t1 = myReluActivation!(X,X,true);
t2 = myReluActivation(X,X,true);
t1 = [];
t2 = [];
gc();


 
@time for k=1:10; t2 = myTanhActivation(Y,dY,true); end
@time for k=1:10; t2 = myTanhActivation!(Y,dY,true); end
@time for k=1:10; t2 = myReluActivation(Y,dY,true); end
@time for k=1:10; t2 = myReluActivation!(Y,dY,true); end

The text was updated successfully, but these errors were encountered:

klensink · 2018-01-30T16:15:41Z

I really like this, there are lots of places we can be doing things in place and save some GC time.

This, and the sparse kernels, are also a really good place to play around with mulithreading the for loops. I'll try to whip up an example before the call today, but I tested it quickly yesterday and got great results

klensink · 2018-01-30T18:17:40Z

Multithreading example:

using BenchmarkTools

Y = randn(10*768,2*512);
dY = zeros(size(Y));

function myTanhActivation_mt!{T<:Real}(A::Array{T},dA::Array{T},doDerivative::Bool=false)
    Threads.@threads for k=1:length(A)
        @inbounds A[k] = tanh(A[k])
        @inbounds dA[k] = 1-A[k]^2
    end  
    return A,dA 
end

function myTanhActivation!{T<:Real}(A::Array{T},dA::Array{T},doDerivative::Bool=false)
    for k=1:length(A)
        @inbounds A[k] = tanh(A[k])
        @inbounds dA[k] = 1-A[k]^2
    end  
    return A,dA 
end

function myTanhActivation{T<:Real}(A::Array{T},dA::Array{T},doDerivative::Bool=false)
    return myTanhActivation!(copy(A),copy(dA),doDerivative)
end

function myReluActivation!(Y::Array{T},dA::Array{T},doDerivative::Bool=false) where {T}
    for k=1:length(Y)
        @inbounds Y[k] = max(Y[k],0.0);
        @inbounds dA[k] = sign(Y[k]);
    end  
    return Y,dA 
end

function myReluActivation_mt!(Y::Array{T},dA::Array{T},doDerivative::Bool=false) where {T}
    Threads.@threads for k=1:length(Y)
        @inbounds Y[k] = max(Y[k],0.0);
        @inbounds dA[k] = sign(Y[k]);
    end  
    return Y,dA 
end


function myReluActivation(Y::Array{T},dA::Array{T},doDerivative::Bool=false) where {T}
    return myReluActivation!(copy(Y),copy(Y),doDerivative)
end

t1 = @benchmark myTanhActivation!($Y, $dY, true) 
t2 = @benchmark myTanhActivation_mt!($Y, $dY, true) 

r1 = @benchmark myReluActivation!($Y, $dY, true) 
r2 = @benchmark myReluActivation_mt!($Y, $dY, true) 

println("--- Tanh ---")
display(judge(median(t2), median(t1)))
println("--- RELU ---")
display(judge(median(r2), median(r1)))

Before starting Julia you need to give it more threads like export JULIA_NUM_THREADS=4 on OSX/Linux.

--- Tanh ---
BenchmarkTools.TrialJudgement: 
  time:   -76.94% => improvement (5.00% tolerance)
  memory: +150.00% => regression (1.00% tolerance)
--- RELU ---
BenchmarkTools.TrialJudgement: 
  time:   -55.77% => improvement (5.00% tolerance)
  memory: +150.00% => regression (1.00% tolerance)

julia> t1
BenchmarkTools.Trial: 
  memory estimate:  32 bytes
  allocs estimate:  1
  --------------
  minimum time:     143.728 ms (0.00% GC)
  median time:      175.690 ms (0.00% GC)
  mean time:        163.612 ms (0.00% GC)
  maximum time:     178.066 ms (0.00% GC)
  --------------
  samples:          31
  evals/sample:     1

julia> t2
BenchmarkTools.Trial: 
  memory estimate:  80 bytes
  allocs estimate:  2
  --------------
  minimum time:     40.205 ms (0.00% GC)
  median time:      40.517 ms (0.00% GC)
  mean time:        40.570 ms (0.00% GC)
  maximum time:     43.109 ms (0.00% GC)
  --------------
  samples:          124
  evals/sample:     1

lruthotto · 2018-01-30T19:45:03Z

sorry, I only saw this now. It looks really promising, I agree!

Question: How stable are the threads? I remember this to being an experimental feature. As long as this works in a stable way also for more complicated functions, I'm fine using it.

klensink · 2018-01-30T19:54:24Z

I have the same question/concern as you. I've only used it before for simple loops, so we will need to do some testing to see how it handles more complicated loops.

It also seems to be a little finicky about your hardware. I've had no problems on Macs, but I had some problems yesterday on my Ubuntu machine

lruthotto · 2018-01-30T20:04:07Z

another question is how Threads performs when its being used at different layers. Say, threading over all the examples in the forward propagation and then also using threading in the activation might cause problems, right?

klensink · 2018-01-30T20:07:43Z

Yes you need to be careful how you layer them. If you end up using more than you have they block each other and the program grinds to a halt.

The same is also true for making sure that if you're making BLAS calls inside the loop, that the number of BLAS threads times the number of Julia threads doesn't exceed how many threads you have.

klensink · 2018-02-09T01:44:26Z

On a similar note, couldn't we also create in place apply functions for the normalization layers? I tried adding this and the derivative tests failed, so I assume it isn't always going to be the case. But in some places, like the example below, it seems like it shouldn't be an issue because the data is being overwritten anyways.

https://github.com/XtractOpen/Meganet.jl/blob/master/src/layers/singleLayer.jl#L37

eldadHaber · 2018-02-09T01:47:14Z

Maybe generate apply!

…

On Feb 8, 2018, at 5:44 PM, Keegan Lensink ***@***.***> wrote: On a similar note, couldn't we also create in place apply functions for the normalization layers? I tried adding this and the derivative tests failed, so I assume it isn't always going to be the case. But in some places, like the example below, it seems like it shouldn't be an issue because the data is being overwritten anyways. https://github.com/XtractOpen/Meganet.jl/blob/master/src/layers/singleLayer.jl#L37 — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.

lruthotto · 2018-02-10T19:50:19Z

I think an inplace apply function would be good to have for other Meganet elements as well. In particular, I'm talking about those that do not change the dimension of the input features.

Why don't we rename the current method to apply! with the understanding that it'll overwrite the input features. Then let's also add a smaller function apply that only copies the input features and then calls apply!

I'm not sure if this works, but what we could also be doing is that apply! accepts an input argument that gets over-written with the output features. this is similar to A_mul_B! in julia. This version would also allow pre-allocation for Meganet elements that change the dimension of the input features . Maybe this is a better option?

erantreister · 2018-02-13T08:00:13Z

Hi Guys,

I wasn't aware of the conversations here, but here is a suggestion for you to check:

function reluActivation!(A::Array{T},dA::Array{T},doDerivative::Bool=false) where {T}
A .= max.(A,zero(T));
if doDerivative
	dA .= sign.(A);
else
    dA = zeros(T,0)
end
return A,dA
end

The .= operator does stuff in place (no allocation), and internally it uses vectorization and some multi-threading. I think that this is the best way to go - I had a really good experience with it so far. I only found out about that recently.

If you wish to work with Threads, that's an option, but as far as my experience goes - they are far less efficient than openMP threads in C. I still think that Julia workers is the best way (maybe not use all possible workers and leave some for automatic internal multithreading in BLAS and stuff). I can explain this further on Skype if you wish.

lruthotto added help wanted optimize code performance labels Jan 28, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Preallocation in activations #25

Preallocation in activations #25

lruthotto commented Jan 28, 2018

klensink commented Jan 30, 2018

klensink commented Jan 30, 2018

lruthotto commented Jan 30, 2018

klensink commented Jan 30, 2018

lruthotto commented Jan 30, 2018

klensink commented Jan 30, 2018

klensink commented Feb 9, 2018

eldadHaber commented Feb 9, 2018 via email

lruthotto commented Feb 10, 2018

erantreister commented Feb 13, 2018

Preallocation in activations #25

Preallocation in activations #25

Comments

lruthotto commented Jan 28, 2018

klensink commented Jan 30, 2018

klensink commented Jan 30, 2018

lruthotto commented Jan 30, 2018

klensink commented Jan 30, 2018

lruthotto commented Jan 30, 2018

klensink commented Jan 30, 2018

klensink commented Feb 9, 2018

eldadHaber commented Feb 9, 2018 via email

lruthotto commented Feb 10, 2018

erantreister commented Feb 13, 2018