Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Preallocation in activations #25

Open
lruthotto opened this issue Jan 28, 2018 · 10 comments
Open

Preallocation in activations #25

lruthotto opened this issue Jan 28, 2018 · 10 comments

Comments

@lruthotto
Copy link
Contributor

Maybe we should write two versions of activation functions (and other ones). One that does allocation and one that operates in place. See the following example that Eran and I have put together.

Y = randn(10*768,2*512);
dY = zeros(size(Y));


function myTanhActivation!{T<:Real}(A::Array{T},dA::Array{T},doDerivative::Bool=false)
for k=1:length(A)
    @inbounds A[k] = tanh(A[k])
    @inbounds dA[k] = 1-A[k]^2
end
return A,dA
end

function myTanhActivation{T<:Real}(A::Array{T},dA::Array{T},doDerivative::Bool=false)
return myTanhActivation!(copy(A),copy(dA),doDerivative)
end


function myReluActivation!(Y::Array{T},dA::Array{T},doDerivative::Bool=false) where {T}
for k=1:length(Y)
    @inbounds Y[k] = max(Y[k],0.0);
    @inbounds dA[k] = sign(Y[k]);
end
return Y,dA
end

function myReluActivation(Y::Array{T},dA::Array{T},doDerivative::Bool=false) where {T}
return myReluActivation!(copy(Y),copy(Y),doDerivative)
end




X = copy(Y);
t1 = myTanhActivation!(X,X,true);
t2 = myTanhActivation(X,X,true);
t1 = myReluActivation!(X,X,true);
t2 = myReluActivation(X,X,true);
t1 = [];
t2 = [];
gc();


 
@time for k=1:10; t2 = myTanhActivation(Y,dY,true); end
@time for k=1:10; t2 = myTanhActivation!(Y,dY,true); end
@time for k=1:10; t2 = myReluActivation(Y,dY,true); end
@time for k=1:10; t2 = myReluActivation!(Y,dY,true); end
@klensink
Copy link
Contributor

I really like this, there are lots of places we can be doing things in place and save some GC time.

This, and the sparse kernels, are also a really good place to play around with mulithreading the for loops. I'll try to whip up an example before the call today, but I tested it quickly yesterday and got great results

@klensink
Copy link
Contributor

Multithreading example:

using BenchmarkTools

Y = randn(10*768,2*512);
dY = zeros(size(Y));

function myTanhActivation_mt!{T<:Real}(A::Array{T},dA::Array{T},doDerivative::Bool=false)
    Threads.@threads for k=1:length(A)
        @inbounds A[k] = tanh(A[k])
        @inbounds dA[k] = 1-A[k]^2
    end  
    return A,dA 
end

function myTanhActivation!{T<:Real}(A::Array{T},dA::Array{T},doDerivative::Bool=false)
    for k=1:length(A)
        @inbounds A[k] = tanh(A[k])
        @inbounds dA[k] = 1-A[k]^2
    end  
    return A,dA 
end

function myTanhActivation{T<:Real}(A::Array{T},dA::Array{T},doDerivative::Bool=false)
    return myTanhActivation!(copy(A),copy(dA),doDerivative)
end

function myReluActivation!(Y::Array{T},dA::Array{T},doDerivative::Bool=false) where {T}
    for k=1:length(Y)
        @inbounds Y[k] = max(Y[k],0.0);
        @inbounds dA[k] = sign(Y[k]);
    end  
    return Y,dA 
end

function myReluActivation_mt!(Y::Array{T},dA::Array{T},doDerivative::Bool=false) where {T}
    Threads.@threads for k=1:length(Y)
        @inbounds Y[k] = max(Y[k],0.0);
        @inbounds dA[k] = sign(Y[k]);
    end  
    return Y,dA 
end


function myReluActivation(Y::Array{T},dA::Array{T},doDerivative::Bool=false) where {T}
    return myReluActivation!(copy(Y),copy(Y),doDerivative)
end

t1 = @benchmark myTanhActivation!($Y, $dY, true) 
t2 = @benchmark myTanhActivation_mt!($Y, $dY, true) 

r1 = @benchmark myReluActivation!($Y, $dY, true) 
r2 = @benchmark myReluActivation_mt!($Y, $dY, true) 

println("--- Tanh ---")
display(judge(median(t2), median(t1)))
println("--- RELU ---")
display(judge(median(r2), median(r1)))

Before starting Julia you need to give it more threads like export JULIA_NUM_THREADS=4 on OSX/Linux.

--- Tanh ---
BenchmarkTools.TrialJudgement: 
  time:   -76.94% => improvement (5.00% tolerance)
  memory: +150.00% => regression (1.00% tolerance)
--- RELU ---
BenchmarkTools.TrialJudgement: 
  time:   -55.77% => improvement (5.00% tolerance)
  memory: +150.00% => regression (1.00% tolerance)

julia> t1
BenchmarkTools.Trial: 
  memory estimate:  32 bytes
  allocs estimate:  1
  --------------
  minimum time:     143.728 ms (0.00% GC)
  median time:      175.690 ms (0.00% GC)
  mean time:        163.612 ms (0.00% GC)
  maximum time:     178.066 ms (0.00% GC)
  --------------
  samples:          31
  evals/sample:     1

julia> t2
BenchmarkTools.Trial: 
  memory estimate:  80 bytes
  allocs estimate:  2
  --------------
  minimum time:     40.205 ms (0.00% GC)
  median time:      40.517 ms (0.00% GC)
  mean time:        40.570 ms (0.00% GC)
  maximum time:     43.109 ms (0.00% GC)
  --------------
  samples:          124
  evals/sample:     1

@lruthotto
Copy link
Contributor Author

sorry, I only saw this now. It looks really promising, I agree!

Question: How stable are the threads? I remember this to being an experimental feature. As long as this works in a stable way also for more complicated functions, I'm fine using it.

@klensink
Copy link
Contributor

I have the same question/concern as you. I've only used it before for simple loops, so we will need to do some testing to see how it handles more complicated loops.

It also seems to be a little finicky about your hardware. I've had no problems on Macs, but I had some problems yesterday on my Ubuntu machine

@lruthotto
Copy link
Contributor Author

another question is how Threads performs when its being used at different layers. Say, threading over all the examples in the forward propagation and then also using threading in the activation might cause problems, right?

@klensink
Copy link
Contributor

Yes you need to be careful how you layer them. If you end up using more than you have they block each other and the program grinds to a halt.

The same is also true for making sure that if you're making BLAS calls inside the loop, that the number of BLAS threads times the number of Julia threads doesn't exceed how many threads you have.

@klensink
Copy link
Contributor

klensink commented Feb 9, 2018

On a similar note, couldn't we also create in place apply functions for the normalization layers? I tried adding this and the derivative tests failed, so I assume it isn't always going to be the case. But in some places, like the example below, it seems like it shouldn't be an issue because the data is being overwritten anyways.

https://github.com/XtractOpen/Meganet.jl/blob/master/src/layers/singleLayer.jl#L37

@eldadHaber
Copy link
Contributor

eldadHaber commented Feb 9, 2018 via email

@lruthotto
Copy link
Contributor Author

I think an inplace apply function would be good to have for other Meganet elements as well. In particular, I'm talking about those that do not change the dimension of the input features.

Why don't we rename the current method to apply! with the understanding that it'll overwrite the input features. Then let's also add a smaller function apply that only copies the input features and then calls apply!

I'm not sure if this works, but what we could also be doing is that apply! accepts an input argument that gets over-written with the output features. this is similar to A_mul_B! in julia. This version would also allow pre-allocation for Meganet elements that change the dimension of the input features . Maybe this is a better option?

@erantreister
Copy link
Contributor

Hi Guys,

I wasn't aware of the conversations here, but here is a suggestion for you to check:

function reluActivation!(A::Array{T},dA::Array{T},doDerivative::Bool=false) where {T}
A .= max.(A,zero(T));
if doDerivative
	dA .= sign.(A);
else
    dA = zeros(T,0)
end
return A,dA
end

The .= operator does stuff in place (no allocation), and internally it uses vectorization and some multi-threading. I think that this is the best way to go - I had a really good experience with it so far. I only found out about that recently.

If you wish to work with Threads, that's an option, but as far as my experience goes - they are far less efficient than openMP threads in C. I still think that Julia workers is the best way (maybe not use all possible workers and leave some for automatic internal multithreading in BLAS and stuff). I can explain this further on Skype if you wish.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants