New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add batchnorm #139
Comments
Batchnorm is divided into 2 phases: train and test. During the test phase, |
You could perform a moving average calculation like @CarloLucibello if we are going to have it, it's better to have it with cuDNN backend. By the way, I'm using that ms array as a queue. So, when tape computation finished you will always have those momenta values ordered. You don't need to pass and maintain indices, which momenta belongs to which layer I mean. This was the handy solution I found by then. |
You could perform a moving average calculation like value = (iter * value + this_value) / (iter+1)
yep. I didn't delve into how knet interfaces with cuda though, so I think I'm not able to deal with that. In any case I'd like also to have cpu support
Having a queue is risky though, beacuse if you train more than you test your queue will explode. Also it seems uncorrect, since in test phase you want to average the moments over all the batches in your train set |
I was thinking to this sort of design: type BatchMoments
μs::Vector
σs::Vector
count::Int
end
BatchMoments(n::Integer) = BatchMoments(Vector(n), Vector(n), 0)
function Base.push!(ms::BatchMoments, μ, σ)
n = length(ms.μs)
ms.count = ms.count == n ? 1 : ms.count + 1
ms.μs[ms.count] = μ
ms.σs[ms.count] = σ
end
getmoments(ms::BatchMoments) = (mean(ms.μs), mean(ms.σs))
# Batch Normalization Layer
# works both for convolutional and fully connected layers
# mode, 0=>train, 1=>test
function batchnorm(w, x, ms::BatchMoments; mode=1, ϵ=1e-5)
if mode == 0
d = ndims(x) == 4 ? (1,2,4) : (2,)
s = prod(size(x,d...))
μ = sum(x,d) / s
x0 = x .- μ
x1 = x0 .* x0
σ = sqrt(ϵ + (sum(x1, d)) / s)
elseif mode == 1
μ, σ = getmoments(ms)
end
# we need getval in backpropagation
push!(ms, AutoGrad.getval(μ), AutoGrad.getval(σ))
xhat = (x .- μ) ./ σ
return w[1] .* xhat .+ w[2]
end
function predict(w, ms::Vector{BatchMoments}, x; mode=0)
for i=1:3:length(w)-2
x = conv4(w[i], x)
x = batchnorm(w[i+1:i+2], x, ms[i÷3+1], mode=mode)
x = relu(x)
end
x = mat(x)
return w[end-1]*x .+ w[end]
end where you initialize ms = [BatchMoments(num_train_batches) for _=1:length(w)÷3] Does it seem good/correct? |
@CarloLucibello I really like this interface. Did you try to use it in any of your real models ? Are we sure about its correctness ? In these days none of the big models works without batchnorm |
It would be better it there's a full ResNet example which not only classifies an image using pretrained weights and |
Yup.. I agree with @AStupidBear . Maybe @CarloLucibello can integrate his solution in one of the models in examples repository ? Lenet example looks like a good candidate since it has both conv and linear layers. |
AutoGrad does not yet support weights in user defined types (don't know how
to construct the gradient object in that case)
There is an implementation by @cangumeli that you might want to check out,
here is his message:
Those who need batch normalization for CNNs may check my implementation at
https://github.com/cangumeli/ResNets.jl/blob/master/resnet.jl (bnorm
function implements this layer). It uses running averages with exponential
decay similar to Torch implementation; so it is not required to run
additional iterations with frozen parameters. If you need batch norm for FC
layers, just change the reduction dimensions to (2,) -the batch dimension-
at sum and sumabs2.
Thanks.
Can
…On Thu, Jun 29, 2017 at 7:35 PM Jaymes Brokyn Obiero < ***@***.***> wrote:
Yup.. I agree with @AStupidBear <https://github.com/astupidbear> . Maybe
@CarloLucibello <https://github.com/carlolucibello> can integrate his
solution in one of the models in examples repository ? Lenet
<https://github.com/denizyuret/Knet.jl/blob/26d74a2065e54a7d075050c1293c4d2e19117d2c/examples/lenet.jl#L7-L26>
example looks like a good candidate since it has both conv and linear
layers.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#139 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABvNpn3mOH9i3XfNrBw6UdF3uW3jMXHVks5sI9JagaJpZM4OEVH3>
.
|
It turns out that the code in my comment above is working (autograd doesn't have to access the user defined types in the training mode). The only problem was an ambiguity error of AutoGrad.Rec with size (I'll file an issue over there), so I had to change a line to s = prod(_size(x,d...)) and define _size(x::AutoGrad.Rec, a...) = size(getval(x), a...)
_size(x, a...) = size(getval(x), a...) Tested the code on a fully convolutional netowrk both on cpu and gpu and it is working quite well. It is impressive to see how batch normalization accelerates training. I think it would be nice to have either this version or an exponential decay one in Knet. While I appreciate a lot the transparency of Knet, it would be nice to have some common patterns in the library itself, rather than having to cut/paste or rewrite things over and over. If I receive a green light I can file a PR with the code above (or a variation with exponential decay) and add a mnist+lenet+batchnorm example (resnet would be too much involved) |
I think a PR would be great. We could start a new file under src
(modules.jl?) that has batchnorm, lstm, etc. defined.
…On Fri, Jun 30, 2017 at 1:48 AM Carlo Lucibello ***@***.***> wrote:
AutoGrad does not yet support weights in user defined types (don't know
how to construct the gradient object in that case)
It turns out that the code in my comment above is working (autograd
doesn't have to access the user defined types in the training mode). The
only problem was an ambiguity error of AutoGrad.Rec with size (I'll file an
issue over ther), so I had to change a line to
s = prod(_size(x,d...))
and define
_size(x::AutoGrad.Rec, a...) = size(getval(x), a...)
_size(x, a...) = size(getval(x), a...)
Tested the code on a fully convolutional netowrk both on cpu and gpu and
it is warking quite well. It is impressive to see how batch normalization
accelerates training.
I think it would be nice to have either this version or an exponential
decay one in Knet. While I appreciate a lot the transparency of Knet, it
would be nice to have some common patterns in the library itself, rather
than having to cut/paste or rewrite things over and over.
If I receive a green light I can file a PR with the code above (or a
variation with exponential decay) and add mnist+lenet+batchnorm expample
(resnet would be too much involved)
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#139 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABvNpqVrmiTIYQ1bIMNz7ecAI7lm54lJks5sJCmpgaJpZM4OEVH3>
.
|
@CarloLucibello I think your implementation is network dependent. I have another implementation which works well for me. global moments = []
# mode: initialization => -1, 0 => train, 1 => test
function batchnorm(w, x; λ = 0.9, mode = 0)
μ, σ = nothing, nothing
if mode == -1 || mode == 0
d = ndims(x) == 4 ? (1, 2, 4) : (2,)
s = prod(size(AutoGrad.getval(x), d...))
μ = sum(x, d) / s
x̄ = x .- μ
x₂ = x̄ .* x̄
σ = sqrt(1e-5 + sum(x₂, d) / s)
x̂ = (x .- μ) ./ σ
if mode == 0
μ = λ * μ + (1 - λ) * shift!(moments)
σ = λ * σ + (1 - λ) * shift!(moments)
end
elseif mode == 1
μ, σ = shift!(moments), shift!(moments)
x̂ = (x .- μ) ./ σ
end
push!(moments, AutoGrad.getval(μ), AutoGrad.getval(σ))
return w[1] .* x̂ .+ w[2]
end |
OK, the solution in my head was something like this: function loss(w,x,ygold,ms=[]; predict=resnet101)
ypred = predict(w,x,ms)
ynorm = logp(ypred,1) # ypred .- log(sum(exp(ypred),1))
-sum(ygold .* ynorm) / size(ygold,2)
end
lossgradient = grad(loss)
# one minibatch training
function train!(w,x,ygold,moments,opts; lambda=0.99)
this_moments = []
gloss = lossgradient(w,x,ygold,this_moments)
update!(w, gloss, opts)
update_moments!(moments,this_moments,lambda)
end
function update_moments!(moments,this_moments,lambda)
for k = 1:length(moments)
moments[k] = lambda * moments[k] + (1-lambda) * this_moments[k]
end
end I think they are all similar solutions. Which one you use is just a matter of choice. |
Will revisit this issue once the Modular Interface is in place. |
Closing this, CUDNN based batchnorm added in latest master. |
Could the code from resnet example
https://github.com/denizyuret/Knet.jl/blob/f3c2887c4f0b1cde6e571aa35f097ccc3710ebc0/examples/resnet.jl
for batch normalization be added and exported by Knet?
I can eventually file a PR.
It is not clear to me how the storage of the momenta, in
ms
, could be handled in a robust way thoughThe text was updated successfully, but these errors were encountered: