add batchnorm #139

CarloLucibello · 2017-06-24T13:57:26Z

Could the code from resnet example
https://github.com/denizyuret/Knet.jl/blob/f3c2887c4f0b1cde6e571aa35f097ccc3710ebc0/examples/resnet.jl
for batch normalization be added and exported by Knet?

# Batch Normalization Layer
# works both for convolutional and fully connected layers
# mode, 0=>train, 1=>test
function batchnorm(w, x, ms; mode=1, epsilon=1e-5)
    mu, sigma = nothing, nothing
    if mode == 0
        d = ndims(x) == 4 ? (1,2,4) : (2,)
        s = prod(size(x,d...))
        mu = sum(x,d) / s
        x0 = x .- mu
        x1 = x0 .* x0
        sigma = sqrt(epsilon + (sum(x1, d)) / s)
    elseif mode == 1
        mu = shift!(ms)
        sigma = shift!(ms)
    end

    # we need getval in backpropagation
    push!(ms, AutoGrad.getval(mu), AutoGrad.getval(sigma))
    xhat = (x.-mu) ./ sigma
    return w[1] .* xhat .+ w[2]
end

I can eventually file a PR.

It is not clear to me how the storage of the momenta, in ms, could be handled in a robust way though

The text was updated successfully, but these errors were encountered:

AStupidBear · 2017-06-29T05:41:44Z

Batchnorm is divided into 2 phases: train and test. During the test phase, ms is calculated from the whole dataset. However the resnet example doesn't show how to calculate ms in an elegant way: I have to manually accumulate and average ms for the whole dataset.

ilkerkesen · 2017-06-29T08:40:08Z

You could perform a moving average calculation like value = (iter * value + this_value) / (iter+1)
At least, this is what I'd do if I were using a batch normalization layer in the training phase. @AStupidBear

@CarloLucibello if we are going to have it, it's better to have it with cuDNN backend. By the way, I'm using that ms array as a queue. So, when tape computation finished you will always have those momenta values ordered. You don't need to pass and maintain indices, which momenta belongs to which layer I mean. This was the handy solution I found by then.

CarloLucibello · 2017-06-29T09:06:02Z

You could perform a moving average calculation like value = (iter * value + this_value) / (iter+1)
At least, this is what I'd do if I were using a batch normalization layer in the training phase. @AStupidBear

@CarloLucibello if we are going to have it, it's better to have it with cuDNN backend.

yep. I didn't delve into how knet interfaces with cuda though, so I think I'm not able to deal with that. In any case I'd like also to have cpu support

By the way, I'm using that ms array as a queue. So, when tape computation finished you will always have those momenta values ordered. You don't need to pass and maintain indices, which momenta belongs to which layer I mean. This was the handy solution I found by then.

Having a queue is risky though, beacuse if you train more than you test your queue will explode. Also it seems uncorrect, since in test phase you want to average the moments over all the batches in your train set

CarloLucibello · 2017-06-29T09:11:19Z

I was thinking to this sort of design:

type BatchMoments
    μs::Vector
    σs::Vector
    count::Int
end

BatchMoments(n::Integer) = BatchMoments(Vector(n), Vector(n), 0)

function Base.push!(ms::BatchMoments, μ, σ)
    n = length(ms.μs)
    ms.count = ms.count == n ? 1 : ms.count + 1
    ms.μs[ms.count] = μ
    ms.σs[ms.count] = σ
end

getmoments(ms::BatchMoments) = (mean(ms.μs), mean(ms.σs))

# Batch Normalization Layer
# works both for convolutional and fully connected layers
# mode, 0=>train, 1=>test
function batchnorm(w, x, ms::BatchMoments; mode=1, ϵ=1e-5)
    if mode == 0
        d = ndims(x) == 4 ? (1,2,4) : (2,)
        s = prod(size(x,d...))
        μ = sum(x,d) / s
        x0 = x .- μ
        x1 = x0 .* x0
        σ = sqrt(ϵ + (sum(x1, d)) / s)
    elseif mode == 1
        μ, σ = getmoments(ms)
    end
    # we need getval in backpropagation
    push!(ms, AutoGrad.getval(μ), AutoGrad.getval(σ))
    xhat = (x .- μ) ./ σ
    return w[1] .* xhat .+ w[2]
end

function predict(w, ms::Vector{BatchMoments}, x; mode=0)
    for i=1:3:length(w)-2
        x = conv4(w[i], x)
        x = batchnorm(w[i+1:i+2], x, ms[i÷3+1], mode=mode)
        x = relu(x)
    end
    x = mat(x)
    return w[end-1]*x .+ w[end]
end

where you initialize ms with

ms = [BatchMoments(num_train_batches) for _=1:length(w)÷3]

Does it seem good/correct?

mambuDL · 2017-06-29T12:55:06Z

@CarloLucibello I really like this interface. Did you try to use it in any of your real models ? Are we sure about its correctness ? In these days none of the big models works without batchnorm

AStupidBear · 2017-06-29T16:04:30Z

It would be better it there's a full ResNet example which not only classifies an image using pretrained weights and ms but also trains a large net on an appropriate dataset and collects running average ms in an elegant way.

mambuDL · 2017-06-29T16:35:37Z

Yup.. I agree with @AStupidBear . Maybe @CarloLucibello can integrate his solution in one of the models in examples repository ? Lenet example looks like a good candidate since it has both conv and linear layers.

denizyuret · 2017-06-29T18:13:49Z

AutoGrad does not yet support weights in user defined types (don't know how to construct the gradient object in that case) There is an implementation by @cangumeli that you might want to check out, here is his message: Those who need batch normalization for CNNs may check my implementation at https://github.com/cangumeli/ResNets.jl/blob/master/resnet.jl (bnorm function implements this layer). It uses running averages with exponential decay similar to Torch implementation; so it is not required to run additional iterations with frozen parameters. If you need batch norm for FC layers, just change the reduction dimensions to (2,) -the batch dimension- at sum and sumabs2. Thanks. Can

…

On Thu, Jun 29, 2017 at 7:35 PM Jaymes Brokyn Obiero < ***@***.***> wrote: Yup.. I agree with @AStupidBear <https://github.com/astupidbear> . Maybe @CarloLucibello <https://github.com/carlolucibello> can integrate his solution in one of the models in examples repository ? Lenet <https://github.com/denizyuret/Knet.jl/blob/26d74a2065e54a7d075050c1293c4d2e19117d2c/examples/lenet.jl#L7-L26> example looks like a good candidate since it has both conv and linear layers. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#139 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABvNpn3mOH9i3XfNrBw6UdF3uW3jMXHVks5sI9JagaJpZM4OEVH3> .

CarloLucibello · 2017-06-29T22:48:09Z

AutoGrad does not yet support weights in user defined types (don't know how to construct the gradient object in that case)

It turns out that the code in my comment above is working (autograd doesn't have to access the user defined types in the training mode). The only problem was an ambiguity error of AutoGrad.Rec with size (I'll file an issue over there), so I had to change a line to

s = prod(_size(x,d...))

and define

_size(x::AutoGrad.Rec, a...) = size(getval(x), a...)
_size(x, a...) = size(getval(x), a...)

Tested the code on a fully convolutional netowrk both on cpu and gpu and it is working quite well. It is impressive to see how batch normalization accelerates training.

I think it would be nice to have either this version or an exponential decay one in Knet. While I appreciate a lot the transparency of Knet, it would be nice to have some common patterns in the library itself, rather than having to cut/paste or rewrite things over and over.

If I receive a green light I can file a PR with the code above (or a variation with exponential decay) and add a mnist+lenet+batchnorm example (resnet would be too much involved)

denizyuret · 2017-06-30T07:09:15Z

I think a PR would be great. We could start a new file under src (modules.jl?) that has batchnorm, lstm, etc. defined.

…

On Fri, Jun 30, 2017 at 1:48 AM Carlo Lucibello ***@***.***> wrote: AutoGrad does not yet support weights in user defined types (don't know how to construct the gradient object in that case) It turns out that the code in my comment above is working (autograd doesn't have to access the user defined types in the training mode). The only problem was an ambiguity error of AutoGrad.Rec with size (I'll file an issue over ther), so I had to change a line to s = prod(_size(x,d...)) and define _size(x::AutoGrad.Rec, a...) = size(getval(x), a...) _size(x, a...) = size(getval(x), a...) Tested the code on a fully convolutional netowrk both on cpu and gpu and it is warking quite well. It is impressive to see how batch normalization accelerates training. I think it would be nice to have either this version or an exponential decay one in Knet. While I appreciate a lot the transparency of Knet, it would be nice to have some common patterns in the library itself, rather than having to cut/paste or rewrite things over and over. If I receive a green light I can file a PR with the code above (or a variation with exponential decay) and add mnist+lenet+batchnorm expample (resnet would be too much involved) — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#139 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABvNpqVrmiTIYQ1bIMNz7ecAI7lm54lJks5sJCmpgaJpZM4OEVH3> .

AStupidBear · 2017-06-30T11:33:13Z

@CarloLucibello I think your implementation is network dependent. I have another implementation which works well for me.

global moments = []
# mode: initialization => -1, 0 => train, 1 => test
function batchnorm(w, x; λ = 0.9, mode = 0)
  μ, σ = nothing, nothing
  if mode == -1 || mode == 0 
    d = ndims(x) == 4 ? (1, 2, 4) : (2,)
    s = prod(size(AutoGrad.getval(x), d...))
    μ = sum(x, d) / s
    x̄ = x .- μ
    x₂ = x̄ .* x̄
    σ = sqrt(1e-5 + sum(x₂, d) / s)
    x̂ = (x .- μ) ./ σ
    if mode == 0
      μ = λ * μ + (1 - λ) * shift!(moments)
      σ = λ * σ + (1 - λ) * shift!(moments)
    end
  elseif mode == 1
    μ, σ = shift!(moments), shift!(moments)
    x̂ = (x .- μ) ./ σ
  end
  push!(moments, AutoGrad.getval(μ), AutoGrad.getval(σ))
  return w[1] .* x̂ .+ w[2]
end

ilkerkesen · 2017-06-30T11:52:26Z

OK, the solution in my head was something like this:

function loss(w,x,ygold,ms=[]; predict=resnet101)
    ypred = predict(w,x,ms)
    ynorm = logp(ypred,1)  # ypred .- log(sum(exp(ypred),1))
    -sum(ygold .* ynorm) / size(ygold,2)
end

lossgradient = grad(loss)

# one minibatch training
function train!(w,x,ygold,moments,opts; lambda=0.99)
    this_moments = []
    gloss = lossgradient(w,x,ygold,this_moments)
    update!(w, gloss, opts)
    update_moments!(moments,this_moments,lambda)
end

function update_moments!(moments,this_moments,lambda)
    for k = 1:length(moments)
        moments[k] = lambda * moments[k] + (1-lambda) * this_moments[k]
    end
end

I think they are all similar solutions. Which one you use is just a matter of choice.

denizyuret · 2017-10-19T16:39:10Z

Will revisit this issue once the Modular Interface is in place.

denizyuret · 2017-12-01T18:38:16Z

Closing this, CUDNN based batchnorm added in latest master.

CarloLucibello closed this as completed Jun 29, 2017

CarloLucibello reopened this Jun 29, 2017

CarloLucibello mentioned this issue Jun 30, 2017

ambiguity error in size(rec, dims...) denizyuret/AutoGrad.jl#18

Closed

CarloLucibello mentioned this issue Jul 1, 2017

[RFC] Batch normalization + LeNet with batchnorm example #140

Closed

denizyuret added the enhancement label Oct 19, 2017

denizyuret closed this as completed Dec 1, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add batchnorm #139

add batchnorm #139

CarloLucibello commented Jun 24, 2017 •

edited

AStupidBear commented Jun 29, 2017

ilkerkesen commented Jun 29, 2017

CarloLucibello commented Jun 29, 2017

CarloLucibello commented Jun 29, 2017 •

edited

mambuDL commented Jun 29, 2017

AStupidBear commented Jun 29, 2017

mambuDL commented Jun 29, 2017

denizyuret commented Jun 29, 2017 via email

CarloLucibello commented Jun 29, 2017 •

edited

denizyuret commented Jun 30, 2017 via email

AStupidBear commented Jun 30, 2017

ilkerkesen commented Jun 30, 2017

denizyuret commented Oct 19, 2017

denizyuret commented Dec 1, 2017

add batchnorm #139

add batchnorm #139

Comments

CarloLucibello commented Jun 24, 2017 • edited

AStupidBear commented Jun 29, 2017

ilkerkesen commented Jun 29, 2017

CarloLucibello commented Jun 29, 2017

CarloLucibello commented Jun 29, 2017 • edited

mambuDL commented Jun 29, 2017

AStupidBear commented Jun 29, 2017

mambuDL commented Jun 29, 2017

denizyuret commented Jun 29, 2017 via email

CarloLucibello commented Jun 29, 2017 • edited

denizyuret commented Jun 30, 2017 via email

AStupidBear commented Jun 30, 2017

ilkerkesen commented Jun 30, 2017

denizyuret commented Oct 19, 2017

denizyuret commented Dec 1, 2017

CarloLucibello commented Jun 24, 2017 •

edited

CarloLucibello commented Jun 29, 2017 •

edited

CarloLucibello commented Jun 29, 2017 •

edited