Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add batchnorm #139

Closed
CarloLucibello opened this issue Jun 24, 2017 · 14 comments
Closed

add batchnorm #139

CarloLucibello opened this issue Jun 24, 2017 · 14 comments

Comments

@CarloLucibello
Copy link
Collaborator

CarloLucibello commented Jun 24, 2017

Could the code from resnet example
https://github.com/denizyuret/Knet.jl/blob/f3c2887c4f0b1cde6e571aa35f097ccc3710ebc0/examples/resnet.jl
for batch normalization be added and exported by Knet?

# Batch Normalization Layer
# works both for convolutional and fully connected layers
# mode, 0=>train, 1=>test
function batchnorm(w, x, ms; mode=1, epsilon=1e-5)
    mu, sigma = nothing, nothing
    if mode == 0
        d = ndims(x) == 4 ? (1,2,4) : (2,)
        s = prod(size(x,d...))
        mu = sum(x,d) / s
        x0 = x .- mu
        x1 = x0 .* x0
        sigma = sqrt(epsilon + (sum(x1, d)) / s)
    elseif mode == 1
        mu = shift!(ms)
        sigma = shift!(ms)
    end

    # we need getval in backpropagation
    push!(ms, AutoGrad.getval(mu), AutoGrad.getval(sigma))
    xhat = (x.-mu) ./ sigma
    return w[1] .* xhat .+ w[2]
end

I can eventually file a PR.

It is not clear to me how the storage of the momenta, in ms, could be handled in a robust way though

@AStupidBear
Copy link

Batchnorm is divided into 2 phases: train and test. During the test phase, ms is calculated from the whole dataset. However the resnet example doesn't show how to calculate ms in an elegant way: I have to manually accumulate and average ms for the whole dataset.

@ilkerkesen
Copy link
Collaborator

You could perform a moving average calculation like value = (iter * value + this_value) / (iter+1)
At least, this is what I'd do if I were using a batch normalization layer in the training phase. @AStupidBear

@CarloLucibello if we are going to have it, it's better to have it with cuDNN backend. By the way, I'm using that ms array as a queue. So, when tape computation finished you will always have those momenta values ordered. You don't need to pass and maintain indices, which momenta belongs to which layer I mean. This was the handy solution I found by then.

@CarloLucibello
Copy link
Collaborator Author

You could perform a moving average calculation like value = (iter * value + this_value) / (iter+1)
At least, this is what I'd do if I were using a batch normalization layer in the training phase. @AStupidBear

@CarloLucibello if we are going to have it, it's better to have it with cuDNN backend.

yep. I didn't delve into how knet interfaces with cuda though, so I think I'm not able to deal with that. In any case I'd like also to have cpu support

By the way, I'm using that ms array as a queue. So, when tape computation finished you will always have those momenta values ordered. You don't need to pass and maintain indices, which momenta belongs to which layer I mean. This was the handy solution I found by then.

Having a queue is risky though, beacuse if you train more than you test your queue will explode. Also it seems uncorrect, since in test phase you want to average the moments over all the batches in your train set

@CarloLucibello
Copy link
Collaborator Author

CarloLucibello commented Jun 29, 2017

I was thinking to this sort of design:

type BatchMoments
    μs::Vector
    σs::Vector
    count::Int
end

BatchMoments(n::Integer) = BatchMoments(Vector(n), Vector(n), 0)

function Base.push!(ms::BatchMoments, μ, σ)
    n = length(ms.μs)
    ms.count = ms.count == n ? 1 : ms.count + 1
    ms.μs[ms.count] = μ
    ms.σs[ms.count] = σ
end

getmoments(ms::BatchMoments) = (mean(ms.μs), mean(ms.σs))

# Batch Normalization Layer
# works both for convolutional and fully connected layers
# mode, 0=>train, 1=>test
function batchnorm(w, x, ms::BatchMoments; mode=1, ϵ=1e-5)
    if mode == 0
        d = ndims(x) == 4 ? (1,2,4) : (2,)
        s = prod(size(x,d...))
        μ = sum(x,d) / s
        x0 = x .- μ
        x1 = x0 .* x0
        σ = sqrt+ (sum(x1, d)) / s)
    elseif mode == 1
        μ, σ = getmoments(ms)
    end
    # we need getval in backpropagation
    push!(ms, AutoGrad.getval(μ), AutoGrad.getval(σ))
    xhat = (x .- μ) ./ σ
    return w[1] .* xhat .+ w[2]
end

function predict(w, ms::Vector{BatchMoments}, x; mode=0)
    for i=1:3:length(w)-2
        x = conv4(w[i], x)
        x = batchnorm(w[i+1:i+2], x, ms[i÷3+1], mode=mode)
        x = relu(x)
    end
    x = mat(x)
    return w[end-1]*x .+ w[end]
end

where you initialize ms with

ms = [BatchMoments(num_train_batches) for _=1:length(w)÷3] 

Does it seem good/correct?

@mambuDL
Copy link

mambuDL commented Jun 29, 2017

@CarloLucibello I really like this interface. Did you try to use it in any of your real models ? Are we sure about its correctness ? In these days none of the big models works without batchnorm

@AStupidBear
Copy link

It would be better it there's a full ResNet example which not only classifies an image using pretrained weights and ms but also trains a large net on an appropriate dataset and collects running average ms in an elegant way.

@mambuDL
Copy link

mambuDL commented Jun 29, 2017

Yup.. I agree with @AStupidBear . Maybe @CarloLucibello can integrate his solution in one of the models in examples repository ? Lenet example looks like a good candidate since it has both conv and linear layers.

@denizyuret
Copy link
Owner

denizyuret commented Jun 29, 2017 via email

@CarloLucibello
Copy link
Collaborator Author

CarloLucibello commented Jun 29, 2017

AutoGrad does not yet support weights in user defined types (don't know how to construct the gradient object in that case)

It turns out that the code in my comment above is working (autograd doesn't have to access the user defined types in the training mode). The only problem was an ambiguity error of AutoGrad.Rec with size (I'll file an issue over there), so I had to change a line to

s = prod(_size(x,d...))

and define

_size(x::AutoGrad.Rec, a...) = size(getval(x), a...)
_size(x, a...) = size(getval(x), a...)

Tested the code on a fully convolutional netowrk both on cpu and gpu and it is working quite well. It is impressive to see how batch normalization accelerates training.

I think it would be nice to have either this version or an exponential decay one in Knet. While I appreciate a lot the transparency of Knet, it would be nice to have some common patterns in the library itself, rather than having to cut/paste or rewrite things over and over.

If I receive a green light I can file a PR with the code above (or a variation with exponential decay) and add a mnist+lenet+batchnorm example (resnet would be too much involved)

@denizyuret
Copy link
Owner

denizyuret commented Jun 30, 2017 via email

@AStupidBear
Copy link

@CarloLucibello I think your implementation is network dependent. I have another implementation which works well for me.

global moments = []
# mode: initialization => -1, 0 => train, 1 => test
function batchnorm(w, x; λ = 0.9, mode = 0)
  μ, σ = nothing, nothing
  if mode == -1 || mode == 0 
    d = ndims(x) == 4 ? (1, 2, 4) : (2,)
    s = prod(size(AutoGrad.getval(x), d...))
    μ = sum(x, d) / s
    x̄ = x .- μ
    x₂ =.* x̄
    σ = sqrt(1e-5 + sum(x₂, d) / s)
    x̂ = (x .- μ) ./ σ
    if mode == 0
      μ = λ * μ + (1 - λ) * shift!(moments)
      σ = λ * σ + (1 - λ) * shift!(moments)
    end
  elseif mode == 1
    μ, σ = shift!(moments), shift!(moments)
    x̂ = (x .- μ) ./ σ
  end
  push!(moments, AutoGrad.getval(μ), AutoGrad.getval(σ))
  return w[1] .*.+ w[2]
end

@ilkerkesen
Copy link
Collaborator

OK, the solution in my head was something like this:

function loss(w,x,ygold,ms=[]; predict=resnet101)
    ypred = predict(w,x,ms)
    ynorm = logp(ypred,1)  # ypred .- log(sum(exp(ypred),1))
    -sum(ygold .* ynorm) / size(ygold,2)
end

lossgradient = grad(loss)

# one minibatch training
function train!(w,x,ygold,moments,opts; lambda=0.99)
    this_moments = []
    gloss = lossgradient(w,x,ygold,this_moments)
    update!(w, gloss, opts)
    update_moments!(moments,this_moments,lambda)
end

function update_moments!(moments,this_moments,lambda)
    for k = 1:length(moments)
        moments[k] = lambda * moments[k] + (1-lambda) * this_moments[k]
    end
end

I think they are all similar solutions. Which one you use is just a matter of choice.

@denizyuret
Copy link
Owner

Will revisit this issue once the Modular Interface is in place.

@denizyuret
Copy link
Owner

Closing this, CUDNN based batchnorm added in latest master.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants