Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

reversePush results in Stack overflow for larger than small networks #7

Closed
dynamical-butter-system opened this issue Jul 14, 2015 · 13 comments

Comments

@dynamical-butter-system

Since reversePush is not tail recursive, it results in a stack overflow for even simple networks when you push in a lot of data/use a bunch of memory. Here's the code I used to produce this (backprop is modified to allow a linear output layer and different activation functions but that's not relevant to the issue).

   let trainLine = [|
      for i in 0.0..0.01..18. -> 
            i, cos i|] 

   let testLine = [|
      for i in 0.0..0.01..24. -> 
            i, cos i|] 

   let train = trainLine |> Array.map (fun (x,y) -> vector [D x] , vector [D y])

   let net2 = createNetwork [|1;3; 1|]
   let train2 = backprop true sigmoid net2 1e-3 0.005 10000 train 

   let errs = train2 |> Seq.takeOrMax 600 |> Seq.toArray
@gbaydin
Copy link
Member

gbaydin commented Jul 14, 2015

Hi jragonslayer, thank you very much for reporting this!

There are three main things I should tell you:

  • For replicating the problem on my system and fixing it, could you please provide a full self-contained code, including your version of backprop and any other functions?
  • Are you running this as an fsx script or as a compiled executable? I've had stack overflow problems with F# in the past. Running code in F# Interactive, compiling with Debug or Build, or setting some compiler flags have effect on recursion limits. (Some relevant discussion: http://stackoverflow.com/questions/7947446/why-does-f-impose-a-low-limit-on-stack-size)
  • Most importantly, I should point out that the neural network code and the other examples in the "Examples" section are provided only as little demonstrations/tutorials on how to go about adding AD to your projects. They are VERY basic bare-bones implementations to show the basics of each method, kept simple and unoptimized for illustrative purposes. (I am actually working on a machine learning library built on top of DiffSharp, where neural networks and other things are fully implemented/tested and ready to work out of the box. I hope to release it soon.)

@dynamical-butter-system
Copy link
Author

Thanks! First of all, I want to say this is an awesome library; together with your Hybrid Monte Carlo implementation, you've basically removed a great deal of the burdensome boilerplate that drags down a lot of ML implementations.

I am actually working on a machine learning library built on top of DiffSharp, where neural networks and other things are fully implemented/tested and ready to work out of the box. I hope to release it soon.

Oh cool, I'd been looking to extend the example to support different loss functions, training methods (adagrad and momentum for a start) and the option to apply drop out. As well as trying out a Simple RNN, something which your library and functional programming make really easy to define. But since you're already doing a bunch, I'll wait until you're done to see where things go. You're right that the NN example is not sufficiently flexible, plus I haven't fully understood the workings of the library yet.

I also see that you're experimenting with MKL support (I've also gotten OpenBLAS to work, much friendlier licence) and that you plan to apply CUDA. I'd be happy to extend that to OpenCL (a couple good library choices available) if you have no plans for that. Sorry, I'm rambling now. Back to the issue.

@dynamical-butter-system
Copy link
Author

Here's the full code, crashes whether one uses the interactive, Debug or Release mode.

open DiffSharp.AD
open FsAlg.Generic 
open System   

let rnd = System.Random()

let rec rndn() =
    let x, y = rnd.NextDouble() * 2.0 - 1.0, rnd.NextDouble() * 2.0 - 1.0
    let s = x * x + y *y
    if s > 1.0 then rndn() else x * sqrt (-2.0 * (log s) / s)             


let takeOrMax n (seqs:'a seq) =
   let counter = ref 0
   let en = seqs.GetEnumerator()   
   seq {  
      while !counter < n && en.MoveNext() do 
         incr counter
         yield en.Current }  


// A layer of neurons
type Layer =
    {W:Matrix<D>  // Weight matrix
     b:Vector<D>} // Bias vector

// A feedforward network of neuron layers
type Network =
    {layers:Layer[]} // The layers forming this network                    \

let sigmoid (x:D) = 1. / (1. + exp -x)

let rect (x:D) = log (1. + exp x)

let runLayer flatlast f n (c, (x:Vector<D>)) (l:Layer) =
    let next_output = l.W * x + l.b     

    c + 1, 
    (if flatlast && c = n then next_output 
     else next_output |> Vector.map f) 

let runNetwork flatlast f (x:Vector<D>) (n:Network) =
    n.layers 
    |> Array.fold (runLayer flatlast f (n.layers.Length - 1)) (0,x) 
    |> snd     

let createNetwork (l:int[]) =
    {layers = Array.init (l.Length - 1) (fun i ->
        {W = Matrix.init l.[i + 1] l.[i] (fun _ _ -> D <| rndn())//(-0.5 + rnd.NextDouble()))
         b = Vector.init l.[i + 1] (fun _ -> D (rndn()))})}

let backprop flatlast f (n:Network) (eta:float) epsilon timeout (t:(Vector<_>*Vector<_>)[]) =
    let i = DiffSharp.Util.GlobalTagger.Next
    let atten_eta = ref eta
    let alpha = 1.
    let last_err = ref Double.MaxValue
    seq {for j in 0 .. timeout do
            for l in n.layers do
                l.W |> Matrix.replace (makeDR i)
                l.b |> Vector.replace (makeDR i) 

            let error = t |> Array.sumBy (fun (x, y) -> 
               let v = runNetwork flatlast f x n
               Vector.normSq (y - v))

            error |> reverseProp (D 1.) // Propagate adjoint value 1 backward
            let diff = float error - !last_err 

            if diff > 10. then atten_eta := !atten_eta * 0.999
            last_err := float error

            for l in n.layers do
                l.W |> Matrix.replace (fun (x:D) -> 
                  x.P - !atten_eta * x.A)
                l.b |> Vector.replace (fun (x:D) -> 
                  x.P - !atten_eta * x.A)
            //if j%50=0 then atten_eta := !atten_eta * 0.9
            if j = timeout then printfn "Failed to converge within %i steps." timeout
            printfn "Step: %d, Error %A, eta %A" j (float error) !atten_eta
            yield float error}
    |> Seq.takeWhile ((<) epsilon)


let trainLine = [|
   for i in 0.0..0.01..18. -> 
       i, cos i|] 

let testLine = [|
   for i in 0.0..0.01..24. -> 
       i, cos i|] 

let train = trainLine |> Array.map (fun (x,y) -> vector [D x] , vector [D y])

let net2 = createNetwork [|1;3; 1|]
let train2 = backprop true sigmoid net2 1e-3 0.005 10000 train//.[..500] 

let errs = train2 |> takeOrMax 60 |> Seq.toArray

testLine |> Array.map (fun (x,_) -> 
   let y = runNetwork true sigmoid (vector [D x]) net2
   x, float y.FirstItem)

@gbaydin
Copy link
Member

gbaydin commented Jul 15, 2015

Hi again, thank you for sharing the code. :)

I don't get stack overflow with this code. What I get is that backprop gradient descent seems to be diverging with nan values for the error and the network weights and thus terminates prematurely. (Divergence can happen due to many reasons and doesn't necessarily indicate there is something wrong with the derivatives or their calculation by the library.) Please tell me if I'm missing something.

Anyway, I'm looking into this!

@dynamical-butter-system
Copy link
Author

Ah, I don't know why it doesn't replicate. I'm using .net 4.5.1 and F# 3.1, tried 64 and 32 bit modes...maybe you have different settings for stack size? But it is true that the function not being tail recursive means it's at risk of blowing the stack given enough data, no?

(Yeh, I don't think the divergence issue has anything to do with the library, more a step size on gradient issue.)

@gbaydin
Copy link
Member

gbaydin commented Jul 15, 2015

I've been experimenting quite a bit with this code.

The divergence in the training seems to be caused by the modified activation of the last layer (enabled when flatlast = true). When unmodified, this particular combination of initial weight ranges, network structure, and training data is ill-conditioned and backpropagation diverges. This is what I get:

image

I was able prevent divergence in several ways.Trying different activation functions, learning rates, initial weight ranges, and training data ranges helps. For example, training works with

let train2 = backprop false sigmoid net2 1e-3 0.005 10000 train

image

On my system (also .NET 4.5.1 and F# 3.1) I'm not getting stack overflow. But I'm putting the tail recursion issue on my list of things to look at. Thank you.

Also, there is much room for improvement in this backprop function (momentum, better adaptive learning rates). I'm now writing a better backprop function and I will share it with you here a bit later today. I can also add it to the examples page.

@gbaydin
Copy link
Member

gbaydin commented Jul 15, 2015

Here are some modifications:

  • I changed the backprop function to have a "bold driver" type adaptive learning rate: if the error has decreased, slowly increase the learning rate; if the error has increased, sharply decrease the learning rate and undo the last update.
  • This backprop also has momentum (with the mu parameter)
  • I also modified the Layer type to keep different activation functions associated with each layer (a:D->D). You can create networks like this:
let net = createNetwork [|1; 3; 1|] [|id; tanh; sigmoid|]

This has tanh activations in the hidden layer, sigmoid in the final layer. Or you can

let net = createNetwork [|1; 3; 1|] [|id; sigmoid; id|]

to replicate the case in the first version of your code. (id is the identity function.)

Here is the full code:

open DiffSharp.AD
open FsAlg.Generic 
open System   
open FSharp.Charting


let rnd = System.Random()

let rec rndn() =
    let x, y = rnd.NextDouble() * 2.0 - 1.0, rnd.NextDouble() * 2.0 - 1.0
    let s = x * x + y *y
    if s > 1.0 then rndn() else x * sqrt (-2.0 * (log s) / s)             

type Layer =
    {W:Matrix<D> // Weight matrix
     b:Vector<D> // Bias vector
     a:D->D}     // Activation function

type Network =
    {l:Layer[]} // The layers forming this network
    member n.P = {l = n.l |> Array.map (fun l -> {W = l.W |> Matrix.map primal; b = l.b |> Vector.map primal; a = l.a})}
    member n.A = {l = n.l |> Array.map (fun l -> {W = l.W |> Matrix.map adjoint; b = l.b |> Vector.map adjoint; a = l.a})}
    member n.map(f:D->D) = {l = n.l |> Array.map (fun l -> {W = l.W |> Matrix.map f; b = l.b |> Vector.map f; a = l.a})}
    static member (*) (c:float, n:Network) = {l = n.l |> Array.map (fun l -> {W = l.W * D c; b = l.b * D c; a = l.a})}
    static member (+) (n1:Network, n2:Network) = {l = Array.map2 (fun l1 l2 -> {W = l1.W + l2.W; b = l1.b + l2.b; a = l1.a}) n1.l n2.l}
    static member (-) (n1:Network, n2:Network) = {l = Array.map2 (fun l1 l2 -> {W = l1.W - l2.W; b = l1.b - l2.b; a = l1.a}) n1.l n2.l}

let sigmoid (x:D) = 1. / (1. + exp -x)

let rect (x:D) = log (1. + exp x)

let runLayer (x:Vector<D>) (l:Layer) =
    l.W * x + l.b |> Vector.map l.a

let runNetwork (x:Vector<D>) (n:Network) =
    n.l |> Array.fold runLayer x

let createNetwork (l:int[]) (a:(D->D)[]) =
    {l = Array.init (l.Length - 1) (fun i ->
        {W = Matrix.init l.[i + 1] l.[i] (fun _ _ -> D <| rndn())
         b = Vector.init l.[i + 1] (fun _ -> D <| rndn())
         a = a.[i + 1]})}

let backprop (n:Network) (eta:float) (mu:float) timeout (t:(Vector<_>*Vector<_>)[]) =
    let i = DiffSharp.Util.GlobalTagger.Next
    let eta = ref eta
    let prev_error = ref Double.MaxValue
    let n = ref n
    let prev_n = ref ((!n).P)
    let update = ref (0. * (!n).P)
    for j in 0 .. timeout do
        n:= (!n).map(makeDR i)

        let error = t |> Array.sumBy (fun (x, y) -> 
            let v = runNetwork x (!n)
            Vector.normSq (y - v))

        let diff = float error - !prev_error
        prev_error := float error

        // "Bold driver" adaptive learning rate
        if diff < 0. then
            eta := !eta * 1.1 // Error decreased: increase learning rate
        else
            n := (!prev_n) // Error increased: undo last weight update, decrease learning rate
            eta := !eta * 0.5

        error |> reverseProp (D 1.) // Backpropagation
        prev_n := (!n).P // Backup before update, for possible future reversal
        update := -(!eta) * (!n).A - mu * (!update) // Momentum
        n := (!n).P + (!update)

        printfn "Step: %d, Error %2f, eta %2f" j (float error) !eta
    !n // Return trained network

let trainLine = [|
   for i in 0.0..0.01..6.28 -> 
       i, cos i|] 

let testLine = [|
   for i in 0.0..0.01..24. -> 
       i, cos i|] 

let train = trainLine |> Array.map (fun (x,y) -> vector [D x] , vector [D y])

let net = createNetwork [|1; 3; 1|] [|id; tanh; tanh|]
let net_trained = backprop net 1e-3 0.25 1000 train

let test = testLine |> Array.map (fun (x,_) -> 
   let y = runNetwork (vector [D x]) net_trained
   x, float y.FirstItem)

Chart.Line test

When I run this, training converges

image

and I get something like this:

image

@dynamical-butter-system
Copy link
Author

Great! Those are neat generalizations of the code, thanks.

Setting linear_outputlayer=false will affect the network's ability to learn a regression problem for ranges outside the activation output. I noticed most of the time, divergance occurs if the step size is too large and, the more complex the network, the smaller the step size should be. Do you think this might be due to autodiff and would normally not happened with a hand derived derivative? Thank you again.

@gbaydin
Copy link
Member

gbaydin commented Jul 16, 2015

The result of AD should be exactly the same with a correctly derived manual derivative (symbolic, not numerical approximation). If it's not the same, it should be a bug. :)

I was able to get the training converge with flatlast = true in your original code. Sometimes it converges to a very small error, but if you continue the training past that point, it quickly diverges. You can experiment with let net = createNetwork [|1; 3; 1|] [|id; sigmoid; id|] using the new code.

@dynamical-butter-system
Copy link
Author

~Yep I saw, Thanks. Eagerly awaiting the ML ML library plus the ability to hook up OpenBlas...that order of magnitude improvement will be much appreciated.

@gbaydin
Copy link
Member

gbaydin commented Sep 4, 2015

Hi again!

I wanted to let you know that you were right about the tail recursion problem.

I've been working on a major new version of the library and I also ran experiments with different forms of reversePush and reverseReset. I can confirm that the tail recursion issue was the cause of stack overflow problems. I have a fully tail recursive reverse AD prototype that fixes the issue and can run very large problems without overflow.

I will push the code soon and also put a notice here. Thank you very much!

@dynamical-butter-system
Copy link
Author

Awesome, glad to hear the library continues to progress and improve!

@gbaydin
Copy link
Member

gbaydin commented Sep 29, 2015

The tail recursion bug is fixed in version 0.7.0.

@gbaydin gbaydin closed this as completed Sep 29, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants