-
Notifications
You must be signed in to change notification settings - Fork 67
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
reversePush results in Stack overflow for larger than small networks #7
Comments
Hi jragonslayer, thank you very much for reporting this! There are three main things I should tell you:
|
Thanks! First of all, I want to say this is an awesome library; together with your Hybrid Monte Carlo implementation, you've basically removed a great deal of the burdensome boilerplate that drags down a lot of ML implementations.
Oh cool, I'd been looking to extend the example to support different loss functions, training methods (adagrad and momentum for a start) and the option to apply drop out. As well as trying out a Simple RNN, something which your library and functional programming make really easy to define. But since you're already doing a bunch, I'll wait until you're done to see where things go. You're right that the NN example is not sufficiently flexible, plus I haven't fully understood the workings of the library yet. I also see that you're experimenting with MKL support (I've also gotten OpenBLAS to work, much friendlier licence) and that you plan to apply CUDA. I'd be happy to extend that to OpenCL (a couple good library choices available) if you have no plans for that. Sorry, I'm rambling now. Back to the issue. |
Here's the full code, crashes whether one uses the interactive, Debug or Release mode.
|
Hi again, thank you for sharing the code. :) I don't get stack overflow with this code. What I get is that backprop gradient descent seems to be diverging with Anyway, I'm looking into this! |
Ah, I don't know why it doesn't replicate. I'm using .net 4.5.1 and F# 3.1, tried 64 and 32 bit modes...maybe you have different settings for stack size? But it is true that the function not being tail recursive means it's at risk of blowing the stack given enough data, no? (Yeh, I don't think the divergence issue has anything to do with the library, more a step size on gradient issue.) |
I've been experimenting quite a bit with this code. The divergence in the training seems to be caused by the modified activation of the last layer (enabled when I was able prevent divergence in several ways.Trying different activation functions, learning rates, initial weight ranges, and training data ranges helps. For example, training works with let train2 = backprop false sigmoid net2 1e-3 0.005 10000 train On my system (also .NET 4.5.1 and F# 3.1) I'm not getting stack overflow. But I'm putting the tail recursion issue on my list of things to look at. Thank you. Also, there is much room for improvement in this backprop function (momentum, better adaptive learning rates). I'm now writing a better backprop function and I will share it with you here a bit later today. I can also add it to the examples page. |
Here are some modifications:
let net = createNetwork [|1; 3; 1|] [|id; tanh; sigmoid|] This has let net = createNetwork [|1; 3; 1|] [|id; sigmoid; id|] to replicate the case in the first version of your code. ( Here is the full code: open DiffSharp.AD
open FsAlg.Generic
open System
open FSharp.Charting
let rnd = System.Random()
let rec rndn() =
let x, y = rnd.NextDouble() * 2.0 - 1.0, rnd.NextDouble() * 2.0 - 1.0
let s = x * x + y *y
if s > 1.0 then rndn() else x * sqrt (-2.0 * (log s) / s)
type Layer =
{W:Matrix<D> // Weight matrix
b:Vector<D> // Bias vector
a:D->D} // Activation function
type Network =
{l:Layer[]} // The layers forming this network
member n.P = {l = n.l |> Array.map (fun l -> {W = l.W |> Matrix.map primal; b = l.b |> Vector.map primal; a = l.a})}
member n.A = {l = n.l |> Array.map (fun l -> {W = l.W |> Matrix.map adjoint; b = l.b |> Vector.map adjoint; a = l.a})}
member n.map(f:D->D) = {l = n.l |> Array.map (fun l -> {W = l.W |> Matrix.map f; b = l.b |> Vector.map f; a = l.a})}
static member (*) (c:float, n:Network) = {l = n.l |> Array.map (fun l -> {W = l.W * D c; b = l.b * D c; a = l.a})}
static member (+) (n1:Network, n2:Network) = {l = Array.map2 (fun l1 l2 -> {W = l1.W + l2.W; b = l1.b + l2.b; a = l1.a}) n1.l n2.l}
static member (-) (n1:Network, n2:Network) = {l = Array.map2 (fun l1 l2 -> {W = l1.W - l2.W; b = l1.b - l2.b; a = l1.a}) n1.l n2.l}
let sigmoid (x:D) = 1. / (1. + exp -x)
let rect (x:D) = log (1. + exp x)
let runLayer (x:Vector<D>) (l:Layer) =
l.W * x + l.b |> Vector.map l.a
let runNetwork (x:Vector<D>) (n:Network) =
n.l |> Array.fold runLayer x
let createNetwork (l:int[]) (a:(D->D)[]) =
{l = Array.init (l.Length - 1) (fun i ->
{W = Matrix.init l.[i + 1] l.[i] (fun _ _ -> D <| rndn())
b = Vector.init l.[i + 1] (fun _ -> D <| rndn())
a = a.[i + 1]})}
let backprop (n:Network) (eta:float) (mu:float) timeout (t:(Vector<_>*Vector<_>)[]) =
let i = DiffSharp.Util.GlobalTagger.Next
let eta = ref eta
let prev_error = ref Double.MaxValue
let n = ref n
let prev_n = ref ((!n).P)
let update = ref (0. * (!n).P)
for j in 0 .. timeout do
n:= (!n).map(makeDR i)
let error = t |> Array.sumBy (fun (x, y) ->
let v = runNetwork x (!n)
Vector.normSq (y - v))
let diff = float error - !prev_error
prev_error := float error
// "Bold driver" adaptive learning rate
if diff < 0. then
eta := !eta * 1.1 // Error decreased: increase learning rate
else
n := (!prev_n) // Error increased: undo last weight update, decrease learning rate
eta := !eta * 0.5
error |> reverseProp (D 1.) // Backpropagation
prev_n := (!n).P // Backup before update, for possible future reversal
update := -(!eta) * (!n).A - mu * (!update) // Momentum
n := (!n).P + (!update)
printfn "Step: %d, Error %2f, eta %2f" j (float error) !eta
!n // Return trained network
let trainLine = [|
for i in 0.0..0.01..6.28 ->
i, cos i|]
let testLine = [|
for i in 0.0..0.01..24. ->
i, cos i|]
let train = trainLine |> Array.map (fun (x,y) -> vector [D x] , vector [D y])
let net = createNetwork [|1; 3; 1|] [|id; tanh; tanh|]
let net_trained = backprop net 1e-3 0.25 1000 train
let test = testLine |> Array.map (fun (x,_) ->
let y = runNetwork (vector [D x]) net_trained
x, float y.FirstItem)
Chart.Line test When I run this, training converges and I get something like this: |
Great! Those are neat generalizations of the code, thanks. Setting linear_outputlayer=false will affect the network's ability to learn a regression problem for ranges outside the activation output. I noticed most of the time, divergance occurs if the step size is too large and, the more complex the network, the smaller the step size should be. Do you think this might be due to autodiff and would normally not happened with a hand derived derivative? Thank you again. |
The result of AD should be exactly the same with a correctly derived manual derivative (symbolic, not numerical approximation). If it's not the same, it should be a bug. :) I was able to get the training converge with |
~Yep I saw, Thanks. Eagerly awaiting the ML ML library plus the ability to hook up OpenBlas...that order of magnitude improvement will be much appreciated. |
Hi again! I wanted to let you know that you were right about the tail recursion problem. I've been working on a major new version of the library and I also ran experiments with different forms of reversePush and reverseReset. I can confirm that the tail recursion issue was the cause of stack overflow problems. I have a fully tail recursive reverse AD prototype that fixes the issue and can run very large problems without overflow. I will push the code soon and also put a notice here. Thank you very much! |
Awesome, glad to hear the library continues to progress and improve! |
The tail recursion bug is fixed in version 0.7.0. |
Since reversePush is not tail recursive, it results in a stack overflow for even simple networks when you push in a lot of data/use a bunch of memory. Here's the code I used to produce this (backprop is modified to allow a linear output layer and different activation functions but that's not relevant to the issue).
The text was updated successfully, but these errors were encountered: