Added CHECK if loss is NaN by sguada · Pull Request #1479 · BVLC/caffe

sguada · 2014-11-26T10:48:34Z

As mentioned in #1349 If Loss is Nan it doesn't make sense to keep training.

netheril96 · 2014-11-27T06:11:24Z

I have 74 different softmax loss layers in one network, and some of them being nan can be safely ignored. So this PR will negatively impact my currently working network training.

This should be an option in solver instead.

sguada · 2014-11-27T09:28:39Z

@netheril96 ok, I guess in that case the training can recover from NaNs. I will see how easy is to add it as an option in the solver.

longjon · 2014-12-01T06:54:18Z

Could "due to" be misleading here? It may be that the NaN appears in some low layer but gets propagated up to the loss... (there are really two possible meanings of "due to" here, and I know which one you mean now, but I'm worried I'll forget when I see this message later).

… stop_on_nan is false

seanbell · 2015-04-26T01:28:30Z

More generally, perhaps this should die on any loss that isn't finite (i.e. NaN, +infinity, or -infinity).

PiranjaF · 2015-07-09T09:57:06Z

Thanks a lot - this is highly needed! I've spent way too long trying to understand why pycaffe was crashing when the network was diverging. Quite annoying when trying to do a parameter search over a large parameter space.

PiranjaF · 2015-07-09T17:23:17Z

@sguada, @longjon: How would I go about stopping training more silently? With this implementation any script in pycaffe would still crash immediately. It should be possible to catch the error or perhaps set as an option that training stops without stopping everything else.

seanbell · 2015-07-10T06:38:56Z

@PiranjaF you could change it to break from the loop instead of crash.

PiranjaF · 2015-07-11T22:14:40Z

@seanbell I did not think it could be done that easily. Thanks.

shelhamer · 2015-08-26T00:24:54Z

Closing since the dev branch is deprecated. Please send PRs to master.

longjon reviewed Dec 1, 2014
View reviewed changes

Added CHECK to Solver to stop training when isnan(loss), unless param…

4c174c5

… stop_on_nan is false

sguada force-pushed the check_nan_loss branch from df2e1f8 to 4c174c5 Compare December 26, 2014 18:54

shelhamer added the JL label Mar 10, 2015

jeffdonahue mentioned this pull request Apr 25, 2015

Die on NaN loss #2363

Closed

longjon added the needs rebase label Apr 25, 2015

shelhamer closed this Aug 26, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added CHECK if loss is NaN#1479

Added CHECK if loss is NaN#1479
sguada wants to merge 1 commit into
BVLC:devfrom
sguada:check_nan_loss

sguada commented Nov 26, 2014

Uh oh!

netheril96 commented Nov 27, 2014

Uh oh!

sguada commented Nov 27, 2014

Uh oh!

longjon Dec 1, 2014

Uh oh!

seanbell commented Apr 26, 2015

Uh oh!

PiranjaF commented Jul 9, 2015

Uh oh!

PiranjaF commented Jul 9, 2015

Uh oh!

seanbell commented Jul 10, 2015

Uh oh!

PiranjaF commented Jul 11, 2015

Uh oh!

shelhamer commented Aug 26, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Conversation

sguada commented Nov 26, 2014

Uh oh!

netheril96 commented Nov 27, 2014

Uh oh!

sguada commented Nov 27, 2014

Uh oh!

longjon Dec 1, 2014

Choose a reason for hiding this comment

Uh oh!

seanbell commented Apr 26, 2015

Uh oh!

PiranjaF commented Jul 9, 2015

Uh oh!

PiranjaF commented Jul 9, 2015

Uh oh!

seanbell commented Jul 10, 2015

Uh oh!

PiranjaF commented Jul 11, 2015

Uh oh!

shelhamer commented Aug 26, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants