Skip to content

Added CHECK if loss is NaN#1479

Closed
sguada wants to merge 1 commit into
BVLC:devfrom
sguada:check_nan_loss
Closed

Added CHECK if loss is NaN#1479
sguada wants to merge 1 commit into
BVLC:devfrom
sguada:check_nan_loss

Conversation

@sguada
Copy link
Copy Markdown
Contributor

@sguada sguada commented Nov 26, 2014

As mentioned in #1349 If Loss is Nan it doesn't make sense to keep training.

@netheril96
Copy link
Copy Markdown
Contributor

I have 74 different softmax loss layers in one network, and some of them being nan can be safely ignored. So this PR will negatively impact my currently working network training.

This should be an option in solver instead.

@sguada
Copy link
Copy Markdown
Contributor Author

sguada commented Nov 27, 2014

@netheril96 ok, I guess in that case the training can recover from NaNs. I will see how easy is to add it as an option in the solver.

Comment thread include/caffe/layer.hpp Outdated
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could "due to" be misleading here? It may be that the NaN appears in some low layer but gets propagated up to the loss... (there are really two possible meanings of "due to" here, and I know which one you mean now, but I'm worried I'll forget when I see this message later).

@seanbell
Copy link
Copy Markdown

More generally, perhaps this should die on any loss that isn't finite (i.e. NaN, +infinity, or -infinity).

@PiranjaF
Copy link
Copy Markdown

PiranjaF commented Jul 9, 2015

Thanks a lot - this is highly needed! I've spent way too long trying to understand why pycaffe was crashing when the network was diverging. Quite annoying when trying to do a parameter search over a large parameter space.

@PiranjaF
Copy link
Copy Markdown

PiranjaF commented Jul 9, 2015

@sguada, @longjon: How would I go about stopping training more silently? With this implementation any script in pycaffe would still crash immediately. It should be possible to catch the error or perhaps set as an option that training stops without stopping everything else.

@seanbell
Copy link
Copy Markdown

@PiranjaF you could change it to break from the loop instead of crash.

@PiranjaF
Copy link
Copy Markdown

@seanbell I did not think it could be done that easily. Thanks.

@shelhamer
Copy link
Copy Markdown
Member

Closing since the dev branch is deprecated. Please send PRs to master.

@shelhamer shelhamer closed this Aug 26, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants