Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add extension to kill training when NaN or Inf is detected (FailOnNonNumber) #4545

Merged
merged 14 commits into from Apr 10, 2018

Conversation

rezoo
Copy link
Member

@rezoo rezoo commented Mar 29, 2018

Saving the resources on the Earth is especially significant for human beings. By reducing unnecessary computations produced from a large number of job queues, we will be able to effectively use more computational resources and prevent global warming. This PR aims to reduce unnecessary computations by throwing an exception if the parameters in the optimizer contain NaN.

@kmaehashi
Copy link
Member

Thanks for the PR!
I thought it is better to implement this feature as Trigger rather than Extension. (like EarlyStoppingTrigger)
How do you think?

@rezoo
Copy link
Member Author

rezoo commented Apr 3, 2018

I think extension would be better because:

  1. it is a common case that both EarlyStoppingTrigger and NaNKiller are used. The current chainer doesn't support such a case.
  2. What to do when NaN occurs is just to die; nobody wants to take snapshots including NaN and reduce learning rate.

I would rather know the specific advantages of implementing NaNKillerTrigger.

@beam2d
Copy link
Member

beam2d commented Apr 3, 2018

I agree with @rezoo for that the killer feature should be provided as an extension rather than a stopping trigger.

I still think it's good to provide the NaN detection as a trigger, though. NaN detection may be used in another way, e.g. combining it with another extension, say UnwindTraining, that unwinds the training to the last snapshot. Using this, once NaN is detected, one can automatically retry recent updates with fresh randomness that may avoid the NaN (I have read some papers doing this kind of trick to struggle with unstable training). In this case, we can provide Killer extension that can be used with the NanDetector trigger. It's still good to provide NanKiller extension that combines them for convenience.

@kmaehashi
Copy link
Member

Thank you! I understand the intention. Let's go with Extensions.
Maybe we can later refactor this to be used as a trigger or even within the custom training loop, after discussion in #3013 has settled.

Copy link
Member

@kmaehashi kmaehashi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM except for comments.
Could you resolve conflicts?

from chainer.training import extension


class NaNKiller(extension.Extension):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought it is better to name it NaNDetector or sth, because it kills the training loop, not NaN itself (NaNKiller sounds like removing NaNs).

Copy link
Member Author

@rezoo rezoo Apr 4, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that NaNDetector detects NaN and reports it without killing the process. How about NaNProcessKiller or NaNModelKiller?

class NaNKiller(extension.Extension):
"""Trainer extension to raise RuntimeError if parameters contain NaN.

Although parameters including NaN are unnecessary for many developers,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the word developers is ambiguous. How about in most cases instead of for many developers?

self.dataset, 1, shuffle=False)

def prepare(self, device=None):
tempdir = tempfile.mkdtemp()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please cleanup temp directory used for tests.
How about creating it at the last step of setUp and use tearDown method to remove it?

@kmaehashi
Copy link
Member

Thanks for the update!

I briefly googled how other frameworks handles this issue and found TensorFlow provides tfdbg.has_inf_or_nan API.
https://www.tensorflow.org/api_docs/python/tfdbg/has_inf_or_nan
https://www.tensorflow.org/programmers_guide/debugger#finding_nans_and_infs

Do you think it helps users to detect inf in addition to nan to find explosion of parameters?

@kmaehashi kmaehashi added cat:feature Implementation that introduces new interfaces. to-be-backported Pull request that should be backported. labels Apr 5, 2018
@rezoo
Copy link
Member Author

rezoo commented Apr 5, 2018

Although I never encounter the case where the loss contains inf, including inf is fine for me.
But at least we have to carefully decide the name of this extension.

@tkerola
Copy link
Contributor

tkerola commented Apr 5, 2018

How about naming it BadParameterDetector and adding an argument action to decide what should be done when nan or inf is detected? The default action can be raise, but maybe we can also have an option warn, that just shows a warning that a nan or inf was detected?

@rezoo
Copy link
Member Author

rezoo commented Apr 5, 2018

I cannot imagine the case where we want to raise only warnings, and the name of BadParameter seems to be a little bit ambiguous. In that sense, NaNKiller is grammatically strange but the meaning is easy to understand ("Kill the process if NaN occurs. That's all.")

@kmaehashi
Copy link
Member

Regarding names, we can use some abstract name like BadParam (and document what it means in pydoc), or use NaNOrInf to make it explicit. (I personally prefer the former one)

How about class name FailOnBadParam?

@hvy
Copy link
Member

hvy commented Apr 5, 2018

If you don't want to be explicit about the name (which is personally think is completely fine in this case), "...non-numbers" is used here to address both inf and nan. So *NonNumber(s)* is another suggestion.

@rezoo
Copy link
Member Author

rezoo commented Apr 5, 2018

I have an opposite opinion (I prefer the latter one). As shown in the zen of python, explicit is better than implicit.

@beam2d
Copy link
Member

beam2d commented Apr 6, 2018

We have to precisely explain the situation. NanKiller is explicit for which value is taken account, but is implicit about which variable is taken account. FailOnBadParam is implicit for the former, but explicit for the latter.

I agree that explicit is better, but I also think simplicity is also important, and in that sense, FailOnBadParam is good. One idea I have to avoid the ambiguity is FailOnDivergedParam (I expect users to imagine NaN and inf from the word "diverged"), though it's less simple.

@rezoo
Copy link
Member Author

rezoo commented Apr 6, 2018

I think that Bad is too simple (What is bad? an extension that kills the process based on the accuracy? an extension that checks the property of the current state? Let's see the document... Ah just to kill the model including NaN). It's better to name the extension so that we can understand exactly what this instance is doing because people tend to read a code first rather than document.

I agree to use FailOn, but I argue that FailOnDivergedParam, FailOnNanInf, FailOnNanOrInf, FailOnNonNumber is better than FailOnBadParam.

@kmaehashi
Copy link
Member

Thank you. I think FailOnNonNumber sounds self-explanatory and also simple enough for users.

@rezoo
Copy link
Member Author

rezoo commented Apr 9, 2018

Good. I changed the filename and added the tests.

Copy link
Member

@kmaehashi kmaehashi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM except for minor comments.



class FailOnNonNumber(extension.Extension):
"""Trainer extension to raise RuntimeError if parameters contain NaN or Inf
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add . to the end of the line.

@@ -67,6 +67,7 @@ The typical use case is to use :class:`~chainer.training.extensions.Evaluator` t
chainer.training.extensions.Evaluator
chainer.training.extensions.MicroAverage

chainer.training.extensions.NaNKiller
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please update the class name.

@kmaehashi kmaehashi added this to the v5.0.0a1 milestone Apr 10, 2018
@rezoo
Copy link
Member Author

rezoo commented Apr 10, 2018

Fixed

@kmaehashi kmaehashi added the st:test-and-merge State indicating that pull request is approved by a reviewer and can be merged after CI passes. label Apr 10, 2018
@kmaehashi kmaehashi merged commit 3e97337 into chainer:master Apr 10, 2018
@kmaehashi
Copy link
Member

LGTM!

@kmaehashi kmaehashi changed the title Add extensions.NaNKiller Add extension to kill training when NaN or Inf is detected (FailOnNonNumber) Apr 10, 2018
kmaehashi added a commit to kmaehashi/chainer that referenced this pull request Apr 10, 2018
@rezoo rezoo deleted the nan-killer branch April 11, 2018 15:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cat:feature Implementation that introduces new interfaces. st:test-and-merge State indicating that pull request is approved by a reviewer and can be merged after CI passes. to-be-backported Pull request that should be backported.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants