Add extension to kill training when NaN or Inf is detected (FailOnNonNumber) #4545

rezoo · 2018-03-29T16:37:13Z

Saving the resources on the Earth is especially significant for human beings. By reducing unnecessary computations produced from a large number of job queues, we will be able to effectively use more computational resources and prevent global warming. This PR aims to reduce unnecessary computations by throwing an exception if the parameters in the optimizer contain NaN.

kmaehashi · 2018-03-30T07:35:34Z

Thanks for the PR!
I thought it is better to implement this feature as Trigger rather than Extension. (like EarlyStoppingTrigger)
How do you think?

rezoo · 2018-04-03T02:42:33Z

I think extension would be better because:

it is a common case that both EarlyStoppingTrigger and NaNKiller are used. The current chainer doesn't support such a case.
What to do when NaN occurs is just to die; nobody wants to take snapshots including NaN and reduce learning rate.

I would rather know the specific advantages of implementing NaNKillerTrigger.

beam2d · 2018-04-03T03:13:32Z

I agree with @rezoo for that the killer feature should be provided as an extension rather than a stopping trigger.

I still think it's good to provide the NaN detection as a trigger, though. NaN detection may be used in another way, e.g. combining it with another extension, say UnwindTraining, that unwinds the training to the last snapshot. Using this, once NaN is detected, one can automatically retry recent updates with fresh randomness that may avoid the NaN (I have read some papers doing this kind of trick to struggle with unstable training). In this case, we can provide Killer extension that can be used with the NanDetector trigger. It's still good to provide NanKiller extension that combines them for convenience.

kmaehashi · 2018-04-03T04:43:35Z

Thank you! I understand the intention. Let's go with Extensions.
Maybe we can later refactor this to be used as a trigger or even within the custom training loop, after discussion in #3013 has settled.

kmaehashi

LGTM except for comments.
Could you resolve conflicts?

kmaehashi · 2018-04-03T06:39:15Z

chainer/training/extensions/nan_killer.py

+from chainer.training import extension
+
+
+class NaNKiller(extension.Extension):


I thought it is better to name it NaNDetector or sth, because it kills the training loop, not NaN itself (NaNKiller sounds like removing NaNs).

It seems that NaNDetector detects NaN and reports it without killing the process. How about NaNProcessKiller or NaNModelKiller?

kmaehashi · 2018-04-03T06:40:01Z

chainer/training/extensions/nan_killer.py

+class NaNKiller(extension.Extension):
+    """Trainer extension to raise RuntimeError if parameters contain NaN.
+
+    Although parameters including NaN are unnecessary for many developers,


I think the word developers is ambiguous. How about in most cases instead of for many developers?

kmaehashi · 2018-04-03T06:46:59Z

tests/chainer_tests/training_tests/extensions_tests/test_nan_killer.py

+            self.dataset, 1, shuffle=False)
+
+    def prepare(self, device=None):
+        tempdir = tempfile.mkdtemp()


Please cleanup temp directory used for tests.
How about creating it at the last step of setUp and use tearDown method to remove it?

kmaehashi · 2018-04-05T06:22:30Z

Thanks for the update!

I briefly googled how other frameworks handles this issue and found TensorFlow provides tfdbg.has_inf_or_nan API.
https://www.tensorflow.org/api_docs/python/tfdbg/has_inf_or_nan
https://www.tensorflow.org/programmers_guide/debugger#finding_nans_and_infs

Do you think it helps users to detect inf in addition to nan to find explosion of parameters?

rezoo · 2018-04-05T07:41:18Z

Although I never encounter the case where the loss contains inf, including inf is fine for me.
But at least we have to carefully decide the name of this extension.

tkerola · 2018-04-05T08:27:35Z

How about naming it BadParameterDetector and adding an argument action to decide what should be done when nan or inf is detected? The default action can be raise, but maybe we can also have an option warn, that just shows a warning that a nan or inf was detected?

rezoo · 2018-04-05T09:27:06Z

I cannot imagine the case where we want to raise only warnings, and the name of BadParameter seems to be a little bit ambiguous. In that sense, NaNKiller is grammatically strange but the meaning is easy to understand ("Kill the process if NaN occurs. That's all.")

kmaehashi · 2018-04-05T10:33:54Z

Regarding names, we can use some abstract name like BadParam (and document what it means in pydoc), or use NaNOrInf to make it explicit. (I personally prefer the former one)

How about class name FailOnBadParam?

hvy · 2018-04-05T11:01:05Z

If you don't want to be explicit about the name (which is personally think is completely fine in this case), "...non-numbers" is used here to address both inf and nan. So *NonNumber(s)* is another suggestion.

rezoo · 2018-04-05T11:09:07Z

I have an opposite opinion (I prefer the latter one). As shown in the zen of python, explicit is better than implicit.

beam2d · 2018-04-06T01:28:04Z

We have to precisely explain the situation. NanKiller is explicit for which value is taken account, but is implicit about which variable is taken account. FailOnBadParam is implicit for the former, but explicit for the latter.

I agree that explicit is better, but I also think simplicity is also important, and in that sense, FailOnBadParam is good. One idea I have to avoid the ambiguity is FailOnDivergedParam (I expect users to imagine NaN and inf from the word "diverged"), though it's less simple.

rezoo · 2018-04-06T02:19:07Z

I think that Bad is too simple (What is bad? an extension that kills the process based on the accuracy? an extension that checks the property of the current state? Let's see the document... Ah just to kill the model including NaN). It's better to name the extension so that we can understand exactly what this instance is doing because people tend to read a code first rather than document.

I agree to use FailOn, but I argue that FailOnDivergedParam, FailOnNanInf, FailOnNanOrInf, FailOnNonNumber is better than FailOnBadParam.

kmaehashi · 2018-04-06T10:39:13Z

Thank you. I think FailOnNonNumber sounds self-explanatory and also simple enough for users.

rezoo · 2018-04-09T03:10:35Z

Good. I changed the filename and added the tests.

kmaehashi

LGTM except for minor comments.

kmaehashi · 2018-04-10T03:24:37Z

chainer/training/extensions/fail_on_nonnumber.py

+
+
+class FailOnNonNumber(extension.Extension):
+    """Trainer extension to raise RuntimeError if parameters contain NaN or Inf


Add . to the end of the line.

kmaehashi · 2018-04-10T03:32:30Z

docs/source/reference/training.rst

@@ -67,6 +67,7 @@ The typical use case is to use :class:`~chainer.training.extensions.Evaluator` t
   chainer.training.extensions.Evaluator
   chainer.training.extensions.MicroAverage

+   chainer.training.extensions.NaNKiller


Please update the class name.

rezoo · 2018-04-10T06:02:37Z

Fixed

kmaehashi · 2018-04-10T08:57:59Z

LGTM!

Add extensions.NaNKiller

rezoo added 5 commits March 30, 2018 01:06

Add NaNKiller extension

418a9ed

Fix typo

e510c7f

Add description

f416c7a

Add NaNKiller to reference

cb34b8c

Add run_module

ec65d63

kmaehashi reviewed Apr 3, 2018

View reviewed changes

kmaehashi self-assigned this Apr 3, 2018

kmaehashi mentioned this pull request Apr 3, 2018

[RFC] Add error handler interface to trainer extensions #4566

Closed

rezoo added 4 commits April 4, 2018 21:03

Resolve conflicts

b8e5109

Use tearDown

9d2205d

Revise description

8754d25

Flake8

b1c7f10

kmaehashi added cat:feature Implementation that introduces new interfaces. to-be-backported Pull request that should be backported. labels Apr 5, 2018

rezoo added 2 commits April 9, 2018 12:03

Rename to FailOnNonNumber

7ed1653

Add message

7c06c5a

Flake8

8ca5baf

kmaehashi requested changes Apr 10, 2018

View reviewed changes

kmaehashi added this to the v5.0.0a1 milestone Apr 10, 2018

rezoo added 2 commits April 10, 2018 15:01

Add .

5416a56

Fix module

af1c534

kmaehashi approved these changes Apr 10, 2018

View reviewed changes

kmaehashi added the st:test-and-merge State indicating that pull request is approved by a reviewer and can be merged after CI passes. label Apr 10, 2018

kmaehashi merged commit 3e97337 into chainer:master Apr 10, 2018

kmaehashi changed the title ~~Add extensions.NaNKiller~~ Add extension to kill training when NaN or Inf is detected (FailOnNonNumber) Apr 10, 2018

kmaehashi added a commit to kmaehashi/chainer that referenced this pull request Apr 10, 2018

Merge pull request chainer#4545 from rezoo/nan-killer

d5c5b2e

Add extensions.NaNKiller

kmaehashi mentioned this pull request Apr 10, 2018

[backport] Add extension to kill training when NaN or Inf is detected (FailOnNonNumber) #4602

Merged

rezoo deleted the nan-killer branch April 11, 2018 15:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add extension to kill training when NaN or Inf is detected (FailOnNonNumber) #4545

Add extension to kill training when NaN or Inf is detected (FailOnNonNumber) #4545

rezoo commented Mar 29, 2018

kmaehashi commented Mar 30, 2018

rezoo commented Apr 3, 2018

beam2d commented Apr 3, 2018 •

edited

kmaehashi commented Apr 3, 2018

kmaehashi left a comment

kmaehashi Apr 3, 2018

rezoo Apr 4, 2018 •

edited

kmaehashi Apr 3, 2018

kmaehashi Apr 3, 2018

kmaehashi commented Apr 5, 2018

rezoo commented Apr 5, 2018

tkerola commented Apr 5, 2018

rezoo commented Apr 5, 2018

kmaehashi commented Apr 5, 2018

hvy commented Apr 5, 2018

rezoo commented Apr 5, 2018

beam2d commented Apr 6, 2018

rezoo commented Apr 6, 2018 •

edited

kmaehashi commented Apr 6, 2018

rezoo commented Apr 9, 2018

kmaehashi left a comment

kmaehashi Apr 10, 2018

kmaehashi Apr 10, 2018

rezoo commented Apr 10, 2018

kmaehashi commented Apr 10, 2018

		from chainer.training import extension


		class NaNKiller(extension.Extension):



		class FailOnNonNumber(extension.Extension):
		"""Trainer extension to raise RuntimeError if parameters contain NaN or Inf

Add extension to kill training when NaN or Inf is detected (FailOnNonNumber) #4545

Add extension to kill training when NaN or Inf is detected (FailOnNonNumber) #4545

Conversation

rezoo commented Mar 29, 2018

kmaehashi commented Mar 30, 2018

rezoo commented Apr 3, 2018

beam2d commented Apr 3, 2018 • edited

kmaehashi commented Apr 3, 2018

kmaehashi left a comment

Choose a reason for hiding this comment

kmaehashi Apr 3, 2018

Choose a reason for hiding this comment

rezoo Apr 4, 2018 • edited

Choose a reason for hiding this comment

kmaehashi Apr 3, 2018

Choose a reason for hiding this comment

kmaehashi Apr 3, 2018

Choose a reason for hiding this comment

kmaehashi commented Apr 5, 2018

rezoo commented Apr 5, 2018

tkerola commented Apr 5, 2018

rezoo commented Apr 5, 2018

kmaehashi commented Apr 5, 2018

hvy commented Apr 5, 2018

rezoo commented Apr 5, 2018

beam2d commented Apr 6, 2018

rezoo commented Apr 6, 2018 • edited

kmaehashi commented Apr 6, 2018

rezoo commented Apr 9, 2018

kmaehashi left a comment

Choose a reason for hiding this comment

kmaehashi Apr 10, 2018

Choose a reason for hiding this comment

kmaehashi Apr 10, 2018

Choose a reason for hiding this comment

rezoo commented Apr 10, 2018

kmaehashi commented Apr 10, 2018

beam2d commented Apr 3, 2018 •

edited

rezoo Apr 4, 2018 •

edited

rezoo commented Apr 6, 2018 •

edited