New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add extension to kill training when NaN or Inf is detected (FailOnNonNumber) #4545
Conversation
Thanks for the PR! |
I think extension would be better because:
I would rather know the specific advantages of implementing |
I agree with @rezoo for that the killer feature should be provided as an extension rather than a stopping trigger. I still think it's good to provide the NaN detection as a trigger, though. NaN detection may be used in another way, e.g. combining it with another extension, say |
Thank you! I understand the intention. Let's go with Extensions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM except for comments.
Could you resolve conflicts?
from chainer.training import extension | ||
|
||
|
||
class NaNKiller(extension.Extension): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought it is better to name it NaNDetector
or sth, because it kills the training loop, not NaN
itself (NaNKiller
sounds like removing NaNs).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that NaNDetector
detects NaN and reports it without killing the process. How about NaNProcessKiller
or NaNModelKiller
?
class NaNKiller(extension.Extension): | ||
"""Trainer extension to raise RuntimeError if parameters contain NaN. | ||
|
||
Although parameters including NaN are unnecessary for many developers, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the word developers
is ambiguous. How about in most cases
instead of for many developers
?
self.dataset, 1, shuffle=False) | ||
|
||
def prepare(self, device=None): | ||
tempdir = tempfile.mkdtemp() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please cleanup temp directory used for tests.
How about creating it at the last step of setUp
and use tearDown
method to remove it?
Thanks for the update! I briefly googled how other frameworks handles this issue and found TensorFlow provides Do you think it helps users to detect |
Although I never encounter the case where the loss contains inf, including inf is fine for me. |
How about naming it |
I cannot imagine the case where we want to raise only warnings, and the name of |
Regarding names, we can use some abstract name like How about class name |
If you don't want to be explicit about the name (which is personally think is completely fine in this case), "...non-numbers" is used here to address both |
I have an opposite opinion (I prefer the latter one). As shown in the zen of python, explicit is better than implicit. |
We have to precisely explain the situation. I agree that explicit is better, but I also think simplicity is also important, and in that sense, |
I think that I agree to use |
Thank you. I think |
Good. I changed the filename and added the tests. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM except for minor comments.
|
||
|
||
class FailOnNonNumber(extension.Extension): | ||
"""Trainer extension to raise RuntimeError if parameters contain NaN or Inf |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add .
to the end of the line.
docs/source/reference/training.rst
Outdated
@@ -67,6 +67,7 @@ The typical use case is to use :class:`~chainer.training.extensions.Evaluator` t | |||
chainer.training.extensions.Evaluator | |||
chainer.training.extensions.MicroAverage | |||
|
|||
chainer.training.extensions.NaNKiller |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please update the class name.
Fixed |
LGTM! |
Add extensions.NaNKiller
Saving the resources on the Earth is especially significant for human beings. By reducing unnecessary computations produced from a large number of job queues, we will be able to effectively use more computational resources and prevent global warming. This PR aims to reduce unnecessary computations by throwing an exception if the parameters in the optimizer contain NaN.