Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory Corruption? when trying to use external errors with ComputationGraph #4539

Closed
treo opened this issue Jan 23, 2018 · 8 comments
Closed

Comments

@treo
Copy link
Member

treo commented Jan 23, 2018

Issue Description

I've been trying to find out why the following sometimes results in NaNs in the gradient:
https://gist.github.com/Broele/22b5f7e9bde28a8ca4b58c41ddd343e3

The example is pretty non-sensical, as in: it doesn't do anything useful other than reproducing the bug. I've asked @Broele to reduce his problematic code as far as possible such that it still has the problematic behavior.

While stepping through the code I've found one weird thing:
silentOutput ignores the train flag, when calling feedForward:
https://github.com/deeplearning4j/deeplearning4j/blob/master/deeplearning4j-nn/src/main/java/org/deeplearning4j/nn/graph/ComputationGraph.java#L1631

I've found this, as I realized that the inputs to each Vertex / Layer aren't reset when calling

            Gradient gradient = model.backpropGradient(error);

And most of the time when the gradient ended up with NaNs, the DenseLayer input was also already corrupted and contained either very large numbers or also NaNs. However, whatever I've tried, I couldn't work out at which point that corruption happens. During the run of feedForward, neither the output nor the input are corrupted at this point, however, as soon as it has finished, it seems to get corrupted.

My best guess is that it is somehow related to Workspaces.

I could workaround the problem, by adding a "public api" feedForward before running backpropGradient:

            model.setInputs(input);
            model.feedForward(true);
            Gradient gradient = model.backpropGradient(error);

This results in valid gradient, even over many runs.

@raver119, do you have an Idea what is going on there?

Version Information

Please indicate relevant versions, including, if relevant:

  • Deeplearning4j version: 0.9.1
  • Windows 10
  • CPU Backend
@raver119
Copy link
Contributor

What's workspaceMode used?

@Broele
Copy link

Broele commented Jan 23, 2018

I did not explicitly set one. Could that already be the source of the problem?

@raver119
Copy link
Contributor

No, it shouldn't be.

What NAN_PANIC mode says?

@AlexDBlack
Copy link
Contributor

AlexDBlack commented Jan 23, 2018

I'd say this is very likely workspaces related. The code is obviously from 0.9.1.
Edit: a workaround is likely to set both training and inference workspaces to NONE (inference WS default is SEPARATE)

I ran this on current master.

I've found this, as I realized that the inputs to each Vertex / Layer aren't reset when calling

This has changed on master... now they are cleared, which results in the following error:

Exception in thread "main" java.lang.IllegalStateException: Cannot do backward pass: inputs not set. Layer output (idx 4) numInputs 1
	at org.deeplearning4j.nn.graph.vertex.impl.LayerVertex.doBackward(LayerVertex.java:137)
	at org.deeplearning4j.nn.graph.ComputationGraph.calcBackpropGradients(ComputationGraph.java:1744)
	at org.deeplearning4j.nn.graph.ComputationGraph.backpropGradient(ComputationGraph.java:1672)
	at org.deeplearning4j.Example.main(Example.java:83)

I'll take a look at this today.

@AlexDBlack AlexDBlack self-assigned this Jan 23, 2018
@treo
Copy link
Member Author

treo commented Jan 24, 2018

According to the example on external error usage, I would guess that they shouldn't actually be cleared.

Running it with NAN_PANIC it panics on the backpropGradient call (still on 0.9.1):

Exception in thread "main" org.nd4j.linalg.exception.ND4JIllegalStateException: P.A.N.I.C.! Op.Z() contains 15 NaN value(s): 
	at org.nd4j.linalg.api.ops.executioner.OpExecutionerUtil.checkForNaN(OpExecutionerUtil.java:71)
	at org.nd4j.linalg.api.ops.executioner.OpExecutionerUtil.checkForAny(OpExecutionerUtil.java:75)
	at org.nd4j.linalg.api.blas.impl.BaseLevel3.gemm(BaseLevel3.java:65)
	at org.nd4j.linalg.api.ndarray.BaseNDArray.mmuli(BaseNDArray.java:3011)
	at org.nd4j.linalg.api.ndarray.BaseNDArray.mmul(BaseNDArray.java:2812)
	at org.deeplearning4j.nn.layers.BaseLayer.preOutput(BaseLayer.java:317)
	at org.deeplearning4j.nn.layers.BaseLayer.backpropGradient(BaseLayer.java:92)
	at org.deeplearning4j.nn.graph.vertex.impl.LayerVertex.doBackward(LayerVertex.java:125)
	at org.deeplearning4j.nn.graph.ComputationGraph.calcBackpropGradients(ComputationGraph.java:1664)
	at org.deeplearning4j.nn.graph.ComputationGraph.backpropGradient(ComputationGraph.java:1596)
	at org.deeplearning4j.nn.graph.Example.main(Example.java:81)

So, another pointer to it being related to workspaces there.

@Broele
Copy link

Broele commented Jan 24, 2018

@AlexDBlack I set both workspaces to NONE and that seems to be a workaround. "seems", because the error occurs randomly and I am not 100% sure that it's gone. Maybe I just lowered the probability.
But it looks good.

Yet, another pointer to the workspaces.

@AlexDBlack
Copy link
Contributor

@treo Thanks, though I'm aware there's still an issue here - #4542 and #4541

If you use SCOPE_PANIC on that code you'll very likely run into the issue I'm working on in the above links.

@lock
Copy link

lock bot commented Sep 23, 2018

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked and limited conversation to collaborators Sep 23, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants