Memory Corruption? when trying to use external errors with ComputationGraph #4539

treo · 2018-01-23T21:14:05Z

Issue Description

I've been trying to find out why the following sometimes results in NaNs in the gradient:
https://gist.github.com/Broele/22b5f7e9bde28a8ca4b58c41ddd343e3

The example is pretty non-sensical, as in: it doesn't do anything useful other than reproducing the bug. I've asked @Broele to reduce his problematic code as far as possible such that it still has the problematic behavior.

While stepping through the code I've found one weird thing:
silentOutput ignores the train flag, when calling feedForward:
https://github.com/deeplearning4j/deeplearning4j/blob/master/deeplearning4j-nn/src/main/java/org/deeplearning4j/nn/graph/ComputationGraph.java#L1631

I've found this, as I realized that the inputs to each Vertex / Layer aren't reset when calling

            Gradient gradient = model.backpropGradient(error);

And most of the time when the gradient ended up with NaNs, the DenseLayer input was also already corrupted and contained either very large numbers or also NaNs. However, whatever I've tried, I couldn't work out at which point that corruption happens. During the run of feedForward, neither the output nor the input are corrupted at this point, however, as soon as it has finished, it seems to get corrupted.

My best guess is that it is somehow related to Workspaces.

I could workaround the problem, by adding a "public api" feedForward before running backpropGradient:

            model.setInputs(input);
            model.feedForward(true);
            Gradient gradient = model.backpropGradient(error);

This results in valid gradient, even over many runs.

@raver119, do you have an Idea what is going on there?

Version Information

Please indicate relevant versions, including, if relevant:

Deeplearning4j version: 0.9.1
Windows 10
CPU Backend

The text was updated successfully, but these errors were encountered:

raver119 · 2018-01-23T21:16:34Z

What's workspaceMode used?

Broele · 2018-01-23T21:30:52Z

I did not explicitly set one. Could that already be the source of the problem?

raver119 · 2018-01-23T21:31:34Z

No, it shouldn't be.

What NAN_PANIC mode says?

AlexDBlack · 2018-01-23T23:21:49Z

I'd say this is very likely workspaces related. The code is obviously from 0.9.1.
Edit: a workaround is likely to set both training and inference workspaces to NONE (inference WS default is SEPARATE)

I ran this on current master.

I've found this, as I realized that the inputs to each Vertex / Layer aren't reset when calling

This has changed on master... now they are cleared, which results in the following error:

Exception in thread "main" java.lang.IllegalStateException: Cannot do backward pass: inputs not set. Layer output (idx 4) numInputs 1
	at org.deeplearning4j.nn.graph.vertex.impl.LayerVertex.doBackward(LayerVertex.java:137)
	at org.deeplearning4j.nn.graph.ComputationGraph.calcBackpropGradients(ComputationGraph.java:1744)
	at org.deeplearning4j.nn.graph.ComputationGraph.backpropGradient(ComputationGraph.java:1672)
	at org.deeplearning4j.Example.main(Example.java:83)

I'll take a look at this today.

treo · 2018-01-24T08:12:30Z

According to the example on external error usage, I would guess that they shouldn't actually be cleared.

Running it with NAN_PANIC it panics on the backpropGradient call (still on 0.9.1):

Exception in thread "main" org.nd4j.linalg.exception.ND4JIllegalStateException: P.A.N.I.C.! Op.Z() contains 15 NaN value(s): 
	at org.nd4j.linalg.api.ops.executioner.OpExecutionerUtil.checkForNaN(OpExecutionerUtil.java:71)
	at org.nd4j.linalg.api.ops.executioner.OpExecutionerUtil.checkForAny(OpExecutionerUtil.java:75)
	at org.nd4j.linalg.api.blas.impl.BaseLevel3.gemm(BaseLevel3.java:65)
	at org.nd4j.linalg.api.ndarray.BaseNDArray.mmuli(BaseNDArray.java:3011)
	at org.nd4j.linalg.api.ndarray.BaseNDArray.mmul(BaseNDArray.java:2812)
	at org.deeplearning4j.nn.layers.BaseLayer.preOutput(BaseLayer.java:317)
	at org.deeplearning4j.nn.layers.BaseLayer.backpropGradient(BaseLayer.java:92)
	at org.deeplearning4j.nn.graph.vertex.impl.LayerVertex.doBackward(LayerVertex.java:125)
	at org.deeplearning4j.nn.graph.ComputationGraph.calcBackpropGradients(ComputationGraph.java:1664)
	at org.deeplearning4j.nn.graph.ComputationGraph.backpropGradient(ComputationGraph.java:1596)
	at org.deeplearning4j.nn.graph.Example.main(Example.java:81)

So, another pointer to it being related to workspaces there.

Broele · 2018-01-24T08:20:43Z

@AlexDBlack I set both workspaces to NONE and that seems to be a workaround. "seems", because the error occurs randomly and I am not 100% sure that it's gone. Maybe I just lowered the probability.
But it looks good.

Yet, another pointer to the workspaces.

AlexDBlack · 2018-01-24T08:24:43Z

@treo Thanks, though I'm aware there's still an issue here - #4542 and #4541

If you use SCOPE_PANIC on that code you'll very likely run into the issue I'm working on in the above links.

lock · 2018-09-23T16:26:13Z

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

AlexDBlack self-assigned this Jan 23, 2018

AlexDBlack mentioned this issue Jan 29, 2018

Workspace fixes #4541

Merged

AlexDBlack closed this as completed in #4541 Jan 30, 2018

lock bot locked and limited conversation to collaborators Sep 23, 2018

eclipsewebmaster unassigned AlexDBlack Jun 12, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory Corruption? when trying to use external errors with ComputationGraph #4539

Memory Corruption? when trying to use external errors with ComputationGraph #4539

treo commented Jan 23, 2018

raver119 commented Jan 23, 2018

Broele commented Jan 23, 2018

raver119 commented Jan 23, 2018

AlexDBlack commented Jan 23, 2018 •

edited

treo commented Jan 24, 2018

Broele commented Jan 24, 2018

AlexDBlack commented Jan 24, 2018

lock bot commented Sep 23, 2018

Memory Corruption? when trying to use external errors with ComputationGraph #4539

Memory Corruption? when trying to use external errors with ComputationGraph #4539

Comments

treo commented Jan 23, 2018

Issue Description

Version Information

raver119 commented Jan 23, 2018

Broele commented Jan 23, 2018

raver119 commented Jan 23, 2018

AlexDBlack commented Jan 23, 2018 • edited

treo commented Jan 24, 2018

Broele commented Jan 24, 2018

AlexDBlack commented Jan 24, 2018

lock bot commented Sep 23, 2018

AlexDBlack commented Jan 23, 2018 •

edited