New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DL4J/Arbiter - cuDNN exception during evaluation #4659

Closed
AlexDBlack opened this Issue Feb 15, 2018 · 4 comments

Comments

Projects
None yet
2 participants
@AlexDBlack
Copy link
Member

AlexDBlack commented Feb 15, 2018

Job: Arbiter training (multi-GPU), Linux, CUDA 9 + cuDNN 7.
Not sure if this is a DL4J issue or a threading issue in Arbiter (the latter is quite possible, however).

o.d.o.l.PerformanceListener - Device: [4]; ETL: 0 ms; iteration 6250; iteration time: 201 ms; samples/sec: 318.408; batches/sec: 4.975;
o.d.o.l.PerformanceListener - Device: [3]; ETL: 0 ms; iteration 6300; iteration time: 218 ms; samples/sec: 293.578; batches/sec: 4.587;
o.d.o.l.PerformanceListener - Device: [2]; ETL: 0 ms; iteration 6250; iteration time: 210 ms; samples/sec: 304.762; batches/sec: 4.762;
o.d.o.l.PerformanceListener - Device: [4]; ETL: 0 ms; iteration 6300; iteration time: 212 ms; samples/sec: 301.887; batches/sec: 4.717;
o.d.o.l.PerformanceListener - Device: [3]; ETL: 0 ms; iteration 6350; iteration time: 214 ms; samples/sec: 299.065; batches/sec: 4.673;
o.d.o.l.PerformanceListener - Device: [2]; ETL: 0 ms; iteration 6300; iteration time: 409 ms; samples/sec: 156.479; batches/sec: 2.445;
o.d.o.l.PerformanceListener - Device: [4]; ETL: 0 ms; iteration 6350; iteration time: 228 ms; samples/sec: 280.702; batches/sec: 4.386;
o.d.o.l.PerformanceListener - Device: [2]; ETL: 0 ms; iteration 6350; iteration time: 210 ms; samples/sec: 304.762; batches/sec: 4.762;
i.s.o.ArbiterLocal - Starting evaluation for candidate 3
i.s.o.TrainLocal - --- Starting evaluation ---
Exception in thread "main" java.lang.RuntimeException: cuDNN status = 8: CUDNN_STATUS_EXECUTION_FAILED
        at org.deeplearning4j.nn.layers.BaseCudnnHelper.checkCudnn(BaseCudnnHelper.java:46)
        at org.deeplearning4j.nn.layers.recurrent.CudnnLSTMHelper.activate(CudnnLSTMHelper.java:450)
        at org.deeplearning4j.nn.layers.recurrent.LSTMHelpers.activateHelper(LSTMHelpers.java:182)
        at org.deeplearning4j.nn.layers.recurrent.LSTM.activateHelper(LSTM.java:183)
        at org.deeplearning4j.nn.layers.recurrent.LSTM.activate(LSTM.java:158)
        at org.deeplearning4j.nn.layers.recurrent.BidirectionalLayer.activate(BidirectionalLayer.java:198)
        at org.deeplearning4j.nn.graph.vertex.impl.LayerVertex.doForward(LayerVertex.java:106)
        at org.deeplearning4j.nn.graph.ComputationGraph.feedForward(ComputationGraph.java:1665)
        at org.deeplearning4j.nn.graph.ComputationGraph.feedForward(ComputationGraph.java:1551)
        at org.deeplearning4j.nn.graph.ComputationGraph.silentOutput(ComputationGraph.java:1820)
        at org.deeplearning4j.nn.graph.ComputationGraph.doEvaluation(ComputationGraph.java:3450)
        at org.deeplearning4j.nn.graph.ComputationGraph.doEvaluation(ComputationGraph.java:3390)
        at io.skymind.orange.TrainLocal.evaluate(TrainLocal.java:121)
        at io.skymind.orange.ArbiterLocal$EvalListener.onCandidateStatusChange(ArbiterLocal.java:204)
        at org.deeplearning4j.arbiter.optimize.runner.BaseOptimizationRunner.processReturnedTask(BaseOptimizationRunner.java:229)
        at org.deeplearning4j.arbiter.optimize.runner.BaseOptimizationRunner.execute(BaseOptimizationRunner.java:131)
        at io.skymind.orange.ArbiterLocal.entryPoint(ArbiterLocal.java:156)
        at io.skymind.orange.ArbiterLocal.main(ArbiterLocal.java:74)
@saudet

This comment has been minimized.

Copy link
Member

saudet commented Feb 15, 2018

@AlexDBlack

This comment has been minimized.

Copy link
Member

AlexDBlack commented Feb 15, 2018

Yeah, I've confirmed that training works fine... I suspect what is happening here (yet to confirm) is that we're passing the network back from the training thread to the main thread for evaluation. In which case I just need to make sure only the training thread actually touches the instantiated network...

@AlexDBlack

This comment has been minimized.

Copy link
Member

AlexDBlack commented Feb 16, 2018

Fixed here: deeplearning4j/Arbiter#138
It was simply a threading issue, network being passed between threads (train in one thread, invoke listener/evaluation in another)

@lock

This comment has been minimized.

Copy link

lock bot commented Sep 23, 2018

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked and limited conversation to collaborators Sep 23, 2018

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.