New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unable to train ComputationGraph with SharedTrainingMaster #8765
Closed
Milestone
Comments
AlexDBlack
added a commit
to KonduitAI/deeplearning4j
that referenced
this issue
Mar 13, 2020
Signed-off-by: Alex Black <blacka101@gmail.com>
Thanks for the report - and for pointing out the location of the problem (you were right about the cause). A fix has been implemented here: KonduitAI#318 |
AlexDBlack
added a commit
to KonduitAI/deeplearning4j
that referenced
this issue
Mar 20, 2020
Signed-off-by: Alex Black <blacka101@gmail.com>
AlexDBlack
added a commit
to KonduitAI/deeplearning4j
that referenced
this issue
Mar 20, 2020
Signed-off-by: Alex Black <blacka101@gmail.com>
AlexDBlack
added a commit
to KonduitAI/deeplearning4j
that referenced
this issue
Mar 25, 2020
Signed-off-by: Alex Black <blacka101@gmail.com>
AlexDBlack
added a commit
to KonduitAI/deeplearning4j
that referenced
this issue
Mar 26, 2020
Signed-off-by: Alex Black <blacka101@gmail.com>
AlexDBlack
added a commit
to KonduitAI/deeplearning4j
that referenced
this issue
Mar 26, 2020
* deeplearning4j#8777 MultiLayerNetwork.evaluate(MultiDataSetIterator) overload Signed-off-by: Alex Black <blacka101@gmail.com> * deeplearning4j#8768 SameDiff.equals Signed-off-by: Alex Black <blacka101@gmail.com> * deeplearning4j#8750 shade freemarker library and switch to it in DL4J UI Signed-off-by: Alex Black <blacka101@gmail.com> * deeplearning4j#8704 DL4J UI redirect Signed-off-by: Alex Black <blacka101@gmail.com> * deeplearning4j#8776 RecordReaderDataSetIterator builder collectMetaData fix Signed-off-by: Alex Black <blacka101@gmail.com> * deeplearning4j#8718 Fix DL4J doEvaluation metadata Signed-off-by: Alex Black <blacka101@gmail.com> * deeplearning4j#8715 ArchiveUtils - Add option to not log every extracted file Signed-off-by: Alex Black <blacka101@gmail.com> * No exception for evaluations that don't support metadata Signed-off-by: Alex Black <blacka101@gmail.com> * Fixes Signed-off-by: Alex Black <blacka101@gmail.com> * deeplearning4j#8765 CompGraph+MDS fix for SharedTrainingMaster Signed-off-by: Alex Black <blacka101@gmail.com> * small fix Signed-off-by: Alex Black <blacka101@gmail.com> * Timeout Signed-off-by: Alex Black <blacka101@gmail.com> * Ignore Signed-off-by: Alex Black <blacka101@gmail.com> * Revert freemarker shading Signed-off-by: Alex Black <blacka101@gmail.com> * Ignore Signed-off-by: Alex Black <blacka101@gmail.com>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Hello everyone, I'm facing some issues with training a ComputationGraph in Spark using the gradient sharing implementation.
I'm using Spark 2.4.4, on a YARN cluster. The executors threads are still in a TIMED_WAITING state, and my model is never fit, the weights are never updated.
I don't have the problem using ParameterAveraging method, and I don't think it's a network issue. In fact, using the debugger on a Spark Executor, I think the problem is here :
https://github.com/eclipse/deeplearning4j/blob/58aa5a3a9bbd2c1cbe041c81a57182e4c5d62606/deeplearning4j/deeplearning4j-scaleout/spark/dl4j-spark-parameterserver/src/main/java/org/deeplearning4j/spark/parameterserver/pw/SharedTrainingWrapper.java#L105
The variable
iteratorMDS
is not initialized, unlike the variableiteratorDS
, and later, when fitting on a MultiDataSet under therun
method :My condition is never met since
iteratorDS
is empty anditeratorMDS
is null.I use a
MultiDataSet
since I'm training aComputationGraph
. As I expect, It works when I use aMultiLayerNetwork
.Version Information
1.0.0-beta6
The text was updated successfully, but these errors were encountered: