Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to train ComputationGraph with SharedTrainingMaster #8765

Closed
ChrisYohann opened this issue Mar 8, 2020 · 1 comment · Fixed by KonduitAI/deeplearning4j#318
Closed
Assignees
Milestone

Comments

@ChrisYohann
Copy link

Hello everyone, I'm facing some issues with training a ComputationGraph in Spark using the gradient sharing implementation.

I'm using Spark 2.4.4, on a YARN cluster. The executors threads are still in a TIMED_WAITING state, and my model is never fit, the weights are never updated.

I don't have the problem using ParameterAveraging method, and I don't think it's a network issue. In fact, using the debugger on a Spark Executor, I think the problem is here :

https://github.com/eclipse/deeplearning4j/blob/58aa5a3a9bbd2c1cbe041c81a57182e4c5d62606/deeplearning4j/deeplearning4j-scaleout/spark/dl4j-spark-parameterserver/src/main/java/org/deeplearning4j/spark/parameterserver/pw/SharedTrainingWrapper.java#L105

The variable iteratorMDS is not initialized, unlike the variable iteratorDS, and later, when fitting on a MultiDataSet under the run method :

                while((iteratorDS != null && iteratorDS.hasNext()) || (iteratorMDS != null && iteratorMDS.hasNext())) {
                    //Loop as a guard against concurrent modifications and RCs

                    if (wrapper != null) {
                        if (iteratorDS != null)
                            wrapper.fit(iteratorDS);
                        else
                            wrapper.fit(iteratorMDS);
                    } else {
                        // if wrapper is null, we're fitting standalone model then
                        if (iteratorDS != null) {
                            if (model instanceof ComputationGraph) {
                                ((ComputationGraph) originalModel).fit(iteratorDS);
                            } else if (model instanceof MultiLayerNetwork) {
                                ((MultiLayerNetwork) originalModel).fit(iteratorDS);
                            }
                        } else {
                            if (model instanceof ComputationGraph) {
                                ((ComputationGraph) originalModel).fit(iteratorMDS);
                            } else if (model instanceof MultiLayerNetwork) {
                                ((MultiLayerNetwork) originalModel).fit(iteratorMDS);
                            }
                        }
                    }
  

My condition is never met since iteratorDS is empty and iteratorMDS is null.

I use a MultiDataSet since I'm training a ComputationGraph. As I expect, It works when I use a MultiLayerNetwork.

Version Information

  • Deeplearning4j version : 1.0.0-beta6
  • Platform information : Amazon EMR & Mac OS 10.15.3 (Tested on both)
@AlexDBlack AlexDBlack added this to the 1.0.0-beta7 milestone Mar 13, 2020
@AlexDBlack AlexDBlack self-assigned this Mar 13, 2020
AlexDBlack added a commit to KonduitAI/deeplearning4j that referenced this issue Mar 13, 2020
Signed-off-by: Alex Black <blacka101@gmail.com>
@AlexDBlack
Copy link
Contributor

Thanks for the report - and for pointing out the location of the problem (you were right about the cause).

A fix has been implemented here: KonduitAI#318
It will be merged back to eclipse/deeplearning4j soon (next week at the latest), and will be in snapshots soon after that - https://deeplearning4j.konduit.ai/config/config-snapshots
We'll be doing a release soon (2-3 weeks hopefully) as well.

AlexDBlack added a commit to KonduitAI/deeplearning4j that referenced this issue Mar 20, 2020
Signed-off-by: Alex Black <blacka101@gmail.com>
AlexDBlack added a commit to KonduitAI/deeplearning4j that referenced this issue Mar 20, 2020
Signed-off-by: Alex Black <blacka101@gmail.com>
AlexDBlack added a commit to KonduitAI/deeplearning4j that referenced this issue Mar 25, 2020
Signed-off-by: Alex Black <blacka101@gmail.com>
AlexDBlack added a commit to KonduitAI/deeplearning4j that referenced this issue Mar 26, 2020
Signed-off-by: Alex Black <blacka101@gmail.com>
AlexDBlack added a commit to KonduitAI/deeplearning4j that referenced this issue Mar 26, 2020
* deeplearning4j#8777 MultiLayerNetwork.evaluate(MultiDataSetIterator) overload

Signed-off-by: Alex Black <blacka101@gmail.com>

* deeplearning4j#8768 SameDiff.equals

Signed-off-by: Alex Black <blacka101@gmail.com>

* deeplearning4j#8750 shade freemarker library and switch to it in DL4J UI

Signed-off-by: Alex Black <blacka101@gmail.com>

* deeplearning4j#8704 DL4J UI redirect

Signed-off-by: Alex Black <blacka101@gmail.com>

* deeplearning4j#8776 RecordReaderDataSetIterator builder collectMetaData fix

Signed-off-by: Alex Black <blacka101@gmail.com>

* deeplearning4j#8718 Fix DL4J doEvaluation metadata

Signed-off-by: Alex Black <blacka101@gmail.com>

* deeplearning4j#8715 ArchiveUtils - Add option to not log every extracted file

Signed-off-by: Alex Black <blacka101@gmail.com>

* No exception for evaluations that don't support metadata

Signed-off-by: Alex Black <blacka101@gmail.com>

* Fixes

Signed-off-by: Alex Black <blacka101@gmail.com>

* deeplearning4j#8765 CompGraph+MDS fix for SharedTrainingMaster

Signed-off-by: Alex Black <blacka101@gmail.com>

* small fix

Signed-off-by: Alex Black <blacka101@gmail.com>

* Timeout

Signed-off-by: Alex Black <blacka101@gmail.com>

* Ignore

Signed-off-by: Alex Black <blacka101@gmail.com>

* Revert freemarker shading

Signed-off-by: Alex Black <blacka101@gmail.com>

* Ignore

Signed-off-by: Alex Black <blacka101@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants