Unable to train ComputationGraph with SharedTrainingMaster #8765

ChrisYohann · 2020-03-08T10:35:13Z

Hello everyone, I'm facing some issues with training a ComputationGraph in Spark using the gradient sharing implementation.

I'm using Spark 2.4.4, on a YARN cluster. The executors threads are still in a TIMED_WAITING state, and my model is never fit, the weights are never updated.

I don't have the problem using ParameterAveraging method, and I don't think it's a network issue. In fact, using the debugger on a Spark Executor, I think the problem is here :

https://github.com/eclipse/deeplearning4j/blob/58aa5a3a9bbd2c1cbe041c81a57182e4c5d62606/deeplearning4j/deeplearning4j-scaleout/spark/dl4j-spark-parameterserver/src/main/java/org/deeplearning4j/spark/parameterserver/pw/SharedTrainingWrapper.java#L105

The variable iteratorMDS is not initialized, unlike the variable iteratorDS, and later, when fitting on a MultiDataSet under the run method :

                while((iteratorDS != null && iteratorDS.hasNext()) || (iteratorMDS != null && iteratorMDS.hasNext())) {
                    //Loop as a guard against concurrent modifications and RCs

                    if (wrapper != null) {
                        if (iteratorDS != null)
                            wrapper.fit(iteratorDS);
                        else
                            wrapper.fit(iteratorMDS);
                    } else {
                        // if wrapper is null, we're fitting standalone model then
                        if (iteratorDS != null) {
                            if (model instanceof ComputationGraph) {
                                ((ComputationGraph) originalModel).fit(iteratorDS);
                            } else if (model instanceof MultiLayerNetwork) {
                                ((MultiLayerNetwork) originalModel).fit(iteratorDS);
                            }
                        } else {
                            if (model instanceof ComputationGraph) {
                                ((ComputationGraph) originalModel).fit(iteratorMDS);
                            } else if (model instanceof MultiLayerNetwork) {
                                ((MultiLayerNetwork) originalModel).fit(iteratorMDS);
                            }
                        }
                    }

My condition is never met since iteratorDS is empty and iteratorMDS is null.

I use a MultiDataSet since I'm training a ComputationGraph. As I expect, It works when I use a MultiLayerNetwork.

Version Information

Deeplearning4j version : 1.0.0-beta6
Platform information : Amazon EMR & Mac OS 10.15.3 (Tested on both)

The text was updated successfully, but these errors were encountered:

Signed-off-by: Alex Black <blacka101@gmail.com>

AlexDBlack · 2020-03-13T13:12:57Z

Thanks for the report - and for pointing out the location of the problem (you were right about the cause).

A fix has been implemented here: KonduitAI#318
It will be merged back to eclipse/deeplearning4j soon (next week at the latest), and will be in snapshots soon after that - https://deeplearning4j.konduit.ai/config/config-snapshots
We'll be doing a release soon (2-3 weeks hopefully) as well.

Signed-off-by: Alex Black <blacka101@gmail.com>

* deeplearning4j#8777 MultiLayerNetwork.evaluate(MultiDataSetIterator) overload Signed-off-by: Alex Black <blacka101@gmail.com> * deeplearning4j#8768 SameDiff.equals Signed-off-by: Alex Black <blacka101@gmail.com> * deeplearning4j#8750 shade freemarker library and switch to it in DL4J UI Signed-off-by: Alex Black <blacka101@gmail.com> * deeplearning4j#8704 DL4J UI redirect Signed-off-by: Alex Black <blacka101@gmail.com> * deeplearning4j#8776 RecordReaderDataSetIterator builder collectMetaData fix Signed-off-by: Alex Black <blacka101@gmail.com> * deeplearning4j#8718 Fix DL4J doEvaluation metadata Signed-off-by: Alex Black <blacka101@gmail.com> * deeplearning4j#8715 ArchiveUtils - Add option to not log every extracted file Signed-off-by: Alex Black <blacka101@gmail.com> * No exception for evaluations that don't support metadata Signed-off-by: Alex Black <blacka101@gmail.com> * Fixes Signed-off-by: Alex Black <blacka101@gmail.com> * deeplearning4j#8765 CompGraph+MDS fix for SharedTrainingMaster Signed-off-by: Alex Black <blacka101@gmail.com> * small fix Signed-off-by: Alex Black <blacka101@gmail.com> * Timeout Signed-off-by: Alex Black <blacka101@gmail.com> * Ignore Signed-off-by: Alex Black <blacka101@gmail.com> * Revert freemarker shading Signed-off-by: Alex Black <blacka101@gmail.com> * Ignore Signed-off-by: Alex Black <blacka101@gmail.com>

AlexDBlack added this to the 1.0.0-beta7 milestone Mar 13, 2020

AlexDBlack self-assigned this Mar 13, 2020

AlexDBlack added a commit to KonduitAI/deeplearning4j that referenced this issue Mar 13, 2020

deeplearning4j#8765 CompGraph+MDS fix for SharedTrainingMaster

ff02213

Signed-off-by: Alex Black <blacka101@gmail.com>

AlexDBlack mentioned this issue Mar 13, 2020

Assorted fixes KonduitAI/deeplearning4j#318

Merged

AlexDBlack added a commit to KonduitAI/deeplearning4j that referenced this issue Mar 20, 2020

deeplearning4j#8765 CompGraph+MDS fix for SharedTrainingMaster

7c7b46a

Signed-off-by: Alex Black <blacka101@gmail.com>

AlexDBlack added a commit to KonduitAI/deeplearning4j that referenced this issue Mar 20, 2020

deeplearning4j#8765 CompGraph+MDS fix for SharedTrainingMaster

5b6cfdd

Signed-off-by: Alex Black <blacka101@gmail.com>

AlexDBlack added a commit to KonduitAI/deeplearning4j that referenced this issue Mar 25, 2020

deeplearning4j#8765 CompGraph+MDS fix for SharedTrainingMaster

fc08b1e

Signed-off-by: Alex Black <blacka101@gmail.com>

AlexDBlack added a commit to KonduitAI/deeplearning4j that referenced this issue Mar 26, 2020

deeplearning4j#8765 CompGraph+MDS fix for SharedTrainingMaster

0bda5e3

Signed-off-by: Alex Black <blacka101@gmail.com>

AlexDBlack closed this as completed in KonduitAI/deeplearning4j#318 Mar 26, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unable to train ComputationGraph with SharedTrainingMaster #8765

Unable to train ComputationGraph with SharedTrainingMaster #8765

ChrisYohann commented Mar 8, 2020

AlexDBlack commented Mar 13, 2020

Unable to train ComputationGraph with SharedTrainingMaster #8765

Unable to train ComputationGraph with SharedTrainingMaster #8765

Comments

ChrisYohann commented Mar 8, 2020

Version Information

AlexDBlack commented Mar 13, 2020