Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Distributed training using parameter averaging training master gives java.lang.IllegalStateException: Cannot perform backprop: Dropout mask array is absent (already cleared?) in beta3 #6841

Closed
kjun9 opened this issue Dec 12, 2018 · 5 comments

Comments

Projects
None yet
3 participants
@kjun9
Copy link

commented Dec 12, 2018

The following example used to work in beta but not after upgrading to beta3.

package com.kjun9.example

import org.apache.spark.sql.SparkSession
import org.deeplearning4j.nn.conf.NeuralNetConfiguration
import org.deeplearning4j.nn.conf.layers.{DenseLayer, OutputLayer}
import org.deeplearning4j.nn.graph.ComputationGraph
import org.deeplearning4j.nn.weights.WeightInit
import org.deeplearning4j.spark.impl.graph.SparkComputationGraph
import org.deeplearning4j.spark.impl.paramavg.ParameterAveragingTrainingMaster
import org.nd4j.linalg.activations.Activation
import org.nd4j.linalg.dataset.{MultiDataSet, api}
import org.nd4j.linalg.factory.Nd4j
import org.nd4j.linalg.learning.config.Adam
import org.nd4j.linalg.lossfunctions.LossFunctions.LossFunction

import scala.util.Random

object SparkLearner {

  val spark = SparkSession.builder().master("local[*]").getOrCreate

  def main(args: Array[String]): Unit = {

    val batchSize = 5
    val featSize = 10
    val numBatches = 1000
    val numEpochs = 5
    val random = new Random(0)

    // dummy training data with 2 input feature matrices
    val trainingData: Seq[api.MultiDataSet] = (0 until numBatches).map { feat =>
      val features =
        Array(Nd4j.create(Array.fill(batchSize * featSize)(random.nextDouble), Array(batchSize, featSize)))
      val labels =
        Array(Nd4j.create(Array.fill(batchSize)(random.nextDouble), Array(batchSize, 1)))
      new MultiDataSet(features, labels)
    }
    val trainingRdd = spark.sparkContext.parallelize(trainingData)

    // simple model
    val modelConf = new NeuralNetConfiguration.Builder()
      .updater(new Adam(0.01))
      .weightInit(WeightInit.XAVIER_UNIFORM)
      .biasInit(0)
      .graphBuilder()
      .addInputs("input")
      .addLayer(
        "dense",
        new DenseLayer.Builder()
          .nIn(10)
          .nOut(10)
          .activation(Activation.RELU)
          .hasBias(true)
          .dropOut(0.9)
          .build,
        "input"
      )
      .addLayer(
        "output",
        new OutputLayer.Builder()
          .nIn(10)
          .nOut(1)
          .lossFunction(LossFunction.XENT)
          .activation(Activation.SIGMOID)
          .hasBias(false)
          .build,
        "dense"
      )
      .setOutputs("output")
      .build()

    val model = new ComputationGraph(modelConf)
    model.init()

    // training without spark works
//    (0 until numEpochs).foreach(_ =>
//      (0 until numBatches).foreach(i => model.fit(trainingData(i)))
//    )

    val trainingMaster = new ParameterAveragingTrainingMaster.Builder(batchSize).build
    val sparkModel = new SparkComputationGraph(spark.sparkContext, model, trainingMaster)
    (0 until numEpochs).foreach(_ => sparkModel.fitMultiDataSet(trainingRdd))
  }

}

Here's the error I get:

18/12/12 16:13:20 ERROR Executor: Exception in task 1.0 in stage 3.0 (TID 25)
java.lang.IllegalStateException: Cannot perform backprop: Dropout mask array is absent (already cleared?)
        at org.nd4j.base.Preconditions.throwStateEx(Preconditions.java:641)
        at org.nd4j.base.Preconditions.checkState(Preconditions.java:268)
        at org.deeplearning4j.nn.conf.dropout.Dropout.backprop(Dropout.java:154)
        at org.deeplearning4j.nn.layers.AbstractLayer.backpropDropOutIfPresent(AbstractLayer.java:309)
        at org.deeplearning4j.nn.layers.BaseLayer.backpropGradient(BaseLayer.java:106)
        at org.deeplearning4j.nn.graph.vertex.impl.LayerVertex.doBackward(LayerVertex.java:148)
        at org.deeplearning4j.nn.graph.ComputationGraph.calcBackpropGradients(ComputationGraph.java:2621)
        at org.deeplearning4j.nn.graph.ComputationGraph.computeGradientAndScore(ComputationGraph.java:1369)
        at org.deeplearning4j.nn.graph.ComputationGraph.computeGradientAndScore(ComputationGraph.java:1329)
        at org.deeplearning4j.optimize.solvers.BaseOptimizer.gradientAndScore(BaseOptimizer.java:160)
        at org.deeplearning4j.optimize.solvers.StochasticGradientDescent.optimize(StochasticGradientDescent.java:63)
        at org.deeplearning4j.optimize.Solver.optimize(Solver.java:52)
        at org.deeplearning4j.nn.graph.ComputationGraph.fitHelper(ComputationGraph.java:1149)
        at org.deeplearning4j.nn.graph.ComputationGraph.fit(ComputationGraph.java:1098)
        at org.deeplearning4j.nn.graph.ComputationGraph.fit(ComputationGraph.java:1006)
        at org.deeplearning4j.spark.impl.paramavg.ParameterAveragingTrainingWorker.processMinibatch(ParameterAveragingTrainingWorker.java:232)
        at org.deeplearning4j.spark.impl.paramavg.ParameterAveragingTrainingWorker.processMinibatch(ParameterAveragingTrainingWorker.java:57)
        at org.deeplearning4j.spark.api.worker.ExecuteWorkerMultiDataSetFlatMapAdapter.call(ExecuteWorkerMultiDataSetFlatMap.java:124)
        at org.deeplearning4j.spark.api.worker.ExecuteWorkerMultiDataSetFlatMapAdapter.call(ExecuteWorkerMultiDataSetFlatMap.java:56)
        at org.deeplearning4j.spark.api.worker.ExecuteWorkerPathMDSFlatMapAdapter.call(ExecuteWorkerPathMDSFlatMap.java:95)
        at org.deeplearning4j.spark.api.worker.ExecuteWorkerPathMDSFlatMapAdapter.call(ExecuteWorkerPathMDSFlatMap.java:57)
        at org.datavec.spark.transform.BaseFlatMapFunctionAdaptee.call(BaseFlatMapFunctionAdaptee.java:40)
        at org.apache.spark.api.java.JavaRDDLike$$anonfun$fn$4$1.apply(JavaRDDLike.scala:153)
        at org.apache.spark.api.java.JavaRDDLike$$anonfun$fn$4$1.apply(JavaRDDLike.scala:153)
        at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:801)
        at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:801)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:49)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:49)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:49)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
        at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
        at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
        at org.apache.spark.scheduler.Task.run(Task.scala:109)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)

Seems like the same/related (?) error from #6326 #6756 but I didn't see a mention of spark in those previous issues.

@lichaoir

This comment has been minimized.

Copy link

commented Dec 12, 2018

I have tried the alternative way described in #6756, but it didn't work. When getting layers out of computation graph, I got null.

@kjun9

This comment has been minimized.

Copy link
Author

commented Dec 12, 2018

Some more info that might be relevant:

  • to build the example I'm using sbt-assembly to build the fat jar excluding spark
  • running the example with spark-submit on spark 2.3.2

I did also try with sbt runMain with spark 2.3.2 included as a dependency and it shows the same behaviour

@AlexDBlack AlexDBlack self-assigned this Dec 12, 2018

@lichaoir

This comment has been minimized.

Copy link

commented Dec 17, 2018

I have also tried the snapshot build as recommended in #6756. Unfortunately it didn't work.

AlexDBlack added a commit that referenced this issue Dec 21, 2018

@AlexDBlack

This comment has been minimized.

Copy link
Member

commented Dec 21, 2018

Thanks for reporting and for the code. I was able to reproduce the problem - turns out it was dropout not being cloned when configurations are cloned (hence are incorrectly shared by multiple networks when using parameter averaging).
Fix here will be merged soon: #6858

AlexDBlack added a commit that referenced this issue Dec 22, 2018

[WIP] Misc fixes (#6858)
* Javadoc fixes and small cleanup

* More minor fixes

* #6790 Nd4j.arange javadoc

* Clean up + MLN javadoc pass

* #6890 Nd4j.writeAsNumpy javadoc

* #6859 Fix UIServer.detach

* #6811 ElementWiseVertex single input fix

* #6804 Transforms.dot, Transforms.cross

* #6852 Throw original exception on I18N loading errors

* #6841 Fix dropout instances not being cloned when config is cloned

* #6475 INDArray.isInfinite, INDArray.isNaN

* #6856 Allow scoped out in feedForwardToActivationDetached

* Minor test fixes

printomi added a commit to printomi/deeplearning4j that referenced this issue Jan 7, 2019

[WIP] Misc fixes (deeplearning4j#6858)
* Javadoc fixes and small cleanup

* More minor fixes

* deeplearning4j#6790 Nd4j.arange javadoc

* Clean up + MLN javadoc pass

* deeplearning4j#6890 Nd4j.writeAsNumpy javadoc

* deeplearning4j#6859 Fix UIServer.detach

* deeplearning4j#6811 ElementWiseVertex single input fix

* deeplearning4j#6804 Transforms.dot, Transforms.cross

* deeplearning4j#6852 Throw original exception on I18N loading errors

* deeplearning4j#6841 Fix dropout instances not being cloned when config is cloned

* deeplearning4j#6475 INDArray.isInfinite, INDArray.isNaN

* deeplearning4j#6856 Allow scoped out in feedForwardToActivationDetached

* Minor test fixes
@lock

This comment has been minimized.

Copy link

commented Jan 21, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked and limited conversation to collaborators Jan 21, 2019

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
You can’t perform that action at this time.