Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[jvm-package] scala rabit tracker underflow #1925

Closed
geoHeil opened this issue Jan 2, 2017 · 9 comments
Closed

[jvm-package] scala rabit tracker underflow #1925

geoHeil opened this issue Jan 2, 2017 · 9 comments

Comments

@geoHeil
Copy link
Contributor

geoHeil commented Jan 2, 2017

scala rabit tracker is crashing with an underflow

Environment info

Operating System: osx 10.12.2

Compiler: gcc6

Package used (python/R/jvm/C++): jvm-package

xgboost version used: current master branch

If installing from source, please provide

  1. The commit hash (git rev-parse HEAD)
  2. Logs will be helpful (If logs are large, please upload as attachment).
[INFO] [01/02/2017 22:11:24.468] [RabitTracker-akka.actor.default-dispatcher-7] [akka://RabitTracker/user/Handler] [65] train-error:0.423498
[INFO] [01/02/2017 22:11:24.587] [RabitTracker-akka.actor.default-dispatcher-9] [akka://RabitTracker/user/Handler] [66] train-error:0.422569
[INFO] [01/02/2017 22:11:24.688] [RabitTracker-akka.actor.default-dispatcher-11] [akka://RabitTracker/user/Handler] [67]        train-error:0.422327
[ERROR] [01/02/2017 22:11:24.790] [RabitTracker-akka.actor.default-dispatcher-9] [akka://RabitTracker/user/Handler/ConnectionHandler-03b85111-7230-4f10-bf03-05c62bb97e30] null
java.nio.BufferUnderflowException
        at java.nio.Buffer.nextGetIndex(Buffer.java:506)
        at java.nio.HeapByteBuffer.getInt(HeapByteBuffer.java:361)
        at ml.dmlc.xgboost4j.scala.rabit.util.RabitTrackerHelpers$ByteBufferHelpers.getString(RabitTrackerHelpers.scala:33)
        at ml.dmlc.xgboost4j.scala.rabit.handler.RabitWorkerHandler.decodeCommand(RabitWorkerHandler.scala:102)
        at ml.dmlc.xgboost4j.scala.rabit.handler.RabitWorkerHandler$$anonfun$3.applyOrElse(RabitWorkerHandler.scala:124)
        at ml.dmlc.xgboost4j.scala.rabit.handler.RabitWorkerHandler$$anonfun$3.applyOrElse(RabitWorkerHandler.scala:119)
        at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)
        at akka.actor.FSM$class.processEvent(FSM.scala:604)
        at ml.dmlc.xgboost4j.scala.rabit.handler.RabitWorkerHandler.processEvent(RabitWorkerHandler.scala:39)
        at akka.actor.FSM$class.akka$actor$FSM$$processMsg(FSM.scala:598)
        at akka.actor.FSM$$anonfun$receive$1.applyOrElse(FSM.scala:592)
        at akka.actor.Actor$class.aroundReceive(Actor.scala:467)
        at ml.dmlc.xgboost4j.scala.rabit.handler.RabitWorkerHandler.aroundReceive(RabitWorkerHandler.scala:39)
        at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
        at akka.actor.ActorCell.invoke(ActorCell.scala:487)
        at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)
        at akka.dispatch.Mailbox.run(Mailbox.scala:220)
        at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:397)
        at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
        at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
        at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
        at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

[ERROR] [01/02/2017 22:11:24.798] [RabitTracker-akka.actor.default-dispatcher-11] [LocalActorRefProvider(akka://RabitTracker)] guardian failed, shutting down system
java.lang.AssertionError: assertion failed
        at scala.Predef$.assert(Predef.scala:156)
        at ml.dmlc.xgboost4j.scala.rabit.handler.RabitWorkerHandler$$anonfun$2.applyOrElse(RabitWorkerHandler.scala:110)
        at ml.dmlc.xgboost4j.scala.rabit.handler.RabitWorkerHandler$$anonfun$2.applyOrElse(RabitWorkerHandler.scala:108)
        at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)
        at akka.actor.FSM$class.processEvent(FSM.scala:604)
        at ml.dmlc.xgboost4j.scala.rabit.handler.RabitWorkerHandler.processEvent(RabitWorkerHandler.scala:39)
        at akka.actor.FSM$class.akka$actor$FSM$$processMsg(FSM.scala:598)
        at akka.actor.FSM$$anonfun$receive$1.applyOrElse(FSM.scala:592)
        at akka.actor.Actor$class.aroundReceive(Actor.scala:467)
        at ml.dmlc.xgboost4j.scala.rabit.handler.RabitWorkerHandler.aroundReceive(RabitWorkerHandler.scala:39)
        at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
        at akka.actor.ActorCell.invoke(ActorCell.scala:487)
        at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)
        at akka.dispatch.Mailbox.run(Mailbox.scala:220)
        at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:397)
        at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
        at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
        at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
        at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

[ERROR] [01/02/2017 22:11:24.799] [RabitTracker-akka.actor.default-dispatcher-11] [akka://RabitTracker/user] assertion failed
java.lang.AssertionError: assertion failed
        at scala.Predef$.assert(Predef.scala:156)
        at ml.dmlc.xgboost4j.scala.rabit.handler.RabitWorkerHandler$$anonfun$2.applyOrElse(RabitWorkerHandler.scala:110)
        at ml.dmlc.xgboost4j.scala.rabit.handler.RabitWorkerHandler$$anonfun$2.applyOrElse(RabitWorkerHandler.scala:108)
        at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)
        at akka.actor.FSM$class.processEvent(FSM.scala:604)
        at ml.dmlc.xgboost4j.scala.rabit.handler.RabitWorkerHandler.processEvent(RabitWorkerHandler.scala:39)
        at akka.actor.FSM$class.akka$actor$FSM$$processMsg(FSM.scala:598)
        at akka.actor.FSM$$anonfun$receive$1.applyOrElse(FSM.scala:592)
        at akka.actor.Actor$class.aroundReceive(Actor.scala:467)
        at ml.dmlc.xgboost4j.scala.rabit.handler.RabitWorkerHandler.aroundReceive(RabitWorkerHandler.scala:39)
        at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
        at akka.actor.ActorCell.invoke(ActorCell.scala:487)
        at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)
        at akka.dispatch.Mailbox.run(Mailbox.scala:220)
        at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:397)
        at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
        at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
        at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
        at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

The good thing is the scala implementation seems to be a lot quicker - donwside is: it is crashing.

edit

I noticed that the described problem only occurs when using XGBoostEstimator in a cross validation pipe. If accessed directly (still in a pipeline without cross validation) it seems to work just fine

val xgbEstimator = new XGBoostEstimator(xgbBaseParams)
    .setFeaturesCol("features")
    .setLabelCol("TARGET_column")

  val simplePipeParams = new ParamGridBuilder()
    .addGrid(xgbEstimator.round, Array(numRound))
    .addGrid(xgbEstimator.nWorkers, Array(numWorkers)) // unsure about setting n works
    .build()
  val simplPipe = new Pipeline()
    .setStages(Array(vectorAssembler, xgbEstimator))

  val numberOfFolds = 5
  val cv = new CrossValidator()
    .setEstimator(simplPipe)
    .setEvaluator(new BinaryClassificationEvaluator()
      .setLabelCol("TARGET_column")
      .setRawPredictionCol("prediction"))
    .setEstimatorParamMaps(simplePipeParams)
    .setNumFolds(numberOfFolds)
    .setSeed(12345)

  val cvModel = cv.fit(train)

and simplPipe.fit(train).transform(test).show works just fine

@geoHeil
Copy link
Contributor Author

geoHeil commented Jan 3, 2017

Here is a working example documenting the problem https://gist.github.com/geoHeil/5e355bc7666294574ae9432dd912806f @xydrolase hope this helps to track down the problem

@xydrolase
Copy link
Contributor

Will look into this. Thanks.

@xydrolase
Copy link
Contributor

Thanks for providing the example for reproducing the error.
I've been trying to replicate the bug but so far I didn't run into any issues.

Did you consistently run into the same buffer issue? I can kind of guess where and why it threw that Exception, because the parser for worker command is not exhaustive and may fail in some conditions.

I'll use your stacktrace to see if I can pinpoint the issue, while continue playing with your example to see if I can replicate it or not.

@xydrolase
Copy link
Contributor

This would happen if the worker sends a "print" command message in multiple packets. The "print" command is special as it contains a string, which is the message to print of course.

The validator for commands only verifies the common part, but ignores the optional message. And because the validation is generic to all commands, it does not include any additional logic to the "print" command. I'll fix this this week.

@geoHeil
Copy link
Contributor Author

geoHeil commented Jan 25, 2017

@xydrolase is there any activity with this issue?

@xydrolase
Copy link
Contributor

xydrolase commented Jan 25, 2017

Sorry, pretty hectic recently. It has definitely been 2 weeks since my original promise of "this week".

It's an easy fix, so I will for sure allocate some time Friday to submit a PR to fix this issue.
Ping me if I languish again.

@xydrolase
Copy link
Contributor

@geoHeil I've fixed it locally. Will push it tomorrow after merging upstream commits.

@xydrolase
Copy link
Contributor

@geoHeil I've submitted the PR that hopefully fixes the issue for good.

@geoHeil
Copy link
Contributor Author

geoHeil commented Jan 30, 2017

@xydrolase thanks a lot.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

4 participants