[jvm-package] discussion on evaluation of xgboost4j-spark #1849

fromradio · 2016-12-08T09:19:57Z

Hi, all. Recently I am reading the source code of xgboost4j-spark. I found that the output of predict methods is kind of un-usable.

  def predict(testSet: RDD[MLDenseVector], missingValue: Float): RDD[Array[Array[Float]]] = {
    val broadcastBooster = testSet.sparkContext.broadcast(_booster)
    testSet.mapPartitions { testSamples =>
      val sampleArray = testSamples.toList
      val numRows = sampleArray.size
      val numColumns = sampleArray.head.size
      if (numRows == 0) {
        Iterator()
      } else {
        val rabitEnv = Map("DMLC_TASK_ID" -> TaskContext.getPartitionId().toString)
        Rabit.init(rabitEnv.asJava)
        // translate to required format
        val flatSampleArray = new Array[Float](numRows * numColumns)
        for (i <- flatSampleArray.indices) {
          flatSampleArray(i) = sampleArray(i / numColumns).values(i % numColumns).toFloat
        }
        val dMatrix = new DMatrix(flatSampleArray, numRows, numColumns, missingValue)
        Rabit.shutdown()
        Iterator(broadcastBooster.value.predict(dMatrix))
      }
    }
  }

The output is a RDD of a matrix that each row represents the prediction result (I print the result it seems to be just one Float). OK, you now just get the each partitions' prediction result and you do not know the corresponding relation with the origin data. I don't know what I can do with such a result. I am wondering whether should we delete these methods or replace them with others. I suggest to offer two methods. One is to return input and prediction, the other is to return the label and prediction (enough for most evaluation).

BTW I can offer my implementation.

CodingCat · 2016-12-09T01:40:57Z

are you sure you have label in test data?

CodingCat · 2016-12-09T02:48:48Z

for input case, use zip method in RDD to produce the expected format...

reference: http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.mllib.classification.LogisticRegressionModel

fromradio · 2016-12-09T03:31:14Z

I see. I parse the predict method in spark's GeneralizedLinearModel below

  /**
   * Predict the result given a data point and the weights learned.
   *
   * @param dataMatrix Row vector containing the features for this data point
   * @param weightMatrix Column vector containing the weights of the model
   * @param intercept Intercept of the model.
   */
  protected def predictPoint(dataMatrix: Vector, weightMatrix: Vector, intercept: Double): Double

  /**
   * Predict values for the given data set using the model trained.
   *
   * @param testData RDD representing data points to be predicted
   * @return RDD[Double] where each entry contains the corresponding prediction
   *
   */
  @Since("1.0.0")
  def predict(testData: RDD[Vector]): RDD[Double] = {
    // A small optimization to avoid serializing the entire model. Only the weightsMatrix
    // and intercept is needed.
    val localWeights = weights
    val bcWeights = testData.context.broadcast(localWeights)
    val localIntercept = intercept
    testData.mapPartitions { iter =>
      val w = bcWeights.value
      iter.map(v => predictPoint(v, w, localIntercept))
    }
  }

So it seems that we just need to unfold the Array[Array[Float]] to Iterator and the corresponding is kept?

fromradio · 2016-12-09T03:42:51Z

Here is how they use predict in spark

object LogisticRegressionWithLBFGSExample {

  def main(args: Array[String]): Unit = {
    val conf = new SparkConf().setAppName("LogisticRegressionWithLBFGSExample")
    val sc = new SparkContext(conf)

    // $example on$
    // Load training data in LIBSVM format.
    val data = MLUtils.loadLibSVMFile(sc, "data/mllib/sample_libsvm_data.txt")

    // Split data into training (60%) and test (40%).
    val splits = data.randomSplit(Array(0.6, 0.4), seed = 11L)
    val training = splits(0).cache()
    val test = splits(1)

    // Run training algorithm to build the model
    val model = new LogisticRegressionWithLBFGS()
      .setNumClasses(10)
      .run(training)

    // Compute raw scores on the test set.
    val predictionAndLabels = test.map { case LabeledPoint(label, features) =>
      val prediction = model.predict(features)
      (prediction, label)
    }

    // Get evaluation metrics.
    val metrics = new MulticlassMetrics(predictionAndLabels)
    val accuracy = metrics.accuracy
    println(s"Accuracy = $accuracy")

    // Save and load model
    model.save(sc, "target/tmp/scalaLogisticRegressionWithLBFGSModel")
    val sameModel = LogisticRegressionModel.load(sc,
      "target/tmp/scalaLogisticRegressionWithLBFGSModel")
    // $example off$

    sc.stop()
  }
}

I think that we should provide a predict method similar with spark which takes a vector as input not a matrix. And the later process will be easier. If you think this is necessary I will try to do this~

CodingCat · 2016-12-09T04:54:45Z

XGBoost does not support predict for a single instance efficiently...so predict(v: Vector) does not make sense

fromradio · 2016-12-09T07:15:51Z

I see. So how about implement a method with flatten RDD like spark does. For instance:

def predict(testSet: RDD[Vector], useExternalCache: Boolean = false): RDD[Float] = {
    import DataUtils._
    val broadcastBooster = testSet.sparkContext.broadcast(_booster)
    val appName = testSet.context.appName
    testSet.mapPartitions { testSamples =>
      if (testSamples.hasNext) {
        val rabitEnv = Array("DMLC_TASK_ID" -> TaskContext.getPartitionId().toString).toMap
        Rabit.init(rabitEnv.asJava)
        val cacheFileName = {
          if (useExternalCache) {
            s"$appName-dtest_cache-${TaskContext.getPartitionId()}"
          } else {
            null
          }
        }
        val dMatrix = new DMatrix(new JDMatrix(testSamples, cacheFileName))
        val res = broadcastBooster.value.predict(dMatrix)
        require(res(0).length == 1, "The prediction of the tree should be scalar")
        Rabit.shutdown()
        res.map { arr => arr(0) }.iterator
      } else {
        Iterator()
      }
    }
  }

Now we can use zip to combine input and output

CodingCat · 2016-12-09T14:50:46Z

That's on todo list but no chance to do it

You can deprecate the current one and write a new one

xydrolase · 2016-12-09T15:13:21Z

I think that we should provide a predict method similar with spark which takes a vector as input not a matrix. And the later process will be easier. If you think this is necessary I will try to do this~

@fromradio You will need to implement the GBT logic completely in Java/Scala without calling any JNI functions.
It certainly is doable, but typically you only need prediction for single vectors in production environment, in which case you don't necessarily need or want to ship a fat jar with all the components of XGBoost.

fromradio · 2016-12-13T09:47:11Z

@xydrolase I actually is trying to build a project containing various cluster-based machine learning algorithms and I found the api of xgboost4j-spark is not consistent with spark. Currently I have not much time re-write the GBT logic in scala and I will try when I'm free.

fromradio · 2016-12-13T09:49:36Z

@CodingCat I have finished this on xgboost-0.6, I will pull a request on 0.7 after I finish~

geoHeil · 2016-12-13T09:52:41Z

@fromradio I found this java library https://github.com/komiya-atsushi/xgboost-predictor-java maybe this is interesting to you.

fromradio · 2016-12-13T10:17:49Z

@geoHeil Thanks a lot. It seems to be what I want! I will make a test on prediction speed with xgboost4j-spark I modified.

geoHeil · 2016-12-13T10:19:31Z

Cool. I would be interested to know about your results. Ruimin Wang <notifications@github.com> schrieb am Di. 13. Dez. 2016 um 11:18:

…

@geoHeil <https://github.com/geoHeil> Thanks a lot. It seems to be what I want! I will make a test on prediction speed with xgboost4j-spark I modified. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#1849 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABnc9Jea5dUaPUOAvlSvfUs5-MyYWUCPks5rHnDWgaJpZM4LHmIL> .

fromradio · 2016-12-13T11:31:40Z

@geoHeil I have tested on a dataset (containing 200,000 data) on spark. The xgboost4j-spark cost 1775736 milliseconds containing implicit data transformations. xgboost-predictor-java cost 4620104 milliseconds containing data transformations and 2907550 milliseconds without transformations. I think xgboost4j's prediction on a batch is faster and I will keep using xgboost4j.

geoHeil · 2016-12-13T11:34:05Z

Thanks. Did you test predicting a single record as well? Ruimin Wang <notifications@github.com> schrieb am Di. 13. Dez. 2016 um 12:32:

…

@geoHeil <https://github.com/geoHeil> I have tested on a dataset (containing 200,000 data) on spark. The xgboost4j-spark cost 1775736 milliseconds containing implicit data transformations. xgboost-predictor-java cost 4620104 milliseconds containing data transformations and 2907550 milliseconds without transformations. I think xgboost4j's prediction on a batch is faster and I will keep using xgboost4j. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#1849 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABnc9MVMNvszPjLXIMZFAeSi0qYLTmUhks5rHoImgaJpZM4LHmIL> .

fromradio · 2016-12-13T11:38:05Z

@geoHeil Not yet. xgboost4j does not offer such an api. As discussed before, this is because there is no fast solution yet. I think batch predict can handle most situations.

jiang1983 · 2017-11-24T03:25:14Z

I met this problem too.After one week test ,I found that if you use this API, you should make sure the input RDD is persist. so the partition will not changed. and you can zip the result and input labels!!!!
If your input data is loaded from hdfs, you should do nothing. Otherwise, you should make the train_data persisted or cached

geoHeil mentioned this issue Dec 13, 2016

speed for a single record komiya-atsushi/xgboost-predictor-java#12

Open

CodingCat closed this as completed Dec 15, 2016

lock bot locked as resolved and limited conversation to collaborators Oct 25, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[jvm-package] discussion on evaluation of xgboost4j-spark #1849

[jvm-package] discussion on evaluation of xgboost4j-spark #1849

fromradio commented Dec 8, 2016 •

edited

CodingCat commented Dec 9, 2016

CodingCat commented Dec 9, 2016

fromradio commented Dec 9, 2016

fromradio commented Dec 9, 2016 •

edited

CodingCat commented Dec 9, 2016

fromradio commented Dec 9, 2016

CodingCat commented Dec 9, 2016

xydrolase commented Dec 9, 2016 •

edited

fromradio commented Dec 13, 2016

fromradio commented Dec 13, 2016

geoHeil commented Dec 13, 2016

fromradio commented Dec 13, 2016

geoHeil commented Dec 13, 2016 via email

fromradio commented Dec 13, 2016

geoHeil commented Dec 13, 2016 via email

fromradio commented Dec 13, 2016

jiang1983 commented Nov 24, 2017

[jvm-package] discussion on evaluation of xgboost4j-spark #1849

[jvm-package] discussion on evaluation of xgboost4j-spark #1849

Comments

fromradio commented Dec 8, 2016 • edited

CodingCat commented Dec 9, 2016

CodingCat commented Dec 9, 2016

fromradio commented Dec 9, 2016

fromradio commented Dec 9, 2016 • edited

CodingCat commented Dec 9, 2016

fromradio commented Dec 9, 2016

CodingCat commented Dec 9, 2016

xydrolase commented Dec 9, 2016 • edited

fromradio commented Dec 13, 2016

fromradio commented Dec 13, 2016

geoHeil commented Dec 13, 2016

fromradio commented Dec 13, 2016

geoHeil commented Dec 13, 2016 via email

fromradio commented Dec 13, 2016

geoHeil commented Dec 13, 2016 via email

fromradio commented Dec 13, 2016

jiang1983 commented Nov 24, 2017

fromradio commented Dec 8, 2016 •

edited

fromradio commented Dec 9, 2016 •

edited

xydrolase commented Dec 9, 2016 •

edited