Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[jvm-package] discussion on evaluation of xgboost4j-spark #1849

Closed
fromradio opened this issue Dec 8, 2016 · 17 comments
Closed

[jvm-package] discussion on evaluation of xgboost4j-spark #1849

fromradio opened this issue Dec 8, 2016 · 17 comments

Comments

@fromradio
Copy link
Contributor

fromradio commented Dec 8, 2016

Hi, all. Recently I am reading the source code of xgboost4j-spark. I found that the output of predict methods is kind of un-usable.

  def predict(testSet: RDD[MLDenseVector], missingValue: Float): RDD[Array[Array[Float]]] = {
    val broadcastBooster = testSet.sparkContext.broadcast(_booster)
    testSet.mapPartitions { testSamples =>
      val sampleArray = testSamples.toList
      val numRows = sampleArray.size
      val numColumns = sampleArray.head.size
      if (numRows == 0) {
        Iterator()
      } else {
        val rabitEnv = Map("DMLC_TASK_ID" -> TaskContext.getPartitionId().toString)
        Rabit.init(rabitEnv.asJava)
        // translate to required format
        val flatSampleArray = new Array[Float](numRows * numColumns)
        for (i <- flatSampleArray.indices) {
          flatSampleArray(i) = sampleArray(i / numColumns).values(i % numColumns).toFloat
        }
        val dMatrix = new DMatrix(flatSampleArray, numRows, numColumns, missingValue)
        Rabit.shutdown()
        Iterator(broadcastBooster.value.predict(dMatrix))
      }
    }
  }

The output is a RDD of a matrix that each row represents the prediction result (I print the result it seems to be just one Float). OK, you now just get the each partitions' prediction result and you do not know the corresponding relation with the origin data. I don't know what I can do with such a result. I am wondering whether should we delete these methods or replace them with others. I suggest to offer two methods. One is to return input and prediction, the other is to return the label and prediction (enough for most evaluation).

BTW I can offer my implementation.

@CodingCat
Copy link
Member

are you sure you have label in test data?

@CodingCat
Copy link
Member

for input case, use zip method in RDD to produce the expected format...

reference: http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.mllib.classification.LogisticRegressionModel

@fromradio
Copy link
Contributor Author

I see. I parse the predict method in spark's GeneralizedLinearModel below

  /**
   * Predict the result given a data point and the weights learned.
   *
   * @param dataMatrix Row vector containing the features for this data point
   * @param weightMatrix Column vector containing the weights of the model
   * @param intercept Intercept of the model.
   */
  protected def predictPoint(dataMatrix: Vector, weightMatrix: Vector, intercept: Double): Double

  /**
   * Predict values for the given data set using the model trained.
   *
   * @param testData RDD representing data points to be predicted
   * @return RDD[Double] where each entry contains the corresponding prediction
   *
   */
  @Since("1.0.0")
  def predict(testData: RDD[Vector]): RDD[Double] = {
    // A small optimization to avoid serializing the entire model. Only the weightsMatrix
    // and intercept is needed.
    val localWeights = weights
    val bcWeights = testData.context.broadcast(localWeights)
    val localIntercept = intercept
    testData.mapPartitions { iter =>
      val w = bcWeights.value
      iter.map(v => predictPoint(v, w, localIntercept))
    }
  }

So it seems that we just need to unfold the Array[Array[Float]] to Iterator and the corresponding is kept?

@fromradio
Copy link
Contributor Author

fromradio commented Dec 9, 2016

Here is how they use predict in spark

object LogisticRegressionWithLBFGSExample {

  def main(args: Array[String]): Unit = {
    val conf = new SparkConf().setAppName("LogisticRegressionWithLBFGSExample")
    val sc = new SparkContext(conf)

    // $example on$
    // Load training data in LIBSVM format.
    val data = MLUtils.loadLibSVMFile(sc, "data/mllib/sample_libsvm_data.txt")

    // Split data into training (60%) and test (40%).
    val splits = data.randomSplit(Array(0.6, 0.4), seed = 11L)
    val training = splits(0).cache()
    val test = splits(1)

    // Run training algorithm to build the model
    val model = new LogisticRegressionWithLBFGS()
      .setNumClasses(10)
      .run(training)

    // Compute raw scores on the test set.
    val predictionAndLabels = test.map { case LabeledPoint(label, features) =>
      val prediction = model.predict(features)
      (prediction, label)
    }

    // Get evaluation metrics.
    val metrics = new MulticlassMetrics(predictionAndLabels)
    val accuracy = metrics.accuracy
    println(s"Accuracy = $accuracy")

    // Save and load model
    model.save(sc, "target/tmp/scalaLogisticRegressionWithLBFGSModel")
    val sameModel = LogisticRegressionModel.load(sc,
      "target/tmp/scalaLogisticRegressionWithLBFGSModel")
    // $example off$

    sc.stop()
  }
}

I think that we should provide a predict method similar with spark which takes a vector as input not a matrix. And the later process will be easier. If you think this is necessary I will try to do this~

@CodingCat
Copy link
Member

XGBoost does not support predict for a single instance efficiently...so predict(v: Vector) does not make sense

@fromradio
Copy link
Contributor Author

I see. So how about implement a method with flatten RDD like spark does. For instance:

def predict(testSet: RDD[Vector], useExternalCache: Boolean = false): RDD[Float] = {
    import DataUtils._
    val broadcastBooster = testSet.sparkContext.broadcast(_booster)
    val appName = testSet.context.appName
    testSet.mapPartitions { testSamples =>
      if (testSamples.hasNext) {
        val rabitEnv = Array("DMLC_TASK_ID" -> TaskContext.getPartitionId().toString).toMap
        Rabit.init(rabitEnv.asJava)
        val cacheFileName = {
          if (useExternalCache) {
            s"$appName-dtest_cache-${TaskContext.getPartitionId()}"
          } else {
            null
          }
        }
        val dMatrix = new DMatrix(new JDMatrix(testSamples, cacheFileName))
        val res = broadcastBooster.value.predict(dMatrix)
        require(res(0).length == 1, "The prediction of the tree should be scalar")
        Rabit.shutdown()
        res.map { arr => arr(0) }.iterator
      } else {
        Iterator()
      }
    }
  }

Now we can use zip to combine input and output

@CodingCat
Copy link
Member

That's on todo list but no chance to do it

You can deprecate the current one and write a new one

@xydrolase
Copy link
Contributor

xydrolase commented Dec 9, 2016

I think that we should provide a predict method similar with spark which takes a vector as input not a matrix. And the later process will be easier. If you think this is necessary I will try to do this~

@fromradio You will need to implement the GBT logic completely in Java/Scala without calling any JNI functions.
It certainly is doable, but typically you only need prediction for single vectors in production environment, in which case you don't necessarily need or want to ship a fat jar with all the components of XGBoost.

@fromradio
Copy link
Contributor Author

@xydrolase I actually is trying to build a project containing various cluster-based machine learning algorithms and I found the api of xgboost4j-spark is not consistent with spark. Currently I have not much time re-write the GBT logic in scala and I will try when I'm free.

@fromradio
Copy link
Contributor Author

@CodingCat I have finished this on xgboost-0.6, I will pull a request on 0.7 after I finish~

@geoHeil
Copy link
Contributor

geoHeil commented Dec 13, 2016

@fromradio I found this java library https://github.com/komiya-atsushi/xgboost-predictor-java maybe this is interesting to you.

@fromradio
Copy link
Contributor Author

@geoHeil Thanks a lot. It seems to be what I want! I will make a test on prediction speed with xgboost4j-spark I modified.

@geoHeil
Copy link
Contributor

geoHeil commented Dec 13, 2016 via email

@fromradio
Copy link
Contributor Author

@geoHeil I have tested on a dataset (containing 200,000 data) on spark. The xgboost4j-spark cost 1775736 milliseconds containing implicit data transformations. xgboost-predictor-java cost 4620104 milliseconds containing data transformations and 2907550 milliseconds without transformations. I think xgboost4j's prediction on a batch is faster and I will keep using xgboost4j.

@geoHeil
Copy link
Contributor

geoHeil commented Dec 13, 2016 via email

@fromradio
Copy link
Contributor Author

@geoHeil Not yet. xgboost4j does not offer such an api. As discussed before, this is because there is no fast solution yet. I think batch predict can handle most situations.

@jiang1983
Copy link

I met this problem too.After one week test ,I found that if you use this API, you should make sure the input RDD is persist. so the partition will not changed. and you can zip the result and input labels!!!!
If your input data is loaded from hdfs, you should do nothing. Otherwise, you should make the train_data persisted or cached

@lock lock bot locked as resolved and limited conversation to collaborators Oct 25, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants