New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[jvm-package] discussion on evaluation of xgboost4j-spark #1849
Comments
are you sure you have label in test data? |
for input case, use zip method in RDD to produce the expected format... |
I see. I parse the predict method in spark's /**
* Predict the result given a data point and the weights learned.
*
* @param dataMatrix Row vector containing the features for this data point
* @param weightMatrix Column vector containing the weights of the model
* @param intercept Intercept of the model.
*/
protected def predictPoint(dataMatrix: Vector, weightMatrix: Vector, intercept: Double): Double
/**
* Predict values for the given data set using the model trained.
*
* @param testData RDD representing data points to be predicted
* @return RDD[Double] where each entry contains the corresponding prediction
*
*/
@Since("1.0.0")
def predict(testData: RDD[Vector]): RDD[Double] = {
// A small optimization to avoid serializing the entire model. Only the weightsMatrix
// and intercept is needed.
val localWeights = weights
val bcWeights = testData.context.broadcast(localWeights)
val localIntercept = intercept
testData.mapPartitions { iter =>
val w = bcWeights.value
iter.map(v => predictPoint(v, w, localIntercept))
}
} So it seems that we just need to unfold the Array[Array[Float]] to Iterator and the corresponding is kept? |
Here is how they use predict in spark object LogisticRegressionWithLBFGSExample {
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setAppName("LogisticRegressionWithLBFGSExample")
val sc = new SparkContext(conf)
// $example on$
// Load training data in LIBSVM format.
val data = MLUtils.loadLibSVMFile(sc, "data/mllib/sample_libsvm_data.txt")
// Split data into training (60%) and test (40%).
val splits = data.randomSplit(Array(0.6, 0.4), seed = 11L)
val training = splits(0).cache()
val test = splits(1)
// Run training algorithm to build the model
val model = new LogisticRegressionWithLBFGS()
.setNumClasses(10)
.run(training)
// Compute raw scores on the test set.
val predictionAndLabels = test.map { case LabeledPoint(label, features) =>
val prediction = model.predict(features)
(prediction, label)
}
// Get evaluation metrics.
val metrics = new MulticlassMetrics(predictionAndLabels)
val accuracy = metrics.accuracy
println(s"Accuracy = $accuracy")
// Save and load model
model.save(sc, "target/tmp/scalaLogisticRegressionWithLBFGSModel")
val sameModel = LogisticRegressionModel.load(sc,
"target/tmp/scalaLogisticRegressionWithLBFGSModel")
// $example off$
sc.stop()
}
} I think that we should provide a |
XGBoost does not support predict for a single instance efficiently...so predict(v: Vector) does not make sense |
I see. So how about implement a method with flatten RDD like spark does. For instance: def predict(testSet: RDD[Vector], useExternalCache: Boolean = false): RDD[Float] = {
import DataUtils._
val broadcastBooster = testSet.sparkContext.broadcast(_booster)
val appName = testSet.context.appName
testSet.mapPartitions { testSamples =>
if (testSamples.hasNext) {
val rabitEnv = Array("DMLC_TASK_ID" -> TaskContext.getPartitionId().toString).toMap
Rabit.init(rabitEnv.asJava)
val cacheFileName = {
if (useExternalCache) {
s"$appName-dtest_cache-${TaskContext.getPartitionId()}"
} else {
null
}
}
val dMatrix = new DMatrix(new JDMatrix(testSamples, cacheFileName))
val res = broadcastBooster.value.predict(dMatrix)
require(res(0).length == 1, "The prediction of the tree should be scalar")
Rabit.shutdown()
res.map { arr => arr(0) }.iterator
} else {
Iterator()
}
}
} Now we can use zip to combine input and output |
That's on todo list but no chance to do it You can deprecate the current one and write a new one |
@fromradio You will need to implement the GBT logic completely in Java/Scala without calling any JNI functions. |
@xydrolase I actually is trying to build a project containing various cluster-based machine learning algorithms and I found the api of xgboost4j-spark is not consistent with spark. Currently I have not much time re-write the GBT logic in scala and I will try when I'm free. |
@CodingCat I have finished this on xgboost-0.6, I will pull a request on 0.7 after I finish~ |
@fromradio I found this java library https://github.com/komiya-atsushi/xgboost-predictor-java maybe this is interesting to you. |
@geoHeil Thanks a lot. It seems to be what I want! I will make a test on prediction speed with xgboost4j-spark I modified. |
Cool. I would be interested to know about your results.
Ruimin Wang <notifications@github.com> schrieb am Di. 13. Dez. 2016 um
11:18:
… @geoHeil <https://github.com/geoHeil> Thanks a lot. It seems to be what I
want! I will make a test on prediction speed with xgboost4j-spark I
modified.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1849 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ABnc9Jea5dUaPUOAvlSvfUs5-MyYWUCPks5rHnDWgaJpZM4LHmIL>
.
|
@geoHeil I have tested on a dataset (containing 200,000 data) on spark. The xgboost4j-spark cost 1775736 milliseconds containing implicit data transformations. xgboost-predictor-java cost 4620104 milliseconds containing data transformations and 2907550 milliseconds without transformations. I think xgboost4j's prediction on a batch is faster and I will keep using xgboost4j. |
Thanks. Did you test predicting a single record as well?
Ruimin Wang <notifications@github.com> schrieb am Di. 13. Dez. 2016 um
12:32:
… @geoHeil <https://github.com/geoHeil> I have tested on a dataset
(containing 200,000 data) on spark. The xgboost4j-spark cost 1775736
milliseconds containing implicit data transformations.
xgboost-predictor-java cost 4620104 milliseconds containing data
transformations and 2907550 milliseconds without transformations. I think
xgboost4j's prediction on a batch is faster and I will keep using xgboost4j.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1849 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ABnc9MVMNvszPjLXIMZFAeSi0qYLTmUhks5rHoImgaJpZM4LHmIL>
.
|
@geoHeil Not yet. xgboost4j does not offer such an api. As discussed before, this is because there is no fast solution yet. I think batch predict can handle most situations. |
I met this problem too.After one week test ,I found that if you use this API, you should make sure the input RDD is persist. so the partition will not changed. and you can zip the result and input labels!!!! |
Hi, all. Recently I am reading the source code of xgboost4j-spark. I found that the output of predict methods is kind of un-usable.
The output is a RDD of a matrix that each row represents the prediction result (I print the result it seems to be just one Float). OK, you now just get the each partitions' prediction result and you do not know the corresponding relation with the origin data. I don't know what I can do with such a result. I am wondering whether should we delete these methods or replace them with others. I suggest to offer two methods. One is to return input and prediction, the other is to return the label and prediction (enough for most evaluation).
BTW I can offer my implementation.
The text was updated successfully, but these errors were encountered: