Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[jvm-packages] call setGroup for ranking task #2066

Merged
merged 12 commits into from Mar 6, 2017
Expand Up @@ -107,8 +107,8 @@ object XGBoost extends Serializable {
// to workaround the empty partitions in training dataset,
// this might not be the best efficient implementation, see
// (https://github.com/dmlc/xgboost/issues/1277)
partitionedTrainingSet.mapPartitions {
trainingSamples =>
partitionedTrainingSet.mapPartitionsWithIndex {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please put the original comments back

case (partIndex, trainingSamples) =>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just noticed this change, you don't need to use mapPartitionsWithIndex, and you can just directly use TaskContext.getPartitionId() to get partitionIndex

rabitEnv.put("DMLC_TASK_ID", TaskContext.getPartitionId().toString)
Rabit.init(rabitEnv)
var booster: Booster = null
Expand All @@ -123,6 +123,11 @@ object XGBoost extends Serializable {
}
val partitionItr = fromDenseToSparseLabeledPoints(trainingSamples, missing)
val trainingSet = new DMatrix(new JDMatrix(partitionItr, cacheFileName))
if (xgBoostConfMap.isDefinedAt("groupData")
&& xgBoostConfMap.get("groupData").get != null) {
trainingSet.setGroup(
xgBoostConfMap.get("groupData").get.asInstanceOf[Seq[Seq[Int]]](partIndex).toArray)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the difference to serialize groupData as a part of xgboostConfMap and an independent variable?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I can not understand the intent of the question correctly.
If my answer dose not make sense, please let me know.

I want to call setGroup(group: Array[Int])

And checking xgBoostConfMap is costly.

So I check xgBoostConfMap and make independ variable.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

say 'xgBoostConfMap' has M bytes, 'groupData' has N bytes, with a naive (and incorrect) xgBoostConfMapFiltered will have M - N bytes (which should be larger than M-N considering JVM overhead, etc.)

You still call groupData(partIndex) in the closure of mapPartition, so groupData (N bytes) will still be serialized and pass to executors, and xgBoostConfMapFiltered is also called within mapPartition, so another (M - N) will be serialized and passed to executors...

I didn't see any difference here

booster = SXGBoost.train(trainingSet, xgBoostConfMap, round,
watches = new mutable.HashMap[String, DMatrix] {
put("train", trainingSet)
Expand Down
Expand Up @@ -53,7 +53,14 @@ trait LearningTaskParams extends Params {
s" {${LearningTaskParams.supportedEvalMetrics.mkString(",")}}",
(value: String) => LearningTaskParams.supportedEvalMetrics.contains(value))

setDefault(objective -> "reg:linear", baseScore -> 0.5, numClasses -> 2)
/**
* group data specify each group sizes for ranking task. To correspond to partition of
* training data, it is nested.
*/
val groupData = new Param[Seq[Seq[Int]]](this, "groupData", "group data specify each group size" +
" for ranking task. To correspond to partition of training data, it is nested.")

setDefault(objective -> "reg:linear", baseScore -> 0.5, numClasses -> 2, groupData -> null)
}

private[spark] object LearningTaskParams {
Expand Down
7,423 changes: 7,423 additions & 0 deletions jvm-packages/xgboost4j-spark/src/test/resources/rank-demo-0.txt.train

Large diffs are not rendered by default.