New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[jvm-packages] call setGroup for ranking task #2066
Changes from 8 commits
70c5240
aac2ede
5854fae
d197ab8
6b09bf5
31b93ed
deb5a1e
63fcb77
d29e6f9
537368e
2aa8a4a
56fcfbc
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -107,8 +107,8 @@ object XGBoost extends Serializable { | |
// to workaround the empty partitions in training dataset, | ||
// this might not be the best efficient implementation, see | ||
// (https://github.com/dmlc/xgboost/issues/1277) | ||
partitionedTrainingSet.mapPartitions { | ||
trainingSamples => | ||
partitionedTrainingSet.mapPartitionsWithIndex { | ||
case (partIndex, trainingSamples) => | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I just noticed this change, you don't need to use mapPartitionsWithIndex, and you can just directly use |
||
rabitEnv.put("DMLC_TASK_ID", TaskContext.getPartitionId().toString) | ||
Rabit.init(rabitEnv) | ||
var booster: Booster = null | ||
|
@@ -123,6 +123,11 @@ object XGBoost extends Serializable { | |
} | ||
val partitionItr = fromDenseToSparseLabeledPoints(trainingSamples, missing) | ||
val trainingSet = new DMatrix(new JDMatrix(partitionItr, cacheFileName)) | ||
if (xgBoostConfMap.isDefinedAt("groupData") | ||
&& xgBoostConfMap.get("groupData").get != null) { | ||
trainingSet.setGroup( | ||
xgBoostConfMap.get("groupData").get.asInstanceOf[Seq[Seq[Int]]](partIndex).toArray) | ||
} | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. what's the difference to serialize groupData as a part of xgboostConfMap and an independent variable? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Sorry, I can not understand the intent of the question correctly. I want to call setGroup(group: Array[Int]) And checking xgBoostConfMap is costly. So I check xgBoostConfMap and make independ variable. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. say 'xgBoostConfMap' has M bytes, 'groupData' has N bytes, with a naive (and incorrect) xgBoostConfMapFiltered will have M - N bytes (which should be larger than M-N considering JVM overhead, etc.) You still call groupData(partIndex) in the closure of mapPartition, so groupData (N bytes) will still be serialized and pass to executors, and xgBoostConfMapFiltered is also called within mapPartition, so another (M - N) will be serialized and passed to executors... I didn't see any difference here |
||
booster = SXGBoost.train(trainingSet, xgBoostConfMap, round, | ||
watches = new mutable.HashMap[String, DMatrix] { | ||
put("train", trainingSet) | ||
|
Large diffs are not rendered by default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please put the original comments back