[SPARK-17050][ML][MLLib] Improve kmean rdd.aggregate to rdd.treeAggregate#14628
[SPARK-17050][ML][MLLib] Improve kmean rdd.aggregate to rdd.treeAggregate#14628WeichenXu123 wants to merge 1 commit intoapache:masterfrom
Conversation
|
Test build #63718 has finished for PR 14628 at commit
|
|
A grep shows there is also call to mllib/src/main/scala/org/apache/spark/mllib/clustering]$ git grep -E '\<aggregate\>' |grep -v //
KMeans.scala:417: .aggregate(new Array[Double](runs))(
LDAModel.scala:756: graph.vertices.aggregate(0.0)(seqOp, _ + _) |
|
@lins05 Ok, give me some time to check whether the one in LDAModel is also proper to use treeAggregate.... |
|
Awesome thanks for taking the time to do this. A few follow up questions:
|
|
@holdenk |
|
@lins05 I think the LDA model data size usually will be much smaller than training data, so that here it seems no need to change to treeAggregate ( for DistributedLDAModel ) . thanks! |
|
because KMeans algo is being optimized by another task I close this PR for now and when that one merged I'll check for whether this need to be optimized. |
What changes were proposed in this pull request?
The kmean use
aggregateto compute points cost. As the data RDD is huge so it is better to usetreeAggregate.How was this patch tested?
Existing test.