Skip to content

[SPARK-17050][ML][MLLib] Improve kmean rdd.aggregate to rdd.treeAggregate#14628

Closed
WeichenXu123 wants to merge 1 commit intoapache:masterfrom
WeichenXu123:improve_kmean_aggregate
Closed

[SPARK-17050][ML][MLLib] Improve kmean rdd.aggregate to rdd.treeAggregate#14628
WeichenXu123 wants to merge 1 commit intoapache:masterfrom
WeichenXu123:improve_kmean_aggregate

Conversation

@WeichenXu123
Copy link
Contributor

What changes were proposed in this pull request?

The kmean use aggregate to compute points cost. As the data RDD is huge so it is better to use treeAggregate.

How was this patch tested?

Existing test.

@SparkQA
Copy link

SparkQA commented Aug 13, 2016

Test build #63718 has finished for PR 14628 at commit 3edc7e0.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@lins05
Copy link
Contributor

lins05 commented Aug 13, 2016

A grep shows there is also call to RDD.aggregate in LDAModel, should we fix it here as well?

mllib/src/main/scala/org/apache/spark/mllib/clustering]$ git grep -E '\<aggregate\>' |grep -v // 
KMeans.scala:417:        .aggregate(new Array[Double](runs))(
LDAModel.scala:756:    graph.vertices.aggregate(0.0)(seqOp, _ + _)

@WeichenXu123
Copy link
Contributor Author

@lins05 Ok, give me some time to check whether the one in LDAModel is also proper to use treeAggregate....

@WeichenXu123 WeichenXu123 changed the title [SPARK-17033][Follow-up][ML][MLLib] Improve kmean aggregate to treeAggregate [SPARK-17050][ML][MLLib] Improve kmean rdd.aggregate to rdd.treeAggregate Aug 14, 2016
@holdenk
Copy link
Contributor

holdenk commented Aug 15, 2016

Awesome thanks for taking the time to do this. A few follow up questions:

  1. So this is happening with the default tree depth (2) did you try it with other depths?
  2. Have you had a chance to run it with spark-perf or otherwise benchmark the change to validate that it does actually result in the expected performance improvement/does not lead to a regression?
    Thanks!

@WeichenXu123
Copy link
Contributor Author

@holdenk
I think depth (2) is enough to handle large RDD and bigger depth may add cost. I'll append test result later. Thanks!

@WeichenXu123
Copy link
Contributor Author

@lins05 I think the LDA model data size usually will be much smaller than training data, so that here it seems no need to change to treeAggregate ( for DistributedLDAModel ) . thanks!

@WeichenXu123
Copy link
Contributor Author

because KMeans algo is being optimized by another task I close this PR for now and when that one merged I'll check for whether this need to be optimized.

@WeichenXu123 WeichenXu123 deleted the improve_kmean_aggregate branch April 24, 2019 21:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants