[SPARK-17050][ML][MLLib] Improve kmean rdd.aggregate to rdd.treeAggregate by WeichenXu123 · Pull Request #14628 · apache/spark

WeichenXu123 · 2016-08-13T07:42:51Z

What changes were proposed in this pull request?

The kmean use aggregate to compute points cost. As the data RDD is huge so it is better to use treeAggregate.

How was this patch tested?

Existing test.

SparkQA · 2016-08-13T08:33:07Z

Test build #63718 has finished for PR 14628 at commit 3edc7e0.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

lins05 · 2016-08-13T09:00:32Z

A grep shows there is also call to RDD.aggregate in LDAModel, should we fix it here as well?

mllib/src/main/scala/org/apache/spark/mllib/clustering]$ git grep -E '\<aggregate\>' |grep -v // 
KMeans.scala:417:        .aggregate(new Array[Double](runs))(
LDAModel.scala:756:    graph.vertices.aggregate(0.0)(seqOp, _ + _)

WeichenXu123 · 2016-08-13T09:07:19Z

@lins05 Ok, give me some time to check whether the one in LDAModel is also proper to use treeAggregate....

holdenk · 2016-08-15T17:18:28Z

Awesome thanks for taking the time to do this. A few follow up questions:

So this is happening with the default tree depth (2) did you try it with other depths?
Have you had a chance to run it with spark-perf or otherwise benchmark the change to validate that it does actually result in the expected performance improvement/does not lead to a regression?
Thanks!

WeichenXu123 · 2016-08-16T04:13:23Z

@holdenk
I think depth (2) is enough to handle large RDD and bigger depth may add cost. I'll append test result later. Thanks!

WeichenXu123 · 2016-08-28T01:53:10Z

@lins05 I think the LDA model data size usually will be much smaller than training data, so that here it seems no need to change to treeAggregate ( for DistributedLDAModel ) . thanks!

WeichenXu123 · 2016-08-31T14:34:38Z

because KMeans algo is being optimized by another task I close this PR for now and when that one merged I'll check for whether this need to be optimized.

update kmean aggregate to treeAggregate

3edc7e0

WeichenXu123 changed the title ~~[SPARK-17033][Follow-up][ML][MLLib] Improve kmean aggregate to treeAggregate~~ [SPARK-17050][ML][MLLib] Improve kmean rdd.aggregate to rdd.treeAggregate Aug 14, 2016

WeichenXu123 closed this Aug 31, 2016

WeichenXu123 deleted the improve_kmean_aggregate branch April 24, 2019 21:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-17050][ML][MLLib] Improve kmean rdd.aggregate to rdd.treeAggregate#14628

[SPARK-17050][ML][MLLib] Improve kmean rdd.aggregate to rdd.treeAggregate#14628
WeichenXu123 wants to merge 1 commit intoapache:masterfrom
WeichenXu123:improve_kmean_aggregate

WeichenXu123 commented Aug 13, 2016

Uh oh!

SparkQA commented Aug 13, 2016

Uh oh!

lins05 commented Aug 13, 2016

Uh oh!

WeichenXu123 commented Aug 13, 2016

Uh oh!

holdenk commented Aug 15, 2016

Uh oh!

WeichenXu123 commented Aug 16, 2016

Uh oh!

WeichenXu123 commented Aug 28, 2016

Uh oh!

WeichenXu123 commented Aug 31, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

WeichenXu123 commented Aug 13, 2016

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Aug 13, 2016

Uh oh!

lins05 commented Aug 13, 2016

Uh oh!

WeichenXu123 commented Aug 13, 2016

Uh oh!

holdenk commented Aug 15, 2016

Uh oh!

WeichenXu123 commented Aug 16, 2016

Uh oh!

WeichenXu123 commented Aug 28, 2016

Uh oh!

WeichenXu123 commented Aug 31, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants