[SPARK-4708][MLLib] Make k-mean runs two/three times faster with dense/sparse sample #3565

dbtsai · 2014-12-03T01:22:21Z

Note that the usage of breezeSquaredDistance in
org.apache.spark.mllib.util.MLUtils.fastSquaredDistance
is in the critical path, and breezeSquaredDistance is slow.
We should replace it with our own implementation.

Here is the benchmark against mnist8m dataset.

Before
DenseVector: 70.04secs
SparseVector: 59.05secs

With this PR
DenseVector: 30.58secs
SparseVector: 21.14secs

mengxr · 2014-12-03T01:24:50Z

mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala

-        (p1._1 += p2._1, p1._2 + p2._2)
+        require(p1._1.size == p2._1.size)
+        var i = 0
+        while(i < p1._1.size) {


we can use BLAS.axpy here.

SparkQA · 2014-12-03T01:30:07Z

Test build #24065 has started for PR 3565 at commit b185a77.

This patch merges cleanly.

SparkQA · 2014-12-03T01:31:25Z

Test build #24065 has finished for PR 3565 at commit b185a77.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class VectorWithNorm(val vector: Vector, val norm: Double) extends Serializable

AmplabJenkins · 2014-12-03T01:31:26Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24065/
Test FAILed.

dbtsai · 2014-12-03T01:58:14Z

Calling BLAS will add very small extra overhead. The benchmark will now be

DenseVector: 33.19secs
SparseVector: 22.05secs

SparkQA · 2014-12-03T02:00:15Z

Test build #24067 has started for PR 3565 at commit de24662.

This patch merges cleanly.

SparkQA · 2014-12-03T02:15:11Z

Test build #24068 has started for PR 3565 at commit 08bc068.

This patch merges cleanly.

SparkQA · 2014-12-03T03:20:07Z

Test build #24067 has finished for PR 3565 at commit de24662.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class VectorWithNorm(val vector: Vector, val norm: Double) extends Serializable

AmplabJenkins · 2014-12-03T03:20:10Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24067/
Test PASSed.

SparkQA · 2014-12-03T03:35:38Z

Test build #24068 has finished for PR 3565 at commit 08bc068.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class VectorWithNorm(val vector: Vector, val norm: Double) extends Serializable

AmplabJenkins · 2014-12-03T03:35:41Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24068/
Test PASSed.

…e/sparse sample Note that the usage of `breezeSquaredDistance` in `org.apache.spark.mllib.util.MLUtils.fastSquaredDistance` is in the critical path, and `breezeSquaredDistance` is slow. We should replace it with our own implementation. Here is the benchmark against mnist8m dataset. Before DenseVector: 70.04secs SparseVector: 59.05secs With this PR DenseVector: 30.58secs SparseVector: 21.14secs Author: DB Tsai <dbtsai@alpinenow.com> Closes #3565 from dbtsai/kmean and squashes the following commits: 08bc068 [DB Tsai] restyle de24662 [DB Tsai] address feedback b185a77 [DB Tsai] cleanup 4554ddd [DB Tsai] first commit (cherry picked from commit 7fc49ed) Signed-off-by: Xiangrui Meng <meng@databricks.com>

mengxr · 2014-12-03T11:03:58Z

LGTM. Merged into master and branch-1.2. Thanks!

DB Tsai added 2 commits December 2, 2014 17:05

first commit

4554ddd

cleanup

b185a77

mengxr reviewed Dec 3, 2014
View reviewed changes

address feedback

de24662

restyle

08bc068

asfgit closed this in 7fc49ed Dec 3, 2014

dbtsai deleted the kmean branch December 3, 2014 19:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-4708][MLLib] Make k-mean runs two/three times faster with dense/sparse sample #3565

[SPARK-4708][MLLib] Make k-mean runs two/three times faster with dense/sparse sample #3565

dbtsai commented Dec 3, 2014

mengxr Dec 3, 2014

SparkQA commented Dec 3, 2014

SparkQA commented Dec 3, 2014

AmplabJenkins commented Dec 3, 2014

dbtsai commented Dec 3, 2014

SparkQA commented Dec 3, 2014

SparkQA commented Dec 3, 2014

SparkQA commented Dec 3, 2014

AmplabJenkins commented Dec 3, 2014

SparkQA commented Dec 3, 2014

AmplabJenkins commented Dec 3, 2014

mengxr commented Dec 3, 2014

[SPARK-4708][MLLib] Make k-mean runs two/three times faster with dense/sparse sample #3565

[SPARK-4708][MLLib] Make k-mean runs two/three times faster with dense/sparse sample #3565

Conversation

dbtsai commented Dec 3, 2014

mengxr Dec 3, 2014

Choose a reason for hiding this comment

SparkQA commented Dec 3, 2014

SparkQA commented Dec 3, 2014

AmplabJenkins commented Dec 3, 2014

dbtsai commented Dec 3, 2014

SparkQA commented Dec 3, 2014

SparkQA commented Dec 3, 2014

SparkQA commented Dec 3, 2014

AmplabJenkins commented Dec 3, 2014

SparkQA commented Dec 3, 2014

AmplabJenkins commented Dec 3, 2014

mengxr commented Dec 3, 2014