Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-4708][MLLib] Make k-mean runs two/three times faster with dense/sparse sample #3565

Closed
wants to merge 4 commits into from

Conversation

dbtsai
Copy link
Member

@dbtsai dbtsai commented Dec 3, 2014

Note that the usage of breezeSquaredDistance in
org.apache.spark.mllib.util.MLUtils.fastSquaredDistance
is in the critical path, and breezeSquaredDistance is slow.
We should replace it with our own implementation.

Here is the benchmark against mnist8m dataset.

Before
DenseVector: 70.04secs
SparseVector: 59.05secs

With this PR
DenseVector: 30.58secs
SparseVector: 21.14secs

(p1._1 += p2._1, p1._2 + p2._2)
require(p1._1.size == p2._1.size)
var i = 0
while(i < p1._1.size) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can use BLAS.axpy here.

@SparkQA
Copy link

SparkQA commented Dec 3, 2014

Test build #24065 has started for PR 3565 at commit b185a77.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Dec 3, 2014

Test build #24065 has finished for PR 3565 at commit b185a77.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class VectorWithNorm(val vector: Vector, val norm: Double) extends Serializable

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24065/
Test FAILed.

@dbtsai
Copy link
Member Author

dbtsai commented Dec 3, 2014

Calling BLAS will add very small extra overhead. The benchmark will now be

DenseVector: 33.19secs
SparseVector: 22.05secs

@SparkQA
Copy link

SparkQA commented Dec 3, 2014

Test build #24067 has started for PR 3565 at commit de24662.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Dec 3, 2014

Test build #24068 has started for PR 3565 at commit 08bc068.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Dec 3, 2014

Test build #24067 has finished for PR 3565 at commit de24662.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class VectorWithNorm(val vector: Vector, val norm: Double) extends Serializable

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24067/
Test PASSed.

@SparkQA
Copy link

SparkQA commented Dec 3, 2014

Test build #24068 has finished for PR 3565 at commit 08bc068.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class VectorWithNorm(val vector: Vector, val norm: Double) extends Serializable

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24068/
Test PASSed.

asfgit pushed a commit that referenced this pull request Dec 3, 2014
…e/sparse sample

Note that the usage of `breezeSquaredDistance` in
`org.apache.spark.mllib.util.MLUtils.fastSquaredDistance`
is in the critical path, and `breezeSquaredDistance` is slow.
We should replace it with our own implementation.

Here is the benchmark against mnist8m dataset.

Before
DenseVector: 70.04secs
SparseVector: 59.05secs

With this PR
DenseVector: 30.58secs
SparseVector: 21.14secs

Author: DB Tsai <dbtsai@alpinenow.com>

Closes #3565 from dbtsai/kmean and squashes the following commits:

08bc068 [DB Tsai] restyle
de24662 [DB Tsai] address feedback
b185a77 [DB Tsai] cleanup
4554ddd [DB Tsai] first commit

(cherry picked from commit 7fc49ed)
Signed-off-by: Xiangrui Meng <meng@databricks.com>
@asfgit asfgit closed this in 7fc49ed Dec 3, 2014
@mengxr
Copy link
Contributor

mengxr commented Dec 3, 2014

LGTM. Merged into master and branch-1.2. Thanks!

@dbtsai dbtsai deleted the kmean branch December 3, 2014 19:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants