-
Notifications
You must be signed in to change notification settings - Fork 28.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-4708][MLLib] Make k-mean runs two/three times faster with dense/sparse sample #3565
Conversation
(p1._1 += p2._1, p1._2 + p2._2) | ||
require(p1._1.size == p2._1.size) | ||
var i = 0 | ||
while(i < p1._1.size) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can use BLAS.axpy
here.
Test build #24065 has started for PR 3565 at commit
|
Test build #24065 has finished for PR 3565 at commit
|
Test FAILed. |
Calling BLAS will add very small extra overhead. The benchmark will now be DenseVector: 33.19secs |
Test build #24067 has started for PR 3565 at commit
|
Test build #24068 has started for PR 3565 at commit
|
Test build #24067 has finished for PR 3565 at commit
|
Test PASSed. |
Test build #24068 has finished for PR 3565 at commit
|
Test PASSed. |
…e/sparse sample Note that the usage of `breezeSquaredDistance` in `org.apache.spark.mllib.util.MLUtils.fastSquaredDistance` is in the critical path, and `breezeSquaredDistance` is slow. We should replace it with our own implementation. Here is the benchmark against mnist8m dataset. Before DenseVector: 70.04secs SparseVector: 59.05secs With this PR DenseVector: 30.58secs SparseVector: 21.14secs Author: DB Tsai <dbtsai@alpinenow.com> Closes #3565 from dbtsai/kmean and squashes the following commits: 08bc068 [DB Tsai] restyle de24662 [DB Tsai] address feedback b185a77 [DB Tsai] cleanup 4554ddd [DB Tsai] first commit (cherry picked from commit 7fc49ed) Signed-off-by: Xiangrui Meng <meng@databricks.com>
LGTM. Merged into master and branch-1.2. Thanks! |
Note that the usage of
breezeSquaredDistance
inorg.apache.spark.mllib.util.MLUtils.fastSquaredDistance
is in the critical path, and
breezeSquaredDistance
is slow.We should replace it with our own implementation.
Here is the benchmark against mnist8m dataset.
Before
DenseVector: 70.04secs
SparseVector: 59.05secs
With this PR
DenseVector: 30.58secs
SparseVector: 21.14secs