Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-6685][ML]Use DSYRK to compute AtA in ALS #19536

Closed
wants to merge 14 commits into from
Closed

Conversation

mpjlu
Copy link

@mpjlu mpjlu commented Oct 19, 2017

What changes were proposed in this pull request?

This is a reopen of PR: #13891 with mima fix and code optimization.
There is about 30-40% performance improvement comparing with #13891

How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)

Please review http://spark.apache.org/contributing.html before opening a pull request.

@mpjlu
Copy link
Author

mpjlu commented Oct 19, 2017

Hi @yanboliang , I reopen this PR per your suggestion, thanks.
I have tested the code, the performance result is matched with @hqzizania 's results.
Thanks @hqzizania .

@hqzizania
Copy link
Contributor

Wow, thank you for reopening. LOL @mpjlu

@@ -589,6 +602,9 @@ class ALS(@Since("1.4.0") override val uid: String) extends Estimator[ALSModel]
@Since("2.2.0")
def setColdStartStrategy(value: String): this.type = set(coldStartStrategy, value)

@Since("2.3.0")
def setThreshold(value: Int): this.type = set(threshold, value)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure callers can meaningfully understand and set this. Can't we pick a threshold programmatically?

Copy link
Author

@mpjlu mpjlu Oct 19, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, my test result is better than the previous result, especially for native BLAS. I will update my test results here soon, and I will change this set.

@@ -1043,6 +1043,9 @@ object MimaExcludes {
// [SPARK-21680][ML][MLLIB]optimzie Vector coompress
ProblemFilters.exclude[ReversedMissingMethodProblem]("org.apache.spark.mllib.linalg.Vector.toSparseWithSize"),
ProblemFilters.exclude[ReversedMissingMethodProblem]("org.apache.spark.ml.linalg.Vector.toSparseWithSize")
) ++ Seq(
// [SPARK-6685][ML]Use DSYRK to compute AtA in ALS
ProblemFilters.exclude[DirectMissingMethodProblem]("org.apache.spark.ml.recommendation.ALS.train")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This breaks an API unnecessarily, even though it's a dev API. I think we instead need to remove this as a user-facing param and avoid it altogether.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am ok to remove this, and use a loose threshold (e.g. 100), which is helpful for most cases. How about it?

@SparkQA
Copy link

SparkQA commented Oct 19, 2017

Test build #82905 has finished for PR 19536 at commit 4fdcbe0.

  • This patch fails to generate documentation.
  • This patch merges cleanly.
  • This patch adds no public classes.

@mpjlu
Copy link
Author

mpjlu commented Oct 24, 2017

image

100k*100k, sparsity 0.05 old old New
rank f2j mkl mkl
20 4 9 4
50 13 12 13
100 40 60 30
200 156 90 66
500 900 720 270

This is the performance data of this PR.
The date size is 100,000*100,000, the data sparsity is 0.05.
The test environment is:
3 workers: 35 cores/worker, 210G memory/worker, 1 executor/worker.
The native BLAS is Intel MKL

@mpjlu
Copy link
Author

mpjlu commented Oct 24, 2017

cc @srowen @mengxr @yanboliang
The performance data is updated, thanks.

@mpjlu
Copy link
Author

mpjlu commented Jan 15, 2018

Because I don't have the environment to continue this work, I will close it. Thanks.

@mpjlu mpjlu closed this Jan 15, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants