New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-19759][ML] not using blas in ALSModel.predict for optimization #19685
Conversation
Test build #83551 has finished for PR 19685 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, seems that people have found often in the past that level 1 BLAS ops were not actually faster. Here using it requires overhead of converting to an array. I can believe this is faster. Usually a while loop is a little faster.
Interestingly there is another PR suggesting that Level 1 BLAS is faster when native libs are available.
// potential optimization. | ||
blas.sdot(rank, featuresA.toArray, 1, featuresB.toArray, 1) | ||
var dotProduct = 0.0f | ||
for(i <- 0 until rank) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should use while
instead of for
Have you made some test to check the performance difference for this ? |
@srowen I tried enabling native BLAS, but native BLAS implementation is still much slower: average on 10 runs is 2529,922753 ms against 515,510185 ms of the for loop. As a reference, I am using a OSX 2.5 GHz Intel Core i7. @WeichenXu123 In the description od the PR and here you can see the tests I made. Do you think something else is needed? |
Test build #83598 has finished for PR 19685 at commit
|
Merged to master |
What changes were proposed in this pull request?
In
ALS.predict
currently we are usingblas.sdot
function to perform a dot product on twoSeq
s. It turns out that this is not the most efficient way.I used the following code to compare the implementations:
Thus this PR proposes the old-style for loop implementation for performance reasons.
How was this patch tested?
existing UTs