-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-8823] [MLlib] [PySpark] Optimizations for SparseVector dot products #7222
Conversation
Merged build triggered. |
Merged build started. |
The speed up is not that impressive, but I roughly get a 10x speedup averaged over 100 iterations Dot Products Length 50000 n_values:500 Length: 500000 n_values:50000 squared_distance Length: 50000 n_values:500 Length: 50000, n_values: 5000 |
Test build #36529 has started for PR 7222 at commit |
Test build #36529 has finished for PR 7222 at commit
|
Merged build finished. Test FAILed. |
Merged build triggered. |
Merged build started. |
Test build #36530 has started for PR 7222 at commit |
Test build #36530 has finished for PR 7222 at commit
|
Merged build finished. Test PASSed. |
I verify that the benchmarks remain the same even after the bug fix. |
result += other.values[j] * other.values[j] | ||
j += 1 | ||
return result | ||
self_cmind = np.in1d(self.indices, other.indices) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we simply do the following?
return np.dot(self.values, self.values) + np.dot(other.values, other.values) - 2.0 * self.dot(other)
It might generate some numeric errors when both vectors are long and very close to each other. It should be sufficient for normal use cases.
@MechCoder Thanks for the benchmark! The changes to |
@mengxr Thanks for the comments. I think we can merge this basic improvement for now and we can try out your idea and then reiterate on a new branch. wdyt? |
We should split this PR, keep only the changes to |
OK, done. I forgot that |
Merged build triggered. |
Merged build started. |
Test build #36689 has started for PR 7222 at commit |
Test build #36689 has finished for PR 7222 at commit
|
Merged build finished. Test PASSed. |
LGTM. Merged into master. Please create a new JIRA for |
Follow up for #5946
Currently we iterate over indices and values in SparseVector and can be vectorized.