-
Notifications
You must be signed in to change notification settings - Fork 28.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-7401] [MLlib] [PySpark] Vectorize dot product and sq_dist between SparseVector and DenseVector #5946
Conversation
@jkbradley I was also thinking of vectorizing the sparse dot. I came up with this. Can you think of anything better?
|
Test build #32010 has finished for PR 5946 at commit
|
Test build #32014 has finished for PR 5946 at commit
|
OK, now I'll take a look at this one. (But if you're busy & can't update, no problem) |
Test build #778 has finished for PR 5946 at commit
|
@@ -430,13 +430,19 @@ def dot(self, other): | |||
|
|||
assert len(self) == _vector_size(other), "dimension mismatch" | |||
|
|||
if type(other) in (np.ndarray, array.array, DenseVector): | |||
if type(other) == array.array: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this use isinstance too?
Please make sure unit/doc tests test all of these types. (I think ndarray is missing from one.) Can you please run some timing tests to ensure the changes do speed things up? Local ones should suffice. Thanks! |
@jkbradley I tested it with a number of situations using this script, but surprisingly there is no noticeable improvement by replacing the number of indices and values.
|
Test build #33211 has finished for PR 5946 at commit
|
@MechCoder Could you post you result here ? |
@MechCoder Sorry for the long delay! Responding to your first question which I must have missed before, I am worried about trying to vectorize sparse operations. If you can't see significant speedups, it might make sense to leave those parts as they are. It would be great if you could post timing results here--thanks! |
Oh just a second. I do get some improvements. I was benching the wrong things in the previous comment. These are averaged over 10 runs. Dot product A slightly less optimistic sparse vector of length 50000 and 500 values and a random DenseVector of the same length With length 50000 and 5000 values. Squared distance With length 50000 and 5000 values. With length 500000 and 50000 values Timings in master: 2.352340269088745 Looks like we have a winner here. Do you want me to bench on anything more specific? |
@jkbradley Meanwhile I have addressed your other comments as well. |
Test build #36388 has finished for PR 5946 at commit
|
Well actually I removed the code that iterates on |
jenkins retest this please |
Test build #36391 has finished for PR 5946 at commit
|
Test build #36394 has finished for PR 5946 at commit
|
Test build #36390 has finished for PR 5946 at commit
|
Test build #36397 has finished for PR 5946 at commit
|
Test build #36401 has finished for PR 5946 at commit
|
self.assertEquals(10.0, sv.dot(dv)) | ||
self.assertTrue(array_equal(array([3., 6., 9., 12.]), sv.dot(mat))) | ||
self.assertEquals(30.0, dv.dot(dv)) | ||
self.assertTrue(array_equal(array([10., 20., 30., 40.]), dv.dot(mat))) | ||
self.assertEquals(30.0, lst.dot(dv)) | ||
self.assertTrue(array_equal(array([10., 20., 30., 40.]), lst.dot(mat))) | ||
self.assertTrue(7.0, sv.dot(arr)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
assertEquals
The benchmarks look good now for sure. Are those averaged over many iterations (say 100 or 1000)? Also, are your benchmarks the same for the latest version of the code? |
Test build #36495 has finished for PR 5946 at commit
|
Yes I confirm, that benchmarks are same (posted below). Also in the last commit I removed some overhead for ndim=2 numpy arrays. Dot operations: Vector size 50000, values = 5000, iterations = 100 Vector size 50000, values=500, iterations = 100 Squared distance calcuation: Vector size 50000, values = 5000, iterations = 100 Vector size 50000, values=500, iterations = 100 I can get almost a 100x speedup for the dot and 500x speedup for the squared distances. |
Also I believe we can leverage some speedup on sparse.dot(sparse) . WE can do that after this PR is merged |
LGTM, the improvements are awesome, merging into master, thanks! |
Thanks for the merge! |
…ducts Follow up for #5946 Currently we iterate over indices and values in SparseVector and can be vectorized. Author: MechCoder <manojkumarsivaraj334@gmail.com> Closes #7222 from MechCoder/sparse_optim and squashes the following commits: dcb51d3 [MechCoder] [SPARK-8823] [MLlib] [PySpark] Optimizations for SparseVector dot product
Currently we iterate over indices which can be vectorized.