Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-7401] [MLlib] [PySpark] Vectorize dot product and sq_dist between SparseVector and DenseVector #5946

Closed
wants to merge 5 commits into from

Conversation

MechCoder
Copy link
Contributor

Currently we iterate over indices which can be vectorized.

@MechCoder
Copy link
Contributor Author

@jkbradley I was also thinking of vectorizing the sparse dot. I came up with this. Can you think of anything better?

def dot(self, other):
    cm1 = np.in1d(self.indices, other.indices)
    cm2 = np.in1d(other,indices, self.indices)
    return np.dot(self.values[cm1], other.values[cm2])

@SparkQA
Copy link

SparkQA commented May 6, 2015

Test build #32010 has finished for PR 5946 at commit c862759.

  • This patch fails Python style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 6, 2015

Test build #32014 has finished for PR 5946 at commit 3bea3ea.

  • This patch fails MiMa tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@jkbradley
Copy link
Member

OK, now I'll take a look at this one. (But if you're busy & can't update, no problem)

@SparkQA
Copy link

SparkQA commented May 7, 2015

Test build #778 has finished for PR 5946 at commit 3bea3ea.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -430,13 +430,19 @@ def dot(self, other):

assert len(self) == _vector_size(other), "dimension mismatch"

if type(other) in (np.ndarray, array.array, DenseVector):
if type(other) == array.array:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this use isinstance too?

@jkbradley
Copy link
Member

Please make sure unit/doc tests test all of these types. (I think ndarray is missing from one.)

Can you please run some timing tests to ensure the changes do speed things up? Local ones should suffice.

Thanks!

@MechCoder
Copy link
Contributor Author

@jkbradley I tested it with a number of situations using this script, but surprisingly there is no noticeable improvement by replacing the number of indices and values.

from pyspark.mllib.linalg import SparseVector
import numpy as np
from time import time
rng = np.random.RandomState(0)
ind1 = np.sort(rng.choice(5000000, 500000, replace=False))
ind2 = np.sort(rng.choice(5000000, 5000, replace=False))
val1 = rng.rand(500000)
val2 = rng.rand(5000)

v = SparseVector(5000000, ind1, val1)
v1 = SparseVector(5000000, ind2, val2)
t = time()

t_ = 0.0
for i in xrange(10):
    t = time()
    tmp = v.dot(v1)
    t_ += time() - t

print t_
t_ = 0.0

for i in xrange(10):
    t = time()
    tmp = v.dot1(v1)
    t_ += time() - t

print t_

t_ = 0.0
for i in xrange(10):
    t = time()
    tmp = v.squared_distance(v1)
    t_ += time() - t

print t_
t_ = 0.0

for i in xrange(10):
    t = time()
    tmp = v.squared_distance1(v1)
    t_ += time() - t

print t_

@SparkQA
Copy link

SparkQA commented May 21, 2015

Test build #33211 has finished for PR 5946 at commit 39d0051.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@davies
Copy link
Contributor

davies commented May 21, 2015

@MechCoder Could you post you result here ?

@jkbradley
Copy link
Member

@MechCoder Sorry for the long delay!

Responding to your first question which I must have missed before, I am worried about trying to vectorize sparse operations. If you can't see significant speedups, it might make sense to leave those parts as they are.

It would be great if you could post timing results here--thanks!

@MechCoder
Copy link
Contributor Author

Oh just a second. I do get some improvements. I was benching the wrong things in the previous comment.

These are averaged over 10 runs.

Dot product
For a random sparse vector of length 500000, with 50000 values nd a random DenseVector of the same length.
Timings: 0.06 s in master
Timings 0.0006s in this branch.

A slightly less optimistic sparse vector of length 50000 and 500 values and a random DenseVector of the same length
Timings: 0.0005s in master
Timings: 2.3937225341796876e-05 in master

With length 50000 and 5000 values.
Timings: 0.0058s in master
Timings: 7.47e-5 s in this branch

Squared distance
With length 50000 and 500 values.
Timings: 0.254s in master
Timings: 0.0003 s in this branch

With length 50000 and 5000 values.
Timings: 0.26862592697143556 in master.
Timings: 0.0008 s in this branch

With length 500000 and 50000 values

Timings in master: 2.352340269088745
Timings in this branch: 0.004776120185852051

Looks like we have a winner here. Do you want me to bench on anything more specific?

@MechCoder MechCoder changed the title [SPARK-7401] [MLlib] [PySpark] Vectorize dot product and sq_dist [SPARK-7401] [MLlib] [PySpark] Vectorize dot product and sq_dist for operations between SparseVector and DenseVector Jul 2, 2015
@MechCoder MechCoder changed the title [SPARK-7401] [MLlib] [PySpark] Vectorize dot product and sq_dist for operations between SparseVector and DenseVector [SPARK-7401] [MLlib] [PySpark] Vectorize dot product and sq_dist between SparseVector and DenseVector Jul 2, 2015
@MechCoder
Copy link
Contributor Author

@jkbradley Meanwhile I have addressed your other comments as well.

@SparkQA
Copy link

SparkQA commented Jul 2, 2015

Test build #36388 has finished for PR 5946 at commit c5772a9.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@MechCoder
Copy link
Contributor Author

Well actually I removed the code that iterates on array.array and list and used _convert_to_vector instead.

@MechCoder
Copy link
Contributor Author

jenkins retest this please

@SparkQA
Copy link

SparkQA commented Jul 2, 2015

Test build #36391 has finished for PR 5946 at commit e8fb5ee.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 2, 2015

Test build #36394 has finished for PR 5946 at commit e8fb5ee.

  • This patch fails MiMa tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 2, 2015

Test build #36390 has finished for PR 5946 at commit 4f213f9.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 2, 2015

Test build #36397 has finished for PR 5946 at commit 53272d7.

  • This patch fails Python style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 2, 2015

Test build #36401 has finished for PR 5946 at commit bce2b07.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

self.assertEquals(10.0, sv.dot(dv))
self.assertTrue(array_equal(array([3., 6., 9., 12.]), sv.dot(mat)))
self.assertEquals(30.0, dv.dot(dv))
self.assertTrue(array_equal(array([10., 20., 30., 40.]), dv.dot(mat)))
self.assertEquals(30.0, lst.dot(dv))
self.assertTrue(array_equal(array([10., 20., 30., 40.]), lst.dot(mat)))
self.assertTrue(7.0, sv.dot(arr))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assertEquals

@jkbradley
Copy link
Member

The benchmarks look good now for sure. Are those averaged over many iterations (say 100 or 1000)?

Also, are your benchmarks the same for the latest version of the code?

@SparkQA
Copy link

SparkQA commented Jul 3, 2015

Test build #36495 has finished for PR 5946 at commit 034d086.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@MechCoder
Copy link
Contributor Author

Yes I confirm, that benchmarks are same (posted below). Also in the last commit I removed some overhead for ndim=2 numpy arrays.

Dot operations:
Vector size 500000, values = 50000, iterations = 100
In this branch: 0.0006000828742980957
In master: 0.06196121454238892

Vector size 50000, values = 5000, iterations = 100
In this branch: 5.4757595062255856e-05
In master:0.005893096923828125

Vector size 50000, values=500, iterations = 100
In this branch: 2.2442340850830077e-05
In master: 0.0006871128082275391

Squared distance calcuation:
Vector size 500000, values = 50000, iterations = 100
In this branch: 0.0045609426498413085
In master: 2.4458935689926147

Vector size 50000, values = 5000, iterations = 100
In this branch: 0.0005040478706359864
In master: 0.2419515371322632

Vector size 50000, values=500, iterations = 100
In this branch: 0.0007156062126159667
In master: 0.24092751741409302

I can get almost a 100x speedup for the dot and 500x speedup for the squared distances.

@MechCoder
Copy link
Contributor Author

Also I believe we can leverage some speedup on sparse.dot(sparse) . WE can do that after this PR is merged

@davies
Copy link
Contributor

davies commented Jul 3, 2015

LGTM, the improvements are awesome, merging into master, thanks!

@asfgit asfgit closed this in f0fac2a Jul 3, 2015
@MechCoder
Copy link
Contributor Author

Thanks for the merge!

@MechCoder MechCoder deleted the spark-7203 branch July 4, 2015 03:11
asfgit pushed a commit that referenced this pull request Jul 7, 2015
…ducts

Follow up for #5946

Currently we iterate over indices and values in SparseVector and can be vectorized.

Author: MechCoder <manojkumarsivaraj334@gmail.com>

Closes #7222 from MechCoder/sparse_optim and squashes the following commits:

dcb51d3 [MechCoder] [SPARK-8823] [MLlib] [PySpark] Optimizations for SparseVector dot product
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants