[SPARK-7401] [MLlib] [PySpark] Vectorize dot product and sq_dist between SparseVector and DenseVector #5946

MechCoder · 2015-05-06T18:17:47Z

Currently we iterate over indices which can be vectorized.

MechCoder · 2015-05-06T18:23:00Z

@jkbradley I was also thinking of vectorizing the sparse dot. I came up with this. Can you think of anything better?

def dot(self, other):
    cm1 = np.in1d(self.indices, other.indices)
    cm2 = np.in1d(other,indices, self.indices)
    return np.dot(self.values[cm1], other.values[cm2])

SparkQA · 2015-05-06T18:25:45Z

Test build #32010 has finished for PR 5946 at commit c862759.

This patch fails Python style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-05-06T19:03:04Z

Test build #32014 has finished for PR 5946 at commit 3bea3ea.

This patch fails MiMa tests.
This patch merges cleanly.
This patch adds no public classes.

jkbradley · 2015-05-07T22:17:55Z

OK, now I'll take a look at this one. (But if you're busy & can't update, no problem)

SparkQA · 2015-05-07T22:26:28Z

Test build #778 has finished for PR 5946 at commit 3bea3ea.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

jkbradley · 2015-05-07T23:43:33Z

python/pyspark/mllib/linalg.py

@@ -430,13 +430,19 @@ def dot(self, other):

        assert len(self) == _vector_size(other), "dimension mismatch"

-        if type(other) in (np.ndarray, array.array, DenseVector):
+        if type(other) == array.array:


Should this use isinstance too?

jkbradley · 2015-05-07T23:44:11Z

Please make sure unit/doc tests test all of these types. (I think ndarray is missing from one.)

Can you please run some timing tests to ensure the changes do speed things up? Local ones should suffice.

Thanks!

MechCoder · 2015-05-21T04:01:07Z

@jkbradley I tested it with a number of situations using this script, but surprisingly there is no noticeable improvement by replacing the number of indices and values.

from pyspark.mllib.linalg import SparseVector
import numpy as np
from time import time
rng = np.random.RandomState(0)
ind1 = np.sort(rng.choice(5000000, 500000, replace=False))
ind2 = np.sort(rng.choice(5000000, 5000, replace=False))
val1 = rng.rand(500000)
val2 = rng.rand(5000)

v = SparseVector(5000000, ind1, val1)
v1 = SparseVector(5000000, ind2, val2)
t = time()

t_ = 0.0
for i in xrange(10):
    t = time()
    tmp = v.dot(v1)
    t_ += time() - t

print t_
t_ = 0.0

for i in xrange(10):
    t = time()
    tmp = v.dot1(v1)
    t_ += time() - t

print t_

t_ = 0.0
for i in xrange(10):
    t = time()
    tmp = v.squared_distance(v1)
    t_ += time() - t

print t_
t_ = 0.0

for i in xrange(10):
    t = time()
    tmp = v.squared_distance1(v1)
    t_ += time() - t

print t_

SparkQA · 2015-05-21T05:15:41Z

Test build #33211 has finished for PR 5946 at commit 39d0051.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

davies · 2015-05-21T17:57:24Z

@MechCoder Could you post you result here ?

jkbradley · 2015-07-01T19:15:52Z

@MechCoder Sorry for the long delay!

Responding to your first question which I must have missed before, I am worried about trying to vectorize sparse operations. If you can't see significant speedups, it might make sense to leave those parts as they are.

It would be great if you could post timing results here--thanks!

MechCoder · 2015-07-02T10:08:31Z

Oh just a second. I do get some improvements. I was benching the wrong things in the previous comment.

These are averaged over 10 runs.

Dot product
For a random sparse vector of length 500000, with 50000 values nd a random DenseVector of the same length.
Timings: 0.06 s in master
Timings 0.0006s in this branch.

A slightly less optimistic sparse vector of length 50000 and 500 values and a random DenseVector of the same length
Timings: 0.0005s in master
Timings: 2.3937225341796876e-05 in master

With length 50000 and 5000 values.
Timings: 0.0058s in master
Timings: 7.47e-5 s in this branch

Squared distance
With length 50000 and 500 values.
Timings: 0.254s in master
Timings: 0.0003 s in this branch

With length 50000 and 5000 values.
Timings: 0.26862592697143556 in master.
Timings: 0.0008 s in this branch

With length 500000 and 50000 values

Timings in master: 2.352340269088745
Timings in this branch: 0.004776120185852051

Looks like we have a winner here. Do you want me to bench on anything more specific?

MechCoder · 2015-07-02T10:42:05Z

@jkbradley Meanwhile I have addressed your other comments as well.

SparkQA · 2015-07-02T10:59:06Z

Test build #36388 has finished for PR 5946 at commit c5772a9.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

MechCoder · 2015-07-02T11:08:35Z

Well actually I removed the code that iterates on array.array and list and used _convert_to_vector instead.

MechCoder · 2015-07-02T12:52:19Z

jenkins retest this please

SparkQA · 2015-07-02T13:23:13Z

Test build #36391 has finished for PR 5946 at commit e8fb5ee.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-07-02T13:23:36Z

Test build #36394 has finished for PR 5946 at commit e8fb5ee.

This patch fails MiMa tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-07-02T13:25:57Z

Test build #36390 has finished for PR 5946 at commit 4f213f9.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-07-02T13:41:54Z

Test build #36397 has finished for PR 5946 at commit 53272d7.

This patch fails Python style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-07-02T14:32:41Z

Test build #36401 has finished for PR 5946 at commit bce2b07.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

jkbradley · 2015-07-02T22:43:03Z

python/pyspark/mllib/tests.py

        self.assertEquals(10.0, sv.dot(dv))
        self.assertTrue(array_equal(array([3., 6., 9., 12.]), sv.dot(mat)))
        self.assertEquals(30.0, dv.dot(dv))
        self.assertTrue(array_equal(array([10., 20., 30., 40.]), dv.dot(mat)))
        self.assertEquals(30.0, lst.dot(dv))
        self.assertTrue(array_equal(array([10., 20., 30., 40.]), lst.dot(mat)))
+        self.assertTrue(7.0, sv.dot(arr))


assertEquals

jkbradley · 2015-07-02T22:43:12Z

The benchmarks look good now for sure. Are those averaged over many iterations (say 100 or 1000)?

Also, are your benchmarks the same for the latest version of the code?

SparkQA · 2015-07-03T10:51:18Z

Test build #36495 has finished for PR 5946 at commit 034d086.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

MechCoder · 2015-07-03T10:55:53Z

Yes I confirm, that benchmarks are same (posted below). Also in the last commit I removed some overhead for ndim=2 numpy arrays.

Dot operations:
Vector size 500000, values = 50000, iterations = 100
In this branch: 0.0006000828742980957
In master: 0.06196121454238892

Vector size 50000, values = 5000, iterations = 100
In this branch: 5.4757595062255856e-05
In master:0.005893096923828125

Vector size 50000, values=500, iterations = 100
In this branch: 2.2442340850830077e-05
In master: 0.0006871128082275391

Squared distance calcuation:
Vector size 500000, values = 50000, iterations = 100
In this branch: 0.0045609426498413085
In master: 2.4458935689926147

Vector size 50000, values = 5000, iterations = 100
In this branch: 0.0005040478706359864
In master: 0.2419515371322632

Vector size 50000, values=500, iterations = 100
In this branch: 0.0007156062126159667
In master: 0.24092751741409302

I can get almost a 100x speedup for the dot and 500x speedup for the squared distances.

MechCoder · 2015-07-03T10:56:46Z

Also I believe we can leverage some speedup on sparse.dot(sparse) . WE can do that after this PR is merged

davies · 2015-07-03T22:49:15Z

LGTM, the improvements are awesome, merging into master, thanks!

MechCoder · 2015-07-04T03:11:49Z

Thanks for the merge!

…ducts Follow up for #5946 Currently we iterate over indices and values in SparseVector and can be vectorized. Author: MechCoder <manojkumarsivaraj334@gmail.com> Closes #7222 from MechCoder/sparse_optim and squashes the following commits: dcb51d3 [MechCoder] [SPARK-8823] [MLlib] [PySpark] Optimizations for SparseVector dot product

MechCoder mentioned this pull request May 6, 2015

[SPARK-7328] [MLlib] [PySpark] Pyspark.mllib.linalg.Vectors: Missing items #5872

Closed

MechCoder force-pushed the spark-7203 branch from c862759 to 3bea3ea Compare May 6, 2015 18:43

jkbradley reviewed May 7, 2015
View reviewed changes

MechCoder force-pushed the spark-7203 branch from 3bea3ea to 39d0051 Compare May 21, 2015 03:24

MechCoder changed the title ~~[SPARK-7401] [MLlib] [PySpark] Vectorize dot product and sq_dist~~ [SPARK-7401] [MLlib] [PySpark] Vectorize dot product and sq_dist for operations between SparseVector and DenseVector Jul 2, 2015

MechCoder changed the title ~~[SPARK-7401] [MLlib] [PySpark] Vectorize dot product and sq_dist for operations between SparseVector and DenseVector~~ [SPARK-7401] [MLlib] [PySpark] Vectorize dot product and sq_dist between SparseVector and DenseVector Jul 2, 2015

MechCoder force-pushed the spark-7203 branch from 4f213f9 to e8fb5ee Compare July 2, 2015 11:17

MechCoder added 3 commits July 2, 2015 17:00

[SPARK-7401] Vectorize dot product and sq_dist

e5f1de0

Add tests and other isinstance changes

0ee5dd4

Remove type checks for list, pyarray etc

fcad0a3

MechCoder force-pushed the spark-7203 branch from e8fb5ee to 53272d7 Compare July 2, 2015 13:34

fix doctest

bce2b07

MechCoder force-pushed the spark-7203 branch from 53272d7 to bce2b07 Compare July 2, 2015 14:12

jkbradley reviewed Jul 2, 2015
View reviewed changes

Vectorize dot calculation for numpy arrays for ndim=2

034d086

asfgit closed this in f0fac2a Jul 3, 2015

MechCoder deleted the spark-7203 branch July 4, 2015 03:11

MechCoder mentioned this pull request Jul 4, 2015

[SPARK-8823] [MLlib] [PySpark] Optimizations for SparseVector dot products #7222

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-7401] [MLlib] [PySpark] Vectorize dot product and sq_dist between SparseVector and DenseVector #5946

[SPARK-7401] [MLlib] [PySpark] Vectorize dot product and sq_dist between SparseVector and DenseVector #5946

MechCoder commented May 6, 2015

MechCoder commented May 6, 2015

SparkQA commented May 6, 2015

SparkQA commented May 6, 2015

jkbradley commented May 7, 2015

SparkQA commented May 7, 2015

jkbradley May 7, 2015

jkbradley commented May 7, 2015

MechCoder commented May 21, 2015

SparkQA commented May 21, 2015

davies commented May 21, 2015

jkbradley commented Jul 1, 2015

MechCoder commented Jul 2, 2015

MechCoder commented Jul 2, 2015

SparkQA commented Jul 2, 2015

MechCoder commented Jul 2, 2015

MechCoder commented Jul 2, 2015

SparkQA commented Jul 2, 2015

SparkQA commented Jul 2, 2015

SparkQA commented Jul 2, 2015

SparkQA commented Jul 2, 2015

SparkQA commented Jul 2, 2015

jkbradley Jul 2, 2015

jkbradley commented Jul 2, 2015

SparkQA commented Jul 3, 2015

MechCoder commented Jul 3, 2015

MechCoder commented Jul 3, 2015

davies commented Jul 3, 2015

MechCoder commented Jul 4, 2015

[SPARK-7401] [MLlib] [PySpark] Vectorize dot product and sq_dist between SparseVector and DenseVector #5946

[SPARK-7401] [MLlib] [PySpark] Vectorize dot product and sq_dist between SparseVector and DenseVector #5946

Conversation

MechCoder commented May 6, 2015

MechCoder commented May 6, 2015

SparkQA commented May 6, 2015

SparkQA commented May 6, 2015

jkbradley commented May 7, 2015

SparkQA commented May 7, 2015

jkbradley May 7, 2015

Choose a reason for hiding this comment

jkbradley commented May 7, 2015

MechCoder commented May 21, 2015

SparkQA commented May 21, 2015

davies commented May 21, 2015

jkbradley commented Jul 1, 2015

MechCoder commented Jul 2, 2015

MechCoder commented Jul 2, 2015

SparkQA commented Jul 2, 2015

MechCoder commented Jul 2, 2015

MechCoder commented Jul 2, 2015

SparkQA commented Jul 2, 2015

SparkQA commented Jul 2, 2015

SparkQA commented Jul 2, 2015

SparkQA commented Jul 2, 2015

SparkQA commented Jul 2, 2015

jkbradley Jul 2, 2015

Choose a reason for hiding this comment

jkbradley commented Jul 2, 2015

SparkQA commented Jul 3, 2015

MechCoder commented Jul 3, 2015

MechCoder commented Jul 3, 2015

davies commented Jul 3, 2015

MechCoder commented Jul 4, 2015