[SPARK-8823] [MLlib] [PySpark] Optimizations for SparseVector dot products #7222

MechCoder · 2015-07-04T07:42:39Z

Follow up for #5946

Currently we iterate over indices and values in SparseVector and can be vectorized.

AmplabJenkins · 2015-07-04T07:43:10Z

Merged build triggered.

AmplabJenkins · 2015-07-04T07:43:17Z

Merged build started.

MechCoder · 2015-07-04T07:46:09Z

The speed up is not that impressive, but I roughly get a 10x speedup averaged over 100 iterations

Dot Products
Two Sparse Vectors length 50000 n_values 5000
In master:0.031453819274902345
In this branch: 0.0016013431549072267

Length 50000 n_values:500
In master:0.00331263542175293
In this branch: 0.0006479525566101074

Length: 500000 n_values:50000
In master: 0.04630022764205933
In this branch: 0.014638817310333252

squared_distance
Length: 500000 n_values:50000
In this branch:0.0178
In master:0.158

Length: 50000 n_values:500
In master: 0.0017
In this branch:0.0007526993751525879

Length: 50000, n_values: 5000
In master: 0.0158
In this branch: 0.001717

SparkQA · 2015-07-04T07:46:26Z

Test build #36529 has started for PR 7222 at commit 4030782.

SparkQA · 2015-07-04T07:59:58Z

Test build #36529 has finished for PR 7222 at commit 4030782.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2015-07-04T08:00:05Z

Merged build finished. Test FAILed.

AmplabJenkins · 2015-07-04T08:38:10Z

Merged build triggered.

AmplabJenkins · 2015-07-04T08:38:19Z

Merged build started.

SparkQA · 2015-07-04T08:41:25Z

Test build #36530 has started for PR 7222 at commit 68fd92f.

SparkQA · 2015-07-04T08:55:39Z

Test build #36530 has finished for PR 7222 at commit 68fd92f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2015-07-04T08:56:02Z

Merged build finished. Test PASSed.

MechCoder · 2015-07-04T12:48:29Z

I verify that the benchmarks remain the same even after the bug fix.

mengxr · 2015-07-07T05:09:47Z

python/pyspark/mllib/linalg.py

-                result += other.values[j] * other.values[j]
-                j += 1
-            return result
+            self_cmind = np.in1d(self.indices, other.indices)


Could we simply do the following?

return np.dot(self.values, self.values) + np.dot(other.values, other.values) - 2.0 * self.dot(other)

It might generate some numeric errors when both vectors are long and very close to each other. It should be sufficient for normal use cases.

mengxr · 2015-07-07T05:13:20Z

@MechCoder Thanks for the benchmark! The changes to dot looks good to me except an extra option we should try. I would recommend replacing the square_distance implementation by a simpler version. If that takes longer to discuss, we can split this PR into two.

MechCoder · 2015-07-07T14:46:55Z

@mengxr Thanks for the comments. I think we can merge this basic improvement for now and we can try out your idea and then reiterate on a new branch. wdyt?

mengxr · 2015-07-07T14:52:59Z

We should split this PR, keep only the changes to dot, and test the assume_unique flag. On a separate PR, let's test the performance about square_distance. The currently implementation of square_distance is not trivial to understand. We should iterate more before merge.

…duct

MechCoder · 2015-07-07T14:59:50Z

OK, done. I forgot that assume_unique=True means that each array should have unique elements and not that the two arrays should be different from another.

AmplabJenkins · 2015-07-07T15:03:12Z

Merged build triggered.

AmplabJenkins · 2015-07-07T15:03:18Z

Merged build started.

SparkQA · 2015-07-07T15:04:36Z

Test build #36689 has started for PR 7222 at commit dcb51d3.

SparkQA · 2015-07-07T15:19:24Z

Test build #36689 has finished for PR 7222 at commit dcb51d3.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2015-07-07T15:19:49Z

Merged build finished. Test PASSed.

mengxr · 2015-07-07T16:00:32Z

LGTM. Merged into master. Please create a new JIRA for squared_distance. Thanks!

MechCoder changed the title ~~[SPARK-8823] Optimizations for SparseVector dot products~~ [SPARK-8823] [MLlib] [PySpark] Optimizations for SparseVector dot products Jul 4, 2015

MechCoder force-pushed the sparse_optim branch from 4030782 to 68fd92f Compare July 4, 2015 08:36

mengxr reviewed Jul 7, 2015
View reviewed changes

[SPARK-8823] [MLlib] [PySpark] Optimizations for SparseVector dot pro…

dcb51d3

…duct

MechCoder force-pushed the sparse_optim branch from 68fd92f to dcb51d3 Compare July 7, 2015 14:58

asfgit closed this in 738c107 Jul 7, 2015

MechCoder deleted the sparse_optim branch July 7, 2015 16:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-8823] [MLlib] [PySpark] Optimizations for SparseVector dot products #7222

[SPARK-8823] [MLlib] [PySpark] Optimizations for SparseVector dot products #7222

MechCoder commented Jul 4, 2015

AmplabJenkins commented Jul 4, 2015

AmplabJenkins commented Jul 4, 2015

MechCoder commented Jul 4, 2015

SparkQA commented Jul 4, 2015

SparkQA commented Jul 4, 2015

AmplabJenkins commented Jul 4, 2015

AmplabJenkins commented Jul 4, 2015

AmplabJenkins commented Jul 4, 2015

SparkQA commented Jul 4, 2015

SparkQA commented Jul 4, 2015

AmplabJenkins commented Jul 4, 2015

MechCoder commented Jul 4, 2015

mengxr Jul 7, 2015

mengxr commented Jul 7, 2015

MechCoder commented Jul 7, 2015

mengxr commented Jul 7, 2015

MechCoder commented Jul 7, 2015

AmplabJenkins commented Jul 7, 2015

AmplabJenkins commented Jul 7, 2015

SparkQA commented Jul 7, 2015

SparkQA commented Jul 7, 2015

AmplabJenkins commented Jul 7, 2015

mengxr commented Jul 7, 2015

[SPARK-8823] [MLlib] [PySpark] Optimizations for SparseVector dot products #7222

[SPARK-8823] [MLlib] [PySpark] Optimizations for SparseVector dot products #7222

Conversation

MechCoder commented Jul 4, 2015

AmplabJenkins commented Jul 4, 2015

AmplabJenkins commented Jul 4, 2015

MechCoder commented Jul 4, 2015

SparkQA commented Jul 4, 2015

SparkQA commented Jul 4, 2015

AmplabJenkins commented Jul 4, 2015

AmplabJenkins commented Jul 4, 2015

AmplabJenkins commented Jul 4, 2015

SparkQA commented Jul 4, 2015

SparkQA commented Jul 4, 2015

AmplabJenkins commented Jul 4, 2015

MechCoder commented Jul 4, 2015

mengxr Jul 7, 2015

Choose a reason for hiding this comment

mengxr commented Jul 7, 2015

MechCoder commented Jul 7, 2015

mengxr commented Jul 7, 2015

MechCoder commented Jul 7, 2015

AmplabJenkins commented Jul 7, 2015

AmplabJenkins commented Jul 7, 2015

SparkQA commented Jul 7, 2015

SparkQA commented Jul 7, 2015

AmplabJenkins commented Jul 7, 2015

mengxr commented Jul 7, 2015