Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-8823] [MLlib] [PySpark] Optimizations for SparseVector dot products #7222

Closed
wants to merge 1 commit into from

Conversation

MechCoder
Copy link
Contributor

Follow up for #5946

Currently we iterate over indices and values in SparseVector and can be vectorized.

@MechCoder MechCoder changed the title [SPARK-8823] Optimizations for SparseVector dot products [SPARK-8823] [MLlib] [PySpark] Optimizations for SparseVector dot products Jul 4, 2015
@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@MechCoder
Copy link
Contributor Author

The speed up is not that impressive, but I roughly get a 10x speedup averaged over 100 iterations

Dot Products
Two Sparse Vectors length 50000 n_values 5000
In master:0.031453819274902345
In this branch: 0.0016013431549072267

Length 50000 n_values:500
In master:0.00331263542175293
In this branch: 0.0006479525566101074

Length: 500000 n_values:50000
In master: 0.04630022764205933
In this branch: 0.014638817310333252

squared_distance
Length: 500000 n_values:50000
In this branch:0.0178
In master:0.158

Length: 50000 n_values:500
In master: 0.0017
In this branch:0.0007526993751525879

Length: 50000, n_values: 5000
In master: 0.0158
In this branch: 0.001717

@SparkQA
Copy link

SparkQA commented Jul 4, 2015

Test build #36529 has started for PR 7222 at commit 4030782.

@SparkQA
Copy link

SparkQA commented Jul 4, 2015

Test build #36529 has finished for PR 7222 at commit 4030782.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Merged build finished. Test FAILed.

@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@SparkQA
Copy link

SparkQA commented Jul 4, 2015

Test build #36530 has started for PR 7222 at commit 68fd92f.

@SparkQA
Copy link

SparkQA commented Jul 4, 2015

Test build #36530 has finished for PR 7222 at commit 68fd92f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Merged build finished. Test PASSed.

@MechCoder
Copy link
Contributor Author

I verify that the benchmarks remain the same even after the bug fix.

result += other.values[j] * other.values[j]
j += 1
return result
self_cmind = np.in1d(self.indices, other.indices)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we simply do the following?

return np.dot(self.values, self.values) + np.dot(other.values, other.values) - 2.0 * self.dot(other)

It might generate some numeric errors when both vectors are long and very close to each other. It should be sufficient for normal use cases.

@mengxr
Copy link
Contributor

mengxr commented Jul 7, 2015

@MechCoder Thanks for the benchmark! The changes to dot looks good to me except an extra option we should try. I would recommend replacing the square_distance implementation by a simpler version. If that takes longer to discuss, we can split this PR into two.

@MechCoder
Copy link
Contributor Author

@mengxr Thanks for the comments. I think we can merge this basic improvement for now and we can try out your idea and then reiterate on a new branch. wdyt?

@mengxr
Copy link
Contributor

mengxr commented Jul 7, 2015

We should split this PR, keep only the changes to dot, and test the assume_unique flag. On a separate PR, let's test the performance about square_distance. The currently implementation of square_distance is not trivial to understand. We should iterate more before merge.

@MechCoder
Copy link
Contributor Author

OK, done. I forgot that assume_unique=True means that each array should have unique elements and not that the two arrays should be different from another.

@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@SparkQA
Copy link

SparkQA commented Jul 7, 2015

Test build #36689 has started for PR 7222 at commit dcb51d3.

@SparkQA
Copy link

SparkQA commented Jul 7, 2015

Test build #36689 has finished for PR 7222 at commit dcb51d3.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Merged build finished. Test PASSed.

@mengxr
Copy link
Contributor

mengxr commented Jul 7, 2015

LGTM. Merged into master. Please create a new JIRA for squared_distance. Thanks!

@asfgit asfgit closed this in 738c107 Jul 7, 2015
@MechCoder MechCoder deleted the sparse_optim branch July 7, 2015 16:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants