New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MLlib] [SPARK-2885] DIMSUM: All-pairs similarity #1778

Closed
wants to merge 33 commits into
base: master
from

Conversation

Projects
None yet
9 participants
@rezazadeh
Contributor

rezazadeh commented Aug 5, 2014

All-pairs similarity via DIMSUM

Compute all pairs of similar vectors using brute force approach, and also DIMSUM sampling approach.

Laying down some notation: we are looking for all pairs of similar columns in an m x n RowMatrix whose entries are denoted a_ij, with the i’th row denoted r_i and the j’th column denoted c_j. There is an oversampling parameter labeled ɣ that should be set to 4 log(n)/s to get provably correct results (with high probability), where s is the similarity threshold.

The algorithm is stated with a Map and Reduce, with proofs of correctness and efficiency in published papers [1] [2]. The reducer is simply the summation reducer. The mapper is more interesting, and is also the heart of the scheme. As an exercise, you should try to see why in expectation, the map-reduce below outputs cosine similarities.

dimsumv2

[1] Bosagh-Zadeh, Reza and Carlsson, Gunnar (2013), Dimension Independent Matrix Square using MapReduce, arXiv:1304.1467 http://arxiv.org/abs/1304.1467

[2] Bosagh-Zadeh, Reza and Goel, Ashish (2012), Dimension Independent Similarity Computation, arXiv:1206.2082 http://arxiv.org/abs/1206.2082

Testing

Tests for all invocations included.

Added L1 and L2 norm computation to MultivariateStatisticalSummary since it was needed. Added tests for both of them.

@SparkQA

This comment has been minimized.

Show comment
Hide comment
@SparkQA

SparkQA Aug 5, 2014

QA tests have started for PR 1778. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17926/consoleFull

SparkQA commented Aug 5, 2014

QA tests have started for PR 1778. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17926/consoleFull

@SparkQA

This comment has been minimized.

Show comment
Hide comment
@SparkQA

SparkQA Aug 5, 2014

QA results for PR 1778:
- This patch FAILED unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17926/consoleFull

SparkQA commented Aug 5, 2014

QA results for PR 1778:
- This patch FAILED unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17926/consoleFull

@SparkQA

This comment has been minimized.

Show comment
Hide comment
@SparkQA

SparkQA Aug 5, 2014

QA tests have started for PR 1778. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17929/consoleFull

SparkQA commented Aug 5, 2014

QA tests have started for PR 1778. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17929/consoleFull

@SparkQA

This comment has been minimized.

Show comment
Hide comment
@SparkQA

SparkQA Aug 5, 2014

QA results for PR 1778:
- This patch FAILED unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17929/consoleFull

SparkQA commented Aug 5, 2014

QA results for PR 1778:
- This patch FAILED unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17929/consoleFull

@rezazadeh

This comment has been minimized.

Show comment
Hide comment
@rezazadeh

rezazadeh Aug 5, 2014

Contributor

The binary backwards compatibility check doesn't like adding a new method to the trait MultivariateStatisticalSummary. Suggestions on binary compatibility welcome, @mengxr

Contributor

rezazadeh commented Aug 5, 2014

The binary backwards compatibility check doesn't like adding a new method to the trait MultivariateStatisticalSummary. Suggestions on binary compatibility welcome, @mengxr

@srowen

This comment has been minimized.

Show comment
Hide comment
@srowen

srowen Aug 5, 2014

Member

As a meta-question, what's the theory about what implementations should go into Spark, and which should be external? Not everything needs to be in a "core" library like MLlib. I know Mahout suffered mightily from adding a lot of implementations without much regard to their use or support. I'm not suggesting anything either way about this algorithm. If there's a working theory about what's in and out of scope, I'd love to see it and maybe make sure that people don't implement things for contribution that are too niche.

Member

srowen commented Aug 5, 2014

As a meta-question, what's the theory about what implementations should go into Spark, and which should be external? Not everything needs to be in a "core" library like MLlib. I know Mahout suffered mightily from adding a lot of implementations without much regard to their use or support. I'm not suggesting anything either way about this algorithm. If there's a working theory about what's in and out of scope, I'd love to see it and maybe make sure that people don't implement things for contribution that are too niche.

@rezazadeh

This comment has been minimized.

Show comment
Hide comment
@rezazadeh

rezazadeh Aug 5, 2014

Contributor

Having all-pairs similarity in spark has been requested several times. e.g. http://bit.ly/XAFGs8 , and also by @freeman-lab . This algorithm is also a part of Scalding: twitter/scalding#833

Contributor

rezazadeh commented Aug 5, 2014

Having all-pairs similarity in spark has been requested several times. e.g. http://bit.ly/XAFGs8 , and also by @freeman-lab . This algorithm is also a part of Scalding: twitter/scalding#833

@pwendell

This comment has been minimized.

Show comment
Hide comment
@pwendell

pwendell Aug 5, 2014

Contributor

@rezazadeh mind putting [MLlib] in the title here? That way it gets sorted correctly by our internal reivew tools.

Contributor

pwendell commented Aug 5, 2014

@rezazadeh mind putting [MLlib] in the title here? That way it gets sorted correctly by our internal reivew tools.

@rezazadeh rezazadeh changed the title from DIMSUM: Dimension Independent Matrix Square using Mapreduce to [MLlib] DIMSUM: Dimension Independent Matrix Square using Mapreduce Aug 5, 2014

@freeman-lab

This comment has been minimized.

Show comment
Hide comment
@freeman-lab

freeman-lab Aug 6, 2014

Contributor

@srowen agreed the core vs external library question is important. The requirements here seem reasonable, but there's still gray area. For example, we have lots of analyses that are known / accepted but should I think remain external because they are for specific data types (images & time series). Re: this particular algorithm, it's definitely something we're interested in using, sounds like others are too.

Contributor

freeman-lab commented Aug 6, 2014

@srowen agreed the core vs external library question is important. The requirements here seem reasonable, but there's still gray area. For example, we have lots of analyses that are known / accepted but should I think remain external because they are for specific data types (images & time series). Re: this particular algorithm, it's definitely something we're interested in using, sounds like others are too.

@mengxr

This comment has been minimized.

Show comment
Hide comment
@mengxr

mengxr Aug 6, 2014

Contributor

@rezazadeh Do you mind creating a JIRA for this and then add [SPARK-####] to the title? We also want to learn more about the theory, especially the relation between storage/computation complexity and failure rate.

Btw, to me, finding similar rows (observations) is more natural than finding similar columns.

Contributor

mengxr commented Aug 6, 2014

@rezazadeh Do you mind creating a JIRA for this and then add [SPARK-####] to the title? We also want to learn more about the theory, especially the relation between storage/computation complexity and failure rate.

Btw, to me, finding similar rows (observations) is more natural than finding similar columns.

@rezazadeh rezazadeh changed the title from [MLlib] DIMSUM: Dimension Independent Matrix Square using Mapreduce to [MLlib] [SPARK-2885] DIMSUM: Dimension Independent Matrix Square using Mapreduce Aug 6, 2014

@SparkQA

This comment has been minimized.

Show comment
Hide comment
@SparkQA

SparkQA Aug 6, 2014

QA tests have started for PR 1778. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18061/consoleFull

SparkQA commented Aug 6, 2014

QA tests have started for PR 1778. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18061/consoleFull

@rezazadeh

This comment has been minimized.

Show comment
Hide comment
@rezazadeh

rezazadeh Aug 6, 2014

Contributor

@mengxr Updated the PR to compute column magnitude as a method in RowMatrix so that binary compatibility shouldn't be a problem. This allowed me to use breeze too, which should take advantage of hardware acceleration when possible.

@srowen @freeman-lab @mengxr I added a JIRA for this PR, and clearly laid out why it is worthwhile adding this functionality to MLlib. https://issues.apache.org/jira/browse/SPARK-2885

Contributor

rezazadeh commented Aug 6, 2014

@mengxr Updated the PR to compute column magnitude as a method in RowMatrix so that binary compatibility shouldn't be a problem. This allowed me to use breeze too, which should take advantage of hardware acceleration when possible.

@srowen @freeman-lab @mengxr I added a JIRA for this PR, and clearly laid out why it is worthwhile adding this functionality to MLlib. https://issues.apache.org/jira/browse/SPARK-2885

@SparkQA

This comment has been minimized.

Show comment
Hide comment
@SparkQA

SparkQA Aug 6, 2014

QA results for PR 1778:
- This patch FAILED unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18061/consoleFull

SparkQA commented Aug 6, 2014

QA results for PR 1778:
- This patch FAILED unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18061/consoleFull

@SparkQA

This comment has been minimized.

Show comment
Hide comment
@SparkQA

SparkQA Aug 6, 2014

QA tests have started for PR 1778. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18076/consoleFull

SparkQA commented Aug 6, 2014

QA tests have started for PR 1778. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18076/consoleFull

@SparkQA

This comment has been minimized.

Show comment
Hide comment
@SparkQA

SparkQA Aug 7, 2014

QA results for PR 1778:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18076/consoleFull

SparkQA commented Aug 7, 2014

QA results for PR 1778:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18076/consoleFull

@SparkQA

This comment has been minimized.

Show comment
Hide comment
@SparkQA

SparkQA Sep 26, 2014

QA tests have started for PR 1778 at commit aea0247.

  • This patch does not merge cleanly!

SparkQA commented Sep 26, 2014

QA tests have started for PR 1778 at commit aea0247.

  • This patch does not merge cleanly!
Merge remote-tracking branch 'upstream/master' into dimsumv2
Conflicts:
	mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/RowMatrix.scala
@SparkQA

This comment has been minimized.

Show comment
Hide comment
@SparkQA

SparkQA Sep 26, 2014

QA tests have started for PR 1778 at commit 3467cff.

  • This patch merges cleanly.

SparkQA commented Sep 26, 2014

QA tests have started for PR 1778 at commit 3467cff.

  • This patch merges cleanly.
@rezazadeh

This comment has been minimized.

Show comment
Hide comment
@rezazadeh

rezazadeh Sep 26, 2014

Contributor

@mengxr I also added broadcasting of p and v to further optimize space usage. Also now we're avoiding divide by zero if there is a column with zero magnitude.

Contributor

rezazadeh commented Sep 26, 2014

@mengxr I also added broadcasting of p and v to further optimize space usage. Also now we're avoiding divide by zero if there is a column with zero magnitude.

@SparkQA

This comment has been minimized.

Show comment
Hide comment
@SparkQA

SparkQA Sep 26, 2014

QA tests have finished for PR 1778 at commit aea0247.

  • This patch fails unit tests.
  • This patch does not merge cleanly!

SparkQA commented Sep 26, 2014

QA tests have finished for PR 1778 at commit aea0247.

  • This patch fails unit tests.
  • This patch does not merge cleanly!
@SparkQA

This comment has been minimized.

Show comment
Hide comment
@SparkQA

SparkQA Sep 26, 2014

QA tests have finished for PR 1778 at commit 3467cff.

  • This patch fails unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

SparkQA commented Sep 26, 2014

QA tests have finished for PR 1778 at commit 3467cff.

  • This patch fails unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
@mengxr

This comment has been minimized.

Show comment
Hide comment
@mengxr

mengxr Sep 26, 2014

Contributor

test this please

Contributor

mengxr commented Sep 26, 2014

test this please

@SparkQA

This comment has been minimized.

Show comment
Hide comment
@SparkQA

SparkQA Sep 26, 2014

QA tests have started for PR 1778 at commit ee8bd65.

  • This patch merges cleanly.

SparkQA commented Sep 26, 2014

QA tests have started for PR 1778 at commit ee8bd65.

  • This patch merges cleanly.
@SparkQA

This comment has been minimized.

Show comment
Hide comment
@SparkQA

SparkQA Sep 26, 2014

QA tests have finished for PR 1778 at commit ee8bd65.

  • This patch fails unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

SparkQA commented Sep 26, 2014

QA tests have finished for PR 1778 at commit ee8bd65.

  • This patch fails unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
@rezazadeh

This comment has been minimized.

Show comment
Hide comment
@rezazadeh

rezazadeh Sep 26, 2014

Contributor

Only the binary compatibility test is failing, which is expected.

Contributor

rezazadeh commented Sep 26, 2014

Only the binary compatibility test is failing, which is expected.

@mengxr

This comment has been minimized.

Show comment
Hide comment
@mengxr

mengxr Sep 26, 2014

Contributor

@rezazadeh Could you set the exclusion rules in dev/MimaExcludes.scala?

Contributor

mengxr commented Sep 26, 2014

@rezazadeh Could you set the exclusion rules in dev/MimaExcludes.scala?

@SparkQA

This comment has been minimized.

Show comment
Hide comment
@SparkQA

SparkQA Sep 26, 2014

QA tests have started for PR 1778 at commit 4eb71c6.

  • This patch merges cleanly.

SparkQA commented Sep 26, 2014

QA tests have started for PR 1778 at commit 4eb71c6.

  • This patch merges cleanly.
@SparkQA

This comment has been minimized.

Show comment
Hide comment
@SparkQA

SparkQA Sep 26, 2014

QA tests have finished for PR 1778 at commit 4eb71c6.

  • This patch fails unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

SparkQA commented Sep 26, 2014

QA tests have finished for PR 1778 at commit 4eb71c6.

  • This patch fails unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
@rezazadeh

This comment has been minimized.

Show comment
Hide comment
@rezazadeh

rezazadeh Sep 26, 2014

Contributor

Jenkins, test this please.

Contributor

rezazadeh commented Sep 26, 2014

Jenkins, test this please.

@SparkQA

This comment has been minimized.

Show comment
Hide comment
@SparkQA

SparkQA Sep 26, 2014

QA tests have started for PR 1778 at commit 4eb71c6.

  • This patch merges cleanly.

SparkQA commented Sep 26, 2014

QA tests have started for PR 1778 at commit 4eb71c6.

  • This patch merges cleanly.
@SparkQA

This comment has been minimized.

Show comment
Hide comment
@SparkQA

SparkQA Sep 26, 2014

QA tests have finished for PR 1778 at commit 4eb71c6.

  • This patch fails unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

SparkQA commented Sep 26, 2014

QA tests have finished for PR 1778 at commit 4eb71c6.

  • This patch fails unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
@SparkQA

This comment has been minimized.

Show comment
Hide comment
@SparkQA

SparkQA Sep 26, 2014

QA tests have started for PR 1778 at commit 404c64c.

  • This patch merges cleanly.

SparkQA commented Sep 26, 2014

QA tests have started for PR 1778 at commit 404c64c.

  • This patch merges cleanly.
@SparkQA

This comment has been minimized.

Show comment
Hide comment
@SparkQA

SparkQA Sep 27, 2014

QA tests have finished for PR 1778 at commit 404c64c.

  • This patch passes unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

SparkQA commented Sep 27, 2014

QA tests have finished for PR 1778 at commit 404c64c.

  • This patch passes unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
@mengxr

This comment has been minimized.

Show comment
Hide comment
@mengxr

mengxr Sep 29, 2014

Contributor

LGTM. Merged into master! Thanks @rezazadeh !

Contributor

mengxr commented Sep 29, 2014

LGTM. Merged into master! Thanks @rezazadeh !

@asfgit asfgit closed this in 587a0cd Sep 29, 2014

@rezazadeh

This comment has been minimized.

Show comment
Hide comment
@rezazadeh

rezazadeh Sep 29, 2014

Contributor

Thanks for the review @mengxr !

Contributor

rezazadeh commented Sep 29, 2014

Thanks for the review @mengxr !

@rezazadeh rezazadeh deleted the rezazadeh:dimsumv2 branch Sep 30, 2014

dgshep pushed a commit to dgshep/spark that referenced this pull request Dec 8, 2014

[MLlib] [SPARK-2885] DIMSUM: All-pairs similarity
# All-pairs similarity via DIMSUM
Compute all pairs of similar vectors using brute force approach, and also DIMSUM sampling approach.

Laying down some notation: we are looking for all pairs of similar columns in an m x n RowMatrix whose entries are denoted a_ij, with the i’th row denoted r_i and the j’th column denoted c_j. There is an oversampling parameter labeled ɣ that should be set to 4 log(n)/s to get provably correct results (with high probability), where s is the similarity threshold.

The algorithm is stated with a Map and Reduce, with proofs of correctness and efficiency in published papers [1] [2]. The reducer is simply the summation reducer. The mapper is more interesting, and is also the heart of the scheme. As an exercise, you should try to see why in expectation, the map-reduce below outputs cosine similarities.

![dimsumv2](https://cloud.githubusercontent.com/assets/3220351/3807272/d1d9514e-1c62-11e4-9f12-3cfdb1d78b3a.png)

[1] Bosagh-Zadeh, Reza and Carlsson, Gunnar (2013), Dimension Independent Matrix Square using MapReduce, arXiv:1304.1467 http://arxiv.org/abs/1304.1467

[2] Bosagh-Zadeh, Reza and Goel, Ashish (2012), Dimension Independent Similarity Computation, arXiv:1206.2082 http://arxiv.org/abs/1206.2082

# Testing

Tests for all invocations included.

Added L1 and L2 norm computation to MultivariateStatisticalSummary since it was needed. Added tests for both of them.

Author: Reza Zadeh <rizlar@gmail.com>
Author: Xiangrui Meng <meng@databricks.com>

Closes #1778 from rezazadeh/dimsumv2 and squashes the following commits:

404c64c [Reza Zadeh] Merge remote-tracking branch 'upstream/master' into dimsumv2
4eb71c6 [Reza Zadeh] Add excludes for normL1 and normL2
ee8bd65 [Reza Zadeh] Merge remote-tracking branch 'upstream/master' into dimsumv2
976ddd4 [Reza Zadeh] Broadcast colMags. Avoid div by zero.
3467cff [Reza Zadeh] Merge remote-tracking branch 'upstream/master' into dimsumv2
aea0247 [Reza Zadeh] Allow large thresholds to promote sparsity
9fe17c0 [Xiangrui Meng] organize imports
2196ba5 [Xiangrui Meng] Merge branch 'rezazadeh-dimsumv2' into dimsumv2
254ca08 [Reza Zadeh] Merge remote-tracking branch 'upstream/master' into dimsumv2
f2947e4 [Xiangrui Meng] some optimization
3c4cf41 [Xiangrui Meng] Merge branch 'master' into rezazadeh-dimsumv2
0e4eda4 [Reza Zadeh] Use partition index for RNG
251bb9c [Reza Zadeh] Documentation
25e9d0d [Reza Zadeh] Line length for style
fb296f6 [Reza Zadeh] renamed to normL1 and normL2
3764983 [Reza Zadeh] Documentation
e9c6791 [Reza Zadeh] New interface and documentation
613f261 [Reza Zadeh] Column magnitude summary
75a0b51 [Reza Zadeh] Use Ints instead of Longs in the shuffle
0f12ade [Reza Zadeh] Style changes
eb1dc20 [Reza Zadeh] Use Double.PositiveInfinity instead of Double.Max
f56a882 [Reza Zadeh] Remove changes to MultivariateOnlineSummarizer
dbc55ba [Reza Zadeh] Make colMagnitudes a method in RowMatrix
41e8ece [Reza Zadeh] style changes
139c8e1 [Reza Zadeh] Syntax changes
029aa9c [Reza Zadeh] javadoc and new test
75edb25 [Reza Zadeh] All tests passing!
05e59b8 [Reza Zadeh] Add test
502ce52 [Reza Zadeh] new interface
654c4fb [Reza Zadeh] default methods
3726ca9 [Reza Zadeh] Remove MatrixAlgebra
6bebabb [Reza Zadeh] remove changes to MatrixSuite
5b8cd7d [Reza Zadeh] Initial files
@appierys

This comment has been minimized.

Show comment
Hide comment
@appierys

appierys Oct 21, 2016

Does anyone know how to extend this to the 'Cross Product' case as mentioned in the paper?

appierys commented Oct 21, 2016

Does anyone know how to extend this to the 'Cross Product' case as mentioned in the paper?

tdas pushed a commit to tdas/spark that referenced this pull request May 29, 2018

Revert "[SC-8135][PART1] Check in DataFrame + RDD ProtoSerializer" on…
… dbr-branch-4.x

## What changes were proposed in this pull request?

This PR reverts #1766.

The reason is same as #1777, but it targets to dbr-branch-4.x.

## How was this patch tested?

N/A.

Author: Cheng Lian <lian@databricks.com>

Closes #1778 from liancheng/revert-pr-1766.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment