Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MLlib] [SPARK-2885] DIMSUM: All-pairs similarity #1778

Closed
wants to merge 33 commits into from

Conversation

rezazadeh
Copy link
Contributor

All-pairs similarity via DIMSUM

Compute all pairs of similar vectors using brute force approach, and also DIMSUM sampling approach.

Laying down some notation: we are looking for all pairs of similar columns in an m x n RowMatrix whose entries are denoted a_ij, with the i’th row denoted r_i and the j’th column denoted c_j. There is an oversampling parameter labeled ɣ that should be set to 4 log(n)/s to get provably correct results (with high probability), where s is the similarity threshold.

The algorithm is stated with a Map and Reduce, with proofs of correctness and efficiency in published papers [1] [2]. The reducer is simply the summation reducer. The mapper is more interesting, and is also the heart of the scheme. As an exercise, you should try to see why in expectation, the map-reduce below outputs cosine similarities.

dimsumv2

[1] Bosagh-Zadeh, Reza and Carlsson, Gunnar (2013), Dimension Independent Matrix Square using MapReduce, arXiv:1304.1467 http://arxiv.org/abs/1304.1467

[2] Bosagh-Zadeh, Reza and Goel, Ashish (2012), Dimension Independent Similarity Computation, arXiv:1206.2082 http://arxiv.org/abs/1206.2082

Testing

Tests for all invocations included.

Added L1 and L2 norm computation to MultivariateStatisticalSummary since it was needed. Added tests for both of them.

@SparkQA
Copy link

SparkQA commented Aug 5, 2014

QA tests have started for PR 1778. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17926/consoleFull

@SparkQA
Copy link

SparkQA commented Aug 5, 2014

QA results for PR 1778:
- This patch FAILED unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17926/consoleFull

@SparkQA
Copy link

SparkQA commented Aug 5, 2014

QA tests have started for PR 1778. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17929/consoleFull

@SparkQA
Copy link

SparkQA commented Aug 5, 2014

QA results for PR 1778:
- This patch FAILED unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17929/consoleFull

@rezazadeh
Copy link
Contributor Author

The binary backwards compatibility check doesn't like adding a new method to the trait MultivariateStatisticalSummary. Suggestions on binary compatibility welcome, @mengxr

@srowen
Copy link
Member

srowen commented Aug 5, 2014

As a meta-question, what's the theory about what implementations should go into Spark, and which should be external? Not everything needs to be in a "core" library like MLlib. I know Mahout suffered mightily from adding a lot of implementations without much regard to their use or support. I'm not suggesting anything either way about this algorithm. If there's a working theory about what's in and out of scope, I'd love to see it and maybe make sure that people don't implement things for contribution that are too niche.

@rezazadeh
Copy link
Contributor Author

Having all-pairs similarity in spark has been requested several times. e.g. http://bit.ly/XAFGs8 , and also by @freeman-lab . This algorithm is also a part of Scalding: twitter/scalding#833

@pwendell
Copy link
Contributor

pwendell commented Aug 5, 2014

@rezazadeh mind putting [MLlib] in the title here? That way it gets sorted correctly by our internal reivew tools.

@rezazadeh rezazadeh changed the title DIMSUM: Dimension Independent Matrix Square using Mapreduce [MLlib] DIMSUM: Dimension Independent Matrix Square using Mapreduce Aug 5, 2014
@freeman-lab
Copy link
Contributor

@srowen agreed the core vs external library question is important. The requirements here seem reasonable, but there's still gray area. For example, we have lots of analyses that are known / accepted but should I think remain external because they are for specific data types (images & time series). Re: this particular algorithm, it's definitely something we're interested in using, sounds like others are too.

@mengxr
Copy link
Contributor

mengxr commented Aug 6, 2014

@rezazadeh Do you mind creating a JIRA for this and then add [SPARK-####] to the title? We also want to learn more about the theory, especially the relation between storage/computation complexity and failure rate.

Btw, to me, finding similar rows (observations) is more natural than finding similar columns.

@rezazadeh rezazadeh changed the title [MLlib] DIMSUM: Dimension Independent Matrix Square using Mapreduce [MLlib] [SPARK-2885] DIMSUM: Dimension Independent Matrix Square using Mapreduce Aug 6, 2014
@SparkQA
Copy link

SparkQA commented Aug 6, 2014

QA tests have started for PR 1778. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18061/consoleFull

@rezazadeh
Copy link
Contributor Author

@mengxr Updated the PR to compute column magnitude as a method in RowMatrix so that binary compatibility shouldn't be a problem. This allowed me to use breeze too, which should take advantage of hardware acceleration when possible.

@srowen @freeman-lab @mengxr I added a JIRA for this PR, and clearly laid out why it is worthwhile adding this functionality to MLlib. https://issues.apache.org/jira/browse/SPARK-2885

@SparkQA
Copy link

SparkQA commented Aug 6, 2014

QA results for PR 1778:
- This patch FAILED unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18061/consoleFull

@SparkQA
Copy link

SparkQA commented Aug 6, 2014

QA tests have started for PR 1778. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18076/consoleFull

@SparkQA
Copy link

SparkQA commented Aug 7, 2014

QA results for PR 1778:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18076/consoleFull

@SparkQA
Copy link

SparkQA commented Sep 26, 2014

QA tests have started for PR 1778 at commit aea0247.

  • This patch does not merge cleanly!

Conflicts:
	mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/RowMatrix.scala
@SparkQA
Copy link

SparkQA commented Sep 26, 2014

QA tests have started for PR 1778 at commit 3467cff.

  • This patch merges cleanly.

@rezazadeh
Copy link
Contributor Author

@mengxr I also added broadcasting of p and v to further optimize space usage. Also now we're avoiding divide by zero if there is a column with zero magnitude.

@SparkQA
Copy link

SparkQA commented Sep 26, 2014

QA tests have finished for PR 1778 at commit aea0247.

  • This patch fails unit tests.
  • This patch does not merge cleanly!

@SparkQA
Copy link

SparkQA commented Sep 26, 2014

QA tests have finished for PR 1778 at commit 3467cff.

  • This patch fails unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@mengxr
Copy link
Contributor

mengxr commented Sep 26, 2014

test this please

@SparkQA
Copy link

SparkQA commented Sep 26, 2014

QA tests have started for PR 1778 at commit ee8bd65.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Sep 26, 2014

QA tests have finished for PR 1778 at commit ee8bd65.

  • This patch fails unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@rezazadeh
Copy link
Contributor Author

Only the binary compatibility test is failing, which is expected.

@mengxr
Copy link
Contributor

mengxr commented Sep 26, 2014

@rezazadeh Could you set the exclusion rules in dev/MimaExcludes.scala?

@SparkQA
Copy link

SparkQA commented Sep 26, 2014

QA tests have started for PR 1778 at commit 4eb71c6.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Sep 26, 2014

QA tests have finished for PR 1778 at commit 4eb71c6.

  • This patch fails unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@rezazadeh
Copy link
Contributor Author

Jenkins, test this please.

@SparkQA
Copy link

SparkQA commented Sep 26, 2014

QA tests have started for PR 1778 at commit 4eb71c6.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Sep 26, 2014

QA tests have finished for PR 1778 at commit 4eb71c6.

  • This patch fails unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 26, 2014

QA tests have started for PR 1778 at commit 404c64c.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Sep 27, 2014

QA tests have finished for PR 1778 at commit 404c64c.

  • This patch passes unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@mengxr
Copy link
Contributor

mengxr commented Sep 29, 2014

LGTM. Merged into master! Thanks @rezazadeh !

@asfgit asfgit closed this in 587a0cd Sep 29, 2014
@rezazadeh
Copy link
Contributor Author

Thanks for the review @mengxr !

@rezazadeh rezazadeh deleted the dimsumv2 branch September 30, 2014 04:37
@appierys
Copy link

Does anyone know how to extend this to the 'Cross Product' case as mentioned in the paper?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants