-
Notifications
You must be signed in to change notification settings - Fork 28.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MLlib] [SPARK-2885] DIMSUM: All-pairs similarity #1778
Conversation
QA tests have started for PR 1778. This patch merges cleanly. |
QA results for PR 1778: |
QA tests have started for PR 1778. This patch merges cleanly. |
QA results for PR 1778: |
The binary backwards compatibility check doesn't like adding a new method to the trait MultivariateStatisticalSummary. Suggestions on binary compatibility welcome, @mengxr |
As a meta-question, what's the theory about what implementations should go into Spark, and which should be external? Not everything needs to be in a "core" library like MLlib. I know Mahout suffered mightily from adding a lot of implementations without much regard to their use or support. I'm not suggesting anything either way about this algorithm. If there's a working theory about what's in and out of scope, I'd love to see it and maybe make sure that people don't implement things for contribution that are too niche. |
Having all-pairs similarity in spark has been requested several times. e.g. http://bit.ly/XAFGs8 , and also by @freeman-lab . This algorithm is also a part of Scalding: twitter/scalding#833 |
@rezazadeh mind putting |
@srowen agreed the core vs external library question is important. The requirements here seem reasonable, but there's still gray area. For example, we have lots of analyses that are known / accepted but should I think remain external because they are for specific data types (images & time series). Re: this particular algorithm, it's definitely something we're interested in using, sounds like others are too. |
@rezazadeh Do you mind creating a JIRA for this and then add Btw, to me, finding similar rows (observations) is more natural than finding similar columns. |
QA tests have started for PR 1778. This patch merges cleanly. |
@mengxr Updated the PR to compute column magnitude as a method in RowMatrix so that binary compatibility shouldn't be a problem. This allowed me to use breeze too, which should take advantage of hardware acceleration when possible. @srowen @freeman-lab @mengxr I added a JIRA for this PR, and clearly laid out why it is worthwhile adding this functionality to MLlib. https://issues.apache.org/jira/browse/SPARK-2885 |
QA results for PR 1778: |
QA tests have started for PR 1778. This patch merges cleanly. |
QA results for PR 1778: |
QA tests have started for PR 1778 at commit
|
Conflicts: mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/RowMatrix.scala
QA tests have started for PR 1778 at commit
|
@mengxr I also added broadcasting of p and v to further optimize space usage. Also now we're avoiding divide by zero if there is a column with zero magnitude. |
QA tests have finished for PR 1778 at commit
|
QA tests have finished for PR 1778 at commit
|
test this please |
QA tests have started for PR 1778 at commit
|
QA tests have finished for PR 1778 at commit
|
Only the binary compatibility test is failing, which is expected. |
@rezazadeh Could you set the exclusion rules in |
QA tests have started for PR 1778 at commit
|
QA tests have finished for PR 1778 at commit
|
Jenkins, test this please. |
QA tests have started for PR 1778 at commit
|
QA tests have finished for PR 1778 at commit
|
QA tests have started for PR 1778 at commit
|
QA tests have finished for PR 1778 at commit
|
LGTM. Merged into master! Thanks @rezazadeh ! |
Thanks for the review @mengxr ! |
Does anyone know how to extend this to the 'Cross Product' case as mentioned in the paper? |
All-pairs similarity via DIMSUM
Compute all pairs of similar vectors using brute force approach, and also DIMSUM sampling approach.
Laying down some notation: we are looking for all pairs of similar columns in an m x n RowMatrix whose entries are denoted a_ij, with the i’th row denoted r_i and the j’th column denoted c_j. There is an oversampling parameter labeled ɣ that should be set to 4 log(n)/s to get provably correct results (with high probability), where s is the similarity threshold.
The algorithm is stated with a Map and Reduce, with proofs of correctness and efficiency in published papers [1] [2]. The reducer is simply the summation reducer. The mapper is more interesting, and is also the heart of the scheme. As an exercise, you should try to see why in expectation, the map-reduce below outputs cosine similarities.
[1] Bosagh-Zadeh, Reza and Carlsson, Gunnar (2013), Dimension Independent Matrix Square using MapReduce, arXiv:1304.1467 http://arxiv.org/abs/1304.1467
[2] Bosagh-Zadeh, Reza and Goel, Ashish (2012), Dimension Independent Similarity Computation, arXiv:1206.2082 http://arxiv.org/abs/1206.2082
Testing
Tests for all invocations included.
Added L1 and L2 norm computation to MultivariateStatisticalSummary since it was needed. Added tests for both of them.