# [MLlib] [SPARK-2885] DIMSUM: All-pairs similarity #1778

Closed
wants to merge 33 commits into
from
+259 −6

None yet

Contributor

# All-pairs similarity via DIMSUM

Compute all pairs of similar vectors using brute force approach, and also DIMSUM sampling approach.

Laying down some notation: we are looking for all pairs of similar columns in an m x n RowMatrix whose entries are denoted a_ij, with the i’th row denoted r_i and the j’th column denoted c_j. There is an oversampling parameter labeled ɣ that should be set to 4 log(n)/s to get provably correct results (with high probability), where s is the similarity threshold.

The algorithm is stated with a Map and Reduce, with proofs of correctness and efficiency in published papers [1] [2]. The reducer is simply the summation reducer. The mapper is more interesting, and is also the heart of the scheme. As an exercise, you should try to see why in expectation, the map-reduce below outputs cosine similarities.

[1] Bosagh-Zadeh, Reza and Carlsson, Gunnar (2013), Dimension Independent Matrix Square using MapReduce, arXiv:1304.1467 http://arxiv.org/abs/1304.1467

[2] Bosagh-Zadeh, Reza and Goel, Ashish (2012), Dimension Independent Similarity Computation, arXiv:1206.2082 http://arxiv.org/abs/1206.2082

# Testing

Tests for all invocations included.

Added L1 and L2 norm computation to MultivariateStatisticalSummary since it was needed. Added tests for both of them.

added some commits Aug 4, 2014
 rezazadeh Initial files 5b8cd7d rezazadeh remove changes to MatrixSuite 6bebabb rezazadeh Remove MatrixAlgebra 3726ca9 rezazadeh default methods 654c4fb rezazadeh new interface 502ce52 rezazadeh Add test 05e59b8 rezazadeh All tests passing! 75edb25 rezazadeh javadoc and new test 029aa9c rezazadeh Syntax changes 139c8e1
commented Aug 5, 2014
 QA tests have started for PR 1778. This patch merges cleanly. View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17926/consoleFull
commented Aug 5, 2014
 QA results for PR 1778:- This patch FAILED unit tests.- This patch merges cleanly- This patch adds no public classesFor more information see test ouptut:https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17926/consoleFull
 rezazadeh style changes 41e8ece
commented Aug 5, 2014
 QA tests have started for PR 1778. This patch merges cleanly. View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17929/consoleFull
commented Aug 5, 2014
 QA results for PR 1778:- This patch FAILED unit tests.- This patch merges cleanly- This patch adds no public classesFor more information see test ouptut:https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17929/consoleFull
Contributor
 The binary backwards compatibility check doesn't like adding a new method to the trait MultivariateStatisticalSummary. Suggestions on binary compatibility welcome, @mengxr
Member
commented Aug 5, 2014
 As a meta-question, what's the theory about what implementations should go into Spark, and which should be external? Not everything needs to be in a "core" library like MLlib. I know Mahout suffered mightily from adding a lot of implementations without much regard to their use or support. I'm not suggesting anything either way about this algorithm. If there's a working theory about what's in and out of scope, I'd love to see it and maybe make sure that people don't implement things for contribution that are too niche.
Contributor
 Having all-pairs similarity in spark has been requested several times. e.g. http://bit.ly/XAFGs8 , and also by @freeman-lab . This algorithm is also a part of Scalding: twitter/scalding#833/
Contributor
commented Aug 5, 2014
 @rezazadeh mind putting [MLlib] in the title here? That way it gets sorted correctly by our internal reivew tools.
changed the title from DIMSUM: Dimension Independent Matrix Square using Mapreduce to [MLlib] DIMSUM: Dimension Independent Matrix Square using Mapreduce Aug 5, 2014
Contributor
 @srowen agreed the core vs external library question is important. The requirements here seem reasonable, but there's still gray area. For example, we have lots of analyses that are known / accepted but should I think remain external because they are for specific data types (images & time series). Re: this particular algorithm, it's definitely something we're interested in using, sounds like others are too.
Contributor
commented Aug 6, 2014
 @rezazadeh Do you mind creating a JIRA for this and then add [SPARK-####] to the title? We also want to learn more about the theory, especially the relation between storage/computation complexity and failure rate. Btw, to me, finding similar rows (observations) is more natural than finding similar columns.
changed the title from [MLlib] DIMSUM: Dimension Independent Matrix Square using Mapreduce to [MLlib] [SPARK-2885] DIMSUM: Dimension Independent Matrix Square using Mapreduce Aug 6, 2014
added some commits Aug 6, 2014
 rezazadeh Make colMagnitudes a method in RowMatrix dbc55ba rezazadeh Remove changes to MultivariateOnlineSummarizer f56a882
commented Aug 6, 2014
 QA tests have started for PR 1778. This patch merges cleanly. View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18061/consoleFull
Contributor
 @mengxr Updated the PR to compute column magnitude as a method in RowMatrix so that binary compatibility shouldn't be a problem. This allowed me to use breeze too, which should take advantage of hardware acceleration when possible. @srowen @freeman-lab @mengxr I added a JIRA for this PR, and clearly laid out why it is worthwhile adding this functionality to MLlib. https://issues.apache.org/jira/browse/SPARK-2885
commented Aug 6, 2014
 QA results for PR 1778:- This patch FAILED unit tests.- This patch merges cleanly- This patch adds no public classesFor more information see test ouptut:https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18061/consoleFull
commented Aug 6, 2014
 QA tests have started for PR 1778. This patch merges cleanly. View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18076/consoleFull
commented Aug 7, 2014
 QA results for PR 1778:- This patch PASSES unit tests.- This patch merges cleanly- This patch adds no public classesFor more information see test ouptut:https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18076/consoleFull
 rezazadeh Use Double.PositiveInfinity instead of Double.Max eb1dc20
commented Aug 7, 2014
 QA tests have started for PR 1778. This patch merges cleanly. View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18099/consoleFull
commented Aug 7, 2014
 QA results for PR 1778:- This patch FAILED unit tests.- This patch merges cleanly- This patch adds no public classesFor more information see test ouptut:https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18099/consoleFull
Contributor
commented Aug 7, 2014
 QA tests have started for PR 1778. This patch merges cleanly. View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18107/consoleFull
commented Aug 7, 2014
 QA results for PR 1778:- This patch PASSES unit tests.- This patch merges cleanly- This patch adds no public classesFor more information see test ouptut:https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18107/consoleFull
changed the title from [MLlib] [SPARK-2885] DIMSUM: Dimension Independent Matrix Square using Mapreduce to [MLlib] [SPARK-2885] All-pairs similarity Aug 7, 2014
changed the title from [MLlib] [SPARK-2885] All-pairs similarity to [MLlib] [SPARK-2885] DIMSUM: All-pairs similarity Aug 7, 2014
referenced this pull request Aug 27, 2014
Closed

#### Dimension Independent Matrix Square Using MapReduce #336

Contributor
commented Aug 30, 2014
 @rezazadeh Could you update the PR to follow Spark Code Style Guide? Thanks!
 rezazadeh Style changes 0f12ade
Contributor
 Style changes made. Experimental results below. We run DIMSUM daily on a production-scale ads dataset. After replacing the traditional cosine similarity computation in late June, we observe 40% improvement in several performance measures, plotted below. The top of the y-axis is around 100TB. For correctness proof, see Theorem 4.3 in http://stanford.edu/~rezab/papers/dimsum.pdf
commented Aug 31, 2014
 QA tests have started for PR 1778 at commit 0f12ade. This patch merges cleanly.
 rezazadeh Use Ints instead of Longs in the shuffle 75a0b51
commented Aug 31, 2014
 QA tests have started for PR 1778 at commit 75a0b51. This patch merges cleanly.
commented Aug 31, 2014
 QA tests have finished for PR 1778 at commit 0f12ade. This patch passes unit tests. This patch merges cleanly. This patch adds the following public classes (experimental): class ExecutorClassLoader(conf: SparkConf, classUri: String, parent: ClassLoader,
commented Aug 31, 2014
 QA tests have finished for PR 1778 at commit 75a0b51. This patch passes unit tests. This patch merges cleanly. This patch adds no public classes.
and 1 other commented on an outdated diff Sep 2, 2014
...apache/spark/mllib/linalg/distributed/RowMatrix.scala
 import org.apache.spark.Logging import org.apache.spark.mllib.rdd.RDDFunctions._ import org.apache.spark.mllib.stat.{MultivariateOnlineSummarizer, MultivariateStatisticalSummary} +import scala.collection.mutable.ListBuffer
 mengxr Contributor Move Scala imports in front of third-party imports. rezazadeh Contributor Done
and 1 other commented on an outdated diff Sep 2, 2014
...apache/spark/mllib/linalg/distributed/RowMatrix.scala
 @@ -390,6 +393,79 @@ class RowMatrix( new RowMatrix(AB, nRows, B.numCols) } + /** + * Find all similar columns using cosine similarity. + * + * @return An n x n sparse matrix of cosine similarities between columns of this matrix. + */ + def similarColumns(): CoordinateMatrix = {
 mengxr Contributor When gamma is infinite, does the name similarColumns still make sense? rezazadeh Contributor Renamed to columnSimilarities()
and 1 other commented on an outdated diff Sep 2, 2014
...apache/spark/mllib/linalg/distributed/RowMatrix.scala
 @@ -390,6 +393,79 @@ class RowMatrix( new RowMatrix(AB, nRows, B.numCols) } + /** + * Find all similar columns using cosine similarity. + * + * @return An n x n sparse matrix of cosine similarities between columns of this matrix. + */ + def similarColumns(): CoordinateMatrix = { + similarColumns(Double.PositiveInfinity) + } + + /** + * Find all similar columns using the DIMSUM sampling algorithm, described in + * http://arxiv.org/abs/1304.1467 + * + * @param gamma The oversampling parameter. For provable results, set to 4 * log(n) / s,
 mengxr Contributor It would be nice to tell users how gamma affects the cost. From users' perspective, s is more friendly than gamma. rezazadeh Contributor Changed the interface to the threshold s instead of gamma.
and 1 other commented on an outdated diff Sep 2, 2014
...apache/spark/mllib/linalg/distributed/RowMatrix.scala
 + * http://arxiv.org/abs/1304.1467 + * + * @param gamma The oversampling parameter. For provable results, set to 4 * log(n) / s, + * where s is the smallest similarity score to be estimated, + * and n is the number of columns + * @return An n x n sparse matrix of cosine similarities between columns of this matrix. + */ + def similarColumns(gamma: Double): CoordinateMatrix = { + similarColumnsDIMSUM(columnMagnitudes(), gamma) + } + + /** + * Return 2-norm of the columns of this matrix. + * @return an array of column magnitudes + */ + def columnMagnitudes(): Array[Double] = {
 mengxr Contributor Mark it private? rezazadeh Contributor Removed this altogether, and moved the functionality to MultivariateOnlineSummarizer, calling the functionality normL1 and normL2
and 1 other commented on an outdated diff Sep 2, 2014
...apache/spark/mllib/linalg/distributed/RowMatrix.scala
 + * and n is the number of columns + * @return An n x n sparse matrix of cosine similarities between columns of this matrix. + */ + def similarColumns(gamma: Double): CoordinateMatrix = { + similarColumnsDIMSUM(columnMagnitudes(), gamma) + } + + /** + * Return 2-norm of the columns of this matrix. + * @return an array of column magnitudes + */ + def columnMagnitudes(): Array[Double] = { + rows.map { x => + val brzX = x.toBreeze + brzX.:*(brzX) + }.fold(BDV.zeros[Double](numCols().toInt))(_ + _).toArray.map(math.sqrt(_))
 mengxr Contributor It will create many temp objects, which we should try to avoid. We need to also double check brzX.:*(brzX) handles sparse vectors correctly. Using aggregate and then activeIterator may be a better solution. Another issue is overflow. rezazadeh Contributor Moved into MultivariateOnlineSummarizer
and 1 other commented on an outdated diff Sep 2, 2014
...apache/spark/mllib/linalg/distributed/RowMatrix.scala
 + val brzX = x.toBreeze + brzX.:*(brzX) + }.fold(BDV.zeros[Double](numCols().toInt))(_ + _).toArray.map(math.sqrt(_)) + } + + /** + * Find all similar columns using the DIMSUM sampling algorithm, described in + * http://arxiv.org/abs/1304.1467 + * + * @param colMags A vector of column magnitudes + * @param gamma The oversampling parameter. For provable results, set to 4 * log(n) / s, + * where s is the smallest similarity score to be estimated, + * and n is the number of columns + * @return An n x n sparse matrix of cosine similarities between columns of this matrix. + */ + def similarColumnsDIMSUM(colMags: Array[Double], gamma: Double): CoordinateMatrix = {
 mengxr Contributor Mark it private? rezazadeh Contributor Marked as private to mllib
and 1 other commented on an outdated diff Sep 2, 2014
...apache/spark/mllib/linalg/distributed/RowMatrix.scala
 + def similarColumnsDIMSUM(colMags: Array[Double], gamma: Double): CoordinateMatrix = { + require(gamma > 1.0, s"Oversampling should be greater than 1: $gamma") + require(colMags.size == this.numCols(), "Number of magnitudes didn't match column dimension") + + val sg = math.sqrt(gamma) // sqrt(gamma) used many times + + val sims = rows.flatMap { row => + val buf = new ListBuffer[((Int, Int), Double)]() + row.toBreeze.activeIterator.foreach { + case (_, 0.0) => // Skip explicit zero elements. + case (i, iVal) => + if (Math.random < sg / colMags(i)) { + row.toBreeze.activeIterator.foreach { + case (_, 0.0) => // Skip explicit zero elements. + case (j, jVal) => + if (Math.random < sg / colMags(j)) {  mengxr Contributor Need to fix the random seed per partition. rezazadeh Contributor Fixed by seeding the RNG with the row. and 1 other commented on an outdated diff Sep 2, 2014 ...e/spark/mllib/linalg/distributed/RowMatrixSuite.scala  + for (mat <- Seq(denseMat, sparseMat)) { + val G = mat.similarColumns(150.0) + assert(closeToZero(G.toBreeze() - expected)) + } + + for (mat <- Seq(denseMat, sparseMat)) { + val G = mat.similarColumns() + assert(closeToZero(G.toBreeze() - expected)) + } + + for (mat <- Seq(denseMat, sparseMat)) { + val G = mat.similarColumnsDIMSUM(colMags.toArray, 150.0) + assert(closeToZero(G.toBreeze() - expected)) + } + } +  mengxr Contributor There is no test for obtaining partial similar pairs. Do you mind adding one? rezazadeh Contributor Added test for partial similar pairs. mengxr Contributor Does the test output only a subset of column pairs? rezazadeh Contributor The test is estimating some column pairs i.e. I looked at the output and checked that some pairs don't have their similarity perfectly computed. Contributor commented Sep 2, 2014  @rezazadeh I understand that it is easier to implement the algorithm on a row-oriented format to compute similar columns. But it still sounds more natural to me to compute similar rows on a row-oriented format. For example, we have a billion small images and we want to pick out similar ones, without computing 10^9 x 10^9 inner products. Does DIMSUM help? and 1 other commented on an outdated diff Sep 3, 2014 ...apache/spark/mllib/linalg/distributed/RowMatrix.scala  + require(colMags.size == this.numCols(), "Number of magnitudes didn't match column dimension") + + val sg = math.sqrt(gamma) // sqrt(gamma) used many times + + val sims = rows.flatMap { row => + val buf = new ListBuffer[((Int, Int), Double)]() + row.toBreeze.activeIterator.foreach { + case (_, 0.0) => // Skip explicit zero elements. + case (i, iVal) => + if (Math.random < sg / colMags(i)) { + row.toBreeze.activeIterator.foreach { + case (_, 0.0) => // Skip explicit zero elements. + case (j, jVal) => + if (Math.random < sg / colMags(j)) { + val contrib = ((i, j), (iVal * jVal) / + (math.min(sg, colMags(i)) * math.min(sg, colMags(j))))  CanoeFZH It wastes that you calculate both colMags(i) and colMags(j) twice. rezazadeh Contributor Changed to computing them only once. added some commits Sep 11, 2014  rezazadeh Column magnitude summary 613f261 rezazadeh New interface and documentation e9c6791 rezazadeh Documentation 3764983 rezazadeh renamed to normL1 and normL2 fb296f6 Contributor  @mengxr All requested changes made. All tests are passing locally. However, I expect Jenkins to complain because of the new normL1 and normL2 methods added to MultivariateStatisticalSummary. As we discussed, it is worth adding L1 and L2 norms to MultivariateStatisticalSummary because they are fundamental summaries of all vectors, columns of RowMatrix are no exception. commented Sep 14, 2014  QA tests have started for PR 1778 at commit fb296f6. This patch merges cleanly. commented Sep 14, 2014  QA tests have finished for PR 1778 at commit fb296f6. This patch fails unit tests. This patch merges cleanly. This patch adds no public classes.  rezazadeh Line length for style 25e9d0d commented Sep 14, 2014  QA tests have started for PR 1778 at commit 25e9d0d. This patch merges cleanly.  rezazadeh Documentation 251bb9c commented Sep 15, 2014  QA tests have finished for PR 1778 at commit 25e9d0d. This patch fails unit tests. This patch merges cleanly. This patch adds no public classes. Contributor commented Sep 16, 2014  test this please commented Sep 16, 2014  QA tests have started for PR 1778 at commit 25e9d0d. This patch merges cleanly. and 1 other commented on an outdated diff Sep 16, 2014 ...apache/spark/mllib/linalg/distributed/RowMatrix.scala  @@ -18,6 +18,7 @@ package org.apache.spark.mllib.linalg.distributed import java.util.Arrays +import scala.collection.mutable.ListBuffer  mengxr Contributor separate scala imports from java ones rezazadeh Contributor Separated and 1 other commented on an outdated diff Sep 16, 2014 ...apache/spark/mllib/linalg/distributed/RowMatrix.scala  import org.apache.spark.Logging import org.apache.spark.mllib.rdd.RDDFunctions._ import org.apache.spark.mllib.stat.{MultivariateOnlineSummarizer, MultivariateStatisticalSummary} +  mengxr Contributor remove extra empty line rezazadeh Contributor Removed extra empty line and 1 other commented on an outdated diff Sep 16, 2014 ...apache/spark/mllib/linalg/distributed/RowMatrix.scala  @@ -390,6 +393,113 @@ class RowMatrix( new RowMatrix(AB, nRows, B.numCols) } + /** + * Compute all cosine similarities between columns of this matrix using the brute-force + * approach of computing normalized dot products. + * + * @return An n x n sparse upper-triangular matrix of cosine similarities between + * columns of this matrix. + */ + def columnSimilarities(): CoordinateMatrix = { + similarColumns(0.0) + } + + /** + * Compute all similarities between columns of this matrix using a sampling approach.  mengxr Contributor all? rezazadeh Contributor Removed all and 1 other commented on an outdated diff Sep 16, 2014 ...apache/spark/mllib/linalg/distributed/RowMatrix.scala  + * Compute all similarities between columns of this matrix using a sampling approach. + * + * The threshold parameter is a trade-off knob between estimate quality and computational cost. + * + * Setting a threshold of 0 guarantees deterministic correct results, but comes at exactly + * the same cost as the brute-force approach. Setting the threshold to positive values + * incurs strictly less computational cost than the brute-force approach, however the + * similarities computed will be estimates. + * + * The sampling guarantees relative-error correctness for those pairs of columns that have + * similarity greater than the given similarity threshold. + * + * To describe the guarantee, we set some notation: + * Let A be the smallest in magnitude non-zero element of this matrix. + * Let B be the largest in magnitude non-zero element of this matrix. + * Let L be the number of non-zeros per row.  mengxr Contributor Is it average or max number of nonzeros? rezazadeh Contributor Max. Added that. and 1 other commented on an outdated diff Sep 16, 2014 ...apache/spark/mllib/linalg/distributed/RowMatrix.scala  + * For example, for {0,1} matrices: A=B=1. + * Another example, for the Netflix matrix: A=1, B=5 + * + * For those column pairs that are above the threshold, + * the computed similarity is correct to within 20% relative error with probability + * at least 1 - (0.981)^(100/B) + * + * The shuffle size is bounded by the *smaller* of the following two expressions: + * + * O(n log(n) L / (threshold * A)) + * O(m L^2) + * + * The latter is the cost of the brute-force approach, so for non-zero thresholds, + * the cost is always cheaper than the brute-force approach. + * + * @param threshold Similarities above this threshold are probably computed correctly.  mengxr Contributor Elaborate more on probably computed correctly? rezazadeh Contributor Elaborated by referring to the method description. and 1 other commented on an outdated diff Sep 16, 2014 ...apache/spark/mllib/linalg/distributed/RowMatrix.scala  + * + * O(n log(n) L / (threshold * A)) + * O(m L^2) + * + * The latter is the cost of the brute-force approach, so for non-zero thresholds, + * the cost is always cheaper than the brute-force approach. + * + * @param threshold Similarities above this threshold are probably computed correctly. + * Set to 0 for deterministic guaranteed correctness. + * @return An n x n sparse upper-triangular matrix of cosine similarities + * between columns of this matrix. + */ + def similarColumns(threshold: Double): CoordinateMatrix = { + require(threshold >= 0 && threshold <= 1, s"Threshold not in [0,1]:$threshold") + + val gamma = if (math.abs(threshold) < 1e-6) {
 mengxr Contributor remove math.abs rezazadeh Contributor Removed math.abs
and 1 other commented on an outdated diff Sep 16, 2014
...apache/spark/mllib/linalg/distributed/RowMatrix.scala
 + + /** + * Find all similar columns using the DIMSUM sampling algorithm, described in two papers + * + * http://arxiv.org/abs/1206.2082 + * http://arxiv.org/abs/1304.1467 + * + * @param colMags A vector of column magnitudes + * @param gamma The oversampling parameter. For provable results, set to 100 * log(n) / s, + * where s is the smallest similarity score to be estimated, + * and n is the number of columns + * @return An n x n sparse upper-triangular matrix of cosine similarities + * between columns of this matrix. + */ + private[mllib] def similarColumnsDIMSUM(colMags: Array[Double], + gamma: Double): CoordinateMatrix = {
 mengxr Contributor 4-space indentation rezazadeh Contributor Used 4-space indentation for for two-lined columns. Pity intellij didn't catch this.
and 1 other commented on an outdated diff Sep 16, 2014
...apache/spark/mllib/linalg/distributed/RowMatrix.scala
 + * @return An n x n sparse upper-triangular matrix of cosine similarities + * between columns of this matrix. + */ + private[mllib] def similarColumnsDIMSUM(colMags: Array[Double], + gamma: Double): CoordinateMatrix = { + require(gamma > 1.0, s"Oversampling should be greater than 1: \$gamma") + require(colMags.size == this.numCols(), "Number of magnitudes didn't match column dimension") + + val sg = math.sqrt(gamma) // sqrt(gamma) used many times + + val sims = rows.flatMap { row => + val buf = new ListBuffer[((Int, Int), Double)]() + row.toBreeze.activeIterator.foreach { + case (_, 0.0) => // Skip explicit zero elements. + case (i, iVal) => + val rand = new scala.util.Random(iVal.toLong)
 mengxr Contributor It won't give you a pseudo random sequence. Think about the case when all values are the same. The seed should be set on a partition level. Could you update this block with mapPartitionsWithIndex? rezazadeh Contributor Updated to use mapPartitionsWithIndex with the partition index as the seed
commented on the diff Sep 16, 2014
...spark/mllib/stat/MultivariateStatisticalSummary.scala
 @@ -53,4 +53,14 @@ trait MultivariateStatisticalSummary { * Minimum value of each column. */ def min: Vector + + /** + * Euclidean magnitude of each column + */ + def normL2: Vector + + /** + * L1 norm of each column + */ + def normL1: Vector
 mengxr Contributor mean, stddev, and variance are summary statistics of the random variable. The 1-norm of a random variable should be E[|X|] instead of \sum_i |x_i|. See http://www.math.uah.edu/stat/expect/Spaces.html . I suggest changing the method names to norm1 and norm2 and outputs the average instead. rezazadeh Contributor For general vectors and matrices, L1 and L2 norms are widely accepted as summaries of a vector and are standard linear algebra: http://en.wikipedia.org/wiki/Norm_(mathematics)#p-norm RowMatrix is a general matrix, not necessarily holding random variable samples in its columns. By using the names normL1 and normL2 it is clear what we're talking about - there is no mistaking what L1 means. Whereas if I were to name them norm1 and use some averaging, I think the name is ambiguous and requires the user to know we're assuming the matrix holds random variable samples in its columns. There is no need to confuse the user to make an assumption we don't need.
and 1 other commented on an outdated diff Sep 16, 2014
...e/spark/mllib/linalg/distributed/RowMatrixSuite.scala
 @@ -95,6 +95,40 @@ class RowMatrixSuite extends FunSuite with LocalSparkContext { } } + test("similar columns") { + val colMags = Vectors.dense(Math.sqrt(126), Math.sqrt(66), Math.sqrt(94)) + val expected = BDM( + (0.0, 54.0, 72.0), + (0.0, 0.0, 78.0), + (0.0, 0.0, 0.0)) + + for(i <- 0 until n) for(j <- 0 until n) { + expected(i, j) /= (colMags(i) * colMags(j)) + } + + for (mat <- Seq(denseMat, sparseMat)) { + val G = mat.similarColumns(0.1).toBreeze() + for(i <- 0 until n) for(j <- 0 until n) {
 mengxr Contributor for (i <- 0 until n; j <- 0 until n) { (space after for and merge two for statements) rezazadeh Contributor Merged and added space.
commented Sep 16, 2014
 QA tests have finished for PR 1778 at commit 25e9d0d. This patch fails unit tests. This patch merges cleanly. This patch adds no public classes.
 rezazadeh Use partition index for RNG 0e4eda4
commented Sep 20, 2014
 QA tests have started for PR 1778 at commit 0e4eda4. This patch merges cleanly.
commented Sep 20, 2014
 QA tests have finished for PR 1778 at commit 0e4eda4. This patch fails unit tests. This patch merges cleanly. This patch adds no public classes.
 The colMags right now have sqrt(sum(column1)^2 + sum(column2)^2 + ... + sum(columnN)^2) It will be good to have (sum(column1) + sum(column2) + ... + sum(columnN)) Can I add this logic in OnlineSummarizer normL1 ? I saw normL1 is not implemented right now...may be this is sum....This is the support number that's commonly used in market basket analysis...will be very useful to have it...I am right now reconstructing it using similarity_{ij}_colMags(i)_colMags(j) which can produce rounding errors sometime..
Contributor
 Why do you say normL1 is not implemented? I have implemented normL1 in MultivariateOnlineSummarizer, with tests. Do you want a version without absolute values? If so, just unnormalize mean, which is also in MultivariateOnlineSummarizer
 ahh I saw it in the code now...that will do...no need for absolute values...numbers are positive for me..thanks..
 The code looks good for sparse input but for dense input is there any issue with using activeIterator ? I understand that due to dimsum threshold check we have to iterate over column and rows independently but if there is dense data, is it possible for having a dimsum threshold check high up and use dense*dense blas ?
added some commits Sep 24, 2014
 mengxr Merge branch 'master' into rezazadeh-dimsumv2 3c4cf41 mengxr some optimization f2947e4
Contributor
commented Sep 25, 2014
 @rezazadeh I made some changes in a local branch: https://github.com/mengxr/spark/blob/rezazadeh-dimsumv2/mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/RowMatrix.scala . Could you merge the latest master to your branch? Then it is easy to compare the diff. I changed the following: similarColumns -> columnSimilarities remove activeIterator and specialize for dense and sparse vectors cache the probabilities and denominators Those should increase the performance by ~5x. But the shuffle is still expensive, because the records are very small. Another question I have is on the sparsity of the result. I ran some tests locally and found that even with gamma = 1.0, the result is still dense (containing all (i, j) pairs) though the shuffle size is smaller.
 rezazadeh Merge remote-tracking branch 'upstream/master' into dimsumv2 254ca08
Contributor
 @mengxr Thanks for the optimizations. I merged the latest master into my branch and pushed to here. Would you like me to merge your branch into mine? There is no guarantee on sparsity, only on lower shuffle size, so what you observed is exactly what is expected. We can allow the user to promote sparsity by allowing them to set the threshold to values greater than 1. In this case, we can't provably guarantee correctness, but it will be useful for users who have very dense matrices, because the guarantees are pessimistic and even after promoting sparsity the result will likely be useful to them. With this in mind, I propose allowing the threshold to be above 1, and when it is, instead of giving an error, give a warning that results are not guaranteed correct, but the computational savings will be useful. Shall I do that?
added some commits Sep 25, 2014
 mengxr Merge branch 'rezazadeh-dimsumv2' into dimsumv2 2196ba5 mengxr organize imports 9fe17c0
Contributor
commented Sep 25, 2014
 @rezazadeh I sent a PR to your repo at: rezazadeh#1 . Could you check the changes and merge it if they are correct (hopefully) and look good to you? Thanks! Thanks for explaining how threshold affects the computation. I think it is nice to allow it to be above 1, given that it is documented accurately.
 rezazadeh Allow large thresholds to promote sparsity aea0247
Contributor
 @mengxr Merged in your changes and added ability for the threshold to be larger with a warning. Tests pass.
commented Sep 26, 2014
 QA tests have started for PR 1778 at commit aea0247. This patch does not merge cleanly!
 rezazadeh Merge remote-tracking branch 'upstream/master' into dimsumv2 Conflicts: mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/RowMatrix.scala 3467cff
commented Sep 26, 2014
 QA tests have started for PR 1778 at commit 3467cff. This patch merges cleanly.
 rezazadeh Broadcast colMags. Avoid div by zero. 976ddd4
Contributor
 @mengxr I also added broadcasting of p and v to further optimize space usage. Also now we're avoiding divide by zero if there is a column with zero magnitude.
commented Sep 26, 2014
 QA tests have finished for PR 1778 at commit aea0247. This patch fails unit tests. This patch does not merge cleanly!
commented Sep 26, 2014
 QA tests have finished for PR 1778 at commit 3467cff. This patch fails unit tests. This patch merges cleanly. This patch adds no public classes.
 rezazadeh Merge remote-tracking branch 'upstream/master' into dimsumv2 ee8bd65
Contributor
commented Sep 26, 2014
commented Sep 26, 2014
 QA tests have started for PR 1778 at commit ee8bd65. This patch merges cleanly.
commented Sep 26, 2014
 QA tests have finished for PR 1778 at commit ee8bd65. This patch fails unit tests. This patch merges cleanly. This patch adds no public classes.
Contributor
 Only the binary compatibility test is failing, which is expected.
Contributor
commented Sep 26, 2014
 @rezazadeh Could you set the exclusion rules in dev/MimaExcludes.scala?
 rezazadeh Add excludes for normL1 and normL2 4eb71c6
commented Sep 26, 2014
 QA tests have started for PR 1778 at commit 4eb71c6. This patch merges cleanly.
commented Sep 26, 2014
 QA tests have finished for PR 1778 at commit 4eb71c6. This patch fails unit tests. This patch merges cleanly. This patch adds no public classes.
Contributor
commented Sep 26, 2014
 QA tests have started for PR 1778 at commit 4eb71c6. This patch merges cleanly.
commented Sep 26, 2014
 QA tests have finished for PR 1778 at commit 4eb71c6. This patch fails unit tests. This patch merges cleanly. This patch adds no public classes.
 rezazadeh Merge remote-tracking branch 'upstream/master' into dimsumv2 404c64c
commented Sep 26, 2014
 QA tests have started for PR 1778 at commit 404c64c. This patch merges cleanly.
commented Sep 27, 2014
 QA tests have finished for PR 1778 at commit 404c64c. This patch passes unit tests. This patch merges cleanly. This patch adds no public classes.
Contributor
commented Sep 29, 2014
 LGTM. Merged into master! Thanks @rezazadeh !
pushed a commit that closed this pull request Sep 29, 2014
 rezazadeh + mengxr [MLlib] [SPARK-2885] DIMSUM: All-pairs similarity # All-pairs similarity via DIMSUM Compute all pairs of similar vectors using brute force approach, and also DIMSUM sampling approach. Laying down some notation: we are looking for all pairs of similar columns in an m x n RowMatrix whose entries are denoted a_ij, with the i’th row denoted r_i and the j’th column denoted c_j. There is an oversampling parameter labeled ɣ that should be set to 4 log(n)/s to get provably correct results (with high probability), where s is the similarity threshold. The algorithm is stated with a Map and Reduce, with proofs of correctness and efficiency in published papers [1] [2]. The reducer is simply the summation reducer. The mapper is more interesting, and is also the heart of the scheme. As an exercise, you should try to see why in expectation, the map-reduce below outputs cosine similarities. ![dimsumv2](https://cloud.githubusercontent.com/assets/3220351/3807272/d1d9514e-1c62-11e4-9f12-3cfdb1d78b3a.png) [1] Bosagh-Zadeh, Reza and Carlsson, Gunnar (2013), Dimension Independent Matrix Square using MapReduce, arXiv:1304.1467 http://arxiv.org/abs/1304.1467 [2] Bosagh-Zadeh, Reza and Goel, Ashish (2012), Dimension Independent Similarity Computation, arXiv:1206.2082 http://arxiv.org/abs/1206.2082 # Testing Tests for all invocations included. Added L1 and L2 norm computation to MultivariateStatisticalSummary since it was needed. Added tests for both of them. Author: Reza Zadeh Author: Xiangrui Meng Closes #1778 from rezazadeh/dimsumv2 and squashes the following commits: 404c64c [Reza Zadeh] Merge remote-tracking branch 'upstream/master' into dimsumv2 4eb71c6 [Reza Zadeh] Add excludes for normL1 and normL2 ee8bd65 [Reza Zadeh] Merge remote-tracking branch 'upstream/master' into dimsumv2 976ddd4 [Reza Zadeh] Broadcast colMags. Avoid div by zero. 3467cff [Reza Zadeh] Merge remote-tracking branch 'upstream/master' into dimsumv2 aea0247 [Reza Zadeh] Allow large thresholds to promote sparsity 9fe17c0 [Xiangrui Meng] organize imports 2196ba5 [Xiangrui Meng] Merge branch 'rezazadeh-dimsumv2' into dimsumv2 254ca08 [Reza Zadeh] Merge remote-tracking branch 'upstream/master' into dimsumv2 f2947e4 [Xiangrui Meng] some optimization 3c4cf41 [Xiangrui Meng] Merge branch 'master' into rezazadeh-dimsumv2 0e4eda4 [Reza Zadeh] Use partition index for RNG 251bb9c [Reza Zadeh] Documentation 25e9d0d [Reza Zadeh] Line length for style fb296f6 [Reza Zadeh] renamed to normL1 and normL2 3764983 [Reza Zadeh] Documentation e9c6791 [Reza Zadeh] New interface and documentation 613f261 [Reza Zadeh] Column magnitude summary 75a0b51 [Reza Zadeh] Use Ints instead of Longs in the shuffle 0f12ade [Reza Zadeh] Style changes eb1dc20 [Reza Zadeh] Use Double.PositiveInfinity instead of Double.Max f56a882 [Reza Zadeh] Remove changes to MultivariateOnlineSummarizer dbc55ba [Reza Zadeh] Make colMagnitudes a method in RowMatrix 41e8ece [Reza Zadeh] style changes 139c8e1 [Reza Zadeh] Syntax changes 029aa9c [Reza Zadeh] javadoc and new test 75edb25 [Reza Zadeh] All tests passing! 05e59b8 [Reza Zadeh] Add test 502ce52 [Reza Zadeh] new interface 654c4fb [Reza Zadeh] default methods 3726ca9 [Reza Zadeh] Remove MatrixAlgebra 6bebabb [Reza Zadeh] remove changes to MatrixSuite 5b8cd7d [Reza Zadeh] Initial files 587a0cd
closed this in 587a0cd Sep 29, 2014
Contributor
 Thanks for the review @mengxr !
deleted the rezazadeh:dimsumv2 branch Sep 30, 2014
pushed a commit to dgshep/spark that referenced this pull request Dec 8, 2014
 rezazadeh + Davis Shepherd [MLlib] [SPARK-2885] DIMSUM: All-pairs similarity # All-pairs similarity via DIMSUM Compute all pairs of similar vectors using brute force approach, and also DIMSUM sampling approach. Laying down some notation: we are looking for all pairs of similar columns in an m x n RowMatrix whose entries are denoted a_ij, with the i’th row denoted r_i and the j’th column denoted c_j. There is an oversampling parameter labeled ɣ that should be set to 4 log(n)/s to get provably correct results (with high probability), where s is the similarity threshold. The algorithm is stated with a Map and Reduce, with proofs of correctness and efficiency in published papers [1] [2]. The reducer is simply the summation reducer. The mapper is more interesting, and is also the heart of the scheme. As an exercise, you should try to see why in expectation, the map-reduce below outputs cosine similarities. ![dimsumv2](https://cloud.githubusercontent.com/assets/3220351/3807272/d1d9514e-1c62-11e4-9f12-3cfdb1d78b3a.png) [1] Bosagh-Zadeh, Reza and Carlsson, Gunnar (2013), Dimension Independent Matrix Square using MapReduce, arXiv:1304.1467 http://arxiv.org/abs/1304.1467 [2] Bosagh-Zadeh, Reza and Goel, Ashish (2012), Dimension Independent Similarity Computation, arXiv:1206.2082 http://arxiv.org/abs/1206.2082 # Testing Tests for all invocations included. Added L1 and L2 norm computation to MultivariateStatisticalSummary since it was needed. Added tests for both of them. Author: Reza Zadeh Author: Xiangrui Meng Closes #1778 from rezazadeh/dimsumv2 and squashes the following commits: 404c64c [Reza Zadeh] Merge remote-tracking branch 'upstream/master' into dimsumv2 4eb71c6 [Reza Zadeh] Add excludes for normL1 and normL2 ee8bd65 [Reza Zadeh] Merge remote-tracking branch 'upstream/master' into dimsumv2 976ddd4 [Reza Zadeh] Broadcast colMags. Avoid div by zero. 3467cff [Reza Zadeh] Merge remote-tracking branch 'upstream/master' into dimsumv2 aea0247 [Reza Zadeh] Allow large thresholds to promote sparsity 9fe17c0 [Xiangrui Meng] organize imports 2196ba5 [Xiangrui Meng] Merge branch 'rezazadeh-dimsumv2' into dimsumv2 254ca08 [Reza Zadeh] Merge remote-tracking branch 'upstream/master' into dimsumv2 f2947e4 [Xiangrui Meng] some optimization 3c4cf41 [Xiangrui Meng] Merge branch 'master' into rezazadeh-dimsumv2 0e4eda4 [Reza Zadeh] Use partition index for RNG 251bb9c [Reza Zadeh] Documentation 25e9d0d [Reza Zadeh] Line length for style fb296f6 [Reza Zadeh] renamed to normL1 and normL2 3764983 [Reza Zadeh] Documentation e9c6791 [Reza Zadeh] New interface and documentation 613f261 [Reza Zadeh] Column magnitude summary 75a0b51 [Reza Zadeh] Use Ints instead of Longs in the shuffle 0f12ade [Reza Zadeh] Style changes eb1dc20 [Reza Zadeh] Use Double.PositiveInfinity instead of Double.Max f56a882 [Reza Zadeh] Remove changes to MultivariateOnlineSummarizer dbc55ba [Reza Zadeh] Make colMagnitudes a method in RowMatrix 41e8ece [Reza Zadeh] style changes 139c8e1 [Reza Zadeh] Syntax changes 029aa9c [Reza Zadeh] javadoc and new test 75edb25 [Reza Zadeh] All tests passing! 05e59b8 [Reza Zadeh] Add test 502ce52 [Reza Zadeh] new interface 654c4fb [Reza Zadeh] default methods 3726ca9 [Reza Zadeh] Remove MatrixAlgebra 6bebabb [Reza Zadeh] remove changes to MatrixSuite 5b8cd7d [Reza Zadeh] Initial files 0b638a9
 Does anyone know how to extend this to the 'Cross Product' case as mentioned in the paper?