[MLlib] [SPARK-2885] DIMSUM: All-pairs similarity #1778

Closed
wants to merge 33 commits into
from

Projects

None yet

9 participants

@rezazadeh
Contributor

All-pairs similarity via DIMSUM

Compute all pairs of similar vectors using brute force approach, and also DIMSUM sampling approach.

Laying down some notation: we are looking for all pairs of similar columns in an m x n RowMatrix whose entries are denoted a_ij, with the i’th row denoted r_i and the j’th column denoted c_j. There is an oversampling parameter labeled ɣ that should be set to 4 log(n)/s to get provably correct results (with high probability), where s is the similarity threshold.

The algorithm is stated with a Map and Reduce, with proofs of correctness and efficiency in published papers [1] [2]. The reducer is simply the summation reducer. The mapper is more interesting, and is also the heart of the scheme. As an exercise, you should try to see why in expectation, the map-reduce below outputs cosine similarities.

dimsumv2

[1] Bosagh-Zadeh, Reza and Carlsson, Gunnar (2013), Dimension Independent Matrix Square using MapReduce, arXiv:1304.1467 http://arxiv.org/abs/1304.1467

[2] Bosagh-Zadeh, Reza and Goel, Ashish (2012), Dimension Independent Similarity Computation, arXiv:1206.2082 http://arxiv.org/abs/1206.2082

Testing

Tests for all invocations included.

Added L1 and L2 norm computation to MultivariateStatisticalSummary since it was needed. Added tests for both of them.

@SparkQA
SparkQA commented Aug 5, 2014

QA tests have started for PR 1778. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17926/consoleFull

@SparkQA
SparkQA commented Aug 5, 2014

QA results for PR 1778:
- This patch FAILED unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17926/consoleFull

@SparkQA
SparkQA commented Aug 5, 2014

QA tests have started for PR 1778. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17929/consoleFull

@SparkQA
SparkQA commented Aug 5, 2014

QA results for PR 1778:
- This patch FAILED unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17929/consoleFull

@rezazadeh
Contributor

The binary backwards compatibility check doesn't like adding a new method to the trait MultivariateStatisticalSummary. Suggestions on binary compatibility welcome, @mengxr

@srowen
Member
srowen commented Aug 5, 2014

As a meta-question, what's the theory about what implementations should go into Spark, and which should be external? Not everything needs to be in a "core" library like MLlib. I know Mahout suffered mightily from adding a lot of implementations without much regard to their use or support. I'm not suggesting anything either way about this algorithm. If there's a working theory about what's in and out of scope, I'd love to see it and maybe make sure that people don't implement things for contribution that are too niche.

@rezazadeh
Contributor

Having all-pairs similarity in spark has been requested several times. e.g. http://bit.ly/XAFGs8 , and also by @freeman-lab . This algorithm is also a part of Scalding: twitter/scalding#833/

@pwendell
Contributor
pwendell commented Aug 5, 2014

@rezazadeh mind putting [MLlib] in the title here? That way it gets sorted correctly by our internal reivew tools.

@rezazadeh rezazadeh changed the title from DIMSUM: Dimension Independent Matrix Square using Mapreduce to [MLlib] DIMSUM: Dimension Independent Matrix Square using Mapreduce Aug 5, 2014
@freeman-lab
Contributor

@srowen agreed the core vs external library question is important. The requirements here seem reasonable, but there's still gray area. For example, we have lots of analyses that are known / accepted but should I think remain external because they are for specific data types (images & time series). Re: this particular algorithm, it's definitely something we're interested in using, sounds like others are too.

@mengxr
Contributor
mengxr commented Aug 6, 2014

@rezazadeh Do you mind creating a JIRA for this and then add [SPARK-####] to the title? We also want to learn more about the theory, especially the relation between storage/computation complexity and failure rate.

Btw, to me, finding similar rows (observations) is more natural than finding similar columns.

@rezazadeh rezazadeh changed the title from [MLlib] DIMSUM: Dimension Independent Matrix Square using Mapreduce to [MLlib] [SPARK-2885] DIMSUM: Dimension Independent Matrix Square using Mapreduce Aug 6, 2014
@SparkQA
SparkQA commented Aug 6, 2014

QA tests have started for PR 1778. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18061/consoleFull

@rezazadeh
Contributor

@mengxr Updated the PR to compute column magnitude as a method in RowMatrix so that binary compatibility shouldn't be a problem. This allowed me to use breeze too, which should take advantage of hardware acceleration when possible.

@srowen @freeman-lab @mengxr I added a JIRA for this PR, and clearly laid out why it is worthwhile adding this functionality to MLlib. https://issues.apache.org/jira/browse/SPARK-2885

@SparkQA
SparkQA commented Aug 6, 2014

QA results for PR 1778:
- This patch FAILED unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18061/consoleFull

@SparkQA
SparkQA commented Aug 6, 2014

QA tests have started for PR 1778. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18076/consoleFull

@SparkQA
SparkQA commented Aug 7, 2014

QA results for PR 1778:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18076/consoleFull

@SparkQA
SparkQA commented Aug 7, 2014

QA tests have started for PR 1778. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18099/consoleFull

@SparkQA
SparkQA commented Aug 7, 2014

QA results for PR 1778:
- This patch FAILED unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18099/consoleFull

@rezazadeh
Contributor

Jenkins, retest this please.

@SparkQA
SparkQA commented Aug 7, 2014

QA tests have started for PR 1778. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18107/consoleFull

@SparkQA
SparkQA commented Aug 7, 2014

QA results for PR 1778:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18107/consoleFull

@rezazadeh rezazadeh changed the title from [MLlib] [SPARK-2885] DIMSUM: Dimension Independent Matrix Square using Mapreduce to [MLlib] [SPARK-2885] All-pairs similarity Aug 7, 2014
@rezazadeh rezazadeh changed the title from [MLlib] [SPARK-2885] All-pairs similarity to [MLlib] [SPARK-2885] DIMSUM: All-pairs similarity Aug 7, 2014
@mengxr
Contributor
mengxr commented Aug 30, 2014

@rezazadeh Could you update the PR to follow Spark Code Style Guide? Thanks!

@rezazadeh
Contributor

Style changes made. Experimental results below.

We run DIMSUM daily on a production-scale ads dataset. After replacing the traditional cosine similarity computation in late June, we observe 40% improvement in several performance measures, plotted below. The top of the y-axis is around 100TB.

image

image

For correctness proof, see Theorem 4.3 in http://stanford.edu/~rezab/papers/dimsum.pdf

@SparkQA
SparkQA commented Aug 31, 2014

QA tests have started for PR 1778 at commit 0f12ade.

  • This patch merges cleanly.
@SparkQA
SparkQA commented Aug 31, 2014

QA tests have started for PR 1778 at commit 75a0b51.

  • This patch merges cleanly.
@SparkQA
SparkQA commented Aug 31, 2014

QA tests have finished for PR 1778 at commit 0f12ade.

  • This patch passes unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class ExecutorClassLoader(conf: SparkConf, classUri: String, parent: ClassLoader,
@SparkQA
SparkQA commented Aug 31, 2014

QA tests have finished for PR 1778 at commit 75a0b51.

  • This patch passes unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
@mengxr mengxr and 1 other commented on an outdated diff Sep 2, 2014
...apache/spark/mllib/linalg/distributed/RowMatrix.scala
import org.apache.spark.Logging
import org.apache.spark.mllib.rdd.RDDFunctions._
import org.apache.spark.mllib.stat.{MultivariateOnlineSummarizer, MultivariateStatisticalSummary}
+import scala.collection.mutable.ListBuffer
@mengxr
mengxr Sep 2, 2014 Contributor

Move Scala imports in front of third-party imports.

@rezazadeh
rezazadeh Sep 14, 2014 Contributor

Done

@mengxr mengxr and 1 other commented on an outdated diff Sep 2, 2014
...apache/spark/mllib/linalg/distributed/RowMatrix.scala
@@ -390,6 +393,79 @@ class RowMatrix(
new RowMatrix(AB, nRows, B.numCols)
}
+ /**
+ * Find all similar columns using cosine similarity.
+ *
+ * @return An n x n sparse matrix of cosine similarities between columns of this matrix.
+ */
+ def similarColumns(): CoordinateMatrix = {
@mengxr
mengxr Sep 2, 2014 Contributor

When gamma is infinite, does the name similarColumns still make sense?

@rezazadeh
rezazadeh Sep 14, 2014 Contributor

Renamed to columnSimilarities()

@mengxr mengxr and 1 other commented on an outdated diff Sep 2, 2014
...apache/spark/mllib/linalg/distributed/RowMatrix.scala
@@ -390,6 +393,79 @@ class RowMatrix(
new RowMatrix(AB, nRows, B.numCols)
}
+ /**
+ * Find all similar columns using cosine similarity.
+ *
+ * @return An n x n sparse matrix of cosine similarities between columns of this matrix.
+ */
+ def similarColumns(): CoordinateMatrix = {
+ similarColumns(Double.PositiveInfinity)
+ }
+
+ /**
+ * Find all similar columns using the DIMSUM sampling algorithm, described in
+ * http://arxiv.org/abs/1304.1467
+ *
+ * @param gamma The oversampling parameter. For provable results, set to 4 * log(n) / s,
@mengxr
mengxr Sep 2, 2014 Contributor

It would be nice to tell users how gamma affects the cost. From users' perspective, s is more friendly than gamma.

@rezazadeh
rezazadeh Sep 14, 2014 Contributor

Changed the interface to the threshold s instead of gamma.

@mengxr mengxr and 1 other commented on an outdated diff Sep 2, 2014
...apache/spark/mllib/linalg/distributed/RowMatrix.scala
+ * http://arxiv.org/abs/1304.1467
+ *
+ * @param gamma The oversampling parameter. For provable results, set to 4 * log(n) / s,
+ * where s is the smallest similarity score to be estimated,
+ * and n is the number of columns
+ * @return An n x n sparse matrix of cosine similarities between columns of this matrix.
+ */
+ def similarColumns(gamma: Double): CoordinateMatrix = {
+ similarColumnsDIMSUM(columnMagnitudes(), gamma)
+ }
+
+ /**
+ * Return 2-norm of the columns of this matrix.
+ * @return an array of column magnitudes
+ */
+ def columnMagnitudes(): Array[Double] = {
@mengxr
mengxr Sep 2, 2014 Contributor

Mark it private?

@rezazadeh
rezazadeh Sep 14, 2014 Contributor

Removed this altogether, and moved the functionality to MultivariateOnlineSummarizer, calling the functionality normL1 and normL2

@mengxr mengxr and 1 other commented on an outdated diff Sep 2, 2014
...apache/spark/mllib/linalg/distributed/RowMatrix.scala
+ * and n is the number of columns
+ * @return An n x n sparse matrix of cosine similarities between columns of this matrix.
+ */
+ def similarColumns(gamma: Double): CoordinateMatrix = {
+ similarColumnsDIMSUM(columnMagnitudes(), gamma)
+ }
+
+ /**
+ * Return 2-norm of the columns of this matrix.
+ * @return an array of column magnitudes
+ */
+ def columnMagnitudes(): Array[Double] = {
+ rows.map { x =>
+ val brzX = x.toBreeze
+ brzX.:*(brzX)
+ }.fold(BDV.zeros[Double](numCols().toInt))(_ + _).toArray.map(math.sqrt(_))
@mengxr
mengxr Sep 2, 2014 Contributor

It will create many temp objects, which we should try to avoid. We need to also double check brzX.:*(brzX) handles sparse vectors correctly. Using aggregate and then activeIterator may be a better solution.

Another issue is overflow.

@rezazadeh
rezazadeh Sep 14, 2014 Contributor

Moved into MultivariateOnlineSummarizer

@mengxr mengxr and 1 other commented on an outdated diff Sep 2, 2014
...apache/spark/mllib/linalg/distributed/RowMatrix.scala
+ val brzX = x.toBreeze
+ brzX.:*(brzX)
+ }.fold(BDV.zeros[Double](numCols().toInt))(_ + _).toArray.map(math.sqrt(_))
+ }
+
+ /**
+ * Find all similar columns using the DIMSUM sampling algorithm, described in
+ * http://arxiv.org/abs/1304.1467
+ *
+ * @param colMags A vector of column magnitudes
+ * @param gamma The oversampling parameter. For provable results, set to 4 * log(n) / s,
+ * where s is the smallest similarity score to be estimated,
+ * and n is the number of columns
+ * @return An n x n sparse matrix of cosine similarities between columns of this matrix.
+ */
+ def similarColumnsDIMSUM(colMags: Array[Double], gamma: Double): CoordinateMatrix = {
@mengxr
mengxr Sep 2, 2014 Contributor

Mark it private?

@rezazadeh
rezazadeh Sep 14, 2014 Contributor

Marked as private to mllib

@mengxr mengxr and 1 other commented on an outdated diff Sep 2, 2014
...apache/spark/mllib/linalg/distributed/RowMatrix.scala
+ def similarColumnsDIMSUM(colMags: Array[Double], gamma: Double): CoordinateMatrix = {
+ require(gamma > 1.0, s"Oversampling should be greater than 1: $gamma")
+ require(colMags.size == this.numCols(), "Number of magnitudes didn't match column dimension")
+
+ val sg = math.sqrt(gamma) // sqrt(gamma) used many times
+
+ val sims = rows.flatMap { row =>
+ val buf = new ListBuffer[((Int, Int), Double)]()
+ row.toBreeze.activeIterator.foreach {
+ case (_, 0.0) => // Skip explicit zero elements.
+ case (i, iVal) =>
+ if (Math.random < sg / colMags(i)) {
+ row.toBreeze.activeIterator.foreach {
+ case (_, 0.0) => // Skip explicit zero elements.
+ case (j, jVal) =>
+ if (Math.random < sg / colMags(j)) {
@mengxr
mengxr Sep 2, 2014 Contributor

Need to fix the random seed per partition.

@rezazadeh
rezazadeh Sep 14, 2014 Contributor

Fixed by seeding the RNG with the row.

@mengxr mengxr and 1 other commented on an outdated diff Sep 2, 2014
...e/spark/mllib/linalg/distributed/RowMatrixSuite.scala
+ for (mat <- Seq(denseMat, sparseMat)) {
+ val G = mat.similarColumns(150.0)
+ assert(closeToZero(G.toBreeze() - expected))
+ }
+
+ for (mat <- Seq(denseMat, sparseMat)) {
+ val G = mat.similarColumns()
+ assert(closeToZero(G.toBreeze() - expected))
+ }
+
+ for (mat <- Seq(denseMat, sparseMat)) {
+ val G = mat.similarColumnsDIMSUM(colMags.toArray, 150.0)
+ assert(closeToZero(G.toBreeze() - expected))
+ }
+ }
+
@mengxr
mengxr Sep 2, 2014 Contributor

There is no test for obtaining partial similar pairs. Do you mind adding one?

@rezazadeh
rezazadeh Sep 14, 2014 Contributor

Added test for partial similar pairs.

@mengxr
mengxr Sep 16, 2014 Contributor

Does the test output only a subset of column pairs?

@rezazadeh
rezazadeh Sep 20, 2014 Contributor

The test is estimating some column pairs i.e. I looked at the output and checked that some pairs don't have their similarity perfectly computed.

@mengxr
Contributor
mengxr commented Sep 2, 2014

@rezazadeh I understand that it is easier to implement the algorithm on a row-oriented format to compute similar columns. But it still sounds more natural to me to compute similar rows on a row-oriented format. For example, we have a billion small images and we want to pick out similar ones, without computing 10^9 x 10^9 inner products. Does DIMSUM help?

@CanoeFZH CanoeFZH and 1 other commented on an outdated diff Sep 3, 2014
...apache/spark/mllib/linalg/distributed/RowMatrix.scala
+ require(colMags.size == this.numCols(), "Number of magnitudes didn't match column dimension")
+
+ val sg = math.sqrt(gamma) // sqrt(gamma) used many times
+
+ val sims = rows.flatMap { row =>
+ val buf = new ListBuffer[((Int, Int), Double)]()
+ row.toBreeze.activeIterator.foreach {
+ case (_, 0.0) => // Skip explicit zero elements.
+ case (i, iVal) =>
+ if (Math.random < sg / colMags(i)) {
+ row.toBreeze.activeIterator.foreach {
+ case (_, 0.0) => // Skip explicit zero elements.
+ case (j, jVal) =>
+ if (Math.random < sg / colMags(j)) {
+ val contrib = ((i, j), (iVal * jVal) /
+ (math.min(sg, colMags(i)) * math.min(sg, colMags(j))))
@CanoeFZH
CanoeFZH Sep 3, 2014

It wastes that you calculate both colMags(i) and colMags(j) twice.

@rezazadeh
rezazadeh Sep 14, 2014 Contributor

Changed to computing them only once.

@rezazadeh
Contributor

@mengxr All requested changes made. All tests are passing locally. However, I expect Jenkins to complain because of the new normL1 and normL2 methods added to MultivariateStatisticalSummary. As we discussed, it is worth adding L1 and L2 norms to MultivariateStatisticalSummary because they are fundamental summaries of all vectors, columns of RowMatrix are no exception.

@SparkQA
SparkQA commented Sep 14, 2014

QA tests have started for PR 1778 at commit fb296f6.

  • This patch merges cleanly.
@SparkQA
SparkQA commented Sep 14, 2014

QA tests have finished for PR 1778 at commit fb296f6.

  • This patch fails unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
@SparkQA
SparkQA commented Sep 14, 2014

QA tests have started for PR 1778 at commit 25e9d0d.

  • This patch merges cleanly.
@SparkQA
SparkQA commented Sep 15, 2014

QA tests have finished for PR 1778 at commit 25e9d0d.

  • This patch fails unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
@mengxr
Contributor
mengxr commented Sep 16, 2014

test this please

@SparkQA
SparkQA commented Sep 16, 2014

QA tests have started for PR 1778 at commit 25e9d0d.

  • This patch merges cleanly.
@mengxr mengxr and 1 other commented on an outdated diff Sep 16, 2014
...apache/spark/mllib/linalg/distributed/RowMatrix.scala
@@ -18,6 +18,7 @@
package org.apache.spark.mllib.linalg.distributed
import java.util.Arrays
+import scala.collection.mutable.ListBuffer
@mengxr
mengxr Sep 16, 2014 Contributor

separate scala imports from java ones

@rezazadeh
rezazadeh Sep 20, 2014 Contributor

Separated

@mengxr mengxr and 1 other commented on an outdated diff Sep 16, 2014
...apache/spark/mllib/linalg/distributed/RowMatrix.scala
import org.apache.spark.Logging
import org.apache.spark.mllib.rdd.RDDFunctions._
import org.apache.spark.mllib.stat.{MultivariateOnlineSummarizer, MultivariateStatisticalSummary}
+
@mengxr
mengxr Sep 16, 2014 Contributor

remove extra empty line

@rezazadeh
rezazadeh Sep 20, 2014 Contributor

Removed extra empty line

@mengxr mengxr and 1 other commented on an outdated diff Sep 16, 2014
...apache/spark/mllib/linalg/distributed/RowMatrix.scala
@@ -390,6 +393,113 @@ class RowMatrix(
new RowMatrix(AB, nRows, B.numCols)
}
+ /**
+ * Compute all cosine similarities between columns of this matrix using the brute-force
+ * approach of computing normalized dot products.
+ *
+ * @return An n x n sparse upper-triangular matrix of cosine similarities between
+ * columns of this matrix.
+ */
+ def columnSimilarities(): CoordinateMatrix = {
+ similarColumns(0.0)
+ }
+
+ /**
+ * Compute all similarities between columns of this matrix using a sampling approach.
@mengxr
mengxr Sep 16, 2014 Contributor

all?

@rezazadeh
rezazadeh Sep 20, 2014 Contributor

Removed all

@mengxr mengxr and 1 other commented on an outdated diff Sep 16, 2014
...apache/spark/mllib/linalg/distributed/RowMatrix.scala
+ * Compute all similarities between columns of this matrix using a sampling approach.
+ *
+ * The threshold parameter is a trade-off knob between estimate quality and computational cost.
+ *
+ * Setting a threshold of 0 guarantees deterministic correct results, but comes at exactly
+ * the same cost as the brute-force approach. Setting the threshold to positive values
+ * incurs strictly less computational cost than the brute-force approach, however the
+ * similarities computed will be estimates.
+ *
+ * The sampling guarantees relative-error correctness for those pairs of columns that have
+ * similarity greater than the given similarity threshold.
+ *
+ * To describe the guarantee, we set some notation:
+ * Let A be the smallest in magnitude non-zero element of this matrix.
+ * Let B be the largest in magnitude non-zero element of this matrix.
+ * Let L be the number of non-zeros per row.
@mengxr
mengxr Sep 16, 2014 Contributor

Is it average or max number of nonzeros?

@rezazadeh
rezazadeh Sep 20, 2014 Contributor

Max. Added that.

@mengxr mengxr and 1 other commented on an outdated diff Sep 16, 2014
...apache/spark/mllib/linalg/distributed/RowMatrix.scala
+ * For example, for {0,1} matrices: A=B=1.
+ * Another example, for the Netflix matrix: A=1, B=5
+ *
+ * For those column pairs that are above the threshold,
+ * the computed similarity is correct to within 20% relative error with probability
+ * at least 1 - (0.981)^(100/B)
+ *
+ * The shuffle size is bounded by the *smaller* of the following two expressions:
+ *
+ * O(n log(n) L / (threshold * A))
+ * O(m L^2)
+ *
+ * The latter is the cost of the brute-force approach, so for non-zero thresholds,
+ * the cost is always cheaper than the brute-force approach.
+ *
+ * @param threshold Similarities above this threshold are probably computed correctly.
@mengxr
mengxr Sep 16, 2014 Contributor

Elaborate more on probably computed correctly?

@rezazadeh
rezazadeh Sep 20, 2014 Contributor

Elaborated by referring to the method description.

@mengxr mengxr and 1 other commented on an outdated diff Sep 16, 2014
...apache/spark/mllib/linalg/distributed/RowMatrix.scala
+ *
+ * O(n log(n) L / (threshold * A))
+ * O(m L^2)
+ *
+ * The latter is the cost of the brute-force approach, so for non-zero thresholds,
+ * the cost is always cheaper than the brute-force approach.
+ *
+ * @param threshold Similarities above this threshold are probably computed correctly.
+ * Set to 0 for deterministic guaranteed correctness.
+ * @return An n x n sparse upper-triangular matrix of cosine similarities
+ * between columns of this matrix.
+ */
+ def similarColumns(threshold: Double): CoordinateMatrix = {
+ require(threshold >= 0 && threshold <= 1, s"Threshold not in [0,1]: $threshold")
+
+ val gamma = if (math.abs(threshold) < 1e-6) {
@mengxr
mengxr Sep 16, 2014 Contributor

remove math.abs

@rezazadeh
rezazadeh Sep 20, 2014 Contributor

Removed math.abs

@mengxr mengxr and 1 other commented on an outdated diff Sep 16, 2014
...apache/spark/mllib/linalg/distributed/RowMatrix.scala
+
+ /**
+ * Find all similar columns using the DIMSUM sampling algorithm, described in two papers
+ *
+ * http://arxiv.org/abs/1206.2082
+ * http://arxiv.org/abs/1304.1467
+ *
+ * @param colMags A vector of column magnitudes
+ * @param gamma The oversampling parameter. For provable results, set to 100 * log(n) / s,
+ * where s is the smallest similarity score to be estimated,
+ * and n is the number of columns
+ * @return An n x n sparse upper-triangular matrix of cosine similarities
+ * between columns of this matrix.
+ */
+ private[mllib] def similarColumnsDIMSUM(colMags: Array[Double],
+ gamma: Double): CoordinateMatrix = {
@mengxr
mengxr Sep 16, 2014 Contributor

4-space indentation

@rezazadeh
rezazadeh Sep 20, 2014 Contributor

Used 4-space indentation for for two-lined columns. Pity intellij didn't catch this.

@mengxr mengxr and 1 other commented on an outdated diff Sep 16, 2014
...apache/spark/mllib/linalg/distributed/RowMatrix.scala
+ * @return An n x n sparse upper-triangular matrix of cosine similarities
+ * between columns of this matrix.
+ */
+ private[mllib] def similarColumnsDIMSUM(colMags: Array[Double],
+ gamma: Double): CoordinateMatrix = {
+ require(gamma > 1.0, s"Oversampling should be greater than 1: $gamma")
+ require(colMags.size == this.numCols(), "Number of magnitudes didn't match column dimension")
+
+ val sg = math.sqrt(gamma) // sqrt(gamma) used many times
+
+ val sims = rows.flatMap { row =>
+ val buf = new ListBuffer[((Int, Int), Double)]()
+ row.toBreeze.activeIterator.foreach {
+ case (_, 0.0) => // Skip explicit zero elements.
+ case (i, iVal) =>
+ val rand = new scala.util.Random(iVal.toLong)
@mengxr
mengxr Sep 16, 2014 Contributor

It won't give you a pseudo random sequence. Think about the case when all values are the same. The seed should be set on a partition level. Could you update this block with mapPartitionsWithIndex?

@rezazadeh
rezazadeh Sep 20, 2014 Contributor

Updated to use mapPartitionsWithIndex with the partition index as the seed

@mengxr mengxr commented on the diff Sep 16, 2014
...spark/mllib/stat/MultivariateStatisticalSummary.scala
@@ -53,4 +53,14 @@ trait MultivariateStatisticalSummary {
* Minimum value of each column.
*/
def min: Vector
+
+ /**
+ * Euclidean magnitude of each column
+ */
+ def normL2: Vector
+
+ /**
+ * L1 norm of each column
+ */
+ def normL1: Vector
@mengxr
mengxr Sep 16, 2014 Contributor

mean, stddev, and variance are summary statistics of the random variable. The 1-norm of a random variable should be E[|X|] instead of \sum_i |x_i|. See http://www.math.uah.edu/stat/expect/Spaces.html . I suggest changing the method names to norm1 and norm2 and outputs the average instead.

@rezazadeh
rezazadeh Sep 20, 2014 Contributor

For general vectors and matrices, L1 and L2 norms are widely accepted as summaries of a vector and are standard linear algebra: http://en.wikipedia.org/wiki/Norm_(mathematics)#p-norm

RowMatrix is a general matrix, not necessarily holding random variable samples in its columns.
By using the names normL1 and normL2 it is clear what we're talking about - there is no mistaking what L1 means. Whereas if I were to name them norm1 and use some averaging, I think the name is ambiguous and requires the user to know we're assuming the matrix holds random variable samples in its columns. There is no need to confuse the user to make an assumption we don't need.

@mengxr mengxr and 1 other commented on an outdated diff Sep 16, 2014
...e/spark/mllib/linalg/distributed/RowMatrixSuite.scala
@@ -95,6 +95,40 @@ class RowMatrixSuite extends FunSuite with LocalSparkContext {
}
}
+ test("similar columns") {
+ val colMags = Vectors.dense(Math.sqrt(126), Math.sqrt(66), Math.sqrt(94))
+ val expected = BDM(
+ (0.0, 54.0, 72.0),
+ (0.0, 0.0, 78.0),
+ (0.0, 0.0, 0.0))
+
+ for(i <- 0 until n) for(j <- 0 until n) {
+ expected(i, j) /= (colMags(i) * colMags(j))
+ }
+
+ for (mat <- Seq(denseMat, sparseMat)) {
+ val G = mat.similarColumns(0.1).toBreeze()
+ for(i <- 0 until n) for(j <- 0 until n) {
@mengxr
mengxr Sep 16, 2014 Contributor

for (i <- 0 until n; j <- 0 until n) { (space after for and merge two for statements)

@rezazadeh
rezazadeh Sep 20, 2014 Contributor

Merged and added space.

@SparkQA
SparkQA commented Sep 16, 2014

QA tests have finished for PR 1778 at commit 25e9d0d.

  • This patch fails unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
@SparkQA
SparkQA commented Sep 20, 2014

QA tests have started for PR 1778 at commit 0e4eda4.

  • This patch merges cleanly.
@SparkQA
SparkQA commented Sep 20, 2014

QA tests have finished for PR 1778 at commit 0e4eda4.

  • This patch fails unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
@debasish83

The colMags right now have sqrt(sum(column1)^2 + sum(column2)^2 + ... + sum(columnN)^2)
It will be good to have (sum(column1) + sum(column2) + ... + sum(columnN))

Can I add this logic in OnlineSummarizer normL1 ? I saw normL1 is not implemented right now...may be this is sum....This is the support number that's commonly used in market basket analysis...will be very useful to have it...I am right now reconstructing it using similarity_{ij}_colMags(i)_colMags(j) which can produce rounding errors sometime..

@rezazadeh
Contributor

Why do you say normL1 is not implemented? I have implemented normL1 in MultivariateOnlineSummarizer, with tests. Do you want a version without absolute values? If so, just unnormalize mean, which is also in MultivariateOnlineSummarizer

@debasish83

ahh I saw it in the code now...that will do...no need for absolute values...numbers are positive for me..thanks..

@debasish83

The code looks good for sparse input but for dense input is there any issue with using activeIterator ? I understand that due to dimsum threshold check we have to iterate over column and rows independently but if there is dense data, is it possible for having a dimsum threshold check high up and use dense*dense blas ?

@mengxr
Contributor
mengxr commented Sep 25, 2014

@rezazadeh I made some changes in a local branch: https://github.com/mengxr/spark/blob/rezazadeh-dimsumv2/mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/RowMatrix.scala . Could you merge the latest master to your branch? Then it is easy to compare the diff. I changed the following:

  1. similarColumns -> columnSimilarities
  2. remove activeIterator and specialize for dense and sparse vectors
  3. cache the probabilities and denominators

Those should increase the performance by ~5x. But the shuffle is still expensive, because the records are very small.

Another question I have is on the sparsity of the result. I ran some tests locally and found that even with gamma = 1.0, the result is still dense (containing all (i, j) pairs) though the shuffle size is smaller.

@rezazadeh
Contributor

@mengxr Thanks for the optimizations. I merged the latest master into my branch and pushed to here. Would you like me to merge your branch into mine?

There is no guarantee on sparsity, only on lower shuffle size, so what you observed is exactly what is expected. We can allow the user to promote sparsity by allowing them to set the threshold to values greater than 1. In this case, we can't provably guarantee correctness, but it will be useful for users who have very dense matrices, because the guarantees are pessimistic and even after promoting sparsity the result will likely be useful to them.

With this in mind, I propose allowing the threshold to be above 1, and when it is, instead of giving an error, give a warning that results are not guaranteed correct, but the computational savings will be useful. Shall I do that?

@mengxr
Contributor
mengxr commented Sep 25, 2014

@rezazadeh I sent a PR to your repo at: rezazadeh#1 . Could you check the changes and merge it if they are correct (hopefully) and look good to you? Thanks!

Thanks for explaining how threshold affects the computation. I think it is nice to allow it to be above 1, given that it is documented accurately.

@rezazadeh
Contributor

@mengxr Merged in your changes and added ability for the threshold to be larger with a warning. Tests pass.

@SparkQA
SparkQA commented Sep 26, 2014

QA tests have started for PR 1778 at commit aea0247.

  • This patch does not merge cleanly!
@rezazadeh rezazadeh Merge remote-tracking branch 'upstream/master' into dimsumv2
Conflicts:
	mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/RowMatrix.scala
3467cff
@SparkQA
SparkQA commented Sep 26, 2014

QA tests have started for PR 1778 at commit 3467cff.

  • This patch merges cleanly.
@rezazadeh
Contributor

@mengxr I also added broadcasting of p and v to further optimize space usage. Also now we're avoiding divide by zero if there is a column with zero magnitude.

@SparkQA
SparkQA commented Sep 26, 2014

QA tests have finished for PR 1778 at commit aea0247.

  • This patch fails unit tests.
  • This patch does not merge cleanly!
@SparkQA
SparkQA commented Sep 26, 2014

QA tests have finished for PR 1778 at commit 3467cff.

  • This patch fails unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
@mengxr
Contributor
mengxr commented Sep 26, 2014

test this please

@SparkQA
SparkQA commented Sep 26, 2014

QA tests have started for PR 1778 at commit ee8bd65.

  • This patch merges cleanly.
@SparkQA
SparkQA commented Sep 26, 2014

QA tests have finished for PR 1778 at commit ee8bd65.

  • This patch fails unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
@rezazadeh
Contributor

Only the binary compatibility test is failing, which is expected.

@mengxr
Contributor
mengxr commented Sep 26, 2014

@rezazadeh Could you set the exclusion rules in dev/MimaExcludes.scala?

@SparkQA
SparkQA commented Sep 26, 2014

QA tests have started for PR 1778 at commit 4eb71c6.

  • This patch merges cleanly.
@SparkQA
SparkQA commented Sep 26, 2014

QA tests have finished for PR 1778 at commit 4eb71c6.

  • This patch fails unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
@rezazadeh
Contributor

Jenkins, test this please.

@SparkQA
SparkQA commented Sep 26, 2014

QA tests have started for PR 1778 at commit 4eb71c6.

  • This patch merges cleanly.
@SparkQA
SparkQA commented Sep 26, 2014

QA tests have finished for PR 1778 at commit 4eb71c6.

  • This patch fails unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
@SparkQA
SparkQA commented Sep 26, 2014

QA tests have started for PR 1778 at commit 404c64c.

  • This patch merges cleanly.
@SparkQA
SparkQA commented Sep 27, 2014

QA tests have finished for PR 1778 at commit 404c64c.

  • This patch passes unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
@mengxr
Contributor
mengxr commented Sep 29, 2014

LGTM. Merged into master! Thanks @rezazadeh !

@asfgit asfgit pushed a commit that closed this pull request Sep 29, 2014
@rezazadeh @mengxr rezazadeh + mengxr [MLlib] [SPARK-2885] DIMSUM: All-pairs similarity
# All-pairs similarity via DIMSUM
Compute all pairs of similar vectors using brute force approach, and also DIMSUM sampling approach.

Laying down some notation: we are looking for all pairs of similar columns in an m x n RowMatrix whose entries are denoted a_ij, with the i’th row denoted r_i and the j’th column denoted c_j. There is an oversampling parameter labeled ɣ that should be set to 4 log(n)/s to get provably correct results (with high probability), where s is the similarity threshold.

The algorithm is stated with a Map and Reduce, with proofs of correctness and efficiency in published papers [1] [2]. The reducer is simply the summation reducer. The mapper is more interesting, and is also the heart of the scheme. As an exercise, you should try to see why in expectation, the map-reduce below outputs cosine similarities.

![dimsumv2](https://cloud.githubusercontent.com/assets/3220351/3807272/d1d9514e-1c62-11e4-9f12-3cfdb1d78b3a.png)

[1] Bosagh-Zadeh, Reza and Carlsson, Gunnar (2013), Dimension Independent Matrix Square using MapReduce, arXiv:1304.1467 http://arxiv.org/abs/1304.1467

[2] Bosagh-Zadeh, Reza and Goel, Ashish (2012), Dimension Independent Similarity Computation, arXiv:1206.2082 http://arxiv.org/abs/1206.2082

# Testing

Tests for all invocations included.

Added L1 and L2 norm computation to MultivariateStatisticalSummary since it was needed. Added tests for both of them.

Author: Reza Zadeh <rizlar@gmail.com>
Author: Xiangrui Meng <meng@databricks.com>

Closes #1778 from rezazadeh/dimsumv2 and squashes the following commits:

404c64c [Reza Zadeh] Merge remote-tracking branch 'upstream/master' into dimsumv2
4eb71c6 [Reza Zadeh] Add excludes for normL1 and normL2
ee8bd65 [Reza Zadeh] Merge remote-tracking branch 'upstream/master' into dimsumv2
976ddd4 [Reza Zadeh] Broadcast colMags. Avoid div by zero.
3467cff [Reza Zadeh] Merge remote-tracking branch 'upstream/master' into dimsumv2
aea0247 [Reza Zadeh] Allow large thresholds to promote sparsity
9fe17c0 [Xiangrui Meng] organize imports
2196ba5 [Xiangrui Meng] Merge branch 'rezazadeh-dimsumv2' into dimsumv2
254ca08 [Reza Zadeh] Merge remote-tracking branch 'upstream/master' into dimsumv2
f2947e4 [Xiangrui Meng] some optimization
3c4cf41 [Xiangrui Meng] Merge branch 'master' into rezazadeh-dimsumv2
0e4eda4 [Reza Zadeh] Use partition index for RNG
251bb9c [Reza Zadeh] Documentation
25e9d0d [Reza Zadeh] Line length for style
fb296f6 [Reza Zadeh] renamed to normL1 and normL2
3764983 [Reza Zadeh] Documentation
e9c6791 [Reza Zadeh] New interface and documentation
613f261 [Reza Zadeh] Column magnitude summary
75a0b51 [Reza Zadeh] Use Ints instead of Longs in the shuffle
0f12ade [Reza Zadeh] Style changes
eb1dc20 [Reza Zadeh] Use Double.PositiveInfinity instead of Double.Max
f56a882 [Reza Zadeh] Remove changes to MultivariateOnlineSummarizer
dbc55ba [Reza Zadeh] Make colMagnitudes a method in RowMatrix
41e8ece [Reza Zadeh] style changes
139c8e1 [Reza Zadeh] Syntax changes
029aa9c [Reza Zadeh] javadoc and new test
75edb25 [Reza Zadeh] All tests passing!
05e59b8 [Reza Zadeh] Add test
502ce52 [Reza Zadeh] new interface
654c4fb [Reza Zadeh] default methods
3726ca9 [Reza Zadeh] Remove MatrixAlgebra
6bebabb [Reza Zadeh] remove changes to MatrixSuite
5b8cd7d [Reza Zadeh] Initial files
587a0cd
@asfgit asfgit closed this in 587a0cd Sep 29, 2014
@rezazadeh
Contributor

Thanks for the review @mengxr !

@rezazadeh rezazadeh deleted the rezazadeh:dimsumv2 branch Sep 30, 2014
@dgshep dgshep pushed a commit to dgshep/spark that referenced this pull request Dec 8, 2014
@rezazadeh rezazadeh + Davis Shepherd [MLlib] [SPARK-2885] DIMSUM: All-pairs similarity
# All-pairs similarity via DIMSUM
Compute all pairs of similar vectors using brute force approach, and also DIMSUM sampling approach.

Laying down some notation: we are looking for all pairs of similar columns in an m x n RowMatrix whose entries are denoted a_ij, with the i’th row denoted r_i and the j’th column denoted c_j. There is an oversampling parameter labeled ɣ that should be set to 4 log(n)/s to get provably correct results (with high probability), where s is the similarity threshold.

The algorithm is stated with a Map and Reduce, with proofs of correctness and efficiency in published papers [1] [2]. The reducer is simply the summation reducer. The mapper is more interesting, and is also the heart of the scheme. As an exercise, you should try to see why in expectation, the map-reduce below outputs cosine similarities.

![dimsumv2](https://cloud.githubusercontent.com/assets/3220351/3807272/d1d9514e-1c62-11e4-9f12-3cfdb1d78b3a.png)

[1] Bosagh-Zadeh, Reza and Carlsson, Gunnar (2013), Dimension Independent Matrix Square using MapReduce, arXiv:1304.1467 http://arxiv.org/abs/1304.1467

[2] Bosagh-Zadeh, Reza and Goel, Ashish (2012), Dimension Independent Similarity Computation, arXiv:1206.2082 http://arxiv.org/abs/1206.2082

# Testing

Tests for all invocations included.

Added L1 and L2 norm computation to MultivariateStatisticalSummary since it was needed. Added tests for both of them.

Author: Reza Zadeh <rizlar@gmail.com>
Author: Xiangrui Meng <meng@databricks.com>

Closes #1778 from rezazadeh/dimsumv2 and squashes the following commits:

404c64c [Reza Zadeh] Merge remote-tracking branch 'upstream/master' into dimsumv2
4eb71c6 [Reza Zadeh] Add excludes for normL1 and normL2
ee8bd65 [Reza Zadeh] Merge remote-tracking branch 'upstream/master' into dimsumv2
976ddd4 [Reza Zadeh] Broadcast colMags. Avoid div by zero.
3467cff [Reza Zadeh] Merge remote-tracking branch 'upstream/master' into dimsumv2
aea0247 [Reza Zadeh] Allow large thresholds to promote sparsity
9fe17c0 [Xiangrui Meng] organize imports
2196ba5 [Xiangrui Meng] Merge branch 'rezazadeh-dimsumv2' into dimsumv2
254ca08 [Reza Zadeh] Merge remote-tracking branch 'upstream/master' into dimsumv2
f2947e4 [Xiangrui Meng] some optimization
3c4cf41 [Xiangrui Meng] Merge branch 'master' into rezazadeh-dimsumv2
0e4eda4 [Reza Zadeh] Use partition index for RNG
251bb9c [Reza Zadeh] Documentation
25e9d0d [Reza Zadeh] Line length for style
fb296f6 [Reza Zadeh] renamed to normL1 and normL2
3764983 [Reza Zadeh] Documentation
e9c6791 [Reza Zadeh] New interface and documentation
613f261 [Reza Zadeh] Column magnitude summary
75a0b51 [Reza Zadeh] Use Ints instead of Longs in the shuffle
0f12ade [Reza Zadeh] Style changes
eb1dc20 [Reza Zadeh] Use Double.PositiveInfinity instead of Double.Max
f56a882 [Reza Zadeh] Remove changes to MultivariateOnlineSummarizer
dbc55ba [Reza Zadeh] Make colMagnitudes a method in RowMatrix
41e8ece [Reza Zadeh] style changes
139c8e1 [Reza Zadeh] Syntax changes
029aa9c [Reza Zadeh] javadoc and new test
75edb25 [Reza Zadeh] All tests passing!
05e59b8 [Reza Zadeh] Add test
502ce52 [Reza Zadeh] new interface
654c4fb [Reza Zadeh] default methods
3726ca9 [Reza Zadeh] Remove MatrixAlgebra
6bebabb [Reza Zadeh] remove changes to MatrixSuite
5b8cd7d [Reza Zadeh] Initial files
0b638a9
@appierys

Does anyone know how to extend this to the 'Cross Product' case as mentioned in the paper?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment