[SPARK-3974][MLlib] Distributed Block Matrix Abstractions #3200

brkyvz · 2014-11-11T05:38:57Z

This pull request includes the abstractions for the distributed BlockMatrix representation.
BlockMatrix will allow users to store very large matrices in small blocks of local matrices. Specific partitioners, such as RowBasedPartitioner and ColumnBasedPartitioner, are implemented in order to optimize addition and multiplication operations that will be added in a following PR.

This work is based on the ml-matrix repo developed at the AMPLab at UC Berkeley, CA.
https://github.com/amplab/ml-matrix

Additional thanks to @rezazadeh, @shivaram, and @mengxr for guidance on the design.

SparkQA · 2014-11-11T05:45:03Z

Test build #23196 has started for PR 3200 at commit f378e16.

This patch merges cleanly.

SparkQA · 2014-11-11T07:10:11Z

Test build #23196 has finished for PR 3200 at commit f378e16.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class BlockPartition(blockIdRow: Int, blockIdCol: Int, mat: DenseMatrix) extends Serializable
- case class BlockPartitionInfo(
- abstract class BlockMatrixPartitioner(
- class GridPartitioner(
- class RowBasedPartitioner(
- class ColumnBasedPartitioner(
- class BlockMatrix(

AmplabJenkins · 2014-11-11T07:10:14Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23196/
Test PASSed.

SparkQA · 2014-11-11T19:37:38Z

Test build #23220 has started for PR 3200 at commit aa8f086.

This patch merges cleanly.

SparkQA · 2014-11-11T21:04:17Z

Test build #23220 has finished for PR 3200 at commit aa8f086.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2014-11-11T21:04:21Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23220/
Test PASSed.

mengxr · 2014-11-13T22:41:22Z

mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/BlockMatrix.scala

+ * @param blockIdCol The column index of this block
+ * @param mat The underlying local matrix
+ */
+case class BlockPartition(blockIdRow: Int, blockIdCol: Int, mat: DenseMatrix) extends Serializable


The name BlockPartition is a little confusing. Is it a partition? Do we allow multiple blocks per partition? If this is just talking about a block, we can call it Submatrix (see: http://en.wikipedia.org/wiki/Block_matrix).

Should the name be blockRowIndex instead of blockIdRow? Id is not the same as Index.

mengxr · 2014-11-13T22:49:11Z

@brkyvz If we have two block matrices, A and B, and A's column block partitioning matches B's row block partitioning, can we take advantage of this fact in computing A * B? I support having only one block matrix partitioner implementation. Then we do the following:

if (A.partitioner.colBlockPartitioner == B.partitioner.rowBlockPartitioner) {
  // zip ...
} else {
  ...
}

rezazadeh · 2014-11-13T22:49:50Z

mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/BlockMatrix.scala

+  override def numCols(): Long = dims._2
+
+  if (partitioner.name.equals("column")) {
+    require(numColBlocks == partitioner.numPartitions, "The number of column blocks should match" +


Output the non-equal parameters here?

brkyvz · 2014-11-13T22:58:03Z

@mengxr

If we have two block matrices, A and B, and A's column block partitioning matches B's row block partitioning, can we take advantage of this fact in computing A * B? I support having only one block matrix partitioner implementation. Then we do the following:

if (A.partitioner.colBlockPartitioner == B.partitioner.rowBlockPartitioner) {
// zip ...
} else {
...
}

By partitioner.rowBlockPartitioner and partitioner.colBlockPartitioner, are you talking about the number of blocks that form the rows and the number of rows per block match?

One problem with zip was that I couldn't guarantee data locality. I tried to force it, but the best way to force it turns out to be a join...

SparkQA · 2014-11-14T19:20:02Z

Test build #23378 has started for PR 3200 at commit 589fbb6.

This patch merges cleanly.

SparkQA · 2014-11-14T19:27:41Z

Test build #23379 has started for PR 3200 at commit 19c17e8.

This patch merges cleanly.

SparkQA · 2014-11-14T20:11:29Z

Test build #23378 has finished for PR 3200 at commit 589fbb6.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class SubMatrix(blockIdRow: Int, blockIdCol: Int, mat: DenseMatrix) extends Serializable
- case class SubMatrixInfo(
- abstract class BlockMatrixPartitioner(
- class GridPartitioner(
- class RowBasedPartitioner(
- class ColumnBasedPartitioner(
- class BlockMatrix(

AmplabJenkins · 2014-11-14T20:11:39Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23378/
Test FAILed.

SparkQA · 2014-11-14T20:19:01Z

Test build #23379 has finished for PR 3200 at commit 19c17e8.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2014-11-14T20:19:04Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23379/
Test FAILed.

SparkQA · 2014-11-14T20:22:41Z

Test build #23382 has started for PR 3200 at commit b05aabb.

This patch merges cleanly.

mengxr · 2015-01-27T02:01:55Z

mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/BlockMatrix.scala

+  }
+
+  /** Partitions sub-matrices as blocks with neighboring sub-matrices. */
+  private def getBlockId(blockRowIndex: Int, blockColIndex: Int): Int = {


Should it be called getPartition or getPartitionId?

SparkQA · 2015-01-27T19:42:54Z

Test build #26178 has started for PR 3200 at commit 5eecd48.

This patch merges cleanly.

SparkQA · 2015-01-27T20:53:35Z

Test build #26178 has finished for PR 3200 at commit 5eecd48.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class DenseMatrix(
- class BlockMatrix(

AmplabJenkins · 2015-01-27T20:53:39Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26178/
Test PASSed.

Simplify GridPartitioner partitioning

SparkQA · 2015-01-28T15:47:50Z

Test build #26225 has started for PR 3200 at commit a8eace2.

This patch merges cleanly.

brkyvz · 2015-01-28T15:59:30Z

@mengxr I don't know if rows and cols will be confusing in terms of naming in GridPartitioner...
However, since it is private and internal, maybe it's not that big of a problem?

SparkQA · 2015-01-28T16:56:48Z

Test build #26225 has finished for PR 3200 at commit a8eace2.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class BlockMatrix(

AmplabJenkins · 2015-01-28T16:56:52Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26225/
Test PASSed.

mengxr · 2015-01-28T18:07:20Z

LGTM. Merged into master. Thanks!

Burak Yavuz added 2 commits November 10, 2014 19:58

Ready for Pull request

b693209

[SPARK-3974] Block Matrix Abstractions ready

f378e16

[SPARK-3974] Additional comments added

aa8f086

mengxr reviewed Nov 13, 2014
View reviewed changes

rezazadeh reviewed Nov 13, 2014
View reviewed changes

[SPARK-3974] Code review feedback addressed

589fbb6

[SPARK-3974] Changed blockIdRow and blockIdCol

19c17e8

[SPARK-3974] Updated tests to reflect changes

b05aabb

brkyvz added 2 commits November 14, 2014 12:35

[SPARK-3974] Pull latest master

645afbe

[SPARK-3974] Updated testing utils from master

49b9586

mengxr reviewed Jan 27, 2015
View reviewed changes

brkyvz added 3 commits January 26, 2015 22:48

almost finished addressing comments

1694c9e

Merge branch 'master' of github.com:apache/spark into SPARK-3974

140f20e

fixed gridPartitioner and added tests

5eecd48

mengxr and others added 4 commits January 27, 2015 23:15

update grid partitioner

24ec7b8

minor updates

e1d3ee8

update tests

feb32a7

Merge pull request #2 from mengxr/brkyvz-SPARK-3974

a8eace2

Simplify GridPartitioner partitioning

asfgit closed this in eeb53bf Jan 28, 2015

brkyvz deleted the SPARK-3974 branch January 30, 2015 20:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-3974][MLlib] Distributed Block Matrix Abstractions #3200

[SPARK-3974][MLlib] Distributed Block Matrix Abstractions #3200

brkyvz commented Nov 11, 2014

SparkQA commented Nov 11, 2014

SparkQA commented Nov 11, 2014

AmplabJenkins commented Nov 11, 2014

SparkQA commented Nov 11, 2014

SparkQA commented Nov 11, 2014

AmplabJenkins commented Nov 11, 2014

mengxr Nov 13, 2014

mengxr commented Nov 13, 2014

rezazadeh Nov 13, 2014

brkyvz commented Nov 13, 2014

SparkQA commented Nov 14, 2014

SparkQA commented Nov 14, 2014

SparkQA commented Nov 14, 2014

AmplabJenkins commented Nov 14, 2014

SparkQA commented Nov 14, 2014

AmplabJenkins commented Nov 14, 2014

SparkQA commented Nov 14, 2014

mengxr Jan 27, 2015

SparkQA commented Jan 27, 2015

SparkQA commented Jan 27, 2015

AmplabJenkins commented Jan 27, 2015

SparkQA commented Jan 28, 2015

brkyvz commented Jan 28, 2015

SparkQA commented Jan 28, 2015

AmplabJenkins commented Jan 28, 2015

mengxr commented Jan 28, 2015

[SPARK-3974][MLlib] Distributed Block Matrix Abstractions #3200

[SPARK-3974][MLlib] Distributed Block Matrix Abstractions #3200

Conversation

brkyvz commented Nov 11, 2014

SparkQA commented Nov 11, 2014

SparkQA commented Nov 11, 2014

AmplabJenkins commented Nov 11, 2014

SparkQA commented Nov 11, 2014

SparkQA commented Nov 11, 2014

AmplabJenkins commented Nov 11, 2014

mengxr Nov 13, 2014

Choose a reason for hiding this comment

mengxr commented Nov 13, 2014

rezazadeh Nov 13, 2014

Choose a reason for hiding this comment

brkyvz commented Nov 13, 2014

SparkQA commented Nov 14, 2014

SparkQA commented Nov 14, 2014

SparkQA commented Nov 14, 2014

AmplabJenkins commented Nov 14, 2014

SparkQA commented Nov 14, 2014

AmplabJenkins commented Nov 14, 2014

SparkQA commented Nov 14, 2014

mengxr Jan 27, 2015

Choose a reason for hiding this comment

SparkQA commented Jan 27, 2015

SparkQA commented Jan 27, 2015

AmplabJenkins commented Jan 27, 2015

SparkQA commented Jan 28, 2015

brkyvz commented Jan 28, 2015

SparkQA commented Jan 28, 2015

AmplabJenkins commented Jan 28, 2015

mengxr commented Jan 28, 2015