Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-3974][MLlib] Distributed Block Matrix Abstractions #3200

Closed
wants to merge 24 commits into from

Conversation

brkyvz
Copy link
Contributor

@brkyvz brkyvz commented Nov 11, 2014

This pull request includes the abstractions for the distributed BlockMatrix representation.
BlockMatrix will allow users to store very large matrices in small blocks of local matrices. Specific partitioners, such as RowBasedPartitioner and ColumnBasedPartitioner, are implemented in order to optimize addition and multiplication operations that will be added in a following PR.

This work is based on the ml-matrix repo developed at the AMPLab at UC Berkeley, CA.
https://github.com/amplab/ml-matrix

Additional thanks to @rezazadeh, @shivaram, and @mengxr for guidance on the design.

@SparkQA
Copy link

SparkQA commented Nov 11, 2014

Test build #23196 has started for PR 3200 at commit f378e16.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Nov 11, 2014

Test build #23196 has finished for PR 3200 at commit f378e16.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class BlockPartition(blockIdRow: Int, blockIdCol: Int, mat: DenseMatrix) extends Serializable
    • case class BlockPartitionInfo(
    • abstract class BlockMatrixPartitioner(
    • class GridPartitioner(
    • class RowBasedPartitioner(
    • class ColumnBasedPartitioner(
    • class BlockMatrix(

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23196/
Test PASSed.

@SparkQA
Copy link

SparkQA commented Nov 11, 2014

Test build #23220 has started for PR 3200 at commit aa8f086.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Nov 11, 2014

Test build #23220 has finished for PR 3200 at commit aa8f086.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23220/
Test PASSed.

* @param blockIdCol The column index of this block
* @param mat The underlying local matrix
*/
case class BlockPartition(blockIdRow: Int, blockIdCol: Int, mat: DenseMatrix) extends Serializable
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name BlockPartition is a little confusing. Is it a partition? Do we allow multiple blocks per partition? If this is just talking about a block, we can call it Submatrix (see: http://en.wikipedia.org/wiki/Block_matrix).

Should the name be blockRowIndex instead of blockIdRow? Id is not the same as Index.

@mengxr
Copy link
Contributor

mengxr commented Nov 13, 2014

@brkyvz If we have two block matrices, A and B, and A's column block partitioning matches B's row block partitioning, can we take advantage of this fact in computing A * B? I support having only one block matrix partitioner implementation. Then we do the following:

if (A.partitioner.colBlockPartitioner == B.partitioner.rowBlockPartitioner) {
  // zip ...
} else {
  ...
}

override def numCols(): Long = dims._2

if (partitioner.name.equals("column")) {
require(numColBlocks == partitioner.numPartitions, "The number of column blocks should match" +
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Output the non-equal parameters here?

@brkyvz
Copy link
Contributor Author

brkyvz commented Nov 13, 2014

@mengxr

If we have two block matrices, A and B, and A's column block partitioning matches B's row block partitioning, can we take advantage of this fact in computing A * B? I support having only one block matrix partitioner implementation. Then we do the following:

if (A.partitioner.colBlockPartitioner == B.partitioner.rowBlockPartitioner) {
// zip ...
} else {
...
}

By partitioner.rowBlockPartitioner and partitioner.colBlockPartitioner, are you talking about the number of blocks that form the rows and the number of rows per block match?

One problem with zip was that I couldn't guarantee data locality. I tried to force it, but the best way to force it turns out to be a join...

@SparkQA
Copy link

SparkQA commented Nov 14, 2014

Test build #23378 has started for PR 3200 at commit 589fbb6.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Nov 14, 2014

Test build #23379 has started for PR 3200 at commit 19c17e8.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Nov 14, 2014

Test build #23378 has finished for PR 3200 at commit 589fbb6.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class SubMatrix(blockIdRow: Int, blockIdCol: Int, mat: DenseMatrix) extends Serializable
    • case class SubMatrixInfo(
    • abstract class BlockMatrixPartitioner(
    • class GridPartitioner(
    • class RowBasedPartitioner(
    • class ColumnBasedPartitioner(
    • class BlockMatrix(

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23378/
Test FAILed.

@SparkQA
Copy link

SparkQA commented Nov 14, 2014

Test build #23379 has finished for PR 3200 at commit 19c17e8.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23379/
Test FAILed.

@SparkQA
Copy link

SparkQA commented Nov 14, 2014

Test build #23382 has started for PR 3200 at commit b05aabb.

  • This patch merges cleanly.

}

/** Partitions sub-matrices as blocks with neighboring sub-matrices. */
private def getBlockId(blockRowIndex: Int, blockColIndex: Int): Int = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should it be called getPartition or getPartitionId?

@SparkQA
Copy link

SparkQA commented Jan 27, 2015

Test build #26178 has started for PR 3200 at commit 5eecd48.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Jan 27, 2015

Test build #26178 has finished for PR 3200 at commit 5eecd48.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class DenseMatrix(
    • class BlockMatrix(

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26178/
Test PASSed.

@SparkQA
Copy link

SparkQA commented Jan 28, 2015

Test build #26225 has started for PR 3200 at commit a8eace2.

  • This patch merges cleanly.

@brkyvz
Copy link
Contributor Author

brkyvz commented Jan 28, 2015

@mengxr I don't know if rows and cols will be confusing in terms of naming in GridPartitioner...
However, since it is private and internal, maybe it's not that big of a problem?

@SparkQA
Copy link

SparkQA commented Jan 28, 2015

Test build #26225 has finished for PR 3200 at commit a8eace2.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class BlockMatrix(

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26225/
Test PASSed.

@mengxr
Copy link
Contributor

mengxr commented Jan 28, 2015

LGTM. Merged into master. Thanks!

@asfgit asfgit closed this in eeb53bf Jan 28, 2015
@brkyvz brkyvz deleted the SPARK-3974 branch January 30, 2015 20:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants