-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-3974][MLlib] Distributed Block Matrix Abstractions #3200
Conversation
Test build #23196 has started for PR 3200 at commit
|
Test build #23196 has finished for PR 3200 at commit
|
Test PASSed. |
Test build #23220 has started for PR 3200 at commit
|
Test build #23220 has finished for PR 3200 at commit
|
Test PASSed. |
* @param blockIdCol The column index of this block | ||
* @param mat The underlying local matrix | ||
*/ | ||
case class BlockPartition(blockIdRow: Int, blockIdCol: Int, mat: DenseMatrix) extends Serializable |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The name BlockPartition
is a little confusing. Is it a partition? Do we allow multiple blocks per partition? If this is just talking about a block, we can call it Submatrix
(see: http://en.wikipedia.org/wiki/Block_matrix).
Should the name be blockRowIndex
instead of blockIdRow
? Id
is not the same as Index
.
@brkyvz If we have two block matrices, A and B, and A's column block partitioning matches B's row block partitioning, can we take advantage of this fact in computing A * B? I support having only one block matrix partitioner implementation. Then we do the following:
|
override def numCols(): Long = dims._2 | ||
|
||
if (partitioner.name.equals("column")) { | ||
require(numColBlocks == partitioner.numPartitions, "The number of column blocks should match" + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Output the non-equal parameters here?
By One problem with zip was that I couldn't guarantee data locality. I tried to force it, but the best way to force it turns out to be a join... |
Test build #23378 has started for PR 3200 at commit
|
Test build #23379 has started for PR 3200 at commit
|
Test build #23378 has finished for PR 3200 at commit
|
Test FAILed. |
Test build #23379 has finished for PR 3200 at commit
|
Test FAILed. |
Test build #23382 has started for PR 3200 at commit
|
} | ||
|
||
/** Partitions sub-matrices as blocks with neighboring sub-matrices. */ | ||
private def getBlockId(blockRowIndex: Int, blockColIndex: Int): Int = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should it be called getPartition
or getPartitionId
?
Test build #26178 has started for PR 3200 at commit
|
Test build #26178 has finished for PR 3200 at commit
|
Test PASSed. |
Simplify GridPartitioner partitioning
Test build #26225 has started for PR 3200 at commit
|
@mengxr I don't know if |
Test build #26225 has finished for PR 3200 at commit
|
Test PASSed. |
LGTM. Merged into master. Thanks! |
This pull request includes the abstractions for the distributed BlockMatrix representation.
BlockMatrix
will allow users to store very large matrices in small blocks of local matrices. Specific partitioners, such asRowBasedPartitioner
andColumnBasedPartitioner
, are implemented in order to optimize addition and multiplication operations that will be added in a following PR.This work is based on the ml-matrix repo developed at the AMPLab at UC Berkeley, CA.
https://github.com/amplab/ml-matrix
Additional thanks to @rezazadeh, @shivaram, and @mengxr for guidance on the design.