-
Notifications
You must be signed in to change notification settings - Fork 13k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FLINK-3919][flink-ml] Distributed Linear Algebra: row-based matrix #1996
Conversation
Hi @chobeat, thanks for opening pull request. I would like to shepherd this PR. |
trait DistributedMatrix { | ||
|
||
def getNumRows: Int | ||
def getNumCols: Int |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about changing the return type of getNumRows
and getNumCols
to DataSet[Long]
? Returning number of elements directly in distributed data in Flink is expensive operation. We can use broadcast variable to transfer the numbers to future operation.
There are some public methods ( |
} | ||
|
||
case class IndexedRow(rowIndex: Int, values: Vector) | ||
extends Ordered[IndexedRow] { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does we need to extend Ordered
trait? It seems unnecessary but if it is necessary, please explain reason.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought it was the most elegant and generic way to sort a list of IndexedRow. If it's not the suggested way in Flink's codebase, I can remove it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, okay. Let's leave it.
Double format: Int + DataSet[Int]
|
||
def compare(other: IndexedRow) = this.rowIndex.compare(other.rowIndex) | ||
|
||
override def toString: String = s"($rowIndex,${values.toString}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing a closing parentheses
Hi @chobeat, thanks for update PR. After addressing comments on source code, I think the last thing to merge this is adding documentation for this. But you can add the documentation after block-based matrix is merged. |
@chiwanpark Yeah I thought I could write the documentation as a third PR but I would like to review the block matrix first because it may change in structure. Anyway I will soon begin working on a doc page with the general structure and some examples for the row-based matrix. |
@chiwanpark before merging I need to go over the numRows/numCols issue again because I noticed they create problems in another project of mine. I think that the |
@chiwanpark I think it should be better to leave to the user the computation of the dimensionality. I tried different options and all of them are sub-optimal. I would leave this feature for later as a next step if it's ok. |
@chobeat Okay. Then we should force user to calculate dimensionality of matrix by changing type of number parameters in constructor. |
|
||
object DistributedRowMatrix { | ||
|
||
type MatrixRowIndex = Int |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this for changing type of matrix row index in future? But there is no usage in other parts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yup, I tried to stay general but I'm not sure I've been disciplined in using MatrixRowIndex instead of Int so maybe I should remove it and leave to an eventual refactoring the generalization.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be better to replace Int
used for row index to MatrixRowIndex
. It makes changing type of row index in future easier.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did the same for columns and I moved the definition to the DistributedMatrix
trait.
* @param other | ||
* @return | ||
*/ | ||
def sum(other: DistributedRowMatrix): DistributedRowMatrix = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is add
more proper name for this method? Please do not update this PR if you agree with me but just notify me because I'm rebasing this PR on current master.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My reply probably got lost because I posted it on Jira instead of Github, sorry.
"Umh, probably you're right. I checked breeze and they use addition for matrix addition and sum for element-wise sum."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is not your fault. :-) It seems my fault. Sorry. Let's merge this to master.
Looks good to me, +1. I'll merge this in few hours. But I think we should change the type of row index to |
First PR of the Distributed Linear Algebra contribution.
It contains a minimal implementation of a row-based distributed matrix and the following operations:
Build from COO
Conversion from distributed to:
by-row UDF on two matrices
sum
subtraction