Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FLINK-3919][flink-ml] Distributed Linear Algebra: row-based matrix #1996

Closed
wants to merge 10 commits into from

Conversation

chobeat
Copy link
Contributor

@chobeat chobeat commented May 17, 2016

First PR of the Distributed Linear Algebra contribution.

It contains a minimal implementation of a row-based distributed matrix and the following operations:

Build from COO

Conversion from distributed to:

  • local dense format
  • local sparse format
  • local COO format.

by-row UDF on two matrices
sum
subtraction

@chiwanpark
Copy link
Member

Hi @chobeat, thanks for opening pull request. I would like to shepherd this PR.

trait DistributedMatrix {

def getNumRows: Int
def getNumCols: Int
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about changing the return type of getNumRows and getNumCols to DataSet[Long]? Returning number of elements directly in distributed data in Flink is expensive operation. We can use broadcast variable to transfer the numbers to future operation.

@chiwanpark
Copy link
Member

There are some public methods (sum, subtract, byRowOperation, etc.) without scaladoc. Please add scaladoc for the methods.

}

case class IndexedRow(rowIndex: Int, values: Vector)
extends Ordered[IndexedRow] {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does we need to extend Ordered trait? It seems unnecessary but if it is necessary, please explain reason.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought it was the most elegant and generic way to sort a list of IndexedRow. If it's not the suggested way in Flink's codebase, I can remove it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, okay. Let's leave it.

Double format: Int +  DataSet[Int]

def compare(other: IndexedRow) = this.rowIndex.compare(other.rowIndex)

override def toString: String = s"($rowIndex,${values.toString}"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing a closing parentheses

@chiwanpark
Copy link
Member

Hi @chobeat, thanks for update PR. After addressing comments on source code, I think the last thing to merge this is adding documentation for this. But you can add the documentation after block-based matrix is merged.

@chobeat
Copy link
Contributor Author

chobeat commented Jun 1, 2016

@chiwanpark Yeah I thought I could write the documentation as a third PR but I would like to review the block matrix first because it may change in structure. Anyway I will soon begin working on a doc page with the general structure and some examples for the row-based matrix.

@chobeat
Copy link
Contributor Author

chobeat commented Jun 1, 2016

@chiwanpark before merging I need to go over the numRows/numCols issue again because I noticed they create problems in another project of mine. I think that the collect() there is too risky and obscure so I would like to refactor it. I'm working on it right now.

@chobeat
Copy link
Contributor Author

chobeat commented Jun 1, 2016

@chiwanpark I think it should be better to leave to the user the computation of the dimensionality. I tried different options and all of them are sub-optimal. I would leave this feature for later as a next step if it's ok.

@chiwanpark
Copy link
Member

@chobeat Okay. Then we should force user to calculate dimensionality of matrix by changing type of number parameters in constructor.


object DistributedRowMatrix {

type MatrixRowIndex = Int
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this for changing type of matrix row index in future? But there is no usage in other parts.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yup, I tried to stay general but I'm not sure I've been disciplined in using MatrixRowIndex instead of Int so maybe I should remove it and leave to an eventual refactoring the generalization.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be better to replace Int used for row index to MatrixRowIndex. It makes changing type of row index in future easier.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did the same for columns and I moved the definition to the DistributedMatrix trait.

* @param other
* @return
*/
def sum(other: DistributedRowMatrix): DistributedRowMatrix = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is add more proper name for this method? Please do not update this PR if you agree with me but just notify me because I'm rebasing this PR on current master.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My reply probably got lost because I posted it on Jira instead of Github, sorry.

"Umh, probably you're right. I checked breeze and they use addition for matrix addition and sum for element-wise sum."

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not your fault. :-) It seems my fault. Sorry. Let's merge this to master.

@chiwanpark
Copy link
Member

Looks good to me, +1. I'll merge this in few hours.

But I think we should change the type of row index to Long in near future. I think we can deal with the incompatibility problem with local matrices by assuming that the number of rows is less than Int.MaxValue and converting row index type from Long to Int in collecting time. I'll submit an issue related to this to JIRA after merge this.

@asfgit asfgit closed this in 63504a3 Jun 23, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants