Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SYSTEMML-294] Print matrix capability #120

Closed
wants to merge 7 commits into from

Conversation

nakul02
Copy link
Member

@nakul02 nakul02 commented Apr 15, 2016

Can pass a matrix object to the print statement

  • More complex things like print(X[1:10, 1:10]) do not work. The correct
    sequence of instructions is not being generated.
  • Also Java style string concatenation does not work. print (x + X) does
    not work. (x=scalar, X=matrix).

I realize the matrix printing may not be the best format. Here is what it looks like for now:

For this DML code:

X = rand(rows=1000, cols=1000, min=0, max=4, pdf="uniform", sparsity=0.2)
Z = X[1:10,1:10]
print(Z)

screenshot

@deroneriksson, @niketanpansare, @mboehm7 - thoughts?

@nakul02
Copy link
Member Author

nakul02 commented Apr 15, 2016

Jenkins Build

@mboehm7
Copy link
Contributor

mboehm7 commented Apr 15, 2016

  1. Sparsity issue: First of all there seems to be an issue with the rand builtin as this matrix should never be represented in dense.
  2. Format: Further, I would recommend to always format each number to a fixed length string and always output it in dense (including 0s). By simply using the toString method (as shown in the screenshot) you would get the dense or sparse representation which might be harder to understand for users.
  3. This functionality should also automatically obtain a small subset of the matrix in order to avoid that users pull huge matrices into memory.

@deroneriksson
Copy link
Member

Very cool! I think there is tremendous potential here. Perhaps some additional delineation could help? Something like:

+--------+--------+--------+--------+--------+
| 1.0    | 2.0    | 2.1234 | 5.4321 | 1.0    |
+--------+--------+--------+--------+--------+
| 1.0    | 2.0    | 3.0    | 0.0    | 1.0    |
+--------+--------+--------+--------+--------+
| 1.0    | 2.0    | 1.0    | 1.0    | 1.0    |
+--------+--------+--------+--------+--------+
| 1.0    | 2.0    | 1.0    | 1.0    | 1.0    |
+--------+--------+--------+--------+--------+
| 1.0    | 2.0    | 1.0    | 1.0    | 1.0    |
+--------+--------+--------+--------+--------+

@mboehm7
Copy link
Contributor

mboehm7 commented Apr 16, 2016

just a correction wrt to my point (1) - this is fine since the memory estimate for dense is still smaller.

@Wenpei
Copy link
Contributor

Wenpei commented Apr 16, 2016

@nakul02 Cool! It's my pain point when work on dml.
I suggest with simple format parameter like
maxRow=10, maxCol=10, dense=true, decimal=3
Then we can get well format matrix for print.

@mboehm7
Copy link
Contributor

mboehm7 commented Apr 18, 2016

  1. Please also consider to implement it as matrix to string cast (with special handling of 1x1 matrices) which would resolve the previously discussed issue of print("result = "+X[1,1]) as well.

@nakul02
Copy link
Member Author

nakul02 commented Apr 18, 2016

Thanks for the feedback. I'll work on these suggestions.

@Wenpei - after speaking about your suggestion for maxRow=x, maxCol=y with @niketanpansare, IMO, it maybe better to express that as print (X[1:x, 1:y]) rather than explicitly specifying a parameter. I can make sense of what the decimal argument might do, but what does the dense argument do?

@deroneriksson - what do you think of @mboehm7's suggestion, do you still want the bounding boxes?

@mboehm7 - WRT to point 3 - Should we print out the entire matrix or only a fixed number of rows and columns. Do we want to buffer parts of the matrix as we print it out?

@mboehm7
Copy link
Contributor

mboehm7 commented Apr 18, 2016

ad 3) I would recommend to define a maxrow/col threshold (let's say 1k) - if the matrix is larger, we have to generate indexing operations.

@deroneriksson
Copy link
Member

@nakul02 I guess my thinking is that I would want it to be as visually clear and powerful as possible. It's actually probably hard to do without introducing something like a printMatrix function. For instance, I could see situations where those bounding boxes are nice (view in console), and I could also see where simple comma-separated values would be nice (copy/paste into Excel). There can be situations where conciseness is nice (and have only a single number to the right of the decimal), and other situations where precision is nice (show lots of numbers to the right of the decimal). Also, it might even be nice to have a dense/sparse option (as Wenpei mentioned), where sparse displays something like i,j,v format rather than a dense 'spreadsheet' look.

However, it's great to be able to display any kind of matrix representation via print(), since currently this has not been possible. This is a big win.

Perhaps both print() of a matrix (for basic situations) and a printMatrix() (for more advanced output) would be nice.

@Wenpei
Copy link
Contributor

Wenpei commented Apr 19, 2016

@nakul02
maxRow/maxCol is not range but more like a threshold as @mboehm7 said. We has default value but user can set it when necessary.
dense means "dense matrix", if it's false, means "sparse matrix" here we may want print as:
i, j, value

@bertholdreinwald
Copy link
Contributor

Printing large dense matrices will be too overwhelming. Users will always use range indexing to display parts of a large matrix, or other ops such as head, tail, firstCols, lastCols, etc. If users print "too much", then there should be some built-in guards to avoid thrashing. This is quite common across systems.
Printing large sparse matrices, i.e. ultra sparse matrices, in <i, j, v> format will be helpful if you don't know the range of where the values are. Yes, there are ways to find out, however having these convenience built-in capabilities will definitely improve the user experience.
Btw, will that apply to frames as well?

@nakul02
Copy link
Member Author

nakul02 commented Apr 19, 2016

@bertholdreinwald - For this PR, I am dealing with matrices only. I'd suggest we deal with frames in a separate JIRA.

@niketanpansare
Copy link
Contributor

niketanpansare commented Apr 19, 2016

I like the idea of toString() method that Matthias suggested and this method can be extended to Frames later:

X = rand(rows=10000, cols=10000, min=0, max=4, pdf="uniform", sparsity=0.2)
print(toString(X, format="csv", rows=1000, cols=10))

Internally, toString() method could have built-in guards that Berthold is suggesting. I like Matthias's idea of implicitly introducing indexing op, which performs following logic (but at hop level):

rlen = nrow(X)
if(nrow(X) > rowThreshold) {
   rlen = rowThreshold
}
clen = ncol(X)
if(ncol(X) > colThreshold) {
   clen = colThreshold
}
print(toString(X[1:rlen, 1:clen], ...))

The next question is why not do implicit conversion instead of toString() method. I like implicit conversion iff we can handle following DML code:

y = " " + X
X1 = y * X

Also, I would like to point out that at language level we have a string representation of matrix, i.e. matrix("1 2 3 4", rows=2 ,cols=2). By having a pretty printing builtin function like toString(), we don't have to worry about incorrect implicit conversions.

Though I don't have a strong opinion about format type, I would suggest sticking to format that we expose in read/write methods (i.e. "csv" for dense and "text" for sparse) rather than call then as dense or sparse OR we can add the new pretty printed matrix format into read/write as well.

@nakul02
Copy link
Member Author

nakul02 commented Apr 21, 2016

Based on the discussion here, offline discussion with @niketanpansare and some thought, IMHO, here are a two of ways of printing matrices:

1. Introduce DML functions - as.string, castAsString

X = rand(rows=1000, cols=1000, min=0, max=4, pdf="uniform", sparsity=0.2)
Z = as.string(X[1:10,1:10])
print(Z)

The as.string can have additional arguments that @Wenpei suggested (i.e. maxRows, maxCols, dense & decimel with default values for each)

Y = as.string(X, maxRows=10, maxCols=20, decimel=4)
print("The matrix is " + Y)

This method has the problem of explicitly converting matrices to strings before passing them to the print statement.
Since the entire system assumes that any binary operation with a matrix yields a matrix, checks for that would not have to be modified throughout the system.
Also as.string would extend to frames and other data structures as they come along.

2. Support it directly from the print statement (as is done in this PR)

This code would work directly because the builtin print function would know how to handle matrices.

X = rand(rows=1000, cols=1000, min=0, max=4, pdf="uniform", sparsity=0.2)
Z = X[1:10,1:10]
print(Z)

Additional arguments can be passed to the print statement like so

print (Z, maxRows=10, maxCols=5, dense=true)

This method would have the problem of changing the underlying systems assumption that any binary operation with a matrix yields a matrix.
The advantage is that it would "feel" more natural.

Thoughts?

@mboehm7
Copy link
Contributor

mboehm7 commented Apr 21, 2016

from my side a +1 for option one - this allows much more flexibility.

However, in terms of language-level integration, we need to think about better alternatives because as.logical / as.integer / as.double and the proposed as.string are pure value-type casts whereas the proposed functionality covers both value-type and data-type cast, i.e., double - matrix into string - scalar. Also, please under no circumstance introduce anything like castAsString - the existing castAsScalar is just kept for historic reasons (its introduction was a mistake but when we replaced it with as.scalar we had to keep it for backwards compatibility because it was already used by external users).

@mboehm7
Copy link
Contributor

mboehm7 commented Apr 21, 2016

one more note: although the scalar-string cast is a much better solution in terms of runtime flexibility - for convenience you might want to automatically compile it whenever somebody feeds a matrix directly into a print.

@deroneriksson
Copy link
Member

From a usability standpoint, it sure would be nice if the following didn't blow up and instead gave the user some type of useful output:

m=matrix("1 2 3 4", rows=2, cols=2);
print(m);
print('this too ' + m);
print(m + ' and this');
print('and ' + m + ' this too');

@bertholdreinwald
Copy link
Contributor

+1 for first option. Please make sure that formatting options, e.g. separator, EoL, can be provided for convenient import into Excel, etc. A one-dimensional sequence of numbers w/o formatting may not be helpful.

@dusenberrymw
Copy link
Contributor

dusenberrymw commented Apr 22, 2016

Let's please keep this as simple as possible at the language level, and not force usage in DML of anything like as.string(...) with a multitude of parameters just for simply printing a matrix (which will just be used for small scale data exploration and algorithm writing/debugging). It's already annoying enough for users to have to convert a 1x1 matrix X to a scalar with as.scalar(X) every time we want to print it or assign it to a scalar variable (i.e. when performing a full right index X[1,5]). That being said, we can definitely use the first option internally for cleanliness within the compiler.

@deroneriksson
Copy link
Member

deroneriksson commented Apr 22, 2016

Could we have both 1 (with no options) and 2?
print(m) simply prints out something (ANYTHING besides throwing an error), and if the user decides this is not sufficient, the user can do a print(as.string(m, maxRows=10, maxCols=20, decimal=4)) to specify some specifics?

@nakul02
Copy link
Member Author

nakul02 commented Apr 27, 2016

I've implemented an initial version of as.string.
It supports the following named arguments:

Argument Description Default
"rows" Maximum number of rows to print 100
"cols" Maximum number of columns to print 100
"decimal" Number after decimal to print, 0 padded 3
"sparse" Whether to print in sparse format, which is Row Index, Col Index, Value FALSE
"separator" Separator String between elements in a row or between the three columns in sparse Format " "
"lineseparator" Separator String between rows or between each line in sparse format "\n"

Here are some examples in use:

Program:

X = rand(rows=10, cols=10, min=0, max=4, pdf="uniform", sparsity=0.2)
print ("matrix \n" + as.string(X))

screen shot 2016-04-27 at 11 50 35 am

Program:

X = rand(rows=10, cols=10, min=0, max=4, pdf="uniform", sparsity=0.2)
x1 = as.string(X, decimal=5)
print ("matrix \n" + x1 )

screen shot 2016-04-27 at 12 03 51 pm

Program:

X = rand(rows=10, cols=10, min=0, max=4, pdf="uniform", sparsity=0.2)
x1 = as.string(X, rows=5, cols=5)
print ("matrix \n" + x1)

screen shot 2016-04-27 at 12 04 55 pm

Program:

X = rand(rows=10, cols=10, min=0, max=4, pdf="uniform", sparsity=0.2)
x1 = as.string(X, sparse=TRUE, rows=5, cols=10)
print ("matrix \n" + x1)

screen shot 2016-04-27 at 12 07 04 pm

Program:

X = rand(rows=10, cols=10, min=0, max=4, pdf="uniform", sparsity=0.2)
x1 = as.string(X, separator=" | ")
print ("matrix \n" + x1)

screen shot 2016-04-27 at 12 08 23 pm

Program:

X = rand(rows=10, cols=10, min=0, max=4, pdf="uniform", sparsity=0.2)
x1 = as.string(X, lineseparator=" | ", rows=4, cols=4)
print ("matrix \n" + x1)

screen shot 2016-04-27 at 12 09 19 pm

@nakul02 nakul02 changed the title [SYSTEMML-294_print_matrix][WIP|] Print matrix capability [SYSTEMML-294_print_matrix][WIP] Print matrix capability Apr 27, 2016
@deroneriksson
Copy link
Member

This is great!

@nakul02
Copy link
Member Author

nakul02 commented Apr 27, 2016

Thanks @deroneriksson.
@Wenpei, @mboehm7, @bertholdreinwald - thoughts?
@deroneriksson, @dusenberrymw - IMHO, we can have as.string automatically invoked when a print is called on a matrix, as a separate JIRA/PR.

@deroneriksson
Copy link
Member

@nakul02 I also really like your idea of the automatic invocation of as.string in a print call on a matrix as a separate PR. That way users have both 1) ease-of-use (print(m)) and 2) flexibility (print(as.string(m,...))).

@Wenpei
Copy link
Contributor

Wenpei commented Apr 28, 2016

@nakul02 LGTM

@bertholdreinwald
Copy link
Contributor

looks good. One last thought looking at your green on black MVS terminal output (saying that, does that date me), I couldn't help but count the rows and columns for identification. Would it be useful to include row# and col# to the output?, i.e.

0   1   2

0: 20 21 22
1: 10 32 33
2: 12 33 99

@nakul02
Copy link
Member Author

nakul02 commented Apr 28, 2016

@bertholdreinwald - I like the green on black, easy on the eyes :)

Just as a thought, if you intend on copy-pasting this into excel or another program, the prepended row numbers would be a hindrance and excel would provide that info anyways. But i leave the decision up to you. We could also make the output fancier by taking in more named parameters.

Another thing you might have noticed is that I begin row and column numbering from 0.
This may not be what a DML user is expecting since programmatically, they begin numbering from 1.
I can, and will fix this before the final commit.
Also each row ends with the separator string. Maybe I should change this, I probably will, unless you or someone else thinks its not a good idea.

@mboehm7
Copy link
Contributor

mboehm7 commented Apr 28, 2016

it's a good start @nakul02 but I would recommend a rework of the actual builtin function and it's implementation:

  1. Builtin function: As mentioned before as.string will later be used as a cast of value types e.g., for scalars/frames (no change of the data type!). Could you please find another builtin name that conveys the meaning better. How about R's cat or toString (please have a look at the docs)? We should only introduce custom functions if there is a real good argument for it. Also, in relation to existing functions like read, write and transformmeta it would be good to use 'sep' instead of 'separator'.

  2. ExecType selection: Please fix the execution type selection (independent of memory, see transformmeta). Right now the compilation would crash for large matrices or in forced exec type hadoop/spark.

  3. String conversion: Please move the print out of core data structures like MatrixBlock, SparseBlock and all its implementations. This functionality is not performance critical
    and should be implemented against basic sparse iterators or get. Maybe you can integrate it into DataConverter (there you will find also examples how to use the generic internal apis).

@dusenberrymw
Copy link
Contributor

I think the next step in a separate PR could be automatically compiling the toString(...) construct when a user passes a matrix directly to print, such as print(X) or print("Matrix: \n" + X).

@mboehm7
Copy link
Contributor

mboehm7 commented May 5, 2016

@nakul02 sorry that it took that long - just a couple of minor comments:

  1. Please replace StringBuffer with StringBuilder in TestUtils.
  2. The memory estimates might produce negative results if either rows or cols are unknown.

I agree with @dusenberrymw that we should directly follow up with a PR to rewrite this pattern into toString.

@nakul02 nakul02 force-pushed the SYSTEMML-294_print_matrix branch from 514e0ab to 3c7aa0f Compare May 6, 2016 18:05
@nakul02
Copy link
Member Author

nakul02 commented May 6, 2016

@mboehm7 - fixed issues.

Started Jenkins Build

@nakul02 nakul02 force-pushed the SYSTEMML-294_print_matrix branch from 3c7aa0f to def5ee9 Compare May 6, 2016 18:10
@nakul02
Copy link
Member Author

nakul02 commented May 6, 2016

Test passed.

@nakul02 nakul02 changed the title [SYSTEMML-294_print_matrix][WIP] Print matrix capability [SYSTEMML-294] Print matrix capability May 6, 2016
@nakul02
Copy link
Member Author

nakul02 commented May 8, 2016

@mboehm7 - if this is ok by you, we can merge this and I can take up the rewrite "print(matrix)" task

@mboehm7
Copy link
Contributor

mboehm7 commented May 8, 2016

LGTM - but please fix the memory estimate (see 2). If the number of nonzeros is unknown the memory estimate would still be negative for sparse print. You could simply set the nnz to rows*cols in case of unknowns.

@deroneriksson
Copy link
Member

@nakul02 Looking forward to this one! Thank you.

@mboehm7
Copy link
Contributor

mboehm7 commented May 13, 2016

@nakul02 do you have an update here?

- Parameters supported : "rows", "cols", "decimal", "sparse",
  "separator", "lineseparator"
- Default parameter values: rows=cols=100, decimal=3, sparse=FALSE,
  separator=" ", lineseparator="\n"
- Example program:

X = rand(rows=10, cols=10, min=0, max=4, pdf="uniform", sparsity=0.2)
x1 = as.string(X, decimal=10, rows=5, cols=5, sparse=TRUE)
print ("matrix " + x1)

- TODO : Add tests
- TODO : Add memory estimation
@nakul02 nakul02 force-pushed the SYSTEMML-294_print_matrix branch from def5ee9 to c3dace4 Compare May 13, 2016 05:33
@nakul02
Copy link
Member Author

nakul02 commented May 13, 2016

@mboehm7 - sorry for the delay, was traveling. Updated, submitted Jenkins build

@mboehm7
Copy link
Contributor

mboehm7 commented May 13, 2016

LGTM - great, thanks @nakul02 - welcome back. :-)

@nakul02
Copy link
Member Author

nakul02 commented May 13, 2016

@mboehm7 - thanks :)

@dusenberrymw
Copy link
Contributor

Test build passed.

@asfgit asfgit closed this in ea49e62 May 13, 2016
@nakul02
Copy link
Member Author

nakul02 commented May 13, 2016

Thanks @dusenberrymw

@deroneriksson
Copy link
Member

Works great! THANK YOU @nakul02

@nakul02
Copy link
Member Author

nakul02 commented May 13, 2016

Thanks @deroneriksson

niketanpansare pushed a commit to niketanpansare/systemml that referenced this pull request May 13, 2016
This adds the ability to print a matrix by introducing a new `toString(...)` function that takes in a matrix and a set of optional arguments, and then outputs a string representation of the matrix for printing.

* Parameters supported : "rows", "cols", "decimal", "sparse",
  "separator", "lineseparator"
* Default parameter values: rows=cols=100, decimal=3, sparse=FALSE,
  separator=" ", lineseparator="\n"

Closes apache#120.
@nakul02 nakul02 deleted the SYSTEMML-294_print_matrix branch February 20, 2017 21:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
7 participants