New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SYSTEMML-294] Print matrix capability #120
Conversation
|
Very cool! I think there is tremendous potential here. Perhaps some additional delineation could help? Something like:
|
just a correction wrt to my point (1) - this is fine since the memory estimate for dense is still smaller. |
@nakul02 Cool! It's my pain point when work on dml. |
|
Thanks for the feedback. I'll work on these suggestions. @Wenpei - after speaking about your suggestion for maxRow=x, maxCol=y with @niketanpansare, IMO, it maybe better to express that as @deroneriksson - what do you think of @mboehm7's suggestion, do you still want the bounding boxes? @mboehm7 - WRT to point 3 - Should we print out the entire matrix or only a fixed number of rows and columns. Do we want to buffer parts of the matrix as we print it out? |
ad 3) I would recommend to define a maxrow/col threshold (let's say 1k) - if the matrix is larger, we have to generate indexing operations. |
@nakul02 I guess my thinking is that I would want it to be as visually clear and powerful as possible. It's actually probably hard to do without introducing something like a printMatrix function. For instance, I could see situations where those bounding boxes are nice (view in console), and I could also see where simple comma-separated values would be nice (copy/paste into Excel). There can be situations where conciseness is nice (and have only a single number to the right of the decimal), and other situations where precision is nice (show lots of numbers to the right of the decimal). Also, it might even be nice to have a dense/sparse option (as Wenpei mentioned), where sparse displays something like i,j,v format rather than a dense 'spreadsheet' look. However, it's great to be able to display any kind of matrix representation via print(), since currently this has not been possible. This is a big win. Perhaps both print() of a matrix (for basic situations) and a printMatrix() (for more advanced output) would be nice. |
Printing large dense matrices will be too overwhelming. Users will always use range indexing to display parts of a large matrix, or other ops such as head, tail, firstCols, lastCols, etc. If users print "too much", then there should be some built-in guards to avoid thrashing. This is quite common across systems. |
@bertholdreinwald - For this PR, I am dealing with matrices only. I'd suggest we deal with frames in a separate JIRA. |
I like the idea of toString() method that Matthias suggested and this method can be extended to Frames later: X = rand(rows=10000, cols=10000, min=0, max=4, pdf="uniform", sparsity=0.2)
print(toString(X, format="csv", rows=1000, cols=10)) Internally, toString() method could have built-in guards that Berthold is suggesting. I like Matthias's idea of implicitly introducing indexing op, which performs following logic (but at hop level): rlen = nrow(X)
if(nrow(X) > rowThreshold) {
rlen = rowThreshold
}
clen = ncol(X)
if(ncol(X) > colThreshold) {
clen = colThreshold
}
print(toString(X[1:rlen, 1:clen], ...)) The next question is why not do implicit conversion instead of toString() method. I like implicit conversion iff we can handle following DML code: y = " " + X
X1 = y * X Also, I would like to point out that at language level we have a string representation of matrix, i.e. Though I don't have a strong opinion about format type, I would suggest sticking to format that we expose in read/write methods (i.e. "csv" for dense and "text" for sparse) rather than call then as dense or sparse OR we can add the new pretty printed matrix format into read/write as well. |
Based on the discussion here, offline discussion with @niketanpansare and some thought, IMHO, here are a two of ways of printing matrices: 1. Introduce DML functions -
|
from my side a +1 for option one - this allows much more flexibility. However, in terms of language-level integration, we need to think about better alternatives because as.logical / as.integer / as.double and the proposed as.string are pure value-type casts whereas the proposed functionality covers both value-type and data-type cast, i.e., double - matrix into string - scalar. Also, please under no circumstance introduce anything like castAsString - the existing castAsScalar is just kept for historic reasons (its introduction was a mistake but when we replaced it with as.scalar we had to keep it for backwards compatibility because it was already used by external users). |
one more note: although the scalar-string cast is a much better solution in terms of runtime flexibility - for convenience you might want to automatically compile it whenever somebody feeds a matrix directly into a print. |
From a usability standpoint, it sure would be nice if the following didn't blow up and instead gave the user some type of useful output:
|
+1 for first option. Please make sure that formatting options, e.g. separator, EoL, can be provided for convenient import into Excel, etc. A one-dimensional sequence of numbers w/o formatting may not be helpful. |
Let's please keep this as simple as possible at the language level, and not force usage in DML of anything like |
Could we have both 1 (with no options) and 2? |
2e41ec1
to
05ce73b
Compare
I've implemented an initial version of
Here are some examples in use: Program:
Program:
Program:
Program:
Program:
Program:
|
This is great! |
Thanks @deroneriksson. |
@nakul02 I also really like your idea of the automatic invocation of as.string in a print call on a matrix as a separate PR. That way users have both 1) ease-of-use (print(m)) and 2) flexibility (print(as.string(m,...))). |
@nakul02 LGTM |
looks good. One last thought looking at your green on black MVS terminal output (saying that, does that date me), I couldn't help but count the rows and columns for identification. Would it be useful to include row# and col# to the output?, i.e.
0: 20 21 22 |
@bertholdreinwald - I like the green on black, easy on the eyes :) Just as a thought, if you intend on copy-pasting this into excel or another program, the prepended row numbers would be a hindrance and excel would provide that info anyways. But i leave the decision up to you. We could also make the output fancier by taking in more named parameters. Another thing you might have noticed is that I begin row and column numbering from 0. |
it's a good start @nakul02 but I would recommend a rework of the actual builtin function and it's implementation:
|
I think the next step in a separate PR could be automatically compiling the |
@nakul02 sorry that it took that long - just a couple of minor comments:
I agree with @dusenberrymw that we should directly follow up with a PR to rewrite this pattern into toString. |
514e0ab
to
3c7aa0f
Compare
@mboehm7 - fixed issues. Started Jenkins Build |
3c7aa0f
to
def5ee9
Compare
Test passed. |
@mboehm7 - if this is ok by you, we can merge this and I can take up the rewrite "print(matrix)" task |
LGTM - but please fix the memory estimate (see 2). If the number of nonzeros is unknown the memory estimate would still be negative for sparse print. You could simply set the nnz to rows*cols in case of unknowns. |
@nakul02 Looking forward to this one! Thank you. |
@nakul02 do you have an update here? |
- Parameters supported : "rows", "cols", "decimal", "sparse", "separator", "lineseparator" - Default parameter values: rows=cols=100, decimal=3, sparse=FALSE, separator=" ", lineseparator="\n" - Example program: X = rand(rows=10, cols=10, min=0, max=4, pdf="uniform", sparsity=0.2) x1 = as.string(X, decimal=10, rows=5, cols=5, sparse=TRUE) print ("matrix " + x1) - TODO : Add tests - TODO : Add memory estimation
def5ee9
to
c3dace4
Compare
@mboehm7 - sorry for the delay, was traveling. Updated, submitted Jenkins build |
LGTM - great, thanks @nakul02 - welcome back. :-) |
@mboehm7 - thanks :) |
Test build passed. |
Thanks @dusenberrymw |
Works great! THANK YOU @nakul02 |
Thanks @deroneriksson |
This adds the ability to print a matrix by introducing a new `toString(...)` function that takes in a matrix and a set of optional arguments, and then outputs a string representation of the matrix for printing. * Parameters supported : "rows", "cols", "decimal", "sparse", "separator", "lineseparator" * Default parameter values: rows=cols=100, decimal=3, sparse=FALSE, separator=" ", lineseparator="\n" Closes apache#120.
Can pass a matrix object to the print statement
sequence of instructions is not being generated.
not work. (x=scalar, X=matrix).
I realize the matrix printing may not be the best format. Here is what it looks like for now:
For this DML code:
@deroneriksson, @niketanpansare, @mboehm7 - thoughts?