New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SYSTEMML-446] [SYSTEMML-702] Updated the sparse matrix multiplication to minimize sparse-to-dense as well as dense-to-sparse conversion #686
Conversation
to minimize sparse-to-dense conversion - The only case which requires additional memory, other than that used internally by CuBLAS/CuSPARSE, input matrices and output matrix is: 'dense-sparse matrix multiplication'. - Since there is no CuSPARSE kernel for dense-sparse matrix multiplication, we either have to transpose the output after performing sparse-dense matrix multiplication or perform dense-dense matrix multiplication after converting the sparse input to dense format.
Refer to this link for build results (access rights to CI server needed): |
Refer to this link for build results (access rights to CI server needed): |
Refer to this link for build results (access rights to CI server needed): |
For correctness, I ran
I compared the last two results with CP with epsilon (
The above setup covers every combination of sparse-dense, matrix-vector, matrix-matrix combinations. All the results match for every combination with every epsilon except following and that too for only
For lower epsilon ( |
transposed ... we use cusparseDcsrmv rather than cusparseDcsrmm2 in this case
I also ran the performance comparison similar to the native BLAS PR and also checked for the correctness. All the below results match the results of CP upto For below results, I used 100 iterations and below DML script:
For traditional ML algorithms, the sparse-dense matrix multiplications helps especially for matrix-matrix
For squared matrix multiplication, in some cases (such as
For DL, cuBLAS is preferred in most cases:
|
@nakul02 can you please review this PR ? |
Refer to this link for build results (access rights to CI server needed): |
Here is the summary of the results for different sparsity (from 0.1 to 0.000001): Sparse-dense
Sparse-sparse
Dense-sparse
The header contains the sparsity of left and right matrix. The values in the cell denotes time in seconds for 100 iteration of corresponding matrix multiplication for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are some inconsistent styling issues (2 statements on a single line, close brace on different line then beginning of else. Please selectively run the code through either the eclipse or the intellij formatter.
|
||
// Convenient methods to swap two values | ||
// Usage: y = swap(x, x=y); | ||
private static long swap(long x, long y) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems very confusing to me.
Is doing swap
this way idiomatic?
This relies heavily on the parameter evaluation order, and combined with the signature of the function swap(long, long)
, it makes for unreadable code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it is possibly the only way to do swap
in java. Though the swap
function itself is difficult to read, it makes the other methods more readable instead of having 3 line swap dispersed all over the code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am somewhat against this.
In an endeavor towards brevity, we are losing clarity. In my google searches for this, I've seen this as a clever stackoverflow answer for a way to swap two primitves given that there is no concept of call by reference for primitives in java. It does not seem like an idiomatic way of doing swaps.
A 3 line swap, or even having 3 statements on the same line would be cleaner IMHO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's agree to disagree here.
return x; | ||
} | ||
|
||
private static int cusparseOp(boolean isTransposed) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This also seems a bit confusing to me. From the function name it is unclear what this is doing.
Also, the number of invocations to this is so few, its better just to have them inline (as it was before).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer these methods as it makes weird tricks explicit. Added documentation to make the function name clear
private static final Log LOG = LogFactory.getLog(LibMatrixCuMatMult.class.getName()); | ||
|
||
private static class CuMatMultParameters { | ||
public int m; public int n; public int k; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please split out each of the declarations into a separate line (does the java formatter do this automatically?)
Also, please put in a comment on what each of these fields means and when they can possibly (if at all) hold special values like -1
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated the documentation
* Memory Requirements - | ||
* Both dense - inputs, output, no intermediate | ||
* Both sparse - inputs, output, no intermediate | ||
* One sparse, one dense - inputs, output, intermediates - (input_dim1 * input_dim2) OR (input_dim1 * input_dim2 + input in sparse format) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this documentation still hold true after your changes (memory estimates)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, updated the documentation
right.getNumRows(), right.getNumColumns(), isLeftTransposed, isRightTransposed); | ||
|
||
if(isM1Sparse && isM2Sparse) { | ||
// ------------------------------------------------------------------------------------- |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might be a good opportunity to create a sparseSpaseMatMult
function which encapsulates the relevant bit of code in this if guard.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't add it since it is one line method. No guard is required due to CuMatMultParameters
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The guard I am referring to is at line 130 - the particular if branch for when both matrices are sparse. This is keeping in line with having functions for sparseDenseMatMult
, denseSparseMatMult
and denseDenseMatMult
.
The function would contain, as you rightly pointed out - just the one line of cusparseDcsrgemm
and the surrounding advanced timer code.
I have no strong opinion here.
return output; | ||
} | ||
|
||
private static void sparseDenseMatMult(GPUContext gCtx, String instName, Pointer C, CSRPointer A, Pointer B, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please make use of the newly created CuMatMultParameters
in the signature of this function as well as its invocation?
While you are at it, could you please add javadoc to this method, while outlining the assumptions and the strategy used?
Also, I like the reduction in complexity of this method by invoking the already existing denseSparseMatMult
👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
else { | ||
// ------------------------------------------------------------------------------------- | ||
// dense-dense matrix multiplication | ||
if(isM1Sparse && !isM2Sparse) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this if guard needed? This seems to be have been covered at line 172.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my earlier commit, I was trying to deal with the memory tradeoffs in this PR, but decided against it so as not to make this PR overly complicated.
JCublas2.cublasDgeam(gCtx.getCublasHandle(), cublasOperation.CUBLAS_OP_T, cublasOperation.CUBLAS_OP_T, | ||
toInt(outCLen), toInt(outRLen), one(), output, | ||
toInt(outRLen), zero(), new Pointer(), toInt(outRLen), C, toInt(outCLen)); | ||
gCtx.cudaFreeHelper(output, true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you need to force this cudaFreeHelper
call to use the eager evaluation. Maybe you could let the user override it with the configuration option instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add some timers here so that we can tell from the advanced stats if this code path was taken.
Ideally I would have liked to encapsulate this transpose
function into a helper function somewhere and invoke that. The timers could be inside that transpose
function and the resulting code would be easier to read.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's clean up other methods in a subsequent PR and focus only on matmult.
Could you please encapsulate the correctness testing you did into the unit tests? |
In master, if either of the matrices is sparse, the output matrix maybe sparse. |
Updated the title of the PR to make the goal of this PR explicit: minimize conversions. We can deal with tradeoffs in a different PR. |
Thank you for the documentation additions! |
Updated the junit tests. I would prefer to keep |
Ok, other than the disagreeable swap and cusparseOp, everything LGTM. |
Thanks. I will merge :) |
Also adding the details stats for each value in the above tables for future reference:
|
Refer to this link for build results (access rights to CI server needed): Failed Tests: 1SystemML-PullRequestBuilder/org.apache.systemml:systemml: 1 |
…n to minimize sparse-to-dense as well as dense-to-sparse conversion 1. The goal of this PR is not to improve performance (for example: by considering the cost of sparse-to-dense vs FLOPs required given a memory budget) but instead to minimize sparse-to-dense conversion in the GPU matrix multiplication operator. 2. If matmult uses unnecessary sparse-to-dense conversion, then we run into risk of one of the two situations: - In best case some of the matmult won't be pushed to GPU under worst-case memory budget. - On other hand, if these conversion are not accounted for, they may cause OOMs. 3. Every operator (except dense-sparse matrix multiplication) uses only memory allocated to input and output matrices. 4. Since there is no CuSPARSE kernel for dense-sparse matrix multiplication operator, we either have to transpose the output after performing sparse-dense matrix multiplication or perform dense-dense matrix multiplication after converting the sparse input to dense format. Closes apache#686.
The goal of this PR is not to improve performance (for example: by considering the cost of sparse-to-dense vs FLOPs required given a memory budget) but instead to minimize sparse-to-dense conversion in the GPU matrix multiplication operator.
If
matmult
uses unnecessary sparse-to-dense conversion, then we run into risk of one of the two situations:matmult
won't be pushed to GPU under worst-case memory budget.Every operator (except
dense-sparse matrix multiplication
) uses only memory allocated to input and output matrices.Since there is no CuSPARSE kernel for
dense-sparse matrix multiplication
operator, we either have to transpose the output after performing sparse-dense matrix multiplication or perform dense-dense matrix multiplication after converting the sparse input to dense format.@nakul02 I will post the correctness and performance comparison in few hours. Please review this PR then :)