Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SYSTEMML-446] [SYSTEMML-702] Updated the sparse matrix multiplication to minimize sparse-to-dense as well as dense-to-sparse conversion #686

Closed

Conversation

niketanpansare
Copy link
Contributor

  1. The goal of this PR is not to improve performance (for example: by considering the cost of sparse-to-dense vs FLOPs required given a memory budget) but instead to minimize sparse-to-dense conversion in the GPU matrix multiplication operator.

  2. If matmult uses unnecessary sparse-to-dense conversion, then we run into risk of one of the two situations:

  • In best case some of the matmult won't be pushed to GPU under worst-case memory budget.
  • On other hand, if these conversion are not accounted for, they may cause OOMs.
  1. Every operator (except dense-sparse matrix multiplication) uses only memory allocated to input and output matrices.

  2. Since there is no CuSPARSE kernel for dense-sparse matrix multiplication operator, we either have to transpose the output after performing sparse-dense matrix multiplication or perform dense-dense matrix multiplication after converting the sparse input to dense format.

@nakul02 I will post the correctness and performance comparison in few hours. Please review this PR then :)

to minimize sparse-to-dense conversion

- The only case which requires additional memory, other than that used
internally by CuBLAS/CuSPARSE, input matrices and output matrix is:
'dense-sparse matrix multiplication'.
- Since there is no CuSPARSE kernel for dense-sparse matrix
multiplication, we either have to transpose the output after performing
sparse-dense matrix multiplication or perform dense-dense matrix
multiplication after converting the sparse input to dense format.
@akchinSTC
Copy link
Contributor

Refer to this link for build results (access rights to CI server needed):
https://sparktc.ibmcloud.com/jenkins/job/SystemML-PullRequestBuilder/2182/

@akchinSTC
Copy link
Contributor

Refer to this link for build results (access rights to CI server needed):
https://sparktc.ibmcloud.com/jenkins/job/SystemML-PullRequestBuilder/2184/

@akchinSTC
Copy link
Contributor

Refer to this link for build results (access rights to CI server needed):
https://sparktc.ibmcloud.com/jenkins/job/SystemML-PullRequestBuilder/2185/

@niketanpansare
Copy link
Contributor Author

For correctness, I ran Z = X %*% Y where X: i X j with sparsity sp1 and Y: j X k with sparsity sp2:

  • CP (apache master)
  • With -gpu (this PR)
  • With -gpu (apache master)

I compared the last two results with CP with epsilon (0.1, 0.0001, 0.0000001, 0.0000000001 and 0.0000000000001) and populated the last two columns for:

for i in 5000 1000 1
do
for j in 5000 1000 1
do
for k in 5000 1000 1
do
for sp1 in 0.9 0.2 0.1 0.01
do
for sp2 in 0.9 0.2 0.1 0.01
do
        ...
done
...

The above setup covers every combination of sparse-dense, matrix-vector, matrix-matrix combinations.

All the results match for every combination with every epsilon except following and that too for only 1.0E-13 epsilon:

leftNumRows,leftNumCols,rightNumCols,leftSparsity,rightSparsity
5000,5000,5000,0.9,0.1
5000,5000,1000,0.9,0.1
5000,1000,5000,0.9,0.2
5000,1000,1000,0.9,0.2
1000,5000,5000,0.9,0.1
1000,5000,1000,0.9,0.1
1000,1000,5000,0.9,0.2
1000,1000,1000,0.9,0.2
1,5000,5000,0.9,0.1
1,5000,1000,0.9,0.1

For lower epsilon (0.0000000001), all the above cases match.

transposed ... we use cusparseDcsrmv rather than cusparseDcsrmm2 in this
case
@niketanpansare
Copy link
Contributor Author

niketanpansare commented Oct 20, 2017

I also ran the performance comparison similar to the native BLAS PR and also checked for the correctness. All the below results match the results of CP upto 1e-13 epsilon. Only [64,196608] x [196608,512] with sparsity 0.99 and 0.01 respectively with epsilon 1e-13 and -gpu (both current master and this PR) didn't match the results of CP.

For below results, I used 100 iterations and below DML script:

X = read("X.mtx")
Z = read("Z.mtx")
Z1 = matrix(0, rows=nrow(X), cols=ncol(Z))
for(i in 1:num_iters) {
        Z1 = X %*% Z
        Z1 = Z1 * 1.2
}

For traditional ML algorithms, the sparse-dense matrix multiplications helps especially for matrix-matrix 1M x 1K x 20 case:

Matrix-vector 1M x 1K
Stats for [1000000, 1000] %*% [1000, 1] with sp1=0.01 and sp2=0.99
CP_MASTER   1  ba+* [test.dml 12:6-12:12]         2.588    100              aqmd[0.001s,100], aqrd[0.003s,100], rlswr[0.003s,0], rlsi[0.006s,200], aqrs[0.935s,100]
GPU_PR   1  gpu_ba+* [test.dml 12:6-12:12]     1.257    100   aqrd[0.004s,1], Msmdv[0.004s,100], H2D[0.018s,2], ad[0.092s,100], sync[0.164s,100], aqrs[0.810s,1]
GPU_MASTER   1  gpu_ba+* [test.dml 12:6-12:12]     1.076    100 az[0.002s,100], aqrd[0.003s,1], Msmdv[0.004s,100], H2D[0.031s,2], a[0.092s,100], sync[0.172s,100], a
Stats for [1000000, 1000] %*% [1000, 1] with sp1=0.01 and sp2=0.01
CP_MASTER   1  ba+* [test.dml 12:6-12:12]         2.348    100              aqmd[0.001s,100], aqrd[0.003s,100], rlswr[0.004s,0], rlsi[0.005s,200], aqrs[0.853s,100]
GPU_PR   1  gpu_ba+* [test.dml 12:6-12:12]     1.281    100   aqrd[0.003s,1], Msmdv[0.004s,100], H2D[0.018s,2], ad[0.091s,100], sync[0.164s,100], aqrs[0.835s,1]
GPU_MASTER   1  gpu_ba+* [test.dml 12:6-12:12]     1.043    100 az[0.001s,100], aqrd[0.003s,1], Msmdv[0.004s,100], H2D[0.022s,2], a[0.089s,100], sync[0.169s,100], a

Matrix-matrix 1M x 1K x 20
Stats for [1000000, 1000] %*% [1000, 20] with sp1=0.01 and sp2=0.99
CP_MASTER   1  ba+* [test.dml 12:6-12:12]         8.954    100              aqmd[0.001s,100], aqrd[0.005s,100], rlswr[0.005s,0], rlsi[0.006s,200], aqrs[0.943s,100]
GPU_PR   1  gpu_ba+* [test.dml 12:6-12:12]     3.444    100   sync[0.001s,100], Msmdm[0.004s,100], aqrd[0.005s,1], H2D[0.018s,2], aqrs[0.688s,1], ad[1.323s,100]
GPU_MASTER   1  gpu_ba+* [test.dml 12:6-12:12]   152.226    100 sync[0.001s,100], aqrd[0.005s,1], H2D[0.018s,2], aqrs[0.643s,1], f[22.784s,100], s2d[1.496s,100], Md
Stats for [1000000, 1000] %*% [1000, 20] with sp1=0.01 and sp2=0.01
CP_MASTER   2  ba+* [test.dml 12:6-12:12]         9.270    100                                aqmd[0.001s,100], rlswr[0.004s,0], rlsi[0.005s,200], aqrs[0.933s,200]
GPU_PR   2  gpu_ba+* [test.dml 12:6-12:12]     4.428    100                 sync[0.001s,100], H2D[0.019s,2], aqrs[0.597s,2], Msmsm[1.079s,100], Msao[2.562s,100]
GPU_MASTER   2  gpu_ba+* [test.dml 12:6-12:12]     4.478    100                 sync[0.001s,100], H2D[0.018s,2], aqrs[0.664s,2], Msmsm[1.077s,100], Msao[2.559s,100]

For squared matrix multiplication, in some cases (such as [3000, 3000] %*% [3000, 3000] with sp1=0.99 and sp2=0.01), cuSPARSE is way slower than sparse2dense+cuBLAS in some cases:

Squared matrix 3K x 3K
Stats for [3000, 3000] %*% [3000, 3000] with sp1=0.01 and sp2=0.99
CP_MASTER   1  ba+* [test.dml 12:6-12:12]         6.134    100              aqmd[0.001s,100], rlswr[0.004s,0], rlsi[0.006s,200], aqrs[0.078s,100], aqrd[0.157s,100]
GPU_PR   1  gpu_ba+* [test.dml 12:6-12:12]     1.724    100   sync[0.001s,100], Msmdm[0.005s,100], H2D[0.015s,2], aqrs[0.088s,1], aqrd[0.143s,1], ad[0.613s,100]
GPU_MASTER   1  gpu_ba+* [test.dml 12:6-12:12]     5.451    100 sync[0.001s,100], Mdmdm[0.010s,100], H2D[0.011s,2], ad[0.068s,100], aqrs[0.086s,1], aqrd[0.164s,1],
Stats for [3000, 3000] %*% [3000, 3000] with sp1=0.01 and sp2=0.01
CP_MASTER   1  ba+* [test.dml 12:6-12:12]        12.887    100                                aqms[0.001s,100], rlswr[0.003s,0], rlsi[0.004s,200], aqrs[0.092s,200]
GPU_PR   1  gpu_ba+* [test.dml 12:6-12:12]     2.988    100                 H2D[0.001s,2], sync[0.001s,100], aqrs[0.100s,2], Msmsm[1.372s,100], Msao[1.488s,100]
GPU_MASTER   1  gpu_ba+* [test.dml 12:6-12:12]     2.987    100                 H2D[0.001s,2], sync[0.001s,100], aqrs[0.116s,2], Msmsm[1.370s,100], Msao[1.477s,100]
Stats for [3000, 3000] %*% [3000, 3000] with sp1=0.99 and sp2=0.01
CP_MASTER   1  ba+* [test.dml 12:6-12:12]         6.982    100              aqmd[0.002s,100], rlswr[0.003s,0], rlsi[0.006s,200], aqrs[0.048s,100], aqrd[0.151s,100]
GPU_PR   1  gpu_ba+* [test.dml 12:6-12:12]    45.634    100  sync[0.001s,100], H2D[0.017s,2], aqrs[0.051s,1], aqrd[0.182s,1], ad[0.592s,100], Msmdm[44.764s,100]
GPU_MASTER   1  gpu_ba+* [test.dml 12:6-12:12]     6.139    100 sync[0.001s,100], Mdmdm[0.008s,100], H2D[0.017s,2], aqrs[0.049s,1], ad[0.072s,100], aqrd[0.157s,1],

For DL, cuBLAS is preferred in most cases:

64 rows x 196,608 columns %*% 196,608 x 512
Stats for [64, 196608] %*% [196608, 512] with sp1=0.01 and sp2=0.99
CP_MASTER   1  ba+* [test.dml 12:6-12:12]         2.529    100              aqmd[0.002s,100], rlswr[0.003s,0], rlsi[0.006s,200], aqrs[0.157s,100], aqrd[0.606s,100]
GPU_PR   1  gpu_ba+* [test.dml 12:6-12:12]     1.338    100   sync[0.001s,100], Msmdm[0.003s,100], ad[0.031s,100], aqrs[0.095s,1], H2D[0.116s,2], aqrd[0.662s,1]
GPU_MASTER   1  gpu_ba+* [test.dml 12:6-12:12]     4.657    100 f[2.874s,100], sync[0.001s,100], Mdmdm[0.010s,100], ad[0.029s,100], H2D[0.116s,2], aqrs[0.159s,1], a
Stats for [64, 196608] %*% [196608, 512] with sp1=0.01 and sp2=0.01
CP_MASTER   1  ba+* [test.dml 12:6-12:12]         0.882    100                                aqmd[0.001s,100], rlsi[0.002s,200], rlswr[0.002s,0], aqrs[0.256s,200]
GPU_PR   1  gpu_ba+* [test.dml 12:6-12:12]     2.538    100                 sync[0.001s,100], H2D[0.003s,2], aqrs[0.206s,2], Msao[0.669s,100], Msmsm[1.596s,100]
GPU_MASTER   1  gpu_ba+* [test.dml 12:6-12:12]     2.567    100                 sync[0.001s,100], H2D[0.003s,2], aqrs[0.227s,2], Msao[0.668s,100], Msmsm[1.599s,100]
Stats for [64, 196608] %*% [196608, 512] with sp1=0.99 and sp2=0.01
CP_MASTER   1  ba+* [test.dml 12:6-12:12]         3.380    100              aqmd[0.001s,100], rlswr[0.003s,0], rlsi[0.005s,200], aqrs[0.115s,100], aqrd[0.176s,100]
GPU_PR   1  gpu_ba+* [test.dml 12:6-12:12]    22.312    100  sync[0.001s,100], H2D[0.017s,2], ad[0.032s,100], aqrs[0.090s,1], aqrd[0.229s,1], Msmdm[21.875s,100]
GPU_MASTER   1  gpu_ba+* [test.dml 12:6-12:12]    12.014    100 H2D[0.017s,2], aqrs[0.143s,1], aqrd[0.228s,1], f[5.034s,100], s2d[6.480s,100], sync[0.001s,100], Mdm

100 rows x 600 columns %*% 600 x 205800 (K:100, CRS=1x2x300=600, NPQ=100x2058x1)
Stats for [100, 600] %*% [600, 205800] with sp1=0.99 and sp2=0.01
CP_MASTER   1  ba+* [test.dml 12:6-12:12]        14.573    100              aqmd[0.001s,100], rlswr[0.004s,0], rlsi[0.005s,200], aqrd[0.040s,100], aqrs[0.335s,100]
GPU_PR   1  gpu_ba+* [test.dml 12:6-12:12]    26.187    100  sync[0.001s,100], H2D[0.003s,2], aqrd[0.040s,1], aqrs[0.332s,1], ad[1.339s,100], Msmdm[24.438s,100]
GPU_MASTER   2  gpu_ba+* [test.dml 12:6-12:12]     5.810    100 f[4.036s,100], sync[0.001s,100], H2D[0.004s,2], Mdmdm[0.008s,100], aqrd[0.041s,1], ad[0.084s,100], a
Stats for [100, 600] %*% [600, 205800] with sp1=0.01 and sp2=0.01
CP_MASTER   1  ba+* [test.dml 12:6-12:12]        17.214    100                                aqms[0.001s,100], rlswr[0.003s,0], rlsi[0.004s,200], aqrs[0.264s,200]
GPU_PR   1  gpu_ba+* [test.dml 12:6-12:12]    21.147    100                Msao[9.523s,100], Msmsm[11.275s,100], sync[0.001s,100], H2D[0.003s,2], aqrs[0.305s,2]
GPU_MASTER   1  gpu_ba+* [test.dml 12:6-12:12]    21.288    100                Msao[9.532s,100], Msmsm[11.302s,100], sync[0.001s,100], H2D[0.003s,2], aqrs[0.411s,2]

100 rows x 900 columns %*% 900 x 205800 (K:100, CRS=1x3x300=900, NPQ=100x2058x1)
Stats for [100, 900] %*% [900, 205800] with sp1=0.99 and sp2=0.01
CP_MASTER   1  ba+* [test.dml 12:6-12:12]        19.557    100              aqmd[0.002s,100], rlswr[0.003s,0], rlsi[0.005s,200], aqrd[0.034s,100], aqrs[0.319s,100]
GPU_PR   1  gpu_ba+* [test.dml 12:6-12:12]    37.719    100  sync[0.001s,100], H2D[0.004s,2], aqrd[0.038s,1], aqrs[0.311s,1], ad[1.326s,100], Msmdm[35.991s,100]
GPU_MASTER   2  gpu_ba+* [test.dml 12:6-12:12]     8.012    100 sync[0.002s,100], H2D[0.005s,2], Mdmdm[0.009s,100], aqrd[0.044s,1], ad[0.085s,100], aqrs[0.397s,1],
Stats for [100, 900] %*% [900, 205800] with sp1=0.01 and sp2=0.01
CP_MASTER   1  ba+* [test.dml 12:6-12:12]        19.671    100                                aqms[0.001s,100], rlswr[0.003s,0], rlsi[0.004s,200], aqrs[0.423s,200]
GPU_PR   1  gpu_ba+* [test.dml 12:6-12:12]    16.614    100                 Msao[7.313s,100], sync[0.001s,100], H2D[0.004s,2], Msmsm[8.889s,100], aqrs[0.367s,2]
GPU_MASTER   1  gpu_ba+* [test.dml 12:6-12:12]    16.731    100                 Msao[7.315s,100], sync[0.002s,100], H2D[0.004s,2], Msmsm[8.902s,100], aqrs[0.471s,2]

@niketanpansare
Copy link
Contributor Author

@nakul02 can you please review this PR ?

@akchinSTC
Copy link
Contributor

Refer to this link for build results (access rights to CI server needed):
https://sparktc.ibmcloud.com/jenkins/job/SystemML-PullRequestBuilder/2186/

@niketanpansare
Copy link
Contributor Author

niketanpansare commented Oct 20, 2017

Here is the summary of the results for different sparsity (from 0.1 to 0.000001):

Sparse-dense

0.1/0.99 0.01/0.99 0.001/0.99 0.0001/0.99 0.00001/0.99 0.000001/0.99
Matrix-vector 1M x 1K 2.646, 1.077, 1.184 1.539, 0.455, 0.492 1.111, 0.289, 0.288 1.063, 0.262, 0.247 1.619, 0.245, 0.252
Matrix-matrix 1M x 1K x 20 16.804, 9.877, 223.848 9.466, 3.470, 152.291 9.091, 2.484, 148.133 6.622, 2.200, 147.446 10.150, 2.114, 2.303 9.570, 2.091, 2.104
Squared matrix 3K x 3K 14.212, 8.051, 5.676 6.913, 1.773, 5.429 4.558, 1.283, 5.355 8.146, 1.227, 5.288 6.698, 1.133, 3.341 0.969, 1.188, 3.226
64 rows x 196,608 columns %*% 196,608 x 512 11.388, 4.556, 4.883 2.471, 1.319, 4.614 1.107, 0.951, 4.470 1.196, 0.902, 4.470 1.670, 0.915, 26.518 1.068, 0.979, 26.468
100 rows x 600 columns %*% 600 x 205800 (K:100, CRS=1x2x300=600, NPQ=100x2058x1) 111.711, 41.010, 8.335 15.119, 26.078, 5.822 8.325, 9.989, 5.545 16.178, 2.693, 5.425 12.600, 1.775, 11.463 12.319, 1.699, 2.943
100 rows x 900 columns %*% 900 x 205800 (K:100, CRS=1x3x300=900, NPQ=100x2058x1) 223.880, 59.703, 11.942 20.375, 37.762, 7.943 9.710, 13.873, 7.509 18.504, 3.168, 7.390 13.499, 1.850, 21.292 11.961, 1.698, 3.907

Sparse-sparse

0.1/0.1 0.01/0.01 0.001/0.001 0.0001/0.0001 0.00001/0.00001 0.000001/0.000001
Matrix-vector 1M x 1K 2.459, 1.056, 1.049 1.455, 0.559, 0.621 1.029, 0.313, 0.308 1.042, 0.248, 0.271 1.010, 0.247, 0.247
Matrix-matrix 1M x 1K x 20 17.116, 10.527, 226.077 9.129, 4.663, 4.473 9.699, 2.604, 2.607 10.088, 2.104, 2.107 8.572, 1.905, 1.922 1.560, 1.866, 1.864
Squared matrix 3K x 3K 7.011, 23.627, 23.694 12.415, 2.940, 2.917 6.233, 0.969, 0.964 4.717, 0.801, 0.799 4.892, 0.790, 0.780 0.382, 0.775, 0.774
64 rows x 196,608 columns %*% 196,608 x 512 3.260, 10.280, 10.200 0.737, 2.506, 2.508 0.491, 0.409, 0.418 0.402, 0.237, 0.222 0.416, 0.211, 0.215 0.398, 0.210, 0.210
100 rows x 600 columns %*% 600 x 205800 (K:100, CRS=1x2x300=600, NPQ=100x2058x1) 17.562, 20.960, 20.850 17.242, 21.352, 21.313 11.496, 4.588, 4.584 9.132, 1.564, 1.572 0.558, 1.509, 1.521 0.433, 1.514, 1.520
100 rows x 900 columns %*% 900 x 205800 (K:100, CRS=1x3x300=900, NPQ=100x2058x1) 21.455, 27.985, 27.866 18.948, 16.483, 16.567 11.452, 5.588, 5.575 8.453, 1.614, 1.618 0.645, 1.530, 1.535 0.398, 1.511, 1.504

Dense-sparse

0.99/0.1 0.99/0.01 0.99/0.001 0.99/0.0001 0.99/0.00001 0.99/0.000001
Matrix-vector 1M x 1K
Matrix-matrix 1M x 1K x 20
Squared matrix 3K x 3K 23.185, 73.655, 6.490 7.835, 45.697, 6.180 4.782, 17.049, 6.002 11.358, 3.196, 5.834 7.324, 1.396, 17.673 0.983, 1.203, 6.896
64 rows x 196,608 columns %*% 196,608 x 512 12.555, 24.412, 13.287 3.452, 22.268, 11.987 1.870, 9.672, 11.687 1.486, 1.620, 11.582 1.461, 0.716, 58.097 1.513, 0.725, 53.831
100 rows x 600 columns %*% 600 x 205800 (K:100, CRS=1x2x300=600, NPQ=100x2058x1)
100 rows x 900 columns %*% 900 x 205800 (K:100, CRS=1x3x300=900, NPQ=100x2058x1)

The header contains the sparsity of left and right matrix. The values in the cell denotes time in seconds for 100 iteration of corresponding matrix multiplication for CP, GPU this PR, GPU master. It shows that for low sparsity, cusparse outperforms cublas in almost all cases but in moderate sparsity, there are cases where converting to dense and performing cublas matrix multiplication helps.

Copy link
Member

@nakul02 nakul02 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are some inconsistent styling issues (2 statements on a single line, close brace on different line then beginning of else. Please selectively run the code through either the eclipse or the intellij formatter.


// Convenient methods to swap two values
// Usage: y = swap(x, x=y);
private static long swap(long x, long y) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems very confusing to me.
Is doing swap this way idiomatic?
This relies heavily on the parameter evaluation order, and combined with the signature of the function swap(long, long), it makes for unreadable code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it is possibly the only way to do swap in java. Though the swap function itself is difficult to read, it makes the other methods more readable instead of having 3 line swap dispersed all over the code.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am somewhat against this.
In an endeavor towards brevity, we are losing clarity. In my google searches for this, I've seen this as a clever stackoverflow answer for a way to swap two primitves given that there is no concept of call by reference for primitives in java. It does not seem like an idiomatic way of doing swaps.
A 3 line swap, or even having 3 statements on the same line would be cleaner IMHO.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's agree to disagree here.

return x;
}

private static int cusparseOp(boolean isTransposed) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This also seems a bit confusing to me. From the function name it is unclear what this is doing.
Also, the number of invocations to this is so few, its better just to have them inline (as it was before).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer these methods as it makes weird tricks explicit. Added documentation to make the function name clear

private static final Log LOG = LogFactory.getLog(LibMatrixCuMatMult.class.getName());

private static class CuMatMultParameters {
public int m; public int n; public int k;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please split out each of the declarations into a separate line (does the java formatter do this automatically?)

Also, please put in a comment on what each of these fields means and when they can possibly (if at all) hold special values like -1.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the documentation

* Memory Requirements -
* Both dense - inputs, output, no intermediate
* Both sparse - inputs, output, no intermediate
* One sparse, one dense - inputs, output, intermediates - (input_dim1 * input_dim2) OR (input_dim1 * input_dim2 + input in sparse format)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this documentation still hold true after your changes (memory estimates)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, updated the documentation

right.getNumRows(), right.getNumColumns(), isLeftTransposed, isRightTransposed);

if(isM1Sparse && isM2Sparse) {
// -------------------------------------------------------------------------------------
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might be a good opportunity to create a sparseSpaseMatMult function which encapsulates the relevant bit of code in this if guard.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't add it since it is one line method. No guard is required due to CuMatMultParameters

Copy link
Member

@nakul02 nakul02 Oct 20, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The guard I am referring to is at line 130 - the particular if branch for when both matrices are sparse. This is keeping in line with having functions for sparseDenseMatMult, denseSparseMatMult and denseDenseMatMult.
The function would contain, as you rightly pointed out - just the one line of cusparseDcsrgemm and the surrounding advanced timer code.

I have no strong opinion here.

return output;
}

private static void sparseDenseMatMult(GPUContext gCtx, String instName, Pointer C, CSRPointer A, Pointer B,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please make use of the newly created CuMatMultParameters in the signature of this function as well as its invocation?
While you are at it, could you please add javadoc to this method, while outlining the assumptions and the strategy used?

Also, I like the reduction in complexity of this method by invoking the already existing denseSparseMatMult 👍

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

else {
// -------------------------------------------------------------------------------------
// dense-dense matrix multiplication
if(isM1Sparse && !isM2Sparse) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this if guard needed? This seems to be have been covered at line 172.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my earlier commit, I was trying to deal with the memory tradeoffs in this PR, but decided against it so as not to make this PR overly complicated.

JCublas2.cublasDgeam(gCtx.getCublasHandle(), cublasOperation.CUBLAS_OP_T, cublasOperation.CUBLAS_OP_T,
toInt(outCLen), toInt(outRLen), one(), output,
toInt(outRLen), zero(), new Pointer(), toInt(outRLen), C, toInt(outCLen));
gCtx.cudaFreeHelper(output, true);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you need to force this cudaFreeHelper call to use the eager evaluation. Maybe you could let the user override it with the configuration option instead?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add some timers here so that we can tell from the advanced stats if this code path was taken.
Ideally I would have liked to encapsulate this transpose function into a helper function somewhere and invoke that. The timers could be inside that transpose function and the resulting code would be easier to read.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's clean up other methods in a subsequent PR and focus only on matmult.

@nakul02
Copy link
Member

nakul02 commented Oct 20, 2017

Could you please encapsulate the correctness testing you did into the unit tests?

@nakul02
Copy link
Member

nakul02 commented Oct 20, 2017

In master, if either of the matrices is sparse, the output matrix maybe sparse.
With this PR, if either of the matrices is dense, the output is always dense. There is a memory trade-off to consider, besides the performance.

@niketanpansare niketanpansare changed the title [SYSTEMML-446] [SYSTEMML-702] Updated the sparse matrix multiplication to minimize sparse-to-dense conversion [SYSTEMML-446] [SYSTEMML-702] Updated the sparse matrix multiplication to minimize sparse-to-dense as well as dense-to-sparse conversion Oct 20, 2017
@niketanpansare
Copy link
Contributor Author

Updated the title of the PR to make the goal of this PR explicit: minimize conversions. We can deal with tradeoffs in a different PR.

@nakul02
Copy link
Member

nakul02 commented Oct 20, 2017

Thank you for the documentation additions!
Could you please also take care of the swap and cusparseOp like helper functions and add/edit unit tests?

@niketanpansare
Copy link
Contributor Author

Updated the junit tests. I would prefer to keep swap and cusparseOp methods as is.

@nakul02
Copy link
Member

nakul02 commented Oct 20, 2017

Ok, other than the disagreeable swap and cusparseOp, everything LGTM.
Overall LGTM. Please feel free to merge.

@niketanpansare
Copy link
Contributor Author

Thanks. I will merge :)

@niketanpansare
Copy link
Contributor Author

Also adding the details stats for each value in the above tables for future reference:

Matrix-vector 1M x 1K
Stats for [1000000, 1000] %*% [1000, 1] with sp1=0.1 and sp2=0.99
CP_MASTER   1  ba+* [test.dml 12:6-12:12]       346.542    100              aqrs[0.001s,100], aqmd[0.003s,100], aqrd[0.004s,100], rlswr[0.007s,0], rlsi[0.026s,200]
Stats for [1000000, 1000] %*% [1000, 1] with sp1=0.1 and sp2=0.1
CP_MASTER   1  ba+* [test.dml 12:6-12:12]       361.407    100              aqrs[0.001s,100], aqrd[0.004s,100], aqmd[0.004s,100], rlswr[0.005s,0], rlsi[0.019s,200]

Matrix-matrix 1M x 1K x 20
Stats for [1000000, 1000] %*% [1000, 20] with sp1=0.1 and sp2=0.99
CP_MASTER   1  ba+* [test.dml 12:6-12:12]        16.804    100              aqmd[0.001s,100], aqrd[0.005s,100], rlswr[0.006s,0], rlsi[0.006s,200], aqrs[1.124s,100]
GPU_PR   1  gpu_ba+* [test.dml 12:6-12:12]     9.877    100   sync[0.001s,100], Msmdm[0.005s,100], aqrd[0.005s,1], H2D[0.221s,2], aqrs[0.889s,1], ad[1.320s,100]
GPU_MASTER   1  gpu_ba+* [test.dml 12:6-12:12]   223.848    100 sync[0.002s,100], aqrd[0.005s,1], H2D[0.168s,2], aqrs[1.097s,1], f[22.873s,100], s2d[1.496s,100], Md
Stats for [1000000, 1000] %*% [1000, 20] with sp1=0.1 and sp2=0.1
CP_MASTER   1  ba+* [test.dml 12:6-12:12]        17.116    100              aqmd[0.001s,100], rlswr[0.003s,0], aqrd[0.004s,100], rlsi[0.006s,200], aqrs[1.137s,100]
GPU_PR   1  gpu_ba+* [test.dml 12:6-12:12]    10.527    100   sync[0.001s,100], Msmdm[0.004s,100], aqrd[0.004s,1], H2D[0.282s,2], aqrs[0.871s,1], ad[1.311s,100]
GPU_MASTER   1  gpu_ba+* [test.dml 12:6-12:12]   226.077    100 sync[0.002s,100], aqrd[0.005s,1], H2D[0.222s,2], aqrs[1.417s,1], f[22.940s,100], s2d[1.497s,100], Md

Squared matrix 3K x 3K
Stats for [3000, 3000] %*% [3000, 3000] with sp1=0.1 and sp2=0.99
CP_MASTER   1  ba+* [test.dml 12:6-12:12]        14.212    100              aqmd[0.001s,100], rlswr[0.004s,0], rlsi[0.005s,200], aqrd[0.178s,100], aqrs[0.265s,100]
GPU_PR   1  gpu_ba+* [test.dml 12:6-12:12]     8.051    100   sync[0.001s,100], Msmdm[0.005s,100], H2D[0.012s,2], aqrd[0.154s,1], aqrs[0.362s,1], ad[0.588s,100]
GPU_MASTER   1  gpu_ba+* [test.dml 12:6-12:12]     5.676    100 sync[0.001s,100], Mdmdm[0.009s,100], H2D[0.012s,2], ad[0.066s,100], aqrd[0.159s,1], aqrs[0.164s,1],
Stats for [3000, 3000] %*% [3000, 3000] with sp1=0.1 and sp2=0.1
CP_MASTER   1  ba+* [test.dml 12:6-12:12]         7.011    100                                aqmd[0.001s,100], rlswr[0.003s,0], rlsi[0.004s,200], aqrs[0.337s,200]
GPU_PR   1  gpu_ba+* [test.dml 12:6-12:12]    23.627    100                Msao[2.968s,100], Msmsm[20.267s,100], sync[0.002s,100], H2D[0.004s,2], aqrs[0.334s,2]
GPU_MASTER   1  gpu_ba+* [test.dml 12:6-12:12]    23.694    100                Msao[2.977s,100], Msmsm[20.350s,100], sync[0.001s,100], H2D[0.004s,2], aqrs[0.320s,2]
Stats for [3000, 3000] %*% [3000, 3000] with sp1=0.99 and sp2=0.1
CP_MASTER   1  ba+* [test.dml 12:6-12:12]        23.185    100              aqmd[0.001s,100], rlswr[0.003s,0], rlsi[0.005s,200], aqrd[0.160s,100], aqrs[0.171s,100]
GPU_PR   1  gpu_ba+* [test.dml 12:6-12:12]    73.655    100  Msmdm[72.502s,100], sync[0.002s,100], H2D[0.017s,2], aqrd[0.240s,1], aqrs[0.282s,1], ad[0.577s,100]
GPU_MASTER   1  gpu_ba+* [test.dml 12:6-12:12]     6.490    100 sync[0.001s,100], Mdmdm[0.008s,100], H2D[0.013s,2], ad[0.066s,100], aqrd[0.164s,1], aqrs[0.235s,1],

64 rows x 196,608 columns %*% 196,608 x 512
Stats for [64, 196608] %*% [196608, 512] with sp1=0.1 and sp2=0.99
CP_MASTER   1  ba+* [test.dml 12:6-12:12]        11.388    100              aqmd[0.002s,100], rlswr[0.004s,0], rlsi[0.006s,200], aqrs[0.269s,100], aqrd[0.612s,100]
GPU_PR   1  gpu_ba+* [test.dml 12:6-12:12]     4.556    100   sync[0.001s,100], Msmdm[0.003s,100], ad[0.030s,100], H2D[0.121s,2], aqrs[0.281s,1], aqrd[0.612s,1]
GPU_MASTER   1  gpu_ba+* [test.dml 12:6-12:12]     4.883    100 f[2.984s,100], sync[0.001s,100], Mdmdm[0.009s,100], ad[0.029s,100], H2D[0.121s,2], aqrs[0.262s,1], a
Stats for [64, 196608] %*% [196608, 512] with sp1=0.1 and sp2=0.1
CP_MASTER   1  ba+* [test.dml 12:6-12:12]         3.260    100                                aqmd[0.002s,100], rlswr[0.003s,0], rlsi[0.006s,200], aqrs[1.094s,200]
GPU_PR   1  gpu_ba+* [test.dml 12:6-12:12]    10.280    100                 Msmsm[8.040s,100], sync[0.001s,100], H2D[0.020s,2], aqrs[1.005s,2], Msao[1.069s,100]
GPU_MASTER   1  gpu_ba+* [test.dml 12:6-12:12]    10.200    100                 Msmsm[8.067s,100], sync[0.001s,100], H2D[0.020s,2], aqrs[0.873s,2], Msao[1.075s,100]
Stats for [64, 196608] %*% [196608, 512] with sp1=0.99 and sp2=0.1
CP_MASTER   1  ba+* [test.dml 12:6-12:12]        12.555    100              aqmd[0.001s,100], rlswr[0.003s,0], rlsi[0.006s,200], aqrd[0.179s,100], aqrs[0.767s,100]
GPU_PR   1  gpu_ba+* [test.dml 12:6-12:12]    24.412    100  sync[0.001s,100], H2D[0.032s,2], ad[0.036s,100], aqrd[0.191s,1], aqrs[0.520s,1], Msmdm[23.498s,100]
GPU_MASTER   1  gpu_ba+* [test.dml 12:6-12:12]    13.287    100 H2D[0.033s,2], aqrd[0.280s,1], aqrs[0.503s,1], f[5.787s,100], s2d[6.497s,100], sync[0.001s,100], Mdm

100 rows x 600 columns %*% 600 x 205800 (K:100, CRS=1x2x300=600, NPQ=100x2058x1)
Stats for [100, 600] %*% [600, 205800] with sp1=0.99 and sp2=0.1
CP_MASTER   1  ba+* [test.dml 12:6-12:12]       111.711    100              aqmd[0.002s,100], rlswr[0.006s,0], rlsi[0.008s,200], aqrd[0.038s,100], aqrs[2.102s,100]
GPU_PR   1  gpu_ba+* [test.dml 12:6-12:12]    41.010    100  Msmdm[37.911s,100], sync[0.001s,100], H2D[0.023s,2], aqrd[0.042s,1], ad[1.340s,100], aqrs[1.614s,1]
GPU_MASTER   2  gpu_ba+* [test.dml 12:6-12:12]     8.335    100 sync[0.001s,100], Mdmdm[0.008s,100], H2D[0.022s,2], aqrd[0.045s,1], ad[0.088s,100], f[5.278s,100], s
Stats for [100, 600] %*% [600, 205800] with sp1=0.1 and sp2=0.1
CP_MASTER   1  ba+* [test.dml 12:6-12:12]        17.562    100                                aqrs[2.595s,200], aqmd[0.001s,100], rlsi[0.006s,200], rlswr[0.006s,0]
GPU_PR   1  gpu_ba+* [test.dml 12:6-12:12]    20.960    100                aqrs[1.589s,2], Msmsm[15.794s,100], Msao[3.471s,100], sync[0.002s,100], H2D[0.024s,2]
GPU_MASTER   1  gpu_ba+* [test.dml 12:6-12:12]    20.850    100                aqrs[1.462s,2], Msmsm[15.791s,100], Msao[3.480s,100], sync[0.001s,100], H2D[0.033s,2]

100 rows x 900 columns %*% 900 x 205800 (K:100, CRS=1x3x300=900, NPQ=100x2058x1)
Stats for [100, 900] %*% [900, 205800] with sp1=0.99 and sp2=0.1
CP_MASTER   1  ba+* [test.dml 12:6-12:12]       223.880    100              aqrs[2.518s,100], aqmd[0.003s,100], rlswr[0.006s,0], rlsi[0.021s,200], aqrd[0.034s,100]
GPU_PR   1  gpu_ba+* [test.dml 12:6-12:12]    59.703    100  sync[0.002s,100], H2D[0.032s,2], aqrd[0.042s,1], Msmdm[56.236s,100], ad[1.337s,100], aqrs[1.922s,1]
GPU_MASTER   2  gpu_ba+* [test.dml 12:6-12:12]    11.942    100 f[7.916s,100], sync[0.001s,100], Mdmdm[0.010s,100], H2D[0.033s,2], aqrd[0.042s,1], ad[0.083s,100], s
Stats for [100, 900] %*% [900, 205800] with sp1=0.1 and sp2=0.1
CP_MASTER   1  ba+* [test.dml 12:6-12:12]        21.455    100                                aqrs[2.857s,200], aqmd[0.002s,100], rlsi[0.005s,200], rlswr[0.008s,0]
GPU_PR   1  gpu_ba+* [test.dml 12:6-12:12]    27.985    100                aqrs[2.228s,2], Msao[3.984s,100], sync[0.001s,100], H2D[0.033s,2], Msmsm[21.621s,100]
GPU_MASTER   1  gpu_ba+* [test.dml 12:6-12:12]    27.866    100                Msao[3.987s,100], sync[0.001s,100], H2D[0.047s,2], Msmsm[21.645s,100], aqrs[2.055s,2]

 Matrix-vector 1M x 1K
Stats for [1000000, 1000] %*% [1000, 1] with sp1=0.01 and sp2=0.99
CP_MASTER   1  ba+* [test.dml 12:6-12:12]         2.646    100              aqmd[0.001s,100], aqrd[0.003s,100], rlswr[0.004s,0], rlsi[0.004s,200], aqrs[0.975s,100]
GPU_PR   1  gpu_ba+* [test.dml 12:6-12:12]     1.077    100   aqrd[0.003s,1], Msmdv[0.004s,100], H2D[0.019s,2], ad[0.097s,100], sync[0.174s,100], aqrs[0.617s,1]
GPU_MASTER   1  gpu_ba+* [test.dml 12:6-12:12]     1.184    100 az[0.002s,100], aqrd[0.003s,1], Msmdv[0.004s,100], H2D[0.018s,2], a[0.093s,100], sync[0.171s,100], a
Stats for [1000000, 1000] %*% [1000, 1] with sp1=0.01 and sp2=0.01
CP_MASTER   1  ba+* [test.dml 12:6-12:12]         2.459    100              aqmd[0.001s,100], aqrd[0.003s,100], rlswr[0.003s,0], rlsi[0.004s,200], aqrs[1.001s,100]
GPU_PR   1  gpu_ba+* [test.dml 12:6-12:12]     1.056    100   aqrd[0.003s,1], Msmdv[0.007s,100], H2D[0.020s,2], ad[0.092s,100], sync[0.164s,100], aqrs[0.602s,1]
GPU_MASTER   1  gpu_ba+* [test.dml 12:6-12:12]     1.049    100 az[0.002s,100], aqrd[0.003s,1], Msmdv[0.004s,100], H2D[0.018s,2], a[0.095s,100], sync[0.170s,100], a

Matrix-matrix 1M x 1K x 20
Stats for [1000000, 1000] %*% [1000, 20] with sp1=0.01 and sp2=0.99
CP_MASTER   1  ba+* [test.dml 12:6-12:12]         9.466    100              aqmd[0.002s,100], rlswr[0.004s,0], aqrd[0.005s,100], rlsi[0.005s,200], aqrs[0.992s,100]
GPU_PR   1  gpu_ba+* [test.dml 12:6-12:12]     3.470    100   sync[0.001s,100], Msmdm[0.004s,100], aqrd[0.005s,1], H2D[0.018s,2], aqrs[0.715s,1], ad[1.317s,100]
GPU_MASTER   1  gpu_ba+* [test.dml 12:6-12:12]   152.291    100 sync[0.002s,100], aqrd[0.005s,1], H2D[0.018s,2], aqrs[0.637s,1], f[22.822s,100], s2d[1.497s,100], Md
Stats for [1000000, 1000] %*% [1000, 20] with sp1=0.01 and sp2=0.01
CP_MASTER   2  ba+* [test.dml 12:6-12:12]         9.129    100                                aqmd[0.001s,100], rlswr[0.004s,0], rlsi[0.006s,200], aqrs[1.021s,200]
GPU_PR   2  gpu_ba+* [test.dml 12:6-12:12]     4.663    100                 sync[0.001s,100], H2D[0.018s,2], aqrs[0.815s,2], Msmsm[1.101s,100], Msao[2.569s,100]
GPU_MASTER   2  gpu_ba+* [test.dml 12:6-12:12]     4.473    100                 sync[0.002s,100], H2D[0.022s,2], aqrs[0.593s,2], Msmsm[1.101s,100], Msao[2.590s,100]

Squared matrix 3K x 3K
Stats for [3000, 3000] %*% [3000, 3000] with sp1=0.01 and sp2=0.99
CP_MASTER   1  ba+* [test.dml 12:6-12:12]         6.913    100              aqmd[0.001s,100], rlswr[0.003s,0], rlsi[0.005s,200], aqrs[0.078s,100], aqrd[0.169s,100]
GPU_PR   1  gpu_ba+* [test.dml 12:6-12:12]     1.773    100   sync[0.002s,100], Msmdm[0.005s,100], H2D[0.012s,2], aqrs[0.092s,1], aqrd[0.186s,1], ad[0.616s,100]
GPU_MASTER   1  gpu_ba+* [test.dml 12:6-12:12]     5.429    100 sync[0.002s,100], Mdmdm[0.010s,100], H2D[0.011s,2], ad[0.069s,100], aqrs[0.091s,1], aqrd[0.151s,1],
Stats for [3000, 3000] %*% [3000, 3000] with sp1=0.01 and sp2=0.01
CP_MASTER   1  ba+* [test.dml 12:6-12:12]        12.415    100                                aqms[0.001s,100], rlswr[0.003s,0], rlsi[0.004s,200], aqrs[0.115s,200]
GPU_PR   1  gpu_ba+* [test.dml 12:6-12:12]     2.940    100                 H2D[0.001s,2], sync[0.001s,100], aqrs[0.118s,2], Msmsm[1.344s,100], Msao[1.448s,100]
GPU_MASTER   1  gpu_ba+* [test.dml 12:6-12:12]     2.917    100                 H2D[0.001s,2], sync[0.001s,100], aqrs[0.096s,2], Msmsm[1.342s,100], Msao[1.452s,100]
Stats for [3000, 3000] %*% [3000, 3000] with sp1=0.99 and sp2=0.01
CP_MASTER   1  ba+* [test.dml 12:6-12:12]         7.835    100              aqmd[0.002s,100], rlswr[0.004s,0], rlsi[0.006s,200], aqrs[0.067s,100], aqrd[0.193s,100]
GPU_PR   1  gpu_ba+* [test.dml 12:6-12:12]    45.697    100  sync[0.002s,100], H2D[0.011s,2], aqrs[0.058s,1], aqrd[0.214s,1], ad[0.587s,100], Msmdm[44.802s,100]
GPU_MASTER   1  gpu_ba+* [test.dml 12:6-12:12]     6.180    100 sync[0.001s,100], Mdmdm[0.008s,100], H2D[0.014s,2], aqrs[0.056s,1], ad[0.069s,100], aqrd[0.202s,1],

64 rows x 196,608 columns %*% 196,608 x 512
Stats for [64, 196608] %*% [196608, 512] with sp1=0.01 and sp2=0.99
CP_MASTER   1  ba+* [test.dml 12:6-12:12]         2.471    100              aqmd[0.001s,100], rlswr[0.004s,0], rlsi[0.006s,200], aqrs[0.087s,100], aqrd[0.652s,100]
GPU_PR   1  gpu_ba+* [test.dml 12:6-12:12]     1.319    100   sync[0.001s,100], Msmdm[0.003s,100], ad[0.031s,100], aqrs[0.094s,1], H2D[0.115s,2], aqrd[0.624s,1]
GPU_MASTER   1  gpu_ba+* [test.dml 12:6-12:12]     4.614    100 f[2.873s,100], sync[0.001s,100], Mdmdm[0.009s,100], ad[0.028s,100], aqrs[0.098s,1], H2D[0.119s,2], a
Stats for [64, 196608] %*% [196608, 512] with sp1=0.01 and sp2=0.01
CP_MASTER   1  ba+* [test.dml 12:6-12:12]         0.737    100                                aqmd[0.001s,100], rlsi[0.002s,200], rlswr[0.003s,0], aqrs[0.192s,200]
GPU_PR   1  gpu_ba+* [test.dml 12:6-12:12]     2.506    100                 sync[0.002s,100], H2D[0.004s,2], aqrs[0.201s,2], Msao[0.665s,100], Msmsm[1.568s,100]
GPU_MASTER   1  gpu_ba+* [test.dml 12:6-12:12]     2.508    100                 sync[0.001s,100], H2D[0.003s,2], aqrs[0.195s,2], Msao[0.660s,100], Msmsm[1.569s,100]
Stats for [64, 196608] %*% [196608, 512] with sp1=0.99 and sp2=0.01
CP_MASTER   1  ba+* [test.dml 12:6-12:12]         3.452    100              aqmd[0.001s,100], rlswr[0.004s,0], rlsi[0.006s,200], aqrs[0.118s,100], aqrd[0.234s,100]
GPU_PR   1  gpu_ba+* [test.dml 12:6-12:12]    22.268    100  sync[0.001s,100], H2D[0.017s,2], ad[0.032s,100], aqrs[0.094s,1], aqrd[0.205s,1], Msmdm[21.864s,100]
GPU_MASTER   1  gpu_ba+* [test.dml 12:6-12:12]    11.987    100 H2D[0.023s,2], aqrs[0.113s,1], aqrd[0.230s,1], f[5.043s,100], s2d[6.485s,100], sync[0.001s,100], Mdm

100 rows x 600 columns %*% 600 x 205800 (K:100, CRS=1x2x300=600, NPQ=100x2058x1)
Stats for [100, 600] %*% [600, 205800] with sp1=0.99 and sp2=0.01
CP_MASTER   1  ba+* [test.dml 12:6-12:12]        15.119    100              aqmd[0.001s,100], rlswr[0.003s,0], rlsi[0.005s,200], aqrd[0.033s,100], aqrs[0.258s,100]
GPU_PR   1  gpu_ba+* [test.dml 12:6-12:12]    26.078    100  sync[0.002s,100], H2D[0.003s,2], aqrd[0.042s,1], aqrs[0.229s,1], ad[1.343s,100], Msmdm[24.429s,100]
GPU_MASTER   2  gpu_ba+* [test.dml 12:6-12:12]     5.822    100 f[4.040s,100], sync[0.001s,100], H2D[0.003s,2], Mdmdm[0.007s,100], aqrd[0.043s,1], ad[0.083s,100], a
Stats for [100, 600] %*% [600, 205800] with sp1=0.01 and sp2=0.01
CP_MASTER   1  ba+* [test.dml 12:6-12:12]        17.242    100                                aqms[0.001s,100], rlswr[0.003s,0], rlsi[0.004s,200], aqrs[0.346s,200]
GPU_PR   1  gpu_ba+* [test.dml 12:6-12:12]    21.352    100                Msao[9.591s,100], Msmsm[11.391s,100], sync[0.001s,100], H2D[0.003s,2], aqrs[0.326s,2]
GPU_MASTER   1  gpu_ba+* [test.dml 12:6-12:12]    21.313    100                Msao[9.592s,100], Msmsm[11.404s,100], sync[0.001s,100], H2D[0.003s,2], aqrs[0.278s,2]


100 rows x 900 columns %*% 900 x 205800 (K:100, CRS=1x3x300=900, NPQ=100x2058x1)
Stats for [100, 900] %*% [900, 205800] with sp1=0.99 and sp2=0.01
CP_MASTER   1  ba+* [test.dml 12:6-12:12]        20.375    100              aqmd[0.001s,100], rlswr[0.004s,0], rlsi[0.006s,200], aqrd[0.042s,100], aqrs[0.400s,100]
GPU_PR   1  gpu_ba+* [test.dml 12:6-12:12]    37.762    100  sync[0.002s,100], H2D[0.004s,2], aqrd[0.042s,1], aqrs[0.369s,1], ad[1.328s,100], Msmdm[35.972s,100]
GPU_MASTER   2  gpu_ba+* [test.dml 12:6-12:12]     7.943    100 sync[0.001s,100], H2D[0.004s,2], Mdmdm[0.010s,100], aqrd[0.043s,1], ad[0.085s,100], aqrs[0.336s,1],
Stats for [100, 900] %*% [900, 205800] with sp1=0.01 and sp2=0.01
CP_MASTER   1  ba+* [test.dml 12:6-12:12]        18.948    100                                aqms[0.001s,100], rlswr[0.003s,0], rlsi[0.004s,200], aqrs[0.475s,200]
GPU_PR   1  gpu_ba+* [test.dml 12:6-12:12]    16.483    100                 Msao[7.284s,100], sync[0.001s,100], H2D[0.004s,2], Msmsm[8.796s,100], aqrs[0.360s,2]
GPU_MASTER   1  gpu_ba+* [test.dml 12:6-12:12]    16.567    100                 Msao[7.283s,100], sync[0.001s,100], H2D[0.004s,2], Msmsm[8.790s,100], aqrs[0.455s,2]

 Matrix-vector 1M x 1K
Stats for [1000000, 1000] %*% [1000, 1] with sp1=0.001 and sp2=0.99
CP_MASTER   1  ba+* [test.dml 12:6-12:12]         1.539    100              aqmd[0.001s,100], aqrd[0.003s,100], rlswr[0.003s,0], rlsi[0.004s,200], aqrs[0.342s,100]
GPU_PR   1  gpu_ba+* [test.dml 12:6-12:12]     0.455    100   H2D[0.003s,2], aqrd[0.003s,1], Msmdv[0.004s,100], sync[0.075s,100], ad[0.097s,100], aqrs[0.205s,1]
GPU_MASTER   1  gpu_ba+* [test.dml 12:6-12:12]     0.492    100 az[0.001s,100], aqrd[0.003s,1], H2D[0.003s,2], Msmdv[0.003s,100], sync[0.075s,100], a[0.091s,100], a
Stats for [1000000, 1000] %*% [1000, 1] with sp1=0.001 and sp2=0.001
CP_MASTER   1  ba+* [test.dml 12:6-12:12]         1.455    100              aqmd[0.001s,100], aqrd[0.003s,100], rlswr[0.003s,0], rlsi[0.005s,200], aqrs[0.285s,100]
GPU_PR   1  gpu_ba+* [test.dml 12:6-12:12]     0.559    100   H2D[0.003s,2], aqrd[0.003s,1], Msmdv[0.004s,100], sync[0.075s,100], ad[0.099s,100], aqrs[0.306s,1]
GPU_MASTER   1  gpu_ba+* [test.dml 12:6-12:12]     0.621    100 az[0.001s,100], H2D[0.003s,2], aqrd[0.003s,1], Msmdv[0.004s,100], sync[0.077s,100], a[0.093s,100], a

Matrix-matrix 1M x 1K x 20
Stats for [1000000, 1000] %*% [1000, 20] with sp1=0.001 and sp2=0.99
CP_MASTER   2  ba+* [test.dml 12:6-12:12]         9.091    100              aqmd[0.001s,100], rlswr[0.003s,0], aqrd[0.004s,100], rlsi[0.006s,200], aqrs[0.359s,100]
GPU_PR   1  gpu_ba+* [test.dml 12:6-12:12]     2.484    100   sync[0.001s,100], H2D[0.003s,2], Msmdm[0.004s,100], aqrd[0.006s,1], aqrs[0.273s,1], ad[1.329s,100]
GPU_MASTER   1  gpu_ba+* [test.dml 12:6-12:12]   148.133    100 sync[0.002s,100], s2d[1.495s,100], f[22.993s,100], Mdmdm[123.207s,100], H2D[0.003s,2], aqrd[0.005s,1
Stats for [1000000, 1000] %*% [1000, 20] with sp1=0.001 and sp2=0.001
CP_MASTER   1  ba+* [test.dml 12:6-12:12]         9.699    100                                aqms[0.001s,100], rlsi[0.004s,200], aqrs[0.248s,200], rlswr[0.400s,0]
GPU_PR   2  gpu_ba+* [test.dml 12:6-12:12]     2.604    100                 sync[0.001s,100], H2D[0.003s,2], aqrs[0.296s,2], Msmsm[0.352s,100], Msao[1.868s,100]
GPU_MASTER   2  gpu_ba+* [test.dml 12:6-12:12]     2.607    100                 sync[0.001s,100], H2D[0.003s,2], aqrs[0.329s,2], Msmsm[0.350s,100], Msao[1.851s,100]

Squared matrix 3K x 3K
Stats for [3000, 3000] %*% [3000, 3000] with sp1=0.001 and sp2=0.99
CP_MASTER   1  ba+* [test.dml 12:6-12:12]         4.558    100              aqmd[0.001s,100], rlswr[0.004s,0], rlsi[0.005s,200], aqrs[0.048s,100], aqrd[0.131s,100]
GPU_PR   1  gpu_ba+* [test.dml 12:6-12:12]     1.283    100   sync[0.001s,100], Msmdm[0.005s,100], H2D[0.011s,2], aqrs[0.057s,1], aqrd[0.142s,1], ad[0.628s,100]
GPU_MASTER   1  gpu_ba+* [test.dml 12:6-12:12]     5.355    100 sync[0.001s,100], Mdmdm[0.010s,100], H2D[0.015s,2], aqrs[0.056s,1], ad[0.068s,100], f[4.432s,100], a
Stats for [3000, 3000] %*% [3000, 3000] with sp1=0.001 and sp2=0.001
CP_MASTER   1  ba+* [test.dml 12:6-12:12]         6.233    100                                aqms[0.001s,100], rlsi[0.004s,200], rlswr[0.038s,0], aqrs[0.064s,200]
GPU_PR   2  gpu_ba+* [test.dml 12:6-12:12]     0.969    100                 H2D[0.000s,2], sync[0.001s,100], aqrs[0.069s,2], Msmsm[0.094s,100], Msao[0.780s,100]
GPU_MASTER   2  gpu_ba+* [test.dml 12:6-12:12]     0.964    100                 H2D[0.000s,2], sync[0.001s,100], aqrs[0.064s,2], Msmsm[0.093s,100], Msao[0.778s,100]
Stats for [3000, 3000] %*% [3000, 3000] with sp1=0.99 and sp2=0.001
CP_MASTER   1  ba+* [test.dml 12:6-12:12]         4.782    100              aqmd[0.002s,100], rlswr[0.004s,0], rlsi[0.005s,200], aqrs[0.021s,100], aqrd[0.217s,100]
GPU_PR   1  gpu_ba+* [test.dml 12:6-12:12]    17.049    100  Msmdm[16.213s,100], sync[0.002s,100], H2D[0.011s,2], aqrs[0.022s,1], aqrd[0.201s,1], ad[0.574s,100]
GPU_MASTER   1  gpu_ba+* [test.dml 12:6-12:12]     6.002    100 sync[0.001s,100], Mdmdm[0.008s,100], H2D[0.017s,2], aqrs[0.021s,1], ad[0.068s,100], aqrd[0.162s,1],

64 rows x 196,608 columns %*% 196,608 x 512
Stats for [64, 196608] %*% [196608, 512] with sp1=0.001 and sp2=0.99
CP_MASTER   1  ba+* [test.dml 12:6-12:12]         1.107    100              aqmd[0.001s,100], rlswr[0.002s,0], rlsi[0.003s,200], aqrs[0.049s,100], aqrd[0.616s,100]
GPU_PR   1  gpu_ba+* [test.dml 12:6-12:12]     0.951    100   sync[0.001s,100], Msmdm[0.003s,100], ad[0.031s,100], aqrs[0.058s,1], H2D[0.136s,2], aqrd[0.624s,1]
GPU_MASTER   1  gpu_ba+* [test.dml 12:6-12:12]     4.470    100 f[2.797s,100], sync[0.001s,100], Mdmdm[0.009s,100], ad[0.029s,100], aqrs[0.064s,1], H2D[0.115s,2], a
Stats for [64, 196608] %*% [196608, 512] with sp1=0.001 and sp2=0.001
CP_MASTER   1  ba+* [test.dml 12:6-12:12]         0.491    100                                aqms[0.001s,100], rlsi[0.002s,200], rlswr[0.002s,0], aqrs[0.102s,200]
GPU_PR   1  gpu_ba+* [test.dml 12:6-12:12]     0.409    100                 H2D[0.001s,2], sync[0.001s,100], aqrs[0.089s,2], Msmsm[0.104s,100], Msao[0.173s,100]
GPU_MASTER   1  gpu_ba+* [test.dml 12:6-12:12]     0.418    100                 sync[0.001s,100], H2D[0.001s,2], aqrs[0.094s,2], Msmsm[0.110s,100], Msao[0.170s,100]
Stats for [64, 196608] %*% [196608, 512] with sp1=0.99 and sp2=0.001
CP_MASTER   1  ba+* [test.dml 12:6-12:12]         1.870    100              aqmd[0.001s,100], rlswr[0.004s,0], rlsi[0.006s,200], aqrs[0.058s,100], aqrd[0.186s,100]
GPU_PR   1  gpu_ba+* [test.dml 12:6-12:12]     9.672    100   sync[0.001s,100], H2D[0.020s,2], ad[0.033s,100], aqrs[0.052s,1], aqrd[0.272s,1], Msmdm[9.255s,100]
GPU_MASTER   1  gpu_ba+* [test.dml 12:6-12:12]    11.687    100 sync[0.001s,100], H2D[0.015s,2], aqrd[0.218s,1], f[4.815s,100], s2d[6.514s,100], Mdmdm[0.010s,100],

100 rows x 600 columns %*% 600 x 205800 (K:100, CRS=1x2x300=600, NPQ=100x2058x1)
Stats for [100, 600] %*% [600, 205800] with sp1=0.99 and sp2=0.001
CP_MASTER   2  ba+* [test.dml 12:6-12:12]         8.325    100              aqmd[0.001s,100], rlswr[0.004s,0], rlsi[0.005s,200], aqrd[0.044s,100], aqrs[0.120s,100]
GPU_PR   1  gpu_ba+* [test.dml 12:6-12:12]     9.989    100   Msmdm[8.484s,100], H2D[0.001s,2], sync[0.002s,100], aqrd[0.035s,1], aqrs[0.105s,1], ad[1.345s,100]
GPU_MASTER   2  gpu_ba+* [test.dml 12:6-12:12]     5.545    100 f[3.905s,100], H2D[0.001s,2], sync[0.001s,100], Mdmdm[0.008s,100], aqrd[0.040s,1], ad[0.083s,100], a
Stats for [100, 600] %*% [600, 205800] with sp1=0.001 and sp2=0.001
CP_MASTER   1  ba+* [test.dml 12:6-12:12]        11.496    100                                aqms[0.001s,100], rlswr[0.003s,0], rlsi[0.004s,200], aqrs[0.143s,200]
GPU_PR   1  gpu_ba+* [test.dml 12:6-12:12]     4.588    100                 H2D[0.001s,2], sync[0.001s,100], aqrs[0.127s,2], Msmsm[1.363s,100], Msao[3.075s,100]
GPU_MASTER   1  gpu_ba+* [test.dml 12:6-12:12]     4.584    100                 H2D[0.001s,2], sync[0.001s,100], aqrs[0.118s,2], Msmsm[1.365s,100], Msao[3.080s,100]

100 rows x 900 columns %*% 900 x 205800 (K:100, CRS=1x3x300=900, NPQ=100x2058x1)
Stats for [100, 900] %*% [900, 205800] with sp1=0.99 and sp2=0.001
CP_MASTER   1  ba+* [test.dml 12:6-12:12]         9.710    100              aqmd[0.001s,100], rlswr[0.004s,0], rlsi[0.005s,200], aqrd[0.042s,100], aqrs[0.119s,100]
GPU_PR   1  gpu_ba+* [test.dml 12:6-12:12]    13.873    100  Msmdm[12.351s,100], H2D[0.001s,2], sync[0.001s,100], aqrd[0.042s,1], aqrs[0.131s,1], ad[1.328s,100]
GPU_MASTER   2  gpu_ba+* [test.dml 12:6-12:12]     7.509    100 H2D[0.001s,2], sync[0.001s,100], Mdmdm[0.009s,100], aqrd[0.038s,1], ad[0.082s,100], aqrs[0.127s,1],
Stats for [100, 900] %*% [900, 205800] with sp1=0.001 and sp2=0.001
CP_MASTER   1  ba+* [test.dml 12:6-12:12]        11.452    100                                aqms[0.001s,100], rlswr[0.003s,0], rlsi[0.004s,200], aqrs[0.148s,200]
GPU_PR   1  gpu_ba+* [test.dml 12:6-12:12]     5.588    100                 H2D[0.001s,2], sync[0.001s,100], aqrs[0.157s,2], Msmsm[1.800s,100], Msao[3.604s,100]
GPU_MASTER   1  gpu_ba+* [test.dml 12:6-12:12]     5.575    100                 H2D[0.001s,2], sync[0.001s,100], aqrs[0.144s,2], Msmsm[1.801s,100], Msao[3.604s,100]

 Matrix-vector 1M x 1K
Stats for [1000000, 1000] %*% [1000, 1] with sp1=0.0001 and sp2=0.99
CP_MASTER   1  ba+* [test.dml 12:6-12:12]         1.111    100              aqmd[0.001s,100], rlswr[0.002s,0], aqrd[0.003s,100], rlsi[0.003s,200], aqrs[0.111s,100]
GPU_PR   1  gpu_ba+* [test.dml 12:6-12:12]     0.289    100   H2D[0.002s,2], aqrd[0.003s,1], Msmdv[0.004s,100], sync[0.059s,100], aqrs[0.087s,1], ad[0.099s,100]
GPU_MASTER   1  gpu_ba+* [test.dml 12:6-12:12]     0.288    100 H2D[0.001s,2], az[0.001s,100], aqrd[0.003s,1], Msmdv[0.004s,100], sync[0.058s,100], aqrs[0.092s,1],
Stats for [1000000, 1000] %*% [1000, 1] with sp1=0.0001 and sp2=0.0001
CP_MASTER   1  ba+* [test.dml 12:6-12:12]         1.029    100              aqmd[0.001s,100], rlswr[0.002s,0], aqrd[0.002s,100], rlsi[0.003s,200], aqrs[0.091s,100]
GPU_PR   1  gpu_ba+* [test.dml 12:6-12:12]     0.313    100   H2D[0.001s,2], aqrd[0.003s,1], Msmdv[0.004s,100], sync[0.057s,100], ad[0.097s,100], aqrs[0.113s,1]
GPU_MASTER   1  gpu_ba+* [test.dml 12:6-12:12]     0.308    100 H2D[0.001s,2], az[0.002s,100], aqrd[0.003s,1], Msmdv[0.004s,100], sync[0.058s,100], a[0.094s,100], a

Matrix-matrix 1M x 1K x 20
Stats for [1000000, 1000] %*% [1000, 20] with sp1=0.0001 and sp2=0.99
CP_MASTER   2  ba+* [test.dml 12:6-12:12]         6.622    100              aqmd[0.001s,100], rlswr[0.004s,0], aqrd[0.005s,100], rlsi[0.007s,200], aqrs[0.106s,100]
GPU_PR   1  gpu_ba+* [test.dml 12:6-12:12]     2.200    100   sync[0.001s,100], H2D[0.001s,2], Msmdm[0.004s,100], aqrd[0.005s,1], aqrs[0.100s,1], ad[1.331s,100]
GPU_MASTER   1  gpu_ba+* [test.dml 12:6-12:12]   147.446    100 s2d[1.497s,100], f[23.020s,100], Mdmdm[122.715s,100], H2D[0.001s,2], sync[0.002s,100], aqrd[0.005s,1
Stats for [1000000, 1000] %*% [1000, 20] with sp1=0.0001 and sp2=0.0001
CP_MASTER   1  ba+* [test.dml 12:6-12:12]        10.088    100                                aqms[0.001s,100], rlsi[0.004s,200], aqrs[0.110s,200], rlswr[0.281s,0]
GPU_PR   2  gpu_ba+* [test.dml 12:6-12:12]     2.104    100                 sync[0.001s,100], H2D[0.001s,2], aqrs[0.104s,2], Msmsm[0.235s,100], Msao[1.717s,100]
GPU_MASTER   2  gpu_ba+* [test.dml 12:6-12:12]     2.107    100                 sync[0.001s,100], H2D[0.002s,2], aqrs[0.094s,2], Msmsm[0.236s,100], Msao[1.731s,100]

Squared matrix 3K x 3K
Stats for [3000, 3000] %*% [3000, 3000] with sp1=0.0001 and sp2=0.99
CP_MASTER   1  ba+* [test.dml 12:6-12:12]         8.146    100              aqms[0.001s,100], rlswr[0.003s,0], rlsi[0.004s,200], aqrs[0.045s,100], aqrd[0.163s,100]
GPU_PR   1  gpu_ba+* [test.dml 12:6-12:12]     1.227    100   sync[0.001s,100], Msmdm[0.005s,100], H2D[0.011s,2], aqrs[0.049s,1], aqrd[0.165s,1], ad[0.636s,100]
GPU_MASTER   1  gpu_ba+* [test.dml 12:6-12:12]     5.288    100 sync[0.001s,100], Mdmdm[0.010s,100], H2D[0.011s,2], f[4.307s,100], aqrs[0.049s,1], ad[0.066s,100], a
Stats for [3000, 3000] %*% [3000, 3000] with sp1=0.0001 and sp2=0.0001
CP_MASTER   1  ba+* [test.dml 12:6-12:12]         4.717    100                                aqms[0.001s,100], rlsi[0.004s,200], rlswr[0.012s,0], aqrs[0.056s,200]
GPU_PR   2  gpu_ba+* [test.dml 12:6-12:12]     0.801    100                 H2D[0.000s,2], sync[0.001s,100], Msmsm[0.015s,100], aqrs[0.060s,2], Msao[0.699s,100]
GPU_MASTER   2  gpu_ba+* [test.dml 12:6-12:12]     0.799    100                 H2D[0.000s,2], sync[0.001s,100], Msmsm[0.015s,100], aqrs[0.061s,2], Msao[0.698s,100]
Stats for [3000, 3000] %*% [3000, 3000] with sp1=0.99 and sp2=0.0001
CP_MASTER   1  ba+* [test.dml 12:6-12:12]        11.358    100              aqms[0.001s,100], rlswr[0.003s,0], rlsi[0.004s,200], aqrs[0.014s,100], aqrd[0.191s,100]
GPU_PR   1  gpu_ba+* [test.dml 12:6-12:12]     3.196    100   sync[0.001s,100], H2D[0.014s,2], aqrs[0.019s,1], aqrd[0.210s,1], ad[0.613s,100], Msmdm[2.319s,100]
GPU_MASTER   1  gpu_ba+* [test.dml 12:6-12:12]     5.834    100 sync[0.001s,100], Mdmdm[0.008s,100], H2D[0.011s,2], aqrs[0.016s,1], ad[0.066s,100], aqrd[0.260s,1],

64 rows x 196,608 columns %*% 196,608 x 512
Stats for [64, 196608] %*% [196608, 512] with sp1=0.0001 and sp2=0.99
CP_MASTER   1  ba+* [test.dml 12:6-12:12]         1.196    100              aqmd[0.001s,100], rlswr[0.003s,0], rlsi[0.003s,200], aqrs[0.049s,100], aqrd[0.666s,100]
GPU_PR   1  gpu_ba+* [test.dml 12:6-12:12]     0.902    100   sync[0.001s,100], Msmdm[0.003s,100], ad[0.030s,100], aqrs[0.052s,1], H2D[0.120s,2], aqrd[0.632s,1]
GPU_MASTER   1  gpu_ba+* [test.dml 12:6-12:12]     4.470    100 f[2.802s,100], sync[0.001s,100], Mdmdm[0.010s,100], ad[0.029s,100], aqrs[0.050s,1], H2D[0.115s,2], a
Stats for [64, 196608] %*% [196608, 512] with sp1=0.0001 and sp2=0.0001
CP_MASTER   1  ba+* [test.dml 12:6-12:12]         0.402    100                                aqms[0.001s,100], rlsi[0.002s,200], rlswr[0.004s,0], aqrs[0.060s,200]
GPU_PR   1  gpu_ba+* [test.dml 12:6-12:12]     0.237    100                 H2D[0.001s,2], sync[0.001s,100], Msmsm[0.015s,100], aqrs[0.074s,2], Msao[0.116s,100]
GPU_MASTER   1  gpu_ba+* [test.dml 12:6-12:12]     0.222    100                 H2D[0.001s,2], sync[0.001s,100], Msmsm[0.015s,100], aqrs[0.060s,2], Msao[0.117s,100]
Stats for [64, 196608] %*% [196608, 512] with sp1=0.99 and sp2=0.0001
CP_MASTER   1  ba+* [test.dml 12:6-12:12]         1.486    100              aqmd[0.001s,100], rlswr[0.003s,0], rlsi[0.004s,200], aqrs[0.021s,100], aqrd[0.277s,100]
GPU_PR   1  gpu_ba+* [test.dml 12:6-12:12]     1.620    100   sync[0.001s,100], H2D[0.015s,2], aqrs[0.019s,1], ad[0.035s,100], aqrd[0.195s,1], Msmdm[1.328s,100]
GPU_MASTER   1  gpu_ba+* [test.dml 12:6-12:12]    11.582    100 sync[0.001s,100], H2D[0.015s,2], aqrd[0.212s,1], f[4.761s,100], s2d[6.506s,100], Mdmdm[0.010s,100],

100 rows x 600 columns %*% 600 x 205800 (K:100, CRS=1x2x300=600, NPQ=100x2058x1)
Stats for [100, 600] %*% [600, 205800] with sp1=0.99 and sp2=0.0001
CP_MASTER   1  ba+* [test.dml 12:6-12:12]        16.178    100              aqms[0.001s,100], rlswr[0.003s,0], rlsi[0.004s,200], aqrs[0.034s,100], aqrd[0.041s,100]
GPU_PR   1  gpu_ba+* [test.dml 12:6-12:12]     2.693    100   H2D[0.001s,2], sync[0.002s,100], aqrs[0.032s,1], aqrd[0.035s,1], Msmdm[1.229s,100], ad[1.373s,100]
GPU_MASTER   2  gpu_ba+* [test.dml 12:6-12:12]     5.425    100 f[3.883s,100], H2D[0.001s,2], sync[0.002s,100], Mdmdm[0.008s,100], aqrs[0.028s,1], aqrd[0.039s,1], a
Stats for [100, 600] %*% [600, 205800] with sp1=0.0001 and sp2=0.0001
CP_MASTER   1  ba+* [test.dml 12:6-12:12]         9.132    100                                aqms[0.001s,100], rlsi[0.005s,200], rlswr[0.008s,0], aqrs[0.062s,200]
GPU_PR   2  gpu_ba+* [test.dml 12:6-12:12]     1.564    100                 H2D[0.000s,2], sync[0.001s,100], Msmsm[0.031s,100], aqrs[0.063s,2], Msao[1.446s,100]
GPU_MASTER   2  gpu_ba+* [test.dml 12:6-12:12]     1.572    100                 H2D[0.000s,2], sync[0.001s,100], Msmsm[0.032s,100], aqrs[0.068s,2], Msao[1.453s,100]

100 rows x 900 columns %*% 900 x 205800 (K:100, CRS=1x3x300=900, NPQ=100x2058x1)
Stats for [100, 900] %*% [900, 205800] with sp1=0.99 and sp2=0.0001
CP_MASTER   1  ba+* [test.dml 12:6-12:12]        18.504    100              aqms[0.001s,100], rlswr[0.003s,0], rlsi[0.004s,200], aqrd[0.042s,100], aqrs[0.043s,100]
GPU_PR   1  gpu_ba+* [test.dml 12:6-12:12]     3.168    100   H2D[0.001s,2], sync[0.001s,100], aqrs[0.034s,1], aqrd[0.039s,1], ad[1.377s,100], Msmdm[1.693s,100]
GPU_MASTER   2  gpu_ba+* [test.dml 12:6-12:12]     7.390    100 H2D[0.001s,2], sync[0.001s,100], Mdmdm[0.009s,100], aqrs[0.041s,1], aqrd[0.042s,1], ad[0.081s,100],
Stats for [100, 900] %*% [900, 205800] with sp1=0.0001 and sp2=0.0001
CP_MASTER   1  ba+* [test.dml 12:6-12:12]         8.453    100                                aqms[0.001s,100], rlsi[0.004s,200], rlswr[0.005s,0], aqrs[0.080s,200]
GPU_PR   2  gpu_ba+* [test.dml 12:6-12:12]     1.614    100                 H2D[0.000s,2], sync[0.001s,100], Msmsm[0.049s,100], aqrs[0.066s,2], Msao[1.472s,100]
GPU_MASTER   2  gpu_ba+* [test.dml 12:6-12:12]     1.618    100                 H2D[0.001s,2], sync[0.001s,100], Msmsm[0.048s,100], aqrs[0.069s,2], Msao[1.475s,100]

 Matrix-vector 1M x 1K
Stats for [1000000, 1000] %*% [1000, 1] with sp1=0.00001 and sp2=0.99
CP_MASTER   1  ba+* [test.dml 12:6-12:12]         1.063    100              aqmd[0.001s,100], rlswr[0.002s,0], aqrd[0.003s,100], rlsi[0.003s,200], aqrs[0.073s,100]
GPU_PR   1  gpu_ba+* [test.dml 12:6-12:12]     0.262    100   H2D[0.001s,2], aqrd[0.003s,1], Msmdv[0.004s,100], sync[0.047s,100], aqrs[0.079s,1], ad[0.097s,100]
GPU_MASTER   1  gpu_ba+* [test.dml 12:6-12:12]     0.247    100 H2D[0.001s,2], az[0.002s,100], aqrd[0.003s,1], Msmdv[0.004s,100], sync[0.050s,100], aqrs[0.062s,1],
Stats for [1000000, 1000] %*% [1000, 1] with sp1=0.00001 and sp2=0.00001
CP_MASTER   1  ba+* [test.dml 12:6-12:12]         1.042    100              aqmd[0.001s,100], aqrd[0.002s,100], rlswr[0.002s,0], rlsi[0.003s,200], aqrs[0.081s,100]
GPU_PR   1  gpu_ba+* [test.dml 12:6-12:12]     0.248    100   H2D[0.001s,2], aqrd[0.002s,1], Msmdv[0.004s,100], sync[0.049s,100], aqrs[0.061s,1], ad[0.100s,100]
GPU_MASTER   1  gpu_ba+* [test.dml 12:6-12:12]     0.271    100 az[0.001s,100], H2D[0.002s,2], aqrd[0.002s,1], Msmdv[0.004s,100], sync[0.049s,100], aqrs[0.083s,1],

Matrix-matrix 1M x 1K x 20
Stats for [1000000, 1000] %*% [1000, 20] with sp1=0.00001 and sp2=0.99
CP_MASTER   1  ba+* [test.dml 12:6-12:12]        10.150    100              aqms[0.002s,100], rlsi[0.004s,200], aqrd[0.005s,100], aqrs[0.060s,100], rlswr[0.562s,0]
GPU_PR   1  gpu_ba+* [test.dml 12:6-12:12]     2.114    100   sync[0.001s,100], Msmdm[0.004s,100], aqrd[0.005s,1], H2D[0.007s,2], aqrs[0.076s,1], ad[1.334s,100]
GPU_MASTER   2  gpu_ba+* [test.dml 12:6-12:12]     2.303    100 sync[0.001s,100], H2D[0.001s,2], aqrd[0.006s,1], f[0.026s,200], dtl[0.032s,100], aqrs[0.075s,1], Msm
Stats for [1000000, 1000] %*% [1000, 20] with sp1=0.00001 and sp2=0.00001
CP_MASTER   1  ba+* [test.dml 12:6-12:12]         8.572    100                                aqms[0.002s,100], rlsi[0.005s,200], aqrs[0.063s,200], rlswr[0.295s,0]
GPU_PR   2  gpu_ba+* [test.dml 12:6-12:12]     1.905    100                 sync[0.001s,100], H2D[0.001s,2], aqrs[0.069s,2], Msmsm[0.159s,100], Msao[1.644s,100]
GPU_MASTER   2  gpu_ba+* [test.dml 12:6-12:12]     1.922    100                 sync[0.001s,100], H2D[0.001s,2], aqrs[0.085s,2], Msmsm[0.159s,100], Msao[1.643s,100]

Squared matrix 3K x 3K
Stats for [3000, 3000] %*% [3000, 3000] with sp1=0.00001 and sp2=0.99
CP_MASTER   1  ba+* [test.dml 12:6-12:12]         6.698    100              aqms[0.001s,100], rlswr[0.003s,0], rlsi[0.004s,200], aqrs[0.036s,100], aqrd[0.133s,100]
GPU_PR   1  gpu_ba+* [test.dml 12:6-12:12]     1.133    100   sync[0.001s,100], Msmdm[0.005s,100], H2D[0.011s,2], aqrs[0.047s,1], aqrd[0.133s,1], ad[0.615s,100]
GPU_MASTER   1  gpu_ba+* [test.dml 12:6-12:12]     3.341    100 sync[0.001s,100], H2D[0.011s,2], aqrs[0.046s,1], aqrd[0.130s,1], Msao[0.311s,100], Msmsm[0.322s,100]
Stats for [3000, 3000] %*% [3000, 3000] with sp1=0.00001 and sp2=0.00001
CP_MASTER   1  ba+* [test.dml 12:6-12:12]         4.892    100                                aqms[0.001s,100], rlsi[0.006s,200], rlswr[0.012s,0], aqrs[0.050s,200]
GPU_PR   2  gpu_ba+* [test.dml 12:6-12:12]     0.790    100                 H2D[0.000s,2], sync[0.001s,100], Msmsm[0.011s,100], aqrs[0.061s,2], Msao[0.693s,100]
GPU_MASTER   2  gpu_ba+* [test.dml 12:6-12:12]     0.780    100                 H2D[0.001s,2], sync[0.001s,100], Msmsm[0.011s,100], aqrs[0.059s,2], Msao[0.686s,100]
Stats for [3000, 3000] %*% [3000, 3000] with sp1=0.99 and sp2=0.00001
CP_MASTER   1  ba+* [test.dml 12:6-12:12]         7.324    100              aqms[0.001s,100], rlswr[0.003s,0], rlsi[0.005s,200], aqrs[0.013s,100], aqrd[0.228s,100]
GPU_PR   1  gpu_ba+* [test.dml 12:6-12:12]     1.396    100   sync[0.002s,100], H2D[0.012s,2], aqrs[0.017s,1], aqrd[0.189s,1], Msmdm[0.526s,100], ad[0.629s,100]
GPU_MASTER   1  gpu_ba+* [test.dml 12:6-12:12]    17.673    100 d2s[0.909s,100], f[1.221s,200], Msmsm[6.856s,100], Msao[7.817s,100], sync[0.001s,100], H2D[0.011s,2]

64 rows x 196,608 columns %*% 196,608 x 512
Stats for [64, 196608] %*% [196608, 512] with sp1=0.00001 and sp2=0.99
CP_MASTER   1  ba+* [test.dml 12:6-12:12]         1.670    100              aqmd[0.001s,100], rlswr[0.003s,0], rlsi[0.006s,200], aqrs[0.039s,100], aqrd[0.647s,100]
GPU_PR   1  gpu_ba+* [test.dml 12:6-12:12]     0.915    100   sync[0.001s,100], Msmdm[0.003s,100], ad[0.032s,100], aqrs[0.047s,1], H2D[0.119s,2], aqrd[0.651s,1]
GPU_MASTER   1  gpu_ba+* [test.dml 12:6-12:12]    26.518    100 sync[0.001s,100], aqrs[0.051s,1], H2D[0.115s,2], Msao[0.140s,100], Msmsm[0.235s,100], d2s[9.018s,100
Stats for [64, 196608] %*% [196608, 512] with sp1=0.00001 and sp2=0.00001
CP_MASTER   1  ba+* [test.dml 12:6-12:12]         0.416    100                                                 aqms[0.001s,100], rlsi[0.002s,200], aqrs[0.049s,200]
GPU_PR   1  gpu_ba+* [test.dml 12:6-12:12]     0.211    100                 H2D[0.001s,2], sync[0.001s,100], Msmsm[0.011s,100], aqrs[0.059s,2], Msao[0.113s,100]
GPU_MASTER   1  gpu_ba+* [test.dml 12:6-12:12]     0.215    100                 H2D[0.001s,2], sync[0.001s,100], Msmsm[0.011s,100], aqrs[0.059s,2], Msao[0.114s,100]
Stats for [64, 196608] %*% [196608, 512] with sp1=0.99 and sp2=0.00001
CP_MASTER   1  ba+* [test.dml 12:6-12:12]         1.461    100              aqmd[0.001s,100], rlswr[0.003s,0], rlsi[0.005s,200], aqrs[0.015s,100], aqrd[0.175s,100]
GPU_PR   1  gpu_ba+* [test.dml 12:6-12:12]     0.716    100   sync[0.001s,100], H2D[0.015s,2], aqrs[0.018s,1], ad[0.032s,100], aqrd[0.184s,1], Msmdm[0.445s,100]
GPU_MASTER   1  gpu_ba+* [test.dml 12:6-12:12]    58.097    100 f[1.679s,200], Msmsm[23.476s,100], d2s[10.945s,100], Msao[21.033s,100], sync[0.001s,100], H2D[0.015s

100 rows x 600 columns %*% 600 x 205800 (K:100, CRS=1x2x300=600, NPQ=100x2058x1)
Stats for [100, 600] %*% [600, 205800] with sp1=0.99 and sp2=0.00001
CP_MASTER   1  ba+* [test.dml 12:6-12:12]        12.600    100              aqms[0.001s,100], rlswr[0.003s,0], rlsi[0.004s,200], aqrs[0.023s,100], aqrd[0.040s,100]
GPU_PR   1  gpu_ba+* [test.dml 12:6-12:12]     1.775    100   H2D[0.000s,2], sync[0.002s,100], aqrs[0.024s,1], aqrd[0.040s,1], Msmdm[0.299s,100], ad[1.386s,100]
GPU_MASTER   1  gpu_ba+* [test.dml 12:6-12:12]    11.463    100 H2D[0.000s,2], sync[0.001s,100], aqrs[0.023s,1], f[0.027s,200], dtl[0.032s,100], aqrd[0.045s,1], Msa
Stats for [100, 600] %*% [600, 205800] with sp1=0.00001 and sp2=0.00001
CP_MASTER   1  ba+* [test.dml 12:6-12:12]         0.558    100                                                 aqms[0.001s,100], rlsi[0.002s,200], aqrs[0.048s,200]
GPU_PR   2  gpu_ba+* [test.dml 12:6-12:12]     1.509    100                 H2D[0.000s,2], sync[0.001s,100], Msmsm[0.010s,100], aqrs[0.058s,2], Msao[1.417s,100]
GPU_MASTER   2  gpu_ba+* [test.dml 12:6-12:12]     1.521    100                 H2D[0.000s,2], sync[0.001s,100], Msmsm[0.011s,100], aqrs[0.061s,2], Msao[1.430s,100]

100 rows x 900 columns %*% 900 x 205800 (K:100, CRS=1x3x300=900, NPQ=100x2058x1)
Stats for [100, 900] %*% [900, 205800] with sp1=0.99 and sp2=0.00001
CP_MASTER   1  ba+* [test.dml 12:6-12:12]        13.499    100              aqms[0.001s,100], rlswr[0.003s,0], rlsi[0.004s,200], aqrs[0.024s,100], aqrd[0.039s,100]
GPU_PR   1  gpu_ba+* [test.dml 12:6-12:12]     1.850    100   H2D[0.001s,2], sync[0.001s,100], aqrs[0.025s,1], aqrd[0.042s,1], Msmdm[0.360s,100], ad[1.399s,100]
GPU_MASTER   1  gpu_ba+* [test.dml 12:6-12:12]    21.292    100 H2D[0.001s,2], sync[0.001s,100], aqrs[0.029s,1], aqrd[0.043s,1], f[0.065s,200], d2s[0.324s,100], Msa
Stats for [100, 900] %*% [900, 205800] with sp1=0.00001 and sp2=0.00001
CP_MASTER   1  ba+* [test.dml 12:6-12:12]         0.645    100                                                 aqms[0.001s,100], rlsi[0.002s,200], aqrs[0.051s,200]
GPU_PR   2  gpu_ba+* [test.dml 12:6-12:12]     1.530    100                 H2D[0.000s,2], sync[0.001s,100], Msmsm[0.014s,100], aqrs[0.061s,2], Msao[1.434s,100]
GPU_MASTER   2  gpu_ba+* [test.dml 12:6-12:12]     1.535    100                 H2D[0.000s,2], sync[0.001s,100], Msmsm[0.014s,100], aqrs[0.070s,2], Msao[1.431s,100]

 Matrix-vector 1M x 1K
Stats for [1000000, 1000] %*% [1000, 1] with sp1=0.000001 and sp2=0.99
CP_MASTER   1  ba+* [test.dml 12:6-12:12]         1.619    100              aqmd[0.001s,100], aqrd[0.002s,100], rlswr[0.003s,0], rlsi[0.003s,200], aqrs[0.054s,100]
GPU_PR   1  gpu_ba+* [test.dml 12:6-12:12]     0.245    100   H2D[0.001s,2], aqrd[0.003s,1], Msmdv[0.004s,100], sync[0.044s,100], aqrs[0.061s,1], ad[0.097s,100]
GPU_MASTER   1  gpu_ba+* [test.dml 12:6-12:12]     0.252    100 H2D[0.001s,2], az[0.001s,100], aqrd[0.003s,1], Msmdv[0.004s,100], sync[0.045s,100], aqrs[0.074s,1],
Stats for [1000000, 1000] %*% [1000, 1] with sp1=0.000001 and sp2=0.000001
CP_MASTER   1  ba+* [test.dml 12:6-12:12]         1.010    100              aqmd[0.001s,100], rlswr[0.002s,0], aqrd[0.002s,100], rlsi[0.003s,200], aqrs[0.053s,100]
GPU_PR   1  gpu_ba+* [test.dml 12:6-12:12]     0.247    100   H2D[0.001s,2], aqrd[0.003s,1], Msmdv[0.004s,100], sync[0.046s,100], aqrs[0.062s,1], ad[0.100s,100]
GPU_MASTER   1  gpu_ba+* [test.dml 12:6-12:12]     0.247    100 H2D[0.001s,2], az[0.001s,100], aqrd[0.002s,1], Msmdv[0.004s,100], sync[0.046s,100], aqrs[0.063s,1],

Matrix-matrix 1M x 1K x 20
Stats for [1000000, 1000] %*% [1000, 20] with sp1=0.000001 and sp2=0.99
CP_MASTER   1  ba+* [test.dml 12:6-12:12]         9.570    100              aqms[0.002s,100], aqrd[0.004s,100], rlsi[0.005s,200], aqrs[0.052s,100], rlswr[0.311s,0]
GPU_PR   1  gpu_ba+* [test.dml 12:6-12:12]     2.091    100   sync[0.001s,100], H2D[0.001s,2], Msmdm[0.004s,100], aqrd[0.005s,1], aqrs[0.074s,1], ad[1.337s,100]
GPU_MASTER   2  gpu_ba+* [test.dml 12:6-12:12]     2.104    100 sync[0.001s,100], H2D[0.001s,2], f[0.005s,200], aqrd[0.006s,1], dtl[0.034s,100], aqrs[0.079s,1], Msm
Stats for [1000000, 1000] %*% [1000, 20] with sp1=0.000001 and sp2=0.000001
CP_MASTER   1  ba+* [test.dml 12:6-12:12]         1.560    100                                aqms[0.001s,100], rlsi[0.003s,200], aqrs[0.060s,200], rlswr[0.283s,0]
GPU_PR   2  gpu_ba+* [test.dml 12:6-12:12]     1.866    100                 sync[0.001s,100], H2D[0.001s,2], aqrs[0.061s,2], Msmsm[0.144s,100], Msao[1.623s,100]
GPU_MASTER   2  gpu_ba+* [test.dml 12:6-12:12]     1.864    100                 sync[0.001s,100], H2D[0.001s,2], aqrs[0.060s,2], Msmsm[0.144s,100], Msao[1.628s,100]

Squared matrix 3K x 3K
Stats for [3000, 3000] %*% [3000, 3000] with sp1=0.000001 and sp2=0.99
CP_MASTER   1  ba+* [test.dml 12:6-12:12]         0.969    100              aqms[0.001s,100], rlsi[0.002s,200], aqrs[0.041s,100], rlswr[0.042s,0], aqrd[0.162s,100]
GPU_PR   1  gpu_ba+* [test.dml 12:6-12:12]     1.188    100   sync[0.001s,100], Msmdm[0.005s,100], H2D[0.011s,2], aqrs[0.042s,1], aqrd[0.173s,1], ad[0.630s,100]
GPU_MASTER   1  gpu_ba+* [test.dml 12:6-12:12]     3.226    100 sync[0.001s,100], H2D[0.011s,2], aqrs[0.045s,1], aqrd[0.159s,1], Msmsm[0.221s,100], Msao[0.272s,100]
Stats for [3000, 3000] %*% [3000, 3000] with sp1=0.000001 and sp2=0.000001
CP_MASTER   1  ba+* [test.dml 12:6-12:12]         0.382    100                                                 aqms[0.001s,100], rlsi[0.002s,200], aqrs[0.044s,200]
GPU_PR   2  gpu_ba+* [test.dml 12:6-12:12]     0.775    100                 H2D[0.000s,2], sync[0.001s,100], Msmsm[0.010s,100], aqrs[0.057s,2], Msao[0.682s,100]
GPU_MASTER   2  gpu_ba+* [test.dml 12:6-12:12]     0.774    100                 H2D[0.000s,2], sync[0.001s,100], Msmsm[0.010s,100], aqrs[0.052s,2], Msao[0.682s,100]
Stats for [3000, 3000] %*% [3000, 3000] with sp1=0.99 and sp2=0.000001
CP_MASTER   1  ba+* [test.dml 12:6-12:12]         0.983    100              aqms[0.001s,100], rlsi[0.002s,200], aqrs[0.012s,100], rlswr[0.077s,0], aqrd[0.163s,100]
GPU_PR   1  gpu_ba+* [test.dml 12:6-12:12]     1.203    100   sync[0.001s,100], H2D[0.011s,2], aqrs[0.014s,1], aqrd[0.199s,1], Msmdm[0.315s,100], ad[0.641s,100]
GPU_MASTER   1  gpu_ba+* [test.dml 12:6-12:12]     6.896    100 sync[0.001s,100], H2D[0.011s,2], aqrs[0.018s,1], aqrd[0.173s,1], dtl[0.592s,100], d2s[0.908s,100], f

64 rows x 196,608 columns %*% 196,608 x 512
Stats for [64, 196608] %*% [196608, 512] with sp1=0.000001 and sp2=0.99
CP_MASTER   1  ba+* [test.dml 12:6-12:12]         1.068    100              aqms[0.001s,100], rlsi[0.002s,200], rlswr[0.003s,0], aqrs[0.040s,100], aqrd[0.655s,100]
GPU_PR   1  gpu_ba+* [test.dml 12:6-12:12]     0.979    100   sync[0.001s,100], Msmdm[0.003s,100], ad[0.032s,100], aqrs[0.049s,1], H2D[0.119s,2], aqrd[0.706s,1]
GPU_MASTER   1  gpu_ba+* [test.dml 12:6-12:12]    26.468    100 sync[0.001s,100], aqrs[0.047s,1], Msmsm[0.049s,100], Msao[0.126s,100], H2D[0.195s,2], d2s[9.067s,100
Stats for [64, 196608] %*% [196608, 512] with sp1=0.000001 and sp2=0.000001
CP_MASTER   1  ba+* [test.dml 12:6-12:12]         0.398    100                                                 aqms[0.001s,100], rlsi[0.002s,200], aqrs[0.045s,200]
GPU_PR   1  gpu_ba+* [test.dml 12:6-12:12]     0.210    100                 H2D[0.001s,2], sync[0.001s,100], Msmsm[0.011s,100], aqrs[0.061s,2], Msao[0.108s,100]
GPU_MASTER   1  gpu_ba+* [test.dml 12:6-12:12]     0.210    100                 H2D[0.001s,2], sync[0.001s,100], Msmsm[0.010s,100], aqrs[0.061s,2], Msao[0.108s,100]
Stats for [64, 196608] %*% [196608, 512] with sp1=0.99 and sp2=0.000001
CP_MASTER   1  ba+* [test.dml 12:6-12:12]         1.513    100              aqms[0.001s,100], rlswr[0.003s,0], rlsi[0.004s,200], aqrs[0.011s,100], aqrd[0.175s,100]
GPU_PR   1  gpu_ba+* [test.dml 12:6-12:12]     0.725    100   sync[0.001s,100], aqrs[0.014s,1], H2D[0.021s,2], ad[0.032s,100], aqrd[0.257s,1], Msmdm[0.379s,100]
GPU_MASTER   1  gpu_ba+* [test.dml 12:6-12:12]    53.831    100 f[1.689s,200], Msao[19.213s,100], d2s[10.976s,100], Msmsm[21.033s,100], sync[0.001s,100], aqrs[0.011

100 rows x 600 columns %*% 600 x 205800 (K:100, CRS=1x2x300=600, NPQ=100x2058x1)
Stats for [100, 600] %*% [600, 205800] with sp1=0.99 and sp2=0.000001
CP_MASTER   1  ba+* [test.dml 12:6-12:12]        12.319    100              aqms[0.001s,100], rlswr[0.003s,0], rlsi[0.004s,200], aqrs[0.018s,100], aqrd[0.032s,100]
GPU_PR   1  gpu_ba+* [test.dml 12:6-12:12]     1.699    100   H2D[0.000s,2], sync[0.001s,100], aqrs[0.019s,1], aqrd[0.043s,1], Msmdm[0.207s,100], ad[1.406s,100]
GPU_MASTER   1  gpu_ba+* [test.dml 12:6-12:12]     2.943    100 H2D[0.000s,2], sync[0.001s,100], aqrs[0.019s,1], f[0.028s,200], dtl[0.035s,100], aqrd[0.038s,1], Msm
Stats for [100, 600] %*% [600, 205800] with sp1=0.000001 and sp2=0.000001
CP_MASTER   1  ba+* [test.dml 12:6-12:12]         0.433    100                                                 aqms[0.001s,100], rlsi[0.002s,200], aqrs[0.045s,200]
GPU_PR   2  gpu_ba+* [test.dml 12:6-12:12]     1.514    100                 H2D[0.000s,2], sync[0.001s,100], Msmsm[0.011s,100], aqrs[0.057s,2], Msao[1.427s,100]
GPU_MASTER   2  gpu_ba+* [test.dml 12:6-12:12]     1.520    100                 H2D[0.001s,2], sync[0.001s,100], Msmsm[0.009s,100], aqrs[0.058s,2], Msao[1.428s,100]

100 rows x 900 columns %*% 900 x 205800 (K:100, CRS=1x3x300=900, NPQ=100x2058x1)
Stats for [100, 900] %*% [900, 205800] with sp1=0.99 and sp2=0.000001
CP_MASTER   1  ba+* [test.dml 12:6-12:12]        11.961    100              aqms[0.001s,100], rlswr[0.003s,0], rlsi[0.004s,200], aqrs[0.021s,100], aqrd[0.036s,100]
GPU_PR   1  gpu_ba+* [test.dml 12:6-12:12]     1.698    100   H2D[0.001s,2], sync[0.001s,100], aqrs[0.022s,1], aqrd[0.045s,1], Msmdm[0.209s,100], ad[1.400s,100]
GPU_MASTER   1  gpu_ba+* [test.dml 12:6-12:12]     3.907    100 H2D[0.001s,2], sync[0.002s,100], aqrs[0.023s,1], aqrd[0.045s,1], f[0.068s,200], d2s[0.337s,100], Msa
Stats for [100, 900] %*% [900, 205800] with sp1=0.000001 and sp2=0.000001
CP_MASTER   1  ba+* [test.dml 12:6-12:12]         0.398    100                                                 aqms[0.001s,100], rlsi[0.002s,200], aqrs[0.046s,200]
GPU_PR   2  gpu_ba+* [test.dml 12:6-12:12]     1.511    100                 H2D[0.000s,2], sync[0.001s,100], Msmsm[0.009s,100], aqrs[0.052s,2], Msao[1.426s,100]
GPU_MASTER   2  gpu_ba+* [test.dml 12:6-12:12]     1.504    100                 H2D[0.001s,2], sync[0.001s,100], Msmsm[0.010s,100], aqrs[0.056s,2], Msao[1.416s,100]

@akchinSTC
Copy link
Contributor

@asfgit asfgit closed this in 6de8f05 Oct 20, 2017
j143-zz pushed a commit to j143-zz/systemml that referenced this pull request Nov 4, 2017
…n to minimize sparse-to-dense as well as dense-to-sparse conversion

1. The goal of this PR is not to improve performance (for example: by considering the cost of sparse-to-dense vs FLOPs required given a memory budget) but instead to minimize sparse-to-dense conversion in the GPU matrix multiplication operator.

2. If matmult uses unnecessary sparse-to-dense conversion, then we run into
  risk of one of the two situations:
- In best case some of the matmult won't be pushed to GPU under worst-case memory budget.
- On other hand, if these conversion are not accounted for, they may cause OOMs.

3. Every operator (except dense-sparse matrix multiplication) uses only memory allocated to input and output matrices.

4. Since there is no CuSPARSE kernel for dense-sparse matrix multiplication operator, we either have to transpose the output after performing sparse-dense matrix multiplication or perform dense-dense matrix multiplication after converting the sparse input to dense format.

Closes apache#686.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants