New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SYSTEMML-769] [WIP] Adding support for native BLAS #344
Conversation
Will we produce different jar releases for different platforms? If so, it might help with JCuda as well. |
No, I would recommend packaging .so, .dll, etc in single jar similar to JCuda approach. Since the JNI API is extremely lightweight, the overhead would be minimal. |
Thanks @niketanpansare for sharing your proposal. Can you share benefits of using native based BLAS over Java-based BLAS, other than two below?
How much performance improvement from native BLAS over Java based BLAS? In my opinion, we have to write CPP interface for every BLAS for every needed functionality on top of JNI code to interact with generic SystemML Matrix CPP which will burden SystemML in development and increase test effort. If there is significant performance improvement then its worth. Other thing to consider if there is a benefit from hardware improvements in other BLAS, either those could be available in Java based BLAS or we can add them in Java based BLAS (assuming those are open source library). Why can't Java based BLAS supports different type of BLAS? If its not supporting, shouldn't it be the place to have such code. I would be cautious to understand whole proposal before we can take this into consideration. |
Please see PR 307 for the performance number. The difference is about 2x to 5x for matrix multiplication and about 3x for conv2d. If the concern is about development overhead, there are two solutions:
Let me re-emphasize that development overhead of maintaining the JNI overhead is far less than continuously supporting Java based BLAS. If you can concerned about testing, in fact adding native BLAS simplifies it as Intel would have already done performance testing on different shapes n sizes on different CPUs. Think about testing interfaces vs rigorous performance testing for mat multiplication, tsmm, etc. I don't understand few of your questions: |
I see that you've put a lot of effort in here but I'm still not convinced by the experiments conducted so far. Instead of trying many combinations of small row/column dimensions (that have limited impact on real workloads), could we please select four representative scenarios and thoroughly evaluate them (incl local and distributed operations, different BLAS libraries, different sparse representations, etc)? Let's use the following scenarios (where (1) and (2) are supposed to be memory-bandwidth bound, whereas (3)-(5) are compute-bound; (1) and (2) are representative for regression and binomial classification, (3) and (4) are representative for multinomial classification, and (5) represents deep learning): (1) Matrix-vector/vector-matrix 1M x 1K, dense (~8GB) |
Sure, I will run experiments for shapes (1) to (5). Since I am proposing only supporting dense local BLAS and fall back to Java, sparsity will be not useful as I will end up reporting the same numbers. Also since we are officially supporting Intel MKL for performance to simplify testing (I am ok with removing support for openblas btw), using it as baseline makes more sense; else we might end up circling around the argument that our blas is better than blas X on certain scenarios Y but not Z and not make much progress here. Before we invest time in these and possibly some more experiments, let's agree on the end goal:
|
Let's run all five scenarios (with left/right exchange, and local/distributed operations) - if we would add support for BLAS libraries then anyway for both dense and sparse. Also let's run at least with MKL and OpenBLAS as it would ensure that the integration is sufficiently general. Personally, I would be fine with adding BLAS support (enabled by default) if it consistently achieves speedups by more than 2-3x. If the results are a mixed bag, we might still consider it as an optional feature (disabled by default) but it would certainly not yield simplifications as we would still need to maintain over own java-based library. |
Thanks, that clarifies a lot. 2x to 3x seems reasonable to decide whether to keep it as optional or default. I will work on the experiments today. Since this PR does not change sparse matmult, would you still want me to test with sparsity=0.1 ? |
right now, we're talking only about micro benchmarks - I would recommend to integrate the calls to sparse BLAS operations over a CSR representation just to be able to run these experiments. This should be pretty straightforward and not much effort, right? |
OK, I don't see any BLAS specific interface. I am assuming these BLAS libraries have common interfaces. So my concern over maintenance of code specific to BLAS has resolved,' Based on code change, you have checked in generated library files (.so). Thanks |
@mboehm7 I would prefer to first get this PR (with dense BLAS) in first before starting to work on sparse BLAS and/or add other operations (such as solve, dsyrk, etc). For 10 operations:
Matrix-vector/vector-matrix: 1M x 1K, dense:
Java BLAS: 2.252 sec, MKL: 5.403 sec and OpenBLAS: 23.477 sec
Matrix-matrix/matrix-Matrix 1M x 1K x 20, dense:
Java BLAS: 12.204 sec, MKL: 18.927 sec and OpenBLAS: 33.840 sec
Matrix-matrix/matrix-Matrix 1M x 1K x 100, dense (just out of curiousity):
Java BLAS: 53.218 sec, MKL: 17.090 sec and OpenBLAS: 102.779 sec
Matrix-matrix/matrix-Matrix 3K x 3K x 3K, dense:
Java BLAS: 12.948 sec, MKL: 2.612 sec and OpenBLAS: 2.828 sec |
thanks for the initial results. Couple of comments:
|
Additional considerations are:
|
while thinking about potential "heuristics for switching", we should also ask ourselves if these compute-intensive cases are not already covered by the in-progress GPU backend. |
@mboehm7 - we should keep in mind that a GPU may not be as ubiquitous or easy to find as an installed BLAS library (or MKL) |
yes, but with more and more workloads being migrated to the could this becomes less of an issue as we can actually request preferred node types. |
|
Where did you get that the JNI call would not create a copy? Of course it does (and the numbers reflect that). Since the JVM is free to compose a logical array from multiple physical fragments, the JVM must do so; otherwise, it could not provide a contiguous array. Even if the array is not fragmented, it is likely always copied, because the asynchronous garbage collector is free to re-arrange the array at any time. |
@mboehm7 - Please take a look at this page.
|
So how many JVMs do you think actually support pinning of large arrays? I tend to believe that all major JVMs actually store arrays in multiple fragments. Take a look at http://www.ibm.com/support/knowledgecenter/SSYKE2_7.0.0/com.ibm.java.aix.70.doc/diag/understanding/jni_copypin.html and https://www.ibm.com/developerworks/library/j-jni/ - usually only small arrays are returned as a direct pointer due to the issues mentioned above. How else would you explain the 2.5x slowdown for matrix-vector, which is a trivial operation? |
@bertholdreinwald Please see below for the results of single-threaded implementation: For 10 operations (single-threaded): Matrix-vector/vector-matrix: 1M x 1K, dense: Matrix-matrix/matrix-Matrix 1M x 1K x 20, dense: Matrix-matrix/matrix-Matrix 1M x 1K x 100, dense (just out of curiousity): Matrix-matrix/matrix-Matrix 3K x 3K x 3K, dense: |
As I said before, please specify the HW; otherwise we can't really interpret these numbers. |
Please see below the hardware specification: Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz
CPU(s): 24
Thread(s) per core: 2
Core(s) per socket: 6
Socket(s): 2
NUMA node(s): 2
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 15360K Memory allocated in the above tests: 20G |
Thanks @niketanpansare that helps. However, it also shows measurement errors: how do you explain for Java matrix-vector a speed-up of 34 given 24 virtual cores? Furthermore, and just to avoid that people draw wrong conclusions here: distributed operations will look very similar to the multi-threaded (not single-threaded!) operations because both become memory-bandwidth bound. In any case, we need to run the distributed operations as well - there are no shortcuts. |
The numbers reported here are execution time for b+* from our Statistics and are fairly easy to reproduce. All I did was redirect matrixMult(MatrixBlock m1, MatrixBlock m2, MatrixBlock ret, int k) to matrixMult(MatrixBlock m1, MatrixBlock m2, MatrixBlock ret) ... Since there is fairly complicated logic (for example: recomputeNonZeros in parallel vs sequential, interleaving of memory access), I would not be suspicious of the speedup.
That is a fair point, one cannot directly correlate single-threaded performance with distributed performance. However, as I said earlier, "For simplifying the performance testing, I am OK with paying the penalty for few cases where we are better optimized than Intel MKL." |
well I'm not fine with it because a 2-3x slowdown on scenario (1) and (2) directly affects end-to-end performance of common algorithms such as LinregCG, GLM, L2SVM, MSVM, MLogreg, Kmeans, and PageRank. |
Then, please feel free to update the memory bound logic when the PR is merged. Though slightly tricky, Statistics will still help us with performance debugging. I hope we both agree on the fact that Intel MKL is much-better optimized than our LibMatrixMult. |
Also, seeing a speedup of 34 for matrix-vector on 24 virtual cores should raise red flags. Usually, we only see very moderate speedups of 3-5x for matrix-vector because again, at one point this operation becomes memory-bandwidth bound. I suspect that garbage collection, or just-in-time compilation to interfere with the measurement here. Hence, I would recommend to setup a proper micro-benchmark, with isolated block operations, a number of warm-up runs for just-in-time compilation, and verbose GC flag to exclude runs that overlapped with GC. |
That could be the case. Again, I don't dispute that LibMatrixMult is better option for memory-bound cases and identifying those cases should not stop us from integrating Intel MKL. |
Compiled mac binaries with new changes for sparse
Here are the updated results on Lenet for 2000 iterations: With MKL on Linux:
Iter:2000.0, training loss:0.01354500195846287, training accuracy:100.0
Iter:2000.0, validation loss:211.49319894899025, validation accuracy:97.07991803278688
SystemML Statistics:
Total elapsed time: 480.432 sec.
Total compilation time: 0.000 sec.
Total execution time: 480.432 sec.
Number of compiled Spark inst: 79.
Number of executed Spark inst: 2.
Native mkl calls (LibMatrixMult/LibMatrixDNN): 4270/10553.
Cache hits (Mem, WB, FS, HDFS): 281999/0/0/0.
Cache writes (WB, FS, HDFS): 147642/0/0.
Cache times (ACQr/m, RLS, EXP): 0.103/0.074/2.183/0.000 sec.
HOP DAGs recompiled (PRED, SB): 0/11305.
HOP DAGs recompile time: 6.609 sec.
Spark ctx create time (lazy): 0.022 sec.
Spark trans counts (par,bc,col):0/0/0.
Spark trans times (par,bc,col): 0.000/0.000/0.000 secs.
Total JIT compile time: 36.785 sec.
Total JVM GC count: 4566.
Total JVM GC time: 72.493 sec.
Heavy hitter instructions (name, time, count):
-- 1) sel+ 56.005 sec 6283
-- 2) conv2d_backward_filter 51.247 sec 4026
-- 3) conv2d_bias_add 48.540 sec 4514
-- 4) -* 35.711 sec 16104
-- 5) + 34.503 sec 30566
-- 6) +* 34.500 sec 8052
-- 7) maxpooling_backward 34.057 sec 4026
-- 8) ba+* 31.741 sec 12566
-- 9) conv2d_backward_data 28.271 sec 2013
-- 10) relu_backward 26.899 sec 6039
With Java on Linux:
Iter:2000.0, training loss:0.0059023118210415025, training accuracy:100.0
Iter:2000.0, validation loss:151.31859200647978, validation accuracy:97.46413934426229
SystemML Statistics:
Total elapsed time: 654.523 sec.
Total compilation time: 0.000 sec.
Total execution time: 654.523 sec.
Number of compiled Spark inst: 79.
Number of executed Spark inst: 2.
Cache hits (Mem, WB, FS, HDFS): 281999/0/0/0.
Cache writes (WB, FS, HDFS): 147642/0/0.
Cache times (ACQr/m, RLS, EXP): 0.097/0.073/2.021/0.000 sec.
HOP DAGs recompiled (PRED, SB): 0/11305.
HOP DAGs recompile time: 6.575 sec.
Spark ctx create time (lazy): 0.024 sec.
Spark trans counts (par,bc,col):0/0/0.
Spark trans times (par,bc,col): 0.000/0.000/0.000 secs.
Total JIT compile time: 39.636 sec.
Total JVM GC count: 7094.
Total JVM GC time: 98.133 sec.
Heavy hitter instructions (name, time, count):
-- 1) conv2d_bias_add 149.524 sec 4514
-- 2) conv2d_backward_filter 113.605 sec 4026
-- 3) sel+ 54.545 sec 6283
-- 4) conv2d_backward_data 50.524 sec 2013
-- 5) ba+* 41.264 sec 12566
-- 6) -* 36.036 sec 16104
-- 7) +* 33.817 sec 8052
-- 8) + 32.274 sec 30566
-- 9) * 26.529 sec 35275
-- 10) maxpooling_backward 25.734 sec 4026 |
Refer to this link for build results (access rights to CI server needed): |
Current this PR is up to: It has been open since January. We should consider merging or closing before the 1.0.0 release with significant time for testing if it is merged. |
Refer to this link for build results (access rights to CI server needed): |
@deroneriksson I absolutely agree - something like this needs to go in at least a month or two before a release. Let's resolve all remaining build and deploy issues and then bring it in (along with a testsuite that checks for existing libraries and only runs if they are available; of course we should install it on our build server). |
Also, added lot of documentation.
@deroneriksson @mboehm7 I think this PR has addressed all the deploy/build issues on Linux. Remaining issues can be addressed later as bugfixes in subsequent PR. @bertholdreinwald @nakul02 After spending significant amount of time on trial and error for Mac (and Windows), I think we should only support BLAS on Linux machines for following reasons:
Since we have already have cmake setup in place, we can always support BLAS on Mac and Windows in future when OpenMP related issues are resolved. For your reference, please see below for instructions for compiling BLAS-enabled SystemML on Mac and Windows: 64-bit x86 Windows
64-bit x86 MacThe version of clang that ships with Mac does not come with OpenMP.
(with gcc-6):
(with gcc-6):
|
src/main/cpp/libmatrixdnn.cpp
Outdated
#include <mkl.h> | ||
#include <mkl_service.h> | ||
#endif | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Documentation?
} | ||
} | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Documentation?
} | ||
} | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Documentation?
} | ||
} | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Documentation?
std::memset(temp, 0, numTempElem*numOpenMPThreads*sizeof(double)); | ||
double* rotatedDoutPtrArrays = new double[numRotatedElem*numOpenMPThreads]; | ||
double* loweredMatArrays = new double[numIm2ColElem*numOpenMPThreads]; | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Briefly talk about the parallelization strategy here.
@@ -288,6 +293,17 @@ public void processBiasMultiplyInstruction(ExecutionContext ec) throws DMLRuntim | |||
ec.setMatrixOutput(getOutputVariableName(), outputBlock); | |||
} | |||
|
|||
// Assumption: enableNative && NativeHelper.isNativeLibraryLoaded() is true | |||
// This increases the number of native calls. For example:the cases where filter is sparse but input is dense |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please convert this to Javadoc? Also talk about the parameters and return types.
@@ -65,6 +65,9 @@ | |||
|
|||
<!-- if codegen.enabled, compile literals as constants: 1..heuristic, 2..always --> | |||
<codegen.literals>1</codegen.literals> | |||
|
|||
<!-- enables native blas for matrix multiplication and convolution, experimental feature --> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would recommend this be named "systemml.native.blas". But it's your call here. Going forward I would like that all systemml specific properties be prepended with "systemml." So that when we read the properties files from other projects we depend on, we don't have name clashes.
docs/blas.md
Outdated
@@ -0,0 +1,162 @@ | |||
<!-- |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My recommendation is to name this file "native-backend.md".
"blas.md" doesn't cover the topics.
outputBlock.recomputeNonZeros(); | ||
} | ||
|
||
// Single-threaded matrix multiplication |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A javadoc will be nice here.
@@ -0,0 +1,259 @@ | |||
/* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please add some javadoc to this file?
I agree with your assessment, only supporting the native backend for Linux is fine. This PR looks good to me 👍 |
Refer to this link for build results (access rights to CI server needed): |
With OpenBLAS on Lenet on the same machine
Iter:2000.0, training loss:0.009108877504719245, training accuracy:100.0
Iter:2000.0, validation loss:166.14787969186656, validation accuracy:97.87397540983606
SystemML Statistics:
Total elapsed time: 476.380 sec.
Total compilation time: 0.000 sec.
Total execution time: 476.380 sec.
Number of compiled Spark inst: 79.
Number of executed Spark inst: 2.
Native openblas calls (LibMatrixMult/LibMatrixDNN): 4270/10553.
Cache hits (Mem, WB, FS, HDFS): 281999/0/0/0.
Cache writes (WB, FS, HDFS): 147642/0/0.
Cache times (ACQr/m, RLS, EXP): 0.111/0.068/2.028/0.000 sec.
HOP DAGs recompiled (PRED, SB): 0/11305.
HOP DAGs recompile time: 6.428 sec.
Spark ctx create time (lazy): 0.023 sec.
Spark trans counts (par,bc,col):0/0/0.
Spark trans times (par,bc,col): 0.000/0.000/0.000 secs.
Total JIT compile time: 40.452 sec.
Total JVM GC count: 4557.
Total JVM GC time: 71.705 sec.
Heavy hitter instructions (name, time, count):
-- 1) sel+ 55.054 sec 6283
-- 2) conv2d_bias_add 49.798 sec 4514
-- 3) ba+* 46.230 sec 12566
-- 4) conv2d_backward_filter 41.418 sec 4026
-- 5) -* 36.461 sec 16104
-- 6) +* 32.389 sec 8052
-- 7) maxpooling_backward 32.378 sec 4026
-- 8) + 32.204 sec 30566
-- 9) relu_backward 26.499 sec 6039
-- 10) conv2d_backward_data 26.011 sec 2013
Here is another run with MKL to double-check if the performance is reproducible
Iter:2000.0, training loss:0.013034947944848269, training accuracy:100.0
Iter:2000.0, validation loss:274.1293912979084, validation accuracy:96.28586065573771
SystemML Statistics:
Total elapsed time: 482.600 sec.
Total compilation time: 0.000 sec.
Total execution time: 482.600 sec.
Number of compiled Spark inst: 79.
Number of executed Spark inst: 2.
Native mkl calls (LibMatrixMult/LibMatrixDNN): 4270/10553.
Cache hits (Mem, WB, FS, HDFS): 281999/0/0/0.
Cache writes (WB, FS, HDFS): 147642/0/0.
Cache times (ACQr/m, RLS, EXP): 0.149/0.085/2.201/0.000 sec.
HOP DAGs recompiled (PRED, SB): 0/11305.
HOP DAGs recompile time: 6.666 sec.
Spark ctx create time (lazy): 0.022 sec.
Spark trans counts (par,bc,col):0/0/0.
Spark trans times (par,bc,col): 0.000/0.000/0.000 secs.
Total JIT compile time: 39.626 sec.
Total JVM GC count: 4503.
Total JVM GC time: 71.892 sec.
Heavy hitter instructions (name, time, count):
-- 1) sel+ 55.166 sec 6283
-- 2) conv2d_backward_filter 51.424 sec 4026
-- 3) conv2d_bias_add 47.443 sec 4514
-- 4) -* 38.041 sec 16104
-- 5) +* 35.513 sec 8052
-- 6) + 34.988 sec 30566
-- 7) maxpooling_backward 34.836 sec 4026
-- 8) ba+* 31.071 sec 12566
-- 9) conv2d_backward_data 28.552 sec 2013
-- 10) relu_backward 27.039 sec 6039 |
Refer to this link for build results (access rights to CI server needed): |
Refer to this link for build results (access rights to CI server needed): |
- Both MKL and OpenBLAS show 2-3x performance benefits on conv2d operators on Lenet. - There are several OpenMP related issues on Mac, hence we have explicitly disabled native support for Mac and Windows. Since we have already have cmake setup in place, we can always support BLAS on Mac and Windows in future when OpenMP related issues are resolved. Closes #344.
- Both MKL and OpenBLAS show 2-3x performance benefits on conv2d operators on Lenet. - There are several OpenMP related issues on Mac, hence we have explicitly disabled native support for Mac and Windows. Since we have already have cmake setup in place, we can always support BLAS on Mac and Windows in future when OpenMP related issues are resolved. Closes apache#344.
Based on the discussion with @frreiss and @bertholdreinwald , I am proposing to switch our default BLAS from Java-based BLAS to native BLAS. We will recommend using Intel MKL and provide optional support other BLAS such as OpenBLAS (and possibly Accelerate) etc. Also, if no BLAS is installed, we will switch to Java-based BLAS. This future-proofs SystemML from hardware improvement made in other BLAS and also simplifies testing.
The proposed solution in this jar will work:
Since we are not including external dependency (such as BLAS), this PR adds no additional overhead on the release process. Also when we add support to TSMM, SYSTEMML-1166 will be resolved and will again future proof SystemML on related issues.
The initial performance numbers are same as that of #307
@mboehm7 @dusenberrymw @nakul02 @lresende @deroneriksson @asurve @fschueler