[SYSTEMML-769] [WIP] Adding support for native BLAS #344

niketanpansare · 2017-01-15T18:37:56Z

Based on the discussion with @frreiss and @bertholdreinwald , I am proposing to switch our default BLAS from Java-based BLAS to native BLAS. We will recommend using Intel MKL and provide optional support other BLAS such as OpenBLAS (and possibly Accelerate) etc. Also, if no BLAS is installed, we will switch to Java-based BLAS. This future-proofs SystemML from hardware improvement made in other BLAS and also simplifies testing.

The proposed solution in this jar will work:

In distributed setting with no additional dependency other than BLAS.
On hybrid cluster with different types of BLAS.
With Parfor.
Without an additional artifact: systemml-accelerator.jar (like [SYSTEMML-769] Support for native BLAS and simplify deployment for GPU backend #307).

Since we are not including external dependency (such as BLAS), this PR adds no additional overhead on the release process. Also when we add support to TSMM, SYSTEMML-1166 will be resolved and will again future proof SystemML on related issues.

The initial performance numbers are same as that of #307

@mboehm7 @dusenberrymw @nakul02 @lresende @deroneriksson @asurve @fschueler

nakul02 · 2017-01-15T19:14:36Z

Will we produce different jar releases for different platforms? If so, it might help with JCuda as well.

niketanpansare · 2017-01-15T19:29:50Z

No, I would recommend packaging .so, .dll, etc in single jar similar to JCuda approach. Since the JNI API is extremely lightweight, the overhead would be minimal.

asurve · 2017-01-16T09:12:41Z

Thanks @niketanpansare for sharing your proposal.

Can you share benefits of using native based BLAS over Java-based BLAS, other than two below?
- One you mentioned about possible hardware improvements made in other BLAS can benefit SystemML.

Second you mentioned, Simplifies testing

How much performance improvement from native BLAS over Java based BLAS?

In my opinion, we have to write CPP interface for every BLAS for every needed functionality on top of JNI code to interact with generic SystemML Matrix CPP which will burden SystemML in development and increase test effort. If there is significant performance improvement then its worth.

Other thing to consider if there is a benefit from hardware improvements in other BLAS, either those could be available in Java based BLAS or we can add them in Java based BLAS (assuming those are open source library).

Why can't Java based BLAS supports different type of BLAS? If its not supporting, shouldn't it be the place to have such code.

I would be cautious to understand whole proposal before we can take this into consideration.

niketanpansare · 2017-01-16T16:07:36Z

Please see PR 307 for the performance number. The difference is about 2x to 5x for matrix multiplication and about 3x for conv2d.

If the concern is about development overhead, there are two solutions:

Only support and test with the community version of Intel MKL.
The only code that varies with different BLAS is threading logic to enforce sequential BLAS. If necessary we can factor that out but it would still require as many shared libraries as types of BLAS.

Let me re-emphasize that development overhead of maintaining the JNI overhead is far less than continuously supporting Java based BLAS.

If you can concerned about testing, in fact adding native BLAS simplifies it as Intel would have already done performance testing on different shapes n sizes on different CPUs. Think about testing interfaces vs rigorous performance testing for mat multiplication, tsmm, etc.

I don't understand few of your questions:
Why can Java based blas support different type of blas ?
Adding hardware improvements to Java based blas ? (May be we are not on same page. I would encourage you to look at LibMatrixMult code and how people support avx, fma, numa awareness, etc in C++. I actually think our Java based BLAS is much better than any other Java based BLAS out there (especially in cache friendliness), but there is a cost of maintaining it with evolving hardware improvement.)

mboehm7 · 2017-01-16T16:50:27Z

I see that you've put a lot of effort in here but I'm still not convinced by the experiments conducted so far.

Instead of trying many combinations of small row/column dimensions (that have limited impact on real workloads), could we please select four representative scenarios and thoroughly evaluate them (incl local and distributed operations, different BLAS libraries, different sparse representations, etc)? Let's use the following scenarios (where (1) and (2) are supposed to be memory-bandwidth bound, whereas (3)-(5) are compute-bound; (1) and (2) are representative for regression and binomial classification, (3) and (4) are representative for multinomial classification, and (5) represents deep learning):

(1) Matrix-vector/vector-matrix 1M x 1K, dense (~8GB)
(2) Matrix-vector/vector-matrix 1M x 10K, sparsity=0.1 (~12GB)
(3) Matrix-matrix/matrix-Matrix 1M x 1K x 20, dense (40GFLOPs)
(4) Matrix-matrix/matrix-Matrix 1M x 10K (sparsity=0.1) x 20, (40 GFLOPs)
(5) Squared matrix 3K x 3K, dense (54 GFLOPs)

niketanpansare · 2017-01-16T17:22:04Z

Sure, I will run experiments for shapes (1) to (5). Since I am proposing only supporting dense local BLAS and fall back to Java, sparsity will be not useful as I will end up reporting the same numbers. Also since we are officially supporting Intel MKL for performance to simplify testing (I am ok with removing support for openblas btw), using it as baseline makes more sense; else we might end up circling around the argument that our blas is better than blas X on certain scenarios Y but not Z and not make much progress here.

Before we invest time in these and possibly some more experiments, let's agree on the end goal:

We want to switch to Intel MKL BLAS if the performance is ?? X. If this is the case, let's agree that what is the number we would feel comfortable to consider switching.
We want to switch to Intel MKL BLAS because it simplifies development and performance testing overhead as Intel would have done those for us. All we need to write/test are JNI wrappers not each individual matrix case.

mboehm7 · 2017-01-16T17:39:10Z

Let's run all five scenarios (with left/right exchange, and local/distributed operations) - if we would add support for BLAS libraries then anyway for both dense and sparse. Also let's run at least with MKL and OpenBLAS as it would ensure that the integration is sufficiently general. Personally, I would be fine with adding BLAS support (enabled by default) if it consistently achieves speedups by more than 2-3x. If the results are a mixed bag, we might still consider it as an optional feature (disabled by default) but it would certainly not yield simplifications as we would still need to maintain over own java-based library.

niketanpansare · 2017-01-16T18:09:37Z

Thanks, that clarifies a lot. 2x to 3x seems reasonable to decide whether to keep it as optional or default. I will work on the experiments today. Since this PR does not change sparse matmult, would you still want me to test with sparsity=0.1 ?

mboehm7 · 2017-01-16T18:17:33Z

right now, we're talking only about micro benchmarks - I would recommend to integrate the calls to sparse BLAS operations over a CSR representation just to be able to run these experiments. This should be pretty straightforward and not much effort, right?

asurve · 2017-01-16T21:37:03Z

OK, I don't see any BLAS specific interface. I am assuming these BLAS libraries have common interfaces. So my concern over maintenance of code specific to BLAS has resolved,'

Based on code change, you have checked in generated library files (.so).
I would not prefer to check in generated output (.so) in the repository, rather there should be build process to generate these library on a required platform.
If someone does not have access to target platform to build such library they can get release version of library from Maven repository.
Again, this is not to be worried at this point until you finalize the approach.
But then first step would be to provide instructions and/or scripts to build such libraries and next step to automate running script as a part of build process.

Thanks
Arvind

niketanpansare · 2017-01-17T01:48:32Z

@mboehm7 I would prefer to first get this PR (with dense BLAS) in first before starting to work on sparse BLAS and/or add other operations (such as solve, dsyrk, etc).

For 10 operations:

Matrix-vector/vector-matrix: 1M x 1K, dense: 
Java BLAS: 2.252 sec, MKL: 5.403 sec and OpenBLAS: 23.477 sec

Matrix-matrix/matrix-Matrix 1M x 1K x 20, dense:
Java BLAS: 12.204 sec, MKL: 18.927 sec and OpenBLAS: 33.840 sec

Matrix-matrix/matrix-Matrix 1M x 1K x 100, dense (just out of curiousity):
Java BLAS: 53.218 sec, MKL: 17.090 sec and OpenBLAS: 102.779 sec

Matrix-matrix/matrix-Matrix 3K x 3K x 3K, dense:
Java BLAS: 12.948 sec, MKL: 2.612 sec and OpenBLAS: 2.828 sec

mboehm7 · 2017-01-17T12:45:25Z

thanks for the initial results. Couple of comments:

Could you please specify what kind of HW you ran these experiments?
Please don't call it Java BLAS as we're not implementing the BLAS interface
Let's add the sparse operations - otherwise we cannot make an informed decision whether or not we want to integrate BLAS at all.

bertholdreinwald · 2017-01-17T19:35:54Z

Additional considerations are:

for high dimensional matrices that are compute intensive, the benefits of using MKL/OpenBLAS seem to be there. However for smaller matrices/vectors (m <=20 in your posting), they are slower. What will be the heuristic for switching?
it would be good to see single threaded experiments as well to decide whether there could be any benefits of using MKL in our distributed operations.
I am a little concerned that MKL operations will use memory that is not reflected in our memory estimates, and hence may lead to OOMs.
-I am ok with initially doing it for dense matrices only as sparse ops have much more variety and will require more effort for which we may not have the bandwidth right now.

mboehm7 · 2017-01-17T22:05:48Z

while thinking about potential "heuristics for switching", we should also ask ourselves if these compute-intensive cases are not already covered by the in-progress GPU backend.

nakul02 · 2017-01-17T22:09:41Z

@mboehm7 - we should keep in mind that a GPU may not be as ubiquitous or easy to find as an installed BLAS library (or MKL)

mboehm7 · 2017-01-17T22:14:02Z

yes, but with more and more workloads being migrated to the could this becomes less of an issue as we can actually request preferred node types.

niketanpansare · 2017-01-17T23:24:05Z

For simplifying the performance testing, I am OK with paying the penalty for few cases where we are better optimized than Intel MKL. At best, we can push matrix-vector operation to Java matmult and rest to Intel MKL.
Since we officially support compute-intensive operators on CP (eg: deep learning builtin functions, matrix multiplication in case of autoencoder, etc), we should definitely try to optimize these cases (either by improving java matmult or by adding support for Intel MKL).
Regarding OOMs, I don't think MKL makes a copy (atleast the specification doesn't say it does). Whether a certain JDK makes a copy or not for JNI call, is a fair concern. The specification for JDK (as well as IBM JDK) specifies that they don't create a copy; if a certain JDK disregards that specification and if because of that we get OOM, then its a JDK bug.

mboehm7 · 2017-01-17T23:32:57Z

Where did you get that the JNI call would not create a copy? Of course it does (and the numbers reflect that). Since the JVM is free to compose a logical array from multiple physical fragments, the JVM must do so; otherwise, it could not provide a contiguous array. Even if the array is not fragmented, it is likely always copied, because the asynchronous garbage collector is free to re-arrange the array at any time.

nakul02 · 2017-01-18T00:04:44Z

@mboehm7 - Please take a look at this page.
There is a note there which clearly says:

As of JDK/JRE 1.1, programmers can use Get/Release<primitivetype>ArrayElements functions to obtain a pointer to primitive array elements. If the VM supports pinning, the pointer to the original data is returned; otherwise, a copy is made.

New functions introduced as of JDK/JRE 1.3 allow native code to obtain a direct pointer to array elements even if the VM does not support pinning.

mboehm7 · 2017-01-18T02:01:53Z

So how many JVMs do you think actually support pinning of large arrays? I tend to believe that all major JVMs actually store arrays in multiple fragments. Take a look at http://www.ibm.com/support/knowledgecenter/SSYKE2_7.0.0/com.ibm.java.aix.70.doc/diag/understanding/jni_copypin.html and https://www.ibm.com/developerworks/library/j-jni/ - usually only small arrays are returned as a direct pointer due to the issues mentioned above. How else would you explain the 2.5x slowdown for matrix-vector, which is a trivial operation?

niketanpansare · 2017-01-19T21:10:03Z

@bertholdreinwald Please see below for the results of single-threaded implementation:

For 10 operations (single-threaded):

Matrix-vector/vector-matrix: 1M x 1K, dense:
LibMatrixMult: 75.753 sec, MKL: 71.533 sec and OpenBLAS: 62.354 sec

Matrix-matrix/matrix-Matrix 1M x 1K x 20, dense:
LibMatrixMult: 209.818 sec, MKL: 93.709 sec and OpenBLAS: 91.460 sec

Matrix-matrix/matrix-Matrix 1M x 1K x 100, dense (just out of curiousity):
LibMatrixMult: 648.379 sec, MKL: 146.276 sec and OpenBLAS: 161.662 sec

Matrix-matrix/matrix-Matrix 3K x 3K x 3K, dense:
LibMatrixMult: 138.946 sec, MKL: 15.938 sec and OpenBLAS: 16.843 sec

mboehm7 · 2017-01-19T21:16:43Z

As I said before, please specify the HW; otherwise we can't really interpret these numbers.

niketanpansare · 2017-01-19T21:22:04Z

Please see below the hardware specification:

Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz
CPU(s):                24
Thread(s) per core:    2
Core(s) per socket:    6
Socket(s):             2
NUMA node(s):          2
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              15360K

Memory allocated in the above tests: 20G

mboehm7 · 2017-01-19T21:36:53Z

Thanks @niketanpansare that helps. However, it also shows measurement errors: how do you explain for Java matrix-vector a speed-up of 34 given 24 virtual cores?

Furthermore, and just to avoid that people draw wrong conclusions here: distributed operations will look very similar to the multi-threaded (not single-threaded!) operations because both become memory-bandwidth bound. In any case, we need to run the distributed operations as well - there are no shortcuts.

niketanpansare · 2017-01-19T21:55:46Z

Thanks @niketanpansare that helps. However, it also shows measurement errors: how do you explain for Java matrix-vector a speed-up of 34 given 24 virtual cores?

The numbers reported here are execution time for b+* from our Statistics and are fairly easy to reproduce. All I did was redirect matrixMult(MatrixBlock m1, MatrixBlock m2, MatrixBlock ret, int k) to matrixMult(MatrixBlock m1, MatrixBlock m2, MatrixBlock ret) ... Since there is fairly complicated logic (for example: recomputeNonZeros in parallel vs sequential, interleaving of memory access), I would not be suspicious of the speedup.

Furthermore, and just to avoid that people draw wrong conclusions here: distributed operations will look very similar to the multi-threaded (not single-threaded!) operations because both become memory-bandwidth bound. In any case, we need to run the distributed operations as well - there are no shortcuts.

That is a fair point, one cannot directly correlate single-threaded performance with distributed performance. However, as I said earlier, "For simplifying the performance testing, I am OK with paying the penalty for few cases where we are better optimized than Intel MKL."

mboehm7 · 2017-01-19T22:03:04Z

well I'm not fine with it because a 2-3x slowdown on scenario (1) and (2) directly affects end-to-end performance of common algorithms such as LinregCG, GLM, L2SVM, MSVM, MLogreg, Kmeans, and PageRank.

niketanpansare · 2017-01-19T22:11:33Z

Then, please feel free to update the memory bound logic when the PR is merged. Though slightly tricky, Statistics will still help us with performance debugging. I hope we both agree on the fact that Intel MKL is much-better optimized than our LibMatrixMult.

mboehm7 · 2017-01-19T22:11:49Z

Also, seeing a speedup of 34 for matrix-vector on 24 virtual cores should raise red flags. Usually, we only see very moderate speedups of 3-5x for matrix-vector because again, at one point this operation becomes memory-bandwidth bound.

I suspect that garbage collection, or just-in-time compilation to interfere with the measurement here. Hence, I would recommend to setup a proper micro-benchmark, with isolated block operations, a number of warm-up runs for just-in-time compilation, and verbose GC flag to exclude runs that overlapped with GC.

niketanpansare · 2017-01-19T22:14:49Z

That could be the case. Again, I don't dispute that LibMatrixMult is better option for memory-bound cases and identifying those cases should not stop us from integrating Intel MKL.

Compiled mac binaries with new changes for sparse

niketanpansare · 2017-04-28T19:27:17Z

Here are the updated results on Lenet for 2000 iterations:

With MKL on Linux:
Iter:2000.0, training loss:0.01354500195846287, training accuracy:100.0
Iter:2000.0, validation loss:211.49319894899025, validation accuracy:97.07991803278688
SystemML Statistics:
Total elapsed time:             480.432 sec.
Total compilation time:         0.000 sec.
Total execution time:           480.432 sec.
Number of compiled Spark inst:  79.
Number of executed Spark inst:  2.
Native mkl calls (LibMatrixMult/LibMatrixDNN):  4270/10553.
Cache hits (Mem, WB, FS, HDFS): 281999/0/0/0.
Cache writes (WB, FS, HDFS):    147642/0/0.
Cache times (ACQr/m, RLS, EXP): 0.103/0.074/2.183/0.000 sec.
HOP DAGs recompiled (PRED, SB): 0/11305.
HOP DAGs recompile time:        6.609 sec.
Spark ctx create time (lazy):   0.022 sec.
Spark trans counts (par,bc,col):0/0/0.
Spark trans times (par,bc,col): 0.000/0.000/0.000 secs.
Total JIT compile time:         36.785 sec.
Total JVM GC count:             4566.
Total JVM GC time:              72.493 sec.
Heavy hitter instructions (name, time, count):
-- 1)   sel+    56.005 sec      6283
-- 2)   conv2d_backward_filter  51.247 sec      4026
-- 3)   conv2d_bias_add         48.540 sec      4514
-- 4)   -*      35.711 sec      16104
-- 5)   +       34.503 sec      30566
-- 6)   +*      34.500 sec      8052
-- 7)   maxpooling_backward     34.057 sec      4026
-- 8)   ba+*    31.741 sec      12566
-- 9)   conv2d_backward_data    28.271 sec      2013
-- 10)  relu_backward   26.899 sec      6039

With Java on Linux:
Iter:2000.0, training loss:0.0059023118210415025, training accuracy:100.0
Iter:2000.0, validation loss:151.31859200647978, validation accuracy:97.46413934426229
SystemML Statistics:
Total elapsed time:             654.523 sec.
Total compilation time:         0.000 sec.
Total execution time:           654.523 sec.
Number of compiled Spark inst:  79.
Number of executed Spark inst:  2.
Cache hits (Mem, WB, FS, HDFS): 281999/0/0/0.
Cache writes (WB, FS, HDFS):    147642/0/0.
Cache times (ACQr/m, RLS, EXP): 0.097/0.073/2.021/0.000 sec.
HOP DAGs recompiled (PRED, SB): 0/11305.
HOP DAGs recompile time:        6.575 sec.
Spark ctx create time (lazy):   0.024 sec.
Spark trans counts (par,bc,col):0/0/0.
Spark trans times (par,bc,col): 0.000/0.000/0.000 secs.
Total JIT compile time:         39.636 sec.
Total JVM GC count:             7094.
Total JVM GC time:              98.133 sec.
Heavy hitter instructions (name, time, count):
-- 1)   conv2d_bias_add         149.524 sec     4514
-- 2)   conv2d_backward_filter  113.605 sec     4026
-- 3)   sel+    54.545 sec      6283
-- 4)   conv2d_backward_data    50.524 sec      2013
-- 5)   ba+*    41.264 sec      12566
-- 6)   -*      36.036 sec      16104
-- 7)   +*      33.817 sec      8052
-- 8)   +       32.274 sec      30566
-- 9)   *       26.529 sec      35275
-- 10)  maxpooling_backward     25.734 sec      4026

akchinSTC · 2017-04-28T23:09:49Z

Refer to this link for build results (access rights to CI server needed):
https://sparktc.ibmcloud.com/jenkins/job/SystemML-PullRequestBuilder/1419/

deroneriksson · 2017-04-29T01:13:48Z

Current this PR is up to:
Conversation 102 | Commits 26 | Files changed 51

It has been open since January. We should consider merging or closing before the 1.0.0 release with significant time for testing if it is merged.

@bertholdreinwald @frreiss @mboehm7 @nakul02

akchinSTC · 2017-04-29T01:57:24Z

Refer to this link for build results (access rights to CI server needed):
https://sparktc.ibmcloud.com/jenkins/job/SystemML-PullRequestBuilder/1420/

mboehm7 · 2017-04-29T02:07:51Z

@deroneriksson I absolutely agree - something like this needs to go in at least a month or two before a release. Let's resolve all remaining build and deploy issues and then bring it in (along with a testsuite that checks for existing libraries and only runs if they are available; of course we should install it on our build server).

Also, added lot of documentation.

niketanpansare · 2017-04-29T17:42:33Z

@deroneriksson @mboehm7 I think this PR has addressed all the deploy/build issues on Linux. Remaining issues can be addressed later as bugfixes in subsequent PR.

@bertholdreinwald @nakul02 After spending significant amount of time on trial and error for Mac (and Windows), I think we should only support BLAS on Linux machines for following reasons:

GNU OpenMP is not enabled by default on Mac, and it requires either downloading GCC or using experimental clang-openmp.
GNU OpenMP is not supported by Intel MKL on Mac. Hence we see potential degradation due to switching between GNU OpenMP and Intel OpenMP for operations such as conv2d_backward_data.
To compile with only Intel OpenMP, we need to buy an Intel Compiler. GCC or Clang doesnot allow us to compile with Intel OpenMP.
Even if we buy Intel OpenMP, the problem remains with OpenBLAS+OpenMP support.

Since we have already have cmake setup in place, we can always support BLAS on Mac and Windows in future when OpenMP related issues are resolved.

For your reference, please see below for instructions for compiling BLAS-enabled SystemML on Mac and Windows:

64-bit x86 Windows

Install MKL or Download the OpenBlas Binary
Install Visual Studio Community Edition (tested on VS 2017)
Use the CMake GUI, select the source directory, the output directory
Press the configure button, set the generator and use default native compilers option
Set the CMAKE_BUILD_TYPE to Release, this sets the appropriate optimization flags
By default, USE_INTEL_MKL is selected, if you wanted to use OpenBLAS, unselect the USE_INTEL_MKL, select the USE_OPEN_BLAS.
You might run into errors a couple of times, select the appropriate library and include files/directories (For MKL or OpenBLAS) a couple of times, and all the errors should go away.
Then press generate. This will generate Visual Studio project files, which you can open in VS2017 to compile the libraries.
The current set of dependencies are as follows:
MKL: mkl_rt.dll (for functions: mkl_set_num_threads and cblas_dgemm).
OpenBLAS: libopenblas.dll (for functions: openblas_set_num_threads and cblas_dgemm).
Visual C OpenMP: vcomp140.dll (for function: omp_get_thread_num).
Visual C Runtime: vcruntime140.dll (for functions: memcpy and memset).
API-MS-WIN-CRT-ENVIRONMENT-L1-1-0.DLL, API-MS-WIN-CRT-RUNTIME-L1-1-0.DLL and API-MS-WIN-CRT-HEAP-L1-1-0.DLL (for functions: malloc and free).
KERNEL32.dll
If you get an error Error LNK1181 cannot open input file 'C:/Program.obj',
you may have to use quotation marks around the path in Visual Studio project properties.
Property Pages > C/C++ > General > Additional Include Directories
Property Pages > Linker > Command Line
If you get an error CMake Error at cmake/FindOpenBLAS.cmake:71 (MESSAGE): Could not find OpenBLAS,
please set the environment variable OpenBLAS_HOME or edit the variables OpenBLAS_INCLUDE_DIR and OpenBLAS_LIB
to point to the include directory and libopenblas.dll.a respectively.
If you get an error install Library TARGETS given no DESTINATION!, you can comment install(TARGETS systemml preload LIBRARY DESTINATION lib)
in CMakeLists.txt and manually place the compiled dll in the src/main/cpp/lib directory.

64-bit x86 Mac

The version of clang that ships with Mac does not come with OpenMP. brew install either llvm or g++. The instructions that follow are for llvm:

Intel MKL - CMake should detect the MKL installation path, otherwise it can specified by the environment variable MKLROOT. To use (with clang):

mkdir INTEL && cd INTEL
CXX=/usr/local/opt/llvm/bin/clang++ CC=/usr/local/opt/llvm/bin/clang LDFLAGS=-L/usr/local/opt/llvm/lib CPPFLAGS=-I/usr/local/opt/llvm/include cmake  -DUSE_INTEL_MKL=ON -DCMAKE_BUILD_TYPE=Release ..
make install

(with gcc-6):

mkdir INTEL && cd INTEL
CXX=g++-6 CC=gcc-6 cmake  -DUSE_INTEL_MKL=ON -DCMAKE_BUILD_TYPE=Release ..
make install

OpenBLAS - CMake should be able to detect the path of OpenBLAS. If it can't, set the OpenBLAS environment variable. If using brew to install OpenBLAS, set the OpenBLAS_HOME environment variable to /usr/local/opt/openblas/. To use (with clang):

export OpenBLAS_HOME=/usr/local/opt/openblas/
mkdir OPENBLAS && cd OPENBLAS
CXX=/usr/local/opt/llvm/bin/clang++ CC=/usr/local/opt/llvm/bin/clang LDFLAGS=-L/usr/local/opt/llvm/lib CPPFLAGS=-I/usr/local/opt/llvm/include cmake  -DUSE_OPEN_BLAS=ON -DCMAKE_BUILD_TYPE=Release ..
make install

(with gcc-6):

export OpenBLAS_HOME=/usr/local/opt/openblas/
mkdir OPENBLAS && cd OPENBLAS
CXX=g++-6 CC=gcc-6 -DUSE_OPEN_BLAS=ON -DCMAKE_BUILD_TYPE=Release ..
make install

nakul02 · 2017-04-29T21:10:50Z

src/main/cpp/libmatrixdnn.cpp

+#include <mkl.h>
+#include <mkl_service.h>
+#endif
+


Documentation?

nakul02 · 2017-04-29T21:11:47Z

src/main/cpp/libmatrixdnn.cpp

+		}
+	}
+}
+


Documentation?

nakul02 · 2017-04-29T21:12:02Z

src/main/cpp/libmatrixdnn.cpp

+		}
+	}
+}
+


Documentation?

nakul02 · 2017-04-29T21:12:20Z

src/main/cpp/libmatrixdnn.cpp

+  }
+} 
+
+


Documentation?

nakul02 · 2017-04-29T21:13:34Z

src/main/cpp/libmatrixdnn.cpp

+  std::memset(temp, 0, numTempElem*numOpenMPThreads*sizeof(double));
+  double* rotatedDoutPtrArrays = new double[numRotatedElem*numOpenMPThreads];
+  double* loweredMatArrays = new double[numIm2ColElem*numOpenMPThreads];
+


Briefly talk about the parallelization strategy here.

nakul02 · 2017-04-29T21:20:14Z

src/main/java/org/apache/sysml/runtime/instructions/cp/ConvolutionCPInstruction.java

@@ -288,6 +293,17 @@ public void processBiasMultiplyInstruction(ExecutionContext ec) throws DMLRuntim
 		ec.setMatrixOutput(getOutputVariableName(), outputBlock);
 	}

+	// Assumption: enableNative && NativeHelper.isNativeLibraryLoaded() is true
+	// This increases the number of native calls. For example:the cases where filter is sparse but input is dense


Can you please convert this to Javadoc? Also talk about the parameters and return types.

nakul02 · 2017-04-29T21:22:00Z

conf/SystemML-config.xml.template

@@ -65,6 +65,9 @@

   <!-- if codegen.enabled, compile literals as constants: 1..heuristic, 2..always -->
   <codegen.literals>1</codegen.literals>
+
+   <!-- enables native blas for matrix multiplication and convolution, experimental feature -->


I would recommend this be named "systemml.native.blas". But it's your call here. Going forward I would like that all systemml specific properties be prepended with "systemml." So that when we read the properties files from other projects we depend on, we don't have name clashes.

nakul02 · 2017-04-29T21:24:54Z

docs/blas.md

@@ -0,0 +1,162 @@
+<!--


My recommendation is to name this file "native-backend.md".

"blas.md" doesn't cover the topics.

nakul02 · 2017-04-29T21:28:59Z

src/main/java/org/apache/sysml/runtime/matrix/data/LibMatrixDNN.java

-		outputBlock.recomputeNonZeros();
+	}
+
+	// Single-threaded matrix multiplication


A javadoc will be nice here.

nakul02 · 2017-04-29T21:32:21Z

src/main/java/org/apache/sysml/utils/NativeHelper.java

@@ -0,0 +1,259 @@
+/*


Could you please add some javadoc to this file?

nakul02 · 2017-04-29T21:36:32Z

I agree with your assessment, only supporting the native backend for Linux is fine. This PR looks good to me 👍
You can address my comments in this PR or in a subsequent PR.

akchinSTC · 2017-04-29T21:57:48Z

Refer to this link for build results (access rights to CI server needed):
https://sparktc.ibmcloud.com/jenkins/job/SystemML-PullRequestBuilder/1423/

niketanpansare · 2017-04-30T02:21:55Z

With OpenBLAS on Lenet on the same machine
Iter:2000.0, training loss:0.009108877504719245, training accuracy:100.0
Iter:2000.0, validation loss:166.14787969186656, validation accuracy:97.87397540983606
SystemML Statistics:
Total elapsed time:             476.380 sec.
Total compilation time:         0.000 sec.
Total execution time:           476.380 sec.
Number of compiled Spark inst:  79.
Number of executed Spark inst:  2.
Native openblas calls (LibMatrixMult/LibMatrixDNN):     4270/10553.
Cache hits (Mem, WB, FS, HDFS): 281999/0/0/0.
Cache writes (WB, FS, HDFS):    147642/0/0.
Cache times (ACQr/m, RLS, EXP): 0.111/0.068/2.028/0.000 sec.
HOP DAGs recompiled (PRED, SB): 0/11305.
HOP DAGs recompile time:        6.428 sec.
Spark ctx create time (lazy):   0.023 sec.
Spark trans counts (par,bc,col):0/0/0.
Spark trans times (par,bc,col): 0.000/0.000/0.000 secs.
Total JIT compile time:         40.452 sec.
Total JVM GC count:             4557.
Total JVM GC time:              71.705 sec.
Heavy hitter instructions (name, time, count):
-- 1)   sel+    55.054 sec      6283
-- 2)   conv2d_bias_add         49.798 sec      4514
-- 3)   ba+*    46.230 sec      12566
-- 4)   conv2d_backward_filter  41.418 sec      4026
-- 5)   -*      36.461 sec      16104
-- 6)   +*      32.389 sec      8052
-- 7)   maxpooling_backward     32.378 sec      4026
-- 8)   +       32.204 sec      30566
-- 9)   relu_backward   26.499 sec      6039
-- 10)  conv2d_backward_data    26.011 sec      2013

Here is another run with MKL to double-check if the performance is reproducible
Iter:2000.0, training loss:0.013034947944848269, training accuracy:100.0
Iter:2000.0, validation loss:274.1293912979084, validation accuracy:96.28586065573771
SystemML Statistics:
Total elapsed time:             482.600 sec.
Total compilation time:         0.000 sec.
Total execution time:           482.600 sec.
Number of compiled Spark inst:  79.
Number of executed Spark inst:  2.
Native mkl calls (LibMatrixMult/LibMatrixDNN):  4270/10553.
Cache hits (Mem, WB, FS, HDFS): 281999/0/0/0.
Cache writes (WB, FS, HDFS):    147642/0/0.
Cache times (ACQr/m, RLS, EXP): 0.149/0.085/2.201/0.000 sec.
HOP DAGs recompiled (PRED, SB): 0/11305.
HOP DAGs recompile time:        6.666 sec.
Spark ctx create time (lazy):   0.022 sec.
Spark trans counts (par,bc,col):0/0/0.
Spark trans times (par,bc,col): 0.000/0.000/0.000 secs.
Total JIT compile time:         39.626 sec.
Total JVM GC count:             4503.
Total JVM GC time:              71.892 sec.
Heavy hitter instructions (name, time, count):
-- 1)   sel+    55.166 sec      6283
-- 2)   conv2d_backward_filter  51.424 sec      4026
-- 3)   conv2d_bias_add         47.443 sec      4514
-- 4)   -*      38.041 sec      16104
-- 5)   +*      35.513 sec      8052
-- 6)   +       34.988 sec      30566
-- 7)   maxpooling_backward     34.836 sec      4026
-- 8)   ba+*    31.071 sec      12566
-- 9)   conv2d_backward_data    28.552 sec      2013
-- 10)  relu_backward   27.039 sec      6039

akchinSTC · 2017-04-30T04:33:43Z

Refer to this link for build results (access rights to CI server needed):
https://sparktc.ibmcloud.com/jenkins/job/SystemML-PullRequestBuilder/1425/

akchinSTC · 2017-04-30T09:19:16Z

Refer to this link for build results (access rights to CI server needed):
https://sparktc.ibmcloud.com/jenkins/job/SystemML-PullRequestBuilder/1426/

- Both MKL and OpenBLAS show 2-3x performance benefits on conv2d operators on Lenet. - There are several OpenMP related issues on Mac, hence we have explicitly disabled native support for Mac and Windows. Since we have already have cmake setup in place, we can always support BLAS on Mac and Windows in future when OpenMP related issues are resolved. Closes #344.

- Both MKL and OpenBLAS show 2-3x performance benefits on conv2d operators on Lenet. - There are several OpenMP related issues on Mac, hence we have explicitly disabled native support for Mac and Windows. Since we have already have cmake setup in place, we can always support BLAS on Mac and Windows in future when OpenMP related issues are resolved. Closes apache#344.

niketanpansare mentioned this pull request Jan 15, 2017

[SYSTEMML-769] Support for native BLAS and simplify deployment for GPU backend #307

Closed

Niketan Pansare and others added 6 commits April 28, 2017 11:01

Updated documentation and bwd logic

89ce066

bugfixes for sparse Conv2dBwdData

acc32d2

updating linux libraries

3b0008e

updated windows libraries

fae0153

Compiled mac binaries with new changes for sparse

03bc50e

Merge pull request #13 from nakul02/switch_blas

4246c20

Compiled mac binaries with new changes for sparse

Added iomp5 dependency for mac

f12025a

Updated the documentation of gomp for linux

0bbb9b9

Removed Mac and Linux support as well as simplified NativeHelper class.

70269b4

Also, added lot of documentation.

Niketan Pansare added 2 commits April 29, 2017 11:28

Added useful warning messages to notify if and why blas is disabled

6af7c12

Removed the null check by mistake in previous commit

09c1117

nakul02 reviewed Apr 29, 2017

View reviewed changes

Niketan Pansare added 2 commits April 29, 2017 19:04

Incorporated suggestions by Nakul

ff6a0b5

Updated OpenBLAS libraries

9896195

Added CP RELU_MAX_POOLING_BACKWARD operator

fdde9d5

asfgit closed this in 39a37ae Apr 30, 2017

niketanpansare mentioned this pull request Oct 20, 2017

[SYSTEMML-446] [SYSTEMML-702] Updated the sparse matrix multiplication to minimize sparse-to-dense as well as dense-to-sparse conversion #686

Closed

[SYSTEMML-769] [WIP] Adding support for native BLAS #344

[SYSTEMML-769] [WIP] Adding support for native BLAS #344

Conversation

niketanpansare commented Jan 15, 2017 • edited

nakul02 commented Jan 15, 2017

niketanpansare commented Jan 15, 2017

asurve commented Jan 16, 2017

niketanpansare commented Jan 16, 2017

mboehm7 commented Jan 16, 2017

niketanpansare commented Jan 16, 2017

mboehm7 commented Jan 16, 2017

niketanpansare commented Jan 16, 2017

mboehm7 commented Jan 16, 2017

asurve commented Jan 16, 2017

niketanpansare commented Jan 17, 2017 • edited

mboehm7 commented Jan 17, 2017

bertholdreinwald commented Jan 17, 2017

mboehm7 commented Jan 17, 2017

nakul02 commented Jan 17, 2017

mboehm7 commented Jan 17, 2017

niketanpansare commented Jan 17, 2017

mboehm7 commented Jan 17, 2017

nakul02 commented Jan 18, 2017

mboehm7 commented Jan 18, 2017

niketanpansare commented Jan 19, 2017

mboehm7 commented Jan 19, 2017

niketanpansare commented Jan 19, 2017

mboehm7 commented Jan 19, 2017

niketanpansare commented Jan 19, 2017

mboehm7 commented Jan 19, 2017

niketanpansare commented Jan 19, 2017

mboehm7 commented Jan 19, 2017

niketanpansare commented Jan 19, 2017

niketanpansare commented Apr 28, 2017

akchinSTC commented Apr 28, 2017

deroneriksson commented Apr 29, 2017

akchinSTC commented Apr 29, 2017

mboehm7 commented Apr 29, 2017

niketanpansare commented Apr 29, 2017 • edited

64-bit x86 Windows

64-bit x86 Mac

(with gcc-6):

(with gcc-6):

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nakul02 commented Apr 29, 2017

akchinSTC commented Apr 29, 2017

niketanpansare commented Apr 30, 2017

akchinSTC commented Apr 30, 2017

akchinSTC commented Apr 30, 2017

niketanpansare commented Jan 15, 2017 •

edited

niketanpansare commented Jan 17, 2017 •

edited

niketanpansare commented Apr 29, 2017 •

edited