Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SYSTEMML-769] [WIP] Adding support for native BLAS #344

Closed
wants to merge 32 commits into from

Conversation

niketanpansare
Copy link
Contributor

@niketanpansare niketanpansare commented Jan 15, 2017

Based on the discussion with @frreiss and @bertholdreinwald , I am proposing to switch our default BLAS from Java-based BLAS to native BLAS. We will recommend using Intel MKL and provide optional support other BLAS such as OpenBLAS (and possibly Accelerate) etc. Also, if no BLAS is installed, we will switch to Java-based BLAS. This future-proofs SystemML from hardware improvement made in other BLAS and also simplifies testing.

The proposed solution in this jar will work:

  1. In distributed setting with no additional dependency other than BLAS.
  2. On hybrid cluster with different types of BLAS.
  3. With Parfor.
  4. Without an additional artifact: systemml-accelerator.jar (like [SYSTEMML-769] Support for native BLAS and simplify deployment for GPU backend #307).

Since we are not including external dependency (such as BLAS), this PR adds no additional overhead on the release process. Also when we add support to TSMM, SYSTEMML-1166 will be resolved and will again future proof SystemML on related issues.

The initial performance numbers are same as that of #307

@mboehm7 @dusenberrymw @nakul02 @lresende @deroneriksson @asurve @fschueler

@nakul02
Copy link
Member

nakul02 commented Jan 15, 2017

Will we produce different jar releases for different platforms? If so, it might help with JCuda as well.

@niketanpansare
Copy link
Contributor Author

No, I would recommend packaging .so, .dll, etc in single jar similar to JCuda approach. Since the JNI API is extremely lightweight, the overhead would be minimal.

@asurve
Copy link
Member

asurve commented Jan 16, 2017

Thanks @niketanpansare for sharing your proposal.

Can you share benefits of using native based BLAS over Java-based BLAS, other than two below?
- One you mentioned about possible hardware improvements made in other BLAS can benefit SystemML.

  • Second you mentioned, Simplifies testing

How much performance improvement from native BLAS over Java based BLAS?

In my opinion, we have to write CPP interface for every BLAS for every needed functionality on top of JNI code to interact with generic SystemML Matrix CPP which will burden SystemML in development and increase test effort. If there is significant performance improvement then its worth.

Other thing to consider if there is a benefit from hardware improvements in other BLAS, either those could be available in Java based BLAS or we can add them in Java based BLAS (assuming those are open source library).

Why can't Java based BLAS supports different type of BLAS? If its not supporting, shouldn't it be the place to have such code.

I would be cautious to understand whole proposal before we can take this into consideration.

@niketanpansare
Copy link
Contributor Author

Please see PR 307 for the performance number. The difference is about 2x to 5x for matrix multiplication and about 3x for conv2d.

If the concern is about development overhead, there are two solutions:

  1. Only support and test with the community version of Intel MKL.
  2. The only code that varies with different BLAS is threading logic to enforce sequential BLAS. If necessary we can factor that out but it would still require as many shared libraries as types of BLAS.

Let me re-emphasize that development overhead of maintaining the JNI overhead is far less than continuously supporting Java based BLAS.

If you can concerned about testing, in fact adding native BLAS simplifies it as Intel would have already done performance testing on different shapes n sizes on different CPUs. Think about testing interfaces vs rigorous performance testing for mat multiplication, tsmm, etc.

I don't understand few of your questions:
Why can Java based blas support different type of blas ?
Adding hardware improvements to Java based blas ? (May be we are not on same page. I would encourage you to look at LibMatrixMult code and how people support avx, fma, numa awareness, etc in C++. I actually think our Java based BLAS is much better than any other Java based BLAS out there (especially in cache friendliness), but there is a cost of maintaining it with evolving hardware improvement.)

@mboehm7
Copy link
Contributor

mboehm7 commented Jan 16, 2017

I see that you've put a lot of effort in here but I'm still not convinced by the experiments conducted so far.

Instead of trying many combinations of small row/column dimensions (that have limited impact on real workloads), could we please select four representative scenarios and thoroughly evaluate them (incl local and distributed operations, different BLAS libraries, different sparse representations, etc)? Let's use the following scenarios (where (1) and (2) are supposed to be memory-bandwidth bound, whereas (3)-(5) are compute-bound; (1) and (2) are representative for regression and binomial classification, (3) and (4) are representative for multinomial classification, and (5) represents deep learning):

(1) Matrix-vector/vector-matrix 1M x 1K, dense (~8GB)
(2) Matrix-vector/vector-matrix 1M x 10K, sparsity=0.1 (~12GB)
(3) Matrix-matrix/matrix-Matrix 1M x 1K x 20, dense (40GFLOPs)
(4) Matrix-matrix/matrix-Matrix 1M x 10K (sparsity=0.1) x 20, (40 GFLOPs)
(5) Squared matrix 3K x 3K, dense (54 GFLOPs)

@niketanpansare
Copy link
Contributor Author

Sure, I will run experiments for shapes (1) to (5). Since I am proposing only supporting dense local BLAS and fall back to Java, sparsity will be not useful as I will end up reporting the same numbers. Also since we are officially supporting Intel MKL for performance to simplify testing (I am ok with removing support for openblas btw), using it as baseline makes more sense; else we might end up circling around the argument that our blas is better than blas X on certain scenarios Y but not Z and not make much progress here.

Before we invest time in these and possibly some more experiments, let's agree on the end goal:

  1. We want to switch to Intel MKL BLAS if the performance is ?? X. If this is the case, let's agree that what is the number we would feel comfortable to consider switching.
  2. We want to switch to Intel MKL BLAS because it simplifies development and performance testing overhead as Intel would have done those for us. All we need to write/test are JNI wrappers not each individual matrix case.

@mboehm7
Copy link
Contributor

mboehm7 commented Jan 16, 2017

Let's run all five scenarios (with left/right exchange, and local/distributed operations) - if we would add support for BLAS libraries then anyway for both dense and sparse. Also let's run at least with MKL and OpenBLAS as it would ensure that the integration is sufficiently general. Personally, I would be fine with adding BLAS support (enabled by default) if it consistently achieves speedups by more than 2-3x. If the results are a mixed bag, we might still consider it as an optional feature (disabled by default) but it would certainly not yield simplifications as we would still need to maintain over own java-based library.

@niketanpansare
Copy link
Contributor Author

Thanks, that clarifies a lot. 2x to 3x seems reasonable to decide whether to keep it as optional or default. I will work on the experiments today. Since this PR does not change sparse matmult, would you still want me to test with sparsity=0.1 ?

@mboehm7
Copy link
Contributor

mboehm7 commented Jan 16, 2017

right now, we're talking only about micro benchmarks - I would recommend to integrate the calls to sparse BLAS operations over a CSR representation just to be able to run these experiments. This should be pretty straightforward and not much effort, right?

@asurve
Copy link
Member

asurve commented Jan 16, 2017

OK, I don't see any BLAS specific interface. I am assuming these BLAS libraries have common interfaces. So my concern over maintenance of code specific to BLAS has resolved,'

Based on code change, you have checked in generated library files (.so).
I would not prefer to check in generated output (
.so) in the repository, rather there should be build process to generate these library on a required platform.
If someone does not have access to target platform to build such library they can get release version of library from Maven repository.
Again, this is not to be worried at this point until you finalize the approach.
But then first step would be to provide instructions and/or scripts to build such libraries and next step to automate running script as a part of build process.

Thanks
Arvind

@niketanpansare
Copy link
Contributor Author

niketanpansare commented Jan 17, 2017

@mboehm7 I would prefer to first get this PR (with dense BLAS) in first before starting to work on sparse BLAS and/or add other operations (such as solve, dsyrk, etc).

For 10 operations:

Matrix-vector/vector-matrix: 1M x 1K, dense: 
Java BLAS: 2.252 sec, MKL: 5.403 sec and OpenBLAS: 23.477 sec

Matrix-matrix/matrix-Matrix 1M x 1K x 20, dense:
Java BLAS: 12.204 sec, MKL: 18.927 sec and OpenBLAS: 33.840 sec

Matrix-matrix/matrix-Matrix 1M x 1K x 100, dense (just out of curiousity):
Java BLAS: 53.218 sec, MKL: 17.090 sec and OpenBLAS: 102.779 sec

Matrix-matrix/matrix-Matrix 3K x 3K x 3K, dense:
Java BLAS: 12.948 sec, MKL: 2.612 sec and OpenBLAS: 2.828 sec

@mboehm7
Copy link
Contributor

mboehm7 commented Jan 17, 2017

thanks for the initial results. Couple of comments:

  • Could you please specify what kind of HW you ran these experiments?
  • Please don't call it Java BLAS as we're not implementing the BLAS interface
  • Let's add the sparse operations - otherwise we cannot make an informed decision whether or not we want to integrate BLAS at all.

@bertholdreinwald
Copy link
Contributor

Additional considerations are:

  • for high dimensional matrices that are compute intensive, the benefits of using MKL/OpenBLAS seem to be there. However for smaller matrices/vectors (m <=20 in your posting), they are slower. What will be the heuristic for switching?
  • it would be good to see single threaded experiments as well to decide whether there could be any benefits of using MKL in our distributed operations.
  • I am a little concerned that MKL operations will use memory that is not reflected in our memory estimates, and hence may lead to OOMs.
    -I am ok with initially doing it for dense matrices only as sparse ops have much more variety and will require more effort for which we may not have the bandwidth right now.

@mboehm7
Copy link
Contributor

mboehm7 commented Jan 17, 2017

while thinking about potential "heuristics for switching", we should also ask ourselves if these compute-intensive cases are not already covered by the in-progress GPU backend.

@nakul02
Copy link
Member

nakul02 commented Jan 17, 2017

@mboehm7 - we should keep in mind that a GPU may not be as ubiquitous or easy to find as an installed BLAS library (or MKL)

@mboehm7
Copy link
Contributor

mboehm7 commented Jan 17, 2017

yes, but with more and more workloads being migrated to the could this becomes less of an issue as we can actually request preferred node types.

@niketanpansare
Copy link
Contributor Author

  1. For simplifying the performance testing, I am OK with paying the penalty for few cases where we are better optimized than Intel MKL. At best, we can push matrix-vector operation to Java matmult and rest to Intel MKL.

  2. Since we officially support compute-intensive operators on CP (eg: deep learning builtin functions, matrix multiplication in case of autoencoder, etc), we should definitely try to optimize these cases (either by improving java matmult or by adding support for Intel MKL).

  3. Regarding OOMs, I don't think MKL makes a copy (atleast the specification doesn't say it does). Whether a certain JDK makes a copy or not for JNI call, is a fair concern. The specification for JDK (as well as IBM JDK) specifies that they don't create a copy; if a certain JDK disregards that specification and if because of that we get OOM, then its a JDK bug.

@mboehm7
Copy link
Contributor

mboehm7 commented Jan 17, 2017

Where did you get that the JNI call would not create a copy? Of course it does (and the numbers reflect that). Since the JVM is free to compose a logical array from multiple physical fragments, the JVM must do so; otherwise, it could not provide a contiguous array. Even if the array is not fragmented, it is likely always copied, because the asynchronous garbage collector is free to re-arrange the array at any time.

@nakul02
Copy link
Member

nakul02 commented Jan 18, 2017

@mboehm7 - Please take a look at this page.
There is a note there which clearly says:

As of JDK/JRE 1.1, programmers can use Get/Release<primitivetype>ArrayElements functions to obtain a pointer to primitive array elements. If the VM supports pinning, the pointer to the original data is returned; otherwise, a copy is made.

New functions introduced as of JDK/JRE 1.3 allow native code to obtain a direct pointer to array elements even if the VM does not support pinning.

@mboehm7
Copy link
Contributor

mboehm7 commented Jan 18, 2017

So how many JVMs do you think actually support pinning of large arrays? I tend to believe that all major JVMs actually store arrays in multiple fragments. Take a look at http://www.ibm.com/support/knowledgecenter/SSYKE2_7.0.0/com.ibm.java.aix.70.doc/diag/understanding/jni_copypin.html and https://www.ibm.com/developerworks/library/j-jni/ - usually only small arrays are returned as a direct pointer due to the issues mentioned above. How else would you explain the 2.5x slowdown for matrix-vector, which is a trivial operation?

@niketanpansare
Copy link
Contributor Author

@bertholdreinwald Please see below for the results of single-threaded implementation:

For 10 operations (single-threaded):

Matrix-vector/vector-matrix: 1M x 1K, dense:
LibMatrixMult: 75.753 sec, MKL: 71.533 sec and OpenBLAS: 62.354 sec

Matrix-matrix/matrix-Matrix 1M x 1K x 20, dense:
LibMatrixMult: 209.818 sec, MKL: 93.709 sec and OpenBLAS: 91.460 sec

Matrix-matrix/matrix-Matrix 1M x 1K x 100, dense (just out of curiousity):
LibMatrixMult: 648.379 sec, MKL: 146.276 sec and OpenBLAS: 161.662 sec

Matrix-matrix/matrix-Matrix 3K x 3K x 3K, dense:
LibMatrixMult: 138.946 sec, MKL: 15.938 sec and OpenBLAS: 16.843 sec

@mboehm7
Copy link
Contributor

mboehm7 commented Jan 19, 2017

As I said before, please specify the HW; otherwise we can't really interpret these numbers.

@niketanpansare
Copy link
Contributor Author

Please see below the hardware specification:

Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz
CPU(s):                24
Thread(s) per core:    2
Core(s) per socket:    6
Socket(s):             2
NUMA node(s):          2
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              15360K

Memory allocated in the above tests: 20G

@mboehm7
Copy link
Contributor

mboehm7 commented Jan 19, 2017

Thanks @niketanpansare that helps. However, it also shows measurement errors: how do you explain for Java matrix-vector a speed-up of 34 given 24 virtual cores?

Furthermore, and just to avoid that people draw wrong conclusions here: distributed operations will look very similar to the multi-threaded (not single-threaded!) operations because both become memory-bandwidth bound. In any case, we need to run the distributed operations as well - there are no shortcuts.

@niketanpansare
Copy link
Contributor Author

Thanks @niketanpansare that helps. However, it also shows measurement errors: how do you explain for Java matrix-vector a speed-up of 34 given 24 virtual cores?

The numbers reported here are execution time for b+* from our Statistics and are fairly easy to reproduce. All I did was redirect matrixMult(MatrixBlock m1, MatrixBlock m2, MatrixBlock ret, int k) to matrixMult(MatrixBlock m1, MatrixBlock m2, MatrixBlock ret) ... Since there is fairly complicated logic (for example: recomputeNonZeros in parallel vs sequential, interleaving of memory access), I would not be suspicious of the speedup.

Furthermore, and just to avoid that people draw wrong conclusions here: distributed operations will look very similar to the multi-threaded (not single-threaded!) operations because both become memory-bandwidth bound. In any case, we need to run the distributed operations as well - there are no shortcuts.

That is a fair point, one cannot directly correlate single-threaded performance with distributed performance. However, as I said earlier, "For simplifying the performance testing, I am OK with paying the penalty for few cases where we are better optimized than Intel MKL."

@mboehm7
Copy link
Contributor

mboehm7 commented Jan 19, 2017

well I'm not fine with it because a 2-3x slowdown on scenario (1) and (2) directly affects end-to-end performance of common algorithms such as LinregCG, GLM, L2SVM, MSVM, MLogreg, Kmeans, and PageRank.

@niketanpansare
Copy link
Contributor Author

Then, please feel free to update the memory bound logic when the PR is merged. Though slightly tricky, Statistics will still help us with performance debugging. I hope we both agree on the fact that Intel MKL is much-better optimized than our LibMatrixMult.

@mboehm7
Copy link
Contributor

mboehm7 commented Jan 19, 2017

Also, seeing a speedup of 34 for matrix-vector on 24 virtual cores should raise red flags. Usually, we only see very moderate speedups of 3-5x for matrix-vector because again, at one point this operation becomes memory-bandwidth bound.

I suspect that garbage collection, or just-in-time compilation to interfere with the measurement here. Hence, I would recommend to setup a proper micro-benchmark, with isolated block operations, a number of warm-up runs for just-in-time compilation, and verbose GC flag to exclude runs that overlapped with GC.

@niketanpansare
Copy link
Contributor Author

That could be the case. Again, I don't dispute that LibMatrixMult is better option for memory-bound cases and identifying those cases should not stop us from integrating Intel MKL.

@niketanpansare
Copy link
Contributor Author

Here are the updated results on Lenet for 2000 iterations:

With MKL on Linux:
Iter:2000.0, training loss:0.01354500195846287, training accuracy:100.0
Iter:2000.0, validation loss:211.49319894899025, validation accuracy:97.07991803278688
SystemML Statistics:
Total elapsed time:             480.432 sec.
Total compilation time:         0.000 sec.
Total execution time:           480.432 sec.
Number of compiled Spark inst:  79.
Number of executed Spark inst:  2.
Native mkl calls (LibMatrixMult/LibMatrixDNN):  4270/10553.
Cache hits (Mem, WB, FS, HDFS): 281999/0/0/0.
Cache writes (WB, FS, HDFS):    147642/0/0.
Cache times (ACQr/m, RLS, EXP): 0.103/0.074/2.183/0.000 sec.
HOP DAGs recompiled (PRED, SB): 0/11305.
HOP DAGs recompile time:        6.609 sec.
Spark ctx create time (lazy):   0.022 sec.
Spark trans counts (par,bc,col):0/0/0.
Spark trans times (par,bc,col): 0.000/0.000/0.000 secs.
Total JIT compile time:         36.785 sec.
Total JVM GC count:             4566.
Total JVM GC time:              72.493 sec.
Heavy hitter instructions (name, time, count):
-- 1)   sel+    56.005 sec      6283
-- 2)   conv2d_backward_filter  51.247 sec      4026
-- 3)   conv2d_bias_add         48.540 sec      4514
-- 4)   -*      35.711 sec      16104
-- 5)   +       34.503 sec      30566
-- 6)   +*      34.500 sec      8052
-- 7)   maxpooling_backward     34.057 sec      4026
-- 8)   ba+*    31.741 sec      12566
-- 9)   conv2d_backward_data    28.271 sec      2013
-- 10)  relu_backward   26.899 sec      6039

With Java on Linux:
Iter:2000.0, training loss:0.0059023118210415025, training accuracy:100.0
Iter:2000.0, validation loss:151.31859200647978, validation accuracy:97.46413934426229
SystemML Statistics:
Total elapsed time:             654.523 sec.
Total compilation time:         0.000 sec.
Total execution time:           654.523 sec.
Number of compiled Spark inst:  79.
Number of executed Spark inst:  2.
Cache hits (Mem, WB, FS, HDFS): 281999/0/0/0.
Cache writes (WB, FS, HDFS):    147642/0/0.
Cache times (ACQr/m, RLS, EXP): 0.097/0.073/2.021/0.000 sec.
HOP DAGs recompiled (PRED, SB): 0/11305.
HOP DAGs recompile time:        6.575 sec.
Spark ctx create time (lazy):   0.024 sec.
Spark trans counts (par,bc,col):0/0/0.
Spark trans times (par,bc,col): 0.000/0.000/0.000 secs.
Total JIT compile time:         39.636 sec.
Total JVM GC count:             7094.
Total JVM GC time:              98.133 sec.
Heavy hitter instructions (name, time, count):
-- 1)   conv2d_bias_add         149.524 sec     4514
-- 2)   conv2d_backward_filter  113.605 sec     4026
-- 3)   sel+    54.545 sec      6283
-- 4)   conv2d_backward_data    50.524 sec      2013
-- 5)   ba+*    41.264 sec      12566
-- 6)   -*      36.036 sec      16104
-- 7)   +*      33.817 sec      8052
-- 8)   +       32.274 sec      30566
-- 9)   *       26.529 sec      35275
-- 10)  maxpooling_backward     25.734 sec      4026

@akchinSTC
Copy link
Contributor

Refer to this link for build results (access rights to CI server needed):
https://sparktc.ibmcloud.com/jenkins/job/SystemML-PullRequestBuilder/1419/

@deroneriksson
Copy link
Member

Current this PR is up to:
Conversation 102 | Commits 26 | Files changed 51

It has been open since January. We should consider merging or closing before the 1.0.0 release with significant time for testing if it is merged.

@bertholdreinwald @frreiss @mboehm7 @nakul02

@akchinSTC
Copy link
Contributor

Refer to this link for build results (access rights to CI server needed):
https://sparktc.ibmcloud.com/jenkins/job/SystemML-PullRequestBuilder/1420/

@mboehm7
Copy link
Contributor

mboehm7 commented Apr 29, 2017

@deroneriksson I absolutely agree - something like this needs to go in at least a month or two before a release. Let's resolve all remaining build and deploy issues and then bring it in (along with a testsuite that checks for existing libraries and only runs if they are available; of course we should install it on our build server).

@niketanpansare
Copy link
Contributor Author

niketanpansare commented Apr 29, 2017

@deroneriksson @mboehm7 I think this PR has addressed all the deploy/build issues on Linux. Remaining issues can be addressed later as bugfixes in subsequent PR.

@bertholdreinwald @nakul02 After spending significant amount of time on trial and error for Mac (and Windows), I think we should only support BLAS on Linux machines for following reasons:

  • GNU OpenMP is not enabled by default on Mac, and it requires either downloading GCC or using experimental clang-openmp.
  • GNU OpenMP is not supported by Intel MKL on Mac. Hence we see potential degradation due to switching between GNU OpenMP and Intel OpenMP for operations such as conv2d_backward_data.
  • To compile with only Intel OpenMP, we need to buy an Intel Compiler. GCC or Clang doesnot allow us to compile with Intel OpenMP.
  • Even if we buy Intel OpenMP, the problem remains with OpenBLAS+OpenMP support.

Since we have already have cmake setup in place, we can always support BLAS on Mac and Windows in future when OpenMP related issues are resolved.


For your reference, please see below for instructions for compiling BLAS-enabled SystemML on Mac and Windows:

64-bit x86 Windows

  • Install MKL or Download the OpenBlas Binary
  • Install Visual Studio Community Edition (tested on VS 2017)
  • Use the CMake GUI, select the source directory, the output directory
  • Press the configure button, set the generator and use default native compilers option
  • Set the CMAKE_BUILD_TYPE to Release, this sets the appropriate optimization flags
  • By default, USE_INTEL_MKL is selected, if you wanted to use OpenBLAS, unselect the USE_INTEL_MKL, select the USE_OPEN_BLAS.
  • You might run into errors a couple of times, select the appropriate library and include files/directories (For MKL or OpenBLAS) a couple of times, and all the errors should go away.
  • Then press generate. This will generate Visual Studio project files, which you can open in VS2017 to compile the libraries.
    The current set of dependencies are as follows:
  • MKL: mkl_rt.dll (for functions: mkl_set_num_threads and cblas_dgemm).
  • OpenBLAS: libopenblas.dll (for functions: openblas_set_num_threads and cblas_dgemm).
  • Visual C OpenMP: vcomp140.dll (for function: omp_get_thread_num).
  • Visual C Runtime: vcruntime140.dll (for functions: memcpy and memset).
  • API-MS-WIN-CRT-ENVIRONMENT-L1-1-0.DLL, API-MS-WIN-CRT-RUNTIME-L1-1-0.DLL and API-MS-WIN-CRT-HEAP-L1-1-0.DLL (for functions: malloc and free).
  • KERNEL32.dll
    If you get an error Error LNK1181 cannot open input file 'C:/Program.obj',
    you may have to use quotation marks around the path in Visual Studio project properties.
  • Property Pages > C/C++ > General > Additional Include Directories
  • Property Pages > Linker > Command Line
    If you get an error CMake Error at cmake/FindOpenBLAS.cmake:71 (MESSAGE): Could not find OpenBLAS,
    please set the environment variable OpenBLAS_HOME or edit the variables OpenBLAS_INCLUDE_DIR and OpenBLAS_LIB
    to point to the include directory and libopenblas.dll.a respectively.
    If you get an error install Library TARGETS given no DESTINATION!, you can comment install(TARGETS systemml preload LIBRARY DESTINATION lib)
    in CMakeLists.txt and manually place the compiled dll in the src/main/cpp/lib directory.

64-bit x86 Mac

The version of clang that ships with Mac does not come with OpenMP. brew install either llvm or g++. The instructions that follow are for llvm:

  1. Intel MKL - CMake should detect the MKL installation path, otherwise it can specified by the environment variable MKLROOT. To use (with clang):
mkdir INTEL && cd INTEL
CXX=/usr/local/opt/llvm/bin/clang++ CC=/usr/local/opt/llvm/bin/clang LDFLAGS=-L/usr/local/opt/llvm/lib CPPFLAGS=-I/usr/local/opt/llvm/include cmake  -DUSE_INTEL_MKL=ON -DCMAKE_BUILD_TYPE=Release ..
make install

(with gcc-6):

mkdir INTEL && cd INTEL
CXX=g++-6 CC=gcc-6 cmake  -DUSE_INTEL_MKL=ON -DCMAKE_BUILD_TYPE=Release ..
make install
  1. OpenBLAS - CMake should be able to detect the path of OpenBLAS. If it can't, set the OpenBLAS environment variable. If using brew to install OpenBLAS, set the OpenBLAS_HOME environment variable to /usr/local/opt/openblas/. To use (with clang):
export OpenBLAS_HOME=/usr/local/opt/openblas/
mkdir OPENBLAS && cd OPENBLAS
CXX=/usr/local/opt/llvm/bin/clang++ CC=/usr/local/opt/llvm/bin/clang LDFLAGS=-L/usr/local/opt/llvm/lib CPPFLAGS=-I/usr/local/opt/llvm/include cmake  -DUSE_OPEN_BLAS=ON -DCMAKE_BUILD_TYPE=Release ..
make install

(with gcc-6):

export OpenBLAS_HOME=/usr/local/opt/openblas/
mkdir OPENBLAS && cd OPENBLAS
CXX=g++-6 CC=gcc-6 -DUSE_OPEN_BLAS=ON -DCMAKE_BUILD_TYPE=Release ..
make install

#include <mkl.h>
#include <mkl_service.h>
#endif

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Documentation?

}
}
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Documentation?

}
}
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Documentation?

}
}


Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Documentation?

std::memset(temp, 0, numTempElem*numOpenMPThreads*sizeof(double));
double* rotatedDoutPtrArrays = new double[numRotatedElem*numOpenMPThreads];
double* loweredMatArrays = new double[numIm2ColElem*numOpenMPThreads];

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Briefly talk about the parallelization strategy here.

@@ -288,6 +293,17 @@ public void processBiasMultiplyInstruction(ExecutionContext ec) throws DMLRuntim
ec.setMatrixOutput(getOutputVariableName(), outputBlock);
}

// Assumption: enableNative && NativeHelper.isNativeLibraryLoaded() is true
// This increases the number of native calls. For example:the cases where filter is sparse but input is dense
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please convert this to Javadoc? Also talk about the parameters and return types.

@@ -65,6 +65,9 @@

<!-- if codegen.enabled, compile literals as constants: 1..heuristic, 2..always -->
<codegen.literals>1</codegen.literals>

<!-- enables native blas for matrix multiplication and convolution, experimental feature -->
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would recommend this be named "systemml.native.blas". But it's your call here. Going forward I would like that all systemml specific properties be prepended with "systemml." So that when we read the properties files from other projects we depend on, we don't have name clashes.

docs/blas.md Outdated
@@ -0,0 +1,162 @@
<!--
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My recommendation is to name this file "native-backend.md".

"blas.md" doesn't cover the topics.

outputBlock.recomputeNonZeros();
}

// Single-threaded matrix multiplication
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A javadoc will be nice here.

@@ -0,0 +1,259 @@
/*
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please add some javadoc to this file?

@nakul02
Copy link
Member

nakul02 commented Apr 29, 2017

I agree with your assessment, only supporting the native backend for Linux is fine. This PR looks good to me 👍
You can address my comments in this PR or in a subsequent PR.

@akchinSTC
Copy link
Contributor

Refer to this link for build results (access rights to CI server needed):
https://sparktc.ibmcloud.com/jenkins/job/SystemML-PullRequestBuilder/1423/

@niketanpansare
Copy link
Contributor Author

With OpenBLAS on Lenet on the same machine
Iter:2000.0, training loss:0.009108877504719245, training accuracy:100.0
Iter:2000.0, validation loss:166.14787969186656, validation accuracy:97.87397540983606
SystemML Statistics:
Total elapsed time:             476.380 sec.
Total compilation time:         0.000 sec.
Total execution time:           476.380 sec.
Number of compiled Spark inst:  79.
Number of executed Spark inst:  2.
Native openblas calls (LibMatrixMult/LibMatrixDNN):     4270/10553.
Cache hits (Mem, WB, FS, HDFS): 281999/0/0/0.
Cache writes (WB, FS, HDFS):    147642/0/0.
Cache times (ACQr/m, RLS, EXP): 0.111/0.068/2.028/0.000 sec.
HOP DAGs recompiled (PRED, SB): 0/11305.
HOP DAGs recompile time:        6.428 sec.
Spark ctx create time (lazy):   0.023 sec.
Spark trans counts (par,bc,col):0/0/0.
Spark trans times (par,bc,col): 0.000/0.000/0.000 secs.
Total JIT compile time:         40.452 sec.
Total JVM GC count:             4557.
Total JVM GC time:              71.705 sec.
Heavy hitter instructions (name, time, count):
-- 1)   sel+    55.054 sec      6283
-- 2)   conv2d_bias_add         49.798 sec      4514
-- 3)   ba+*    46.230 sec      12566
-- 4)   conv2d_backward_filter  41.418 sec      4026
-- 5)   -*      36.461 sec      16104
-- 6)   +*      32.389 sec      8052
-- 7)   maxpooling_backward     32.378 sec      4026
-- 8)   +       32.204 sec      30566
-- 9)   relu_backward   26.499 sec      6039
-- 10)  conv2d_backward_data    26.011 sec      2013

Here is another run with MKL to double-check if the performance is reproducible
Iter:2000.0, training loss:0.013034947944848269, training accuracy:100.0
Iter:2000.0, validation loss:274.1293912979084, validation accuracy:96.28586065573771
SystemML Statistics:
Total elapsed time:             482.600 sec.
Total compilation time:         0.000 sec.
Total execution time:           482.600 sec.
Number of compiled Spark inst:  79.
Number of executed Spark inst:  2.
Native mkl calls (LibMatrixMult/LibMatrixDNN):  4270/10553.
Cache hits (Mem, WB, FS, HDFS): 281999/0/0/0.
Cache writes (WB, FS, HDFS):    147642/0/0.
Cache times (ACQr/m, RLS, EXP): 0.149/0.085/2.201/0.000 sec.
HOP DAGs recompiled (PRED, SB): 0/11305.
HOP DAGs recompile time:        6.666 sec.
Spark ctx create time (lazy):   0.022 sec.
Spark trans counts (par,bc,col):0/0/0.
Spark trans times (par,bc,col): 0.000/0.000/0.000 secs.
Total JIT compile time:         39.626 sec.
Total JVM GC count:             4503.
Total JVM GC time:              71.892 sec.
Heavy hitter instructions (name, time, count):
-- 1)   sel+    55.166 sec      6283
-- 2)   conv2d_backward_filter  51.424 sec      4026
-- 3)   conv2d_bias_add         47.443 sec      4514
-- 4)   -*      38.041 sec      16104
-- 5)   +*      35.513 sec      8052
-- 6)   +       34.988 sec      30566
-- 7)   maxpooling_backward     34.836 sec      4026
-- 8)   ba+*    31.071 sec      12566
-- 9)   conv2d_backward_data    28.552 sec      2013
-- 10)  relu_backward   27.039 sec      6039

@akchinSTC
Copy link
Contributor

Refer to this link for build results (access rights to CI server needed):
https://sparktc.ibmcloud.com/jenkins/job/SystemML-PullRequestBuilder/1425/

@akchinSTC
Copy link
Contributor

Refer to this link for build results (access rights to CI server needed):
https://sparktc.ibmcloud.com/jenkins/job/SystemML-PullRequestBuilder/1426/

@asfgit asfgit closed this in 39a37ae Apr 30, 2017
asfgit pushed a commit that referenced this pull request May 1, 2017
- Both MKL and OpenBLAS show 2-3x performance benefits on conv2d
  operators on Lenet.
- There are several OpenMP related issues on Mac, hence we have explicitly
  disabled native support for Mac and Windows. Since we have already have
  cmake setup in place, we can always support BLAS on Mac and Windows in
  future when OpenMP related issues are resolved.

Closes #344.
j143-zz pushed a commit to j143-zz/systemml that referenced this pull request Nov 4, 2017
- Both MKL and OpenBLAS show 2-3x performance benefits on conv2d
  operators on Lenet.
- There are several OpenMP related issues on Mac, hence we have explicitly
  disabled native support for Mac and Windows. Since we have already have
  cmake setup in place, we can always support BLAS on Mac and Windows in
  future when OpenMP related issues are resolved.

Closes apache#344.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
9 participants