New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SYSTEMML-769] Support for native BLAS and simplify deployment for GPU backend #307
Conversation
Overall, I'm quite pleased to see that we can achieve ~equal performance to TensorFlow, particularly on case 4, which is the most probable scenario of the given tests. 26 seconds vs 71 seconds in SystemML today is a huge difference, and this is even with a small image size (2828) compared to say (256256*3), and an even larger number of channels for intermediate layers. I'd also be interested in a comparison of a full network, such as the LeNet example in the SystemML-NN package for the various SystemML setups. As for the BLAS integration, I think it would be a good idea to generically target a BLAS implementation, vs. a specific one. A lot of people will be using OpenBLAS, for example. |
Once we all agree that adding support for native BLAS is a good idea, we can add remaining operations and do comparison for a full network 👍
I agree. In fact, the code is written against cblas_* interface to make this happen. To switch over to generic BLAS, all we need to change is:
|
The results look great 👍 I agree with @dusenberrymw's comments about generalizing the implementation to plug in any BLAS. We cannot assume the availability of a Intel MKL on every system. |
2dc5d86
to
4cc0484
Compare
@bertholdreinwald @mboehm7 @fschueler I ran experiments comparing the performance of matrix multiplication for 1000 iterations. The DML script used for this comparison is as follows: X = matrix(0.1, rows=$1, cols=$2)
Y = matrix(0.2, rows=$2, cols=$3)
Z = matrix(0, rows=$1, cols=$3)
for(i in 1:1000) {
Z = Z + X %*% Y
}
print(as.scalar(Z[1,1])) The values provided to $1, $2 and $3 are For example: The datapoint in red square refers to the speedup of native ba+* over LibMatrixMult for the case Setup, $1, $2, $2, $3, Time in seconds
Native,1000,1,1,100,0.499
CP,1000,1,1,100,0.377
Native,1000,10,10,100,0.522
CP,1000,10,10,100,0.955
Native,1000,100,100,100,0.472
CP,1000,100,100,100,1.780
Native,1000,1000,1000,100,1.286
CP,1000,1000,1000,100,6.849
Native,1000,2000,2000,100,2.196
CP,1000,2000,2000,100,12.234
Native,1000,5000,5000,100,5.870 <<<
CP,1000,5000,5000,100,28.050 <<<
Native,1000,10000,10000,100,10.249
CP,1000,10000,10000,100,55.804 It is important to note that we are competitive with the native library for matrix-vector multiplication. |
I'm a bit puzzled about the experimental setting and results. Are you comparing here sparse-sparse matrix multiply and sparse-sparse element-wise addition? If so, are you also using Sparse BLAS and the same sparse matrix representation for both (e.g., CSR)? Which fraction of the reported time is actually spent in matrix multiplication (excluding output allocation etc)? |
I am only comparing the time taken for dense matrix multiplication. The time reported is from our heavy hitters, so only includes ba+* time. |
Time for output allocation is included in both native and CP. I only edited LibMatrixMult to redirect to native for fair comparison. |
Hm, but why do you force dense matrix multiply on a sparse scenario of these shapes? Anyway, could you please factor out a couple of things: on which scenarios did we apply multi-threading, how much time was spent in output allocation, how much time was spent in the actual matrix multiply, and how much time was spent in computing/maintaining the nnz in the output. Thanks. |
May be I am missing something, why do you think the scenario should be sparse. If you prefer, I can test with random matrix instead of all 0.1 or 0.2. I didn't check each case but for a larger case, both Berthold and I noticed that all cores being used by CP. Also, I tried to ensure that the overhead (output allocation and nnz maintenance) happens for both the cases ... please see LibMatrixNative in this PR. I suspect if we factor out the overhead, speedup will be even more. Unfortunately, I won't be able to do any more experiments for at least another week, but I can share the setup if you or someone else want to take lead until then :) |
sorry, my bad - when skimming the scenario, I mistakenly read it as rand of sparsity 0.1 and 0.2. In that case your can safely ignore my previous comments. However, it would still be useful to actually understand the details of where time goes and why the results show such a non-monotonic behavior. |
just to clarify the issue of multi-threading: we have a parallelization threshold of 2MFLOPs as this is roughly the point where the creation of a thread pool is amortized. We experimented with a shared thread pool but the integration, especially with parfor, was not very clean so we did not put it into master. Please just annotate the cases where we don't use multi-threading to help understand the large variation. |
Can we add a "WIP" tag to this PR? |
4cc0484
to
4e9a033
Compare
@mboehm7 I have created a branch with my experimental setup that annotates the cases where multi-threading was not enabled. Please see https://github.com/niketanpansare/incubator-systemml/blob/matmult_cpp_experiments/scripts/perftest/runMatMultExperiments.sh in case if you want to reproduce the result on a different machine. I am running the experiments now and will provide an updated graph soon. Here is the preview of the results: Java,SingleThreaded,1,1000,100,0.087730
IntelMKL,MultiThreaded,1,1000,1000,0.137696
OpenBLAS,MultiThreaded,1,1000,1000,1.190570
Java,SingleThreaded,1,1000,1000,0.619783
IntelMKL,MultiThreaded,1,1000,2000,0.220410
OpenBLAS,MultiThreaded,1,1000,2000,3.202002
Java,MultiThreaded,1,1000,2000,1.553647
IntelMKL,MultiThreaded,1,1000,5000,2.011847
OpenBLAS,MultiThreaded,1,1000,5000,9.932969
Java,MultiThreaded,1,1000,5000,2.930450
IntelMKL,MultiThreaded,1,1000,10000,7.010215
OpenBLAS,MultiThreaded,1,1000,10000,19.757398
Java,MultiThreaded,1,1000,10000,5.434230
IntelMKL,MultiThreaded,1,2000,1,0.023460
OpenBLAS,MultiThreaded,1,2000,1,0.019632
Java,SingleThreaded,1,2000,1,0.014827
IntelMKL,MultiThreaded,1,2000,10,0.065613
OpenBLAS,MultiThreaded,1,2000,10,0.033173
Java,SingleThreaded,1,2000,10,0.045123
IntelMKL,MultiThreaded,1,2000,100,0.118278
OpenBLAS,MultiThreaded,1,2000,100,0.239601
Java,SingleThreaded,1,2000,100,0.181720
IntelMKL,MultiThreaded,1,2000,1000,0.288750
OpenBLAS,MultiThreaded,1,2000,1000,3.235483
Java,MultiThreaded,1,2000,1000,1.489389
IntelMKL,MultiThreaded,1,2000,2000,2.614357
OpenBLAS,MultiThreaded,1,2000,2000,7.843344 @dusenberrymw @nakul02 @fschueler As an FYI, with some preliminary experiments, Java is outperforming OpenBLAS even with g++ -o /home/[user]/libsystemml.so systemml-cpp/systemml.cpp -I/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.111-1.b15.el7_2.x86_64/include -Isystemml-cpp -I/opt/openblas/include -I/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.111-1.b15.el7_2.x86_64/include/linux -lopenblas -lpthread -lm -ldl -DUSE_OPEN_BLAS -L/opt/openblas/lib -fopenmp -O3 -shared -fPIC -mavx -mfma @fschueler Either I am not compiling openblas in an optimal manner (i.e. AVX/SSE2/FMA are not getting turned on by default) or your previous experiments might be using OpenCL? |
@mboehm7 Please see below the graphs comparing speedup with Intel MKL and OpenBLAS respectively. As per your suggestion, I have marked the cases where multi-threading was not enabled @fschueler I think we donot see the speedup in OpenBLAS because it is likely not being compiled with AVX. The R script used to generate above graphs: t = read.csv("time1.txt", header=TRUE)
t = subset(t, BLAS != "IntelMKL" & Time > 0)
temp = subset(t, BLAS == "Java")
for(i in 1:nrow(temp)) {
M1 = temp[i,]$M
N2 = temp[i,]$N
K1 = temp[i,]$K
temp[i,]$Time = subset(t, BLAS == "Java" & M == M1 & N == N2 & K == K1)$Time / subset(t, BLAS != "Java" & M == M1 & N == N2 & K == K1)$Time
}
qplot(N, Time, data=temp, color=IsSingleThreaded, facets=M~K, xlab="Common dimension (N)", ylab="Speedup", main="Matrix multiplication (Speedup = CP/OpenBLAS)") @nakul02 @fschueler @dusenberrymw Can you help me with Mac command for OpenBLAS and Intel MKL ? https://github.com/niketanpansare/incubator-systemml/blob/for_cpp/src/main/python/setup.py#L70 |
@fschueler I used following steps to compile OpenBLAS: git clone https://github.com/xianyi/OpenBLAS
cd OpenBLAS
make FC=gfortran
sudo make PREFIX=/opt/openblas install Also, the output of $ ./getarch 0
CORE=HASWELL
LIBCORE=haswell
NUM_CORES=24
HAVE_MMX=1
HAVE_SSE=1
HAVE_SSE2=1
HAVE_SSE3=1
HAVE_SSSE3=1
HAVE_SSE4_1=1
HAVE_SSE4_2=1
HAVE_AVX=1
HAVE_FMA3=1 |
597757d
to
46d9b04
Compare
To summarize, I tried following with openblas, but still saw performance degradation:
My guess is that this is due to a performance bug in OpenBLAS on multi-socket machine: OpenMathLib/OpenBLAS#611 $ cat /proc/cpuinfo | grep "physical id" | sort -u | wc -l
2 |
29066ad
to
263e32d
Compare
I have got this PR to functionally complete state. Here are the features that are included as part of this PR:
@nakul02 @mboehm7 @bertholdreinwald @dusenberrymw @deroneriksson @asurve @gweidner @fschueler Can you please try the following script in your environment and comment back on this PR with the log lines prefixed by git clone https://github.com/niketanpansare/incubator-systemml.git
cd incubator-systemml
git checkout for_cpp
mvn clean package -P distribution
pip install target/systemml-0.12.0-incubating-SNAPSHOT-python.tgz
pyspark
>>> from systemml import random
>>> m1 = random.uniform(size=(1000,1000))
>>> m2 = random.uniform(size=(1000,1000))
>>> m3 = m1.dot(m2).toNumPy() Here are the logs from my setups:
On machine 1:
16/12/22 11:21:38 INFO accelerator.LibraryLoader: Unable to load MKL:no mkl_rt in java.library.path
16/12/22 11:21:38 INFO accelerator.BLASHelper: Found BLAS: openblas
16/12/22 11:21:38 INFO accelerator.BLASHelper: Successfully loaded systemml library with openblas
On machine 2 (SYSTEMML_GPU and SYSTEMML_BLAS not set):
16/12/22 14:30:46 INFO accelerator.BLASHelper: Found BLAS: mkl
16/12/22 14:30:46 INFO accelerator.BLASHelper: Successfully loaded systemml library with mkl
16/12/22 14:30:52 INFO accelerator.JCudaHelper: Total number of GPUs on the machine: 2
16/12/22 14:30:52 INFO accelerator.JCudaHelper: GPU is enabled
On machine 2 (SYSTEMML_GPU not set and SYSTEMML_BLAS set to openblas):
16/12/22 14:33:07 INFO accelerator.BLASHelper: Found BLAS: openblas
16/12/22 14:33:07 INFO accelerator.BLASHelper: Successfully loaded systemml library with openblas
16/12/22 14:33:12 INFO accelerator.JCudaHelper: Total number of GPUs on the machine: 2
16/12/22 14:33:12 INFO accelerator.JCudaHelper: GPU is enabled
On machine 2 (SYSTEMML_GPU set to none and SYSTEMML_BLAS set to openblas):
16/12/22 14:34:05 INFO accelerator.BLASHelper: Found BLAS: openblas
16/12/22 14:34:05 INFO accelerator.BLASHelper: Successfully loaded systemml library with openblas
16/12/22 14:34:05 INFO accelerator.JCudaHelper: Not loading JCUDA as SYSTEMML_GPU=none
On machine 2 (SYSTEMML_GPU set to none and SYSTEMML_BLAS set to none):
16/12/22 14:34:54 INFO accelerator.BLASHelper: Not loading native BLAS as SYSTEMML_BLAS=none
16/12/22 14:34:54 INFO accelerator.JCudaHelper: Not loading JCUDA as SYSTEMML_GPU=none If you have few more free cycles, please run the microbenchmarks: https://github.com/niketanpansare/incubator-systemml/tree/for_cpp/scripts/perftest/microbenchmarks |
All tests passed here. |
78c7ae4
to
e2d9a16
Compare
@niketanpansare W.r.t. BLAS on OS X / macOS, currently, the preferred and default BLAS implementation is Apple Accelerate. Like C-BLAS, Atlas, OpenBLAS, Intel MKL, etc, Accelerate implements the BLAS API, and in terms of performance, Accelerate is generally the fastest BLAS implementation on OS X / macOS. Also, just to be clear, Accelerate is part of OS X / macOS, and does not require any installation. For example, a import numpy as np
np.__config__.show()
|
@niketanpansare I ran the quick test above, and I did not see any
|
Can you double check if OpenBLAS is available to Java ? https://github.com/niketanpansare/incubator-systemml/blob/9f325e1ac7155bf6a1cf98fc8e314ac339a198a0/docs/accelerator.md#frequently-asked-questions Also, I agree with you that we should support Accelerate once this PR is in. |
Well, I didn't install OpenBLAS, but I'll look into it on a Linux box. As for the rest of this PR, the |
Thanks @dusenberrymw .... We have three options:
|
1ef5a13
to
e93b6ef
Compare
…nd GPU backend 1. Support for automatic detection of BLAS (MKL and OpenBLAS) and GPU backend. 2. Added native matmult and conv2d functions. In case if native library is not available, we fallback to Java implementation. 3. This will allow us to explore distributed GPU solution.
496b311
to
1da242e
Compare
This is a standing PR to facilitate discussion on whether or not, we should support native BLAS in SystemML. After discussion and after resolving the issues with deployment, we can decide whether to turn on this feature by default. Since I wanted feedback from community before proceeding ahead, I did not complete the PR. The remaining tasks are:
I ran some preliminary performance experiments comparing conv2d with/without sparse+caching and with/without native BLAS. I provided fairly large memory budget (-Xmx20g -Xms20g -Xmn2048m -server) and used Open JDK 1.8 64-Bit Server VM. The script tested the performance of conv2d using four commonly used setups for 1000 iterations:
To compile the native SystemML library, please use:
Please see below the results of the experiments. Both sparse and caching are disabled for the setup
SystemML_native
andSystemML_CP
.,Number of Iterations, Setup, Time in seconds SystemML_native,1000,1,7.103096398 SystemML_CP,1000,1,6.498525426 SystemML_CP_WithCacheNSparseEnabled,1000,1,7.195620854 Tensorflow,1000,1,4.071731716 SystemML_native,1000,2,31.315343223 SystemML_CP,1000,2,81.769984552 SystemML_CP_WithCacheNSparseEnabled,1000,2,101.274622939 Tensorflow,1000,2,33.476548341 SystemML_native,1000,3,7.662274848 SystemML_CP,1000,3,6.355272119 SystemML_CP_WithCacheNSparseEnabled,1000,3,7.607337158 Tensorflow,1000,3,3.837932081 SystemML_native,1000,4,26.638438614 SystemML_CP,1000,4,49.716594505 SystemML_CP_WithCacheNSparseEnabled,1000,4,71.542244484 Tensorflow,1000,4,26.395180006
There are some additional overhead cost (such as initial compilation/validation, reuse of previously allocated but non-zeroed array, dynamic recompilation, GC, etc) which we have not yet optimized. These cost are beyond the scope of this PR and some of them are inherent to our design principles. We can work on them in a separate PR :)
@mboehm7 @bertholdreinwald @dusenberrymw @frreiss @prithvirajsen @fschueler @nakul02 @asurve @deroneriksson I understand the above experiments might not be sufficient to accept the change and would welcome your feedback on additional experiments/setups. I would also appreciate if some of you are willing to help me with these experiments too ;)
Here are the shapes of the matrix multiplication for the four setups:
I will provide an update soon comparing the results of the above matrix multiplications. If you are interested, here are the respective code path for the matrix multiplications:
CP: https://github.com/apache/incubator-systemml/blob/master/src/main/java/org/apache/sysml/runtime/matrix/data/LibMatrixDNN.java#L327
Native: https://github.com/niketanpansare/incubator-systemml/blob/for_cpp/src/main/cpp/systemml.cpp#L163