[SYSTEMML-769] Support for native BLAS and simplify deployment for GPU backend #307

niketanpansare · 2016-12-06T00:34:14Z

This is a standing PR to facilitate discussion on whether or not, we should support native BLAS in SystemML. After discussion and after resolving the issues with deployment, we can decide whether to turn on this feature by default. Since I wanted feedback from community before proceeding ahead, I did not complete the PR. The remaining tasks are:

Generalize to other BLAS, not just MKL. This would also involve completing the CMake file.
Add other operations: conv2d_backward_*, etc.

I ran some preliminary performance experiments comparing conv2d with/without sparse+caching and with/without native BLAS. I provided fairly large memory budget (-Xmx20g -Xms20g -Xmn2048m -server) and used Open JDK 1.8 64-Bit Server VM. The script tested the performance of conv2d using four commonly used setups for 1000 iterations:

max_iterations = 1000
setup = $2
numFilters = -1
numChannels = -1
filterSize = -1
pad = -1
if(setup == 1) {
        numFilters = 20
        numChannels = 1
        filterSize = 5
        pad = 0
}
else if(setup == 2) {
        numFilters = 50
        numChannels = 20
        filterSize = 5
        pad = 0
}
else if(setup == 3) {
        numFilters = 20
        numChannels = 1
        filterSize = 3
        pad = 1
}
else if(setup == 4) {
        numFilters = 50
        numChannels = 20
        filterSize = 3
        pad = 1
}
else {
        stop('Incorrect setup (needs to be [1, 4]).')
}
imgSize = 28
n = 60000
X = rand(rows=n, cols=numChannels*imgSize*imgSize)
batch_size = 64
w = rand(rows=numFilters, cols=numChannels*filterSize*filterSize)
P = (imgSize + 2 * pad - filterSize)  + 1
foo = matrix(0, rows=n, cols=numFilters*P*P)
for(iter in 1:max_iterations) {
        beg = (iter * batch_size) %% n + 1
        end = min(n, beg + batch_size)
        X_batch = X[beg:end, ]
        n_batch = nrow(X_batch)
        convOut_1 = conv2d(X_batch, w, input_shape=[n_batch,numChannels,imgSize,imgSize], filter_shape=[numFilters,numChannels,filterSize,filterSize], padding=[pad,pad], stride=[1,1])
        foo = convOut_1
}
print(sum(foo))

To compile the native SystemML library, please use:

export MKLROOT=/opt/intel/mkl
export JAVA_HOME=....
# Please go to https://software.intel.com/en-us/articles/intel-mkl-link-line-advisor to find LINKER_OPTIONS and COMPILER_OPTIONS
export LINKER_OPTIONS=" -L${MKLROOT}/lib/intel64 -Wl,--no-as-needed -lmkl_rt -lpthread -lm -ldl"
export COMPILER_OPTIONS=" -m64 -I${MKLROOT}/include"
g++ -shared -fPIC -o libsystemml.so systemml.cpp -I. -I$JAVA_HOME/include -I$JAVA_HOME/include/linux -lm -fopenmp -O3 $LINKER_OPTIONS $COMPILER_OPTIONS

Please see below the results of the experiments. Both sparse and caching are disabled for the setup SystemML_native and SystemML_CP.

,Number of Iterations, Setup, Time in seconds
SystemML_native,1000,1,7.103096398
SystemML_CP,1000,1,6.498525426
SystemML_CP_WithCacheNSparseEnabled,1000,1,7.195620854
Tensorflow,1000,1,4.071731716
SystemML_native,1000,2,31.315343223
SystemML_CP,1000,2,81.769984552
SystemML_CP_WithCacheNSparseEnabled,1000,2,101.274622939
Tensorflow,1000,2,33.476548341
SystemML_native,1000,3,7.662274848
SystemML_CP,1000,3,6.355272119
SystemML_CP_WithCacheNSparseEnabled,1000,3,7.607337158
Tensorflow,1000,3,3.837932081
SystemML_native,1000,4,26.638438614
SystemML_CP,1000,4,49.716594505
SystemML_CP_WithCacheNSparseEnabled,1000,4,71.542244484
Tensorflow,1000,4,26.395180006

There are some additional overhead cost (such as initial compilation/validation, reuse of previously allocated but non-zeroed array, dynamic recompilation, GC, etc) which we have not yet optimized. These cost are beyond the scope of this PR and some of them are inherent to our design principles. We can work on them in a separate PR :)

@mboehm7 @bertholdreinwald @dusenberrymw @frreiss @prithvirajsen @fschueler @nakul02 @asurve @deroneriksson I understand the above experiments might not be sufficient to accept the change and would welcome your feedback on additional experiments/setups. I would also appreciate if some of you are willing to help me with these experiments too ;)

Here are the shapes of the matrix multiplication for the four setups:

Setup 1:
64 parallel matrix multiplication of shape (20, 25) %*% (25, 576) executed 1000 times.
 
Setup 2:
64 parallel matrix multiplication of shape (50, 500) %*% (500, 576) executed 1000 times.
 
Setup 3:
64 parallel matrix multiplication of shape (20, 25) %*% (25, 784) executed 1000 times.
 
Setup 4:
64 parallel matrix multiplication of shape (50, 500) %*% (500, 784) executed 1000 times.

I will provide an update soon comparing the results of the above matrix multiplications. If you are interested, here are the respective code path for the matrix multiplications:

dusenberrymw · 2016-12-07T01:02:59Z

Overall, I'm quite pleased to see that we can achieve ~equal performance to TensorFlow, particularly on case 4, which is the most probable scenario of the given tests. 26 seconds vs 71 seconds in SystemML today is a huge difference, and this is even with a small image size (2828) compared to say (256256*3), and an even larger number of channels for intermediate layers.

I'd also be interested in a comparison of a full network, such as the LeNet example in the SystemML-NN package for the various SystemML setups.

As for the BLAS integration, I think it would be a good idea to generically target a BLAS implementation, vs. a specific one. A lot of people will be using OpenBLAS, for example.

niketanpansare · 2016-12-07T02:11:16Z

I'd also be interested in a comparison of a full network, such as the LeNet example in the SystemML-NN package for the various SystemML setups.

Once we all agree that adding support for native BLAS is a good idea, we can add remaining operations and do comparison for a full network 👍

As for the BLAS integration, I think it would be a good idea to generically target a BLAS implementation, vs. a specific one. A lot of people will be using OpenBLAS, for example.

I agree. In fact, the code is written against cblas_* interface to make this happen. To switch over to generic BLAS, all we need to change is:

Generalize two methods for setting the number of threads via conditional compilation: https://github.com/apache/incubator-systemml/pull/307/files#diff-601ed9dd7a60c7a3f26c91e73f68ea5cR48
Update the headers: https://github.com/apache/incubator-systemml/pull/307/files#diff-601ed9dd7a60c7a3f26c91e73f68ea5cR32
Edit CMake file: https://github.com/niketanpansare/incubator-systemml/blob/2dc5d862c107e6906c04a8ca66210a88b8547d2b/src/main/cpp/CMakeLists.txt

nakul02 · 2016-12-07T23:25:19Z

The results look great 👍

I agree with @dusenberrymw's comments about generalizing the implementation to plug in any BLAS. We cannot assume the availability of a Intel MKL on every system.

niketanpansare · 2016-12-08T00:36:50Z

@bertholdreinwald @mboehm7 @fschueler I ran experiments comparing the performance of matrix multiplication for 1000 iterations. The DML script used for this comparison is as follows:

X = matrix(0.1, rows=$1, cols=$2)
Y = matrix(0.2, rows=$2, cols=$3)
Z = matrix(0, rows=$1, cols=$3)
for(i in 1:1000) {
        Z = Z + X %*% Y
}
print(as.scalar(Z[1,1]))

The values provided to $1, $2 and $3 are 1 10 100 1000 2000 5000 10000, allowing us to test the performance of matrix multiplication for different shapes. I then extracted the time required for "ba+*" from the heavy hitters from each case and plotted them. The common dimension (i.e. $2) is plotted along the X-axis and the speedup time_for_ba+*_cp / time_for_ba+*_native is plotted along the Y-axis. The number of rows of X (i.e. $1) and the number of columns of Y (i.e. $3) are plotted using facets.

For example: The datapoint in red square refers to the speedup of native ba+* over LibMatrixMult for the case (1000, 5000) %*% (5000, 100). Here are the results for the cases in that box:

Setup, $1, $2, $2, $3, Time in seconds
Native,1000,1,1,100,0.499
CP,1000,1,1,100,0.377
Native,1000,10,10,100,0.522
CP,1000,10,10,100,0.955
Native,1000,100,100,100,0.472
CP,1000,100,100,100,1.780
Native,1000,1000,1000,100,1.286
CP,1000,1000,1000,100,6.849
Native,1000,2000,2000,100,2.196
CP,1000,2000,2000,100,12.234
Native,1000,5000,5000,100,5.870      <<<
CP,1000,5000,5000,100,28.050         <<<
Native,1000,10000,10000,100,10.249
CP,1000,10000,10000,100,55.804

It is important to note that we are competitive with the native library for matrix-vector multiplication.

mboehm7 · 2016-12-08T08:55:43Z

I'm a bit puzzled about the experimental setting and results. Are you comparing here sparse-sparse matrix multiply and sparse-sparse element-wise addition? If so, are you also using Sparse BLAS and the same sparse matrix representation for both (e.g., CSR)? Which fraction of the reported time is actually spent in matrix multiplication (excluding output allocation etc)?

niketanpansare · 2016-12-08T13:59:10Z

I am only comparing the time taken for dense matrix multiplication. The time reported is from our heavy hitters, so only includes ba+* time.

niketanpansare · 2016-12-08T14:02:54Z

Time for output allocation is included in both native and CP. I only edited LibMatrixMult to redirect to native for fair comparison.

mboehm7 · 2016-12-08T14:17:02Z

Hm, but why do you force dense matrix multiply on a sparse scenario of these shapes?

Anyway, could you please factor out a couple of things: on which scenarios did we apply multi-threading, how much time was spent in output allocation, how much time was spent in the actual matrix multiply, and how much time was spent in computing/maintaining the nnz in the output. Thanks.

niketanpansare · 2016-12-08T17:09:26Z

May be I am missing something, why do you think the scenario should be sparse. If you prefer, I can test with random matrix instead of all 0.1 or 0.2. I didn't check each case but for a larger case, both Berthold and I noticed that all cores being used by CP. Also, I tried to ensure that the overhead (output allocation and nnz maintenance) happens for both the cases ... please see LibMatrixNative in this PR. I suspect if we factor out the overhead, speedup will be even more. Unfortunately, I won't be able to do any more experiments for at least another week, but I can share the setup if you or someone else want to take lead until then :)

mboehm7 · 2016-12-08T17:20:49Z

sorry, my bad - when skimming the scenario, I mistakenly read it as rand of sparsity 0.1 and 0.2. In that case your can safely ignore my previous comments. However, it would still be useful to actually understand the details of where time goes and why the results show such a non-monotonic behavior.

mboehm7 · 2016-12-08T21:58:19Z

just to clarify the issue of multi-threading: we have a parallelization threshold of 2MFLOPs as this is roughly the point where the creation of a thread pool is amortized. We experimented with a shared thread pool but the integration, especially with parfor, was not very clean so we did not put it into master. Please just annotate the cases where we don't use multi-threading to help understand the large variation.

dusenberrymw · 2016-12-12T18:21:56Z

Can we add a "WIP" tag to this PR?

niketanpansare · 2016-12-18T21:11:19Z

@mboehm7 I have created a branch with my experimental setup that annotates the cases where multi-threading was not enabled. Please see https://github.com/niketanpansare/incubator-systemml/blob/matmult_cpp_experiments/scripts/perftest/runMatMultExperiments.sh in case if you want to reproduce the result on a different machine.

I am running the experiments now and will provide an updated graph soon. Here is the preview of the results:

Java,SingleThreaded,1,1000,100,0.087730
IntelMKL,MultiThreaded,1,1000,1000,0.137696
OpenBLAS,MultiThreaded,1,1000,1000,1.190570
Java,SingleThreaded,1,1000,1000,0.619783
IntelMKL,MultiThreaded,1,1000,2000,0.220410
OpenBLAS,MultiThreaded,1,1000,2000,3.202002
Java,MultiThreaded,1,1000,2000,1.553647
IntelMKL,MultiThreaded,1,1000,5000,2.011847
OpenBLAS,MultiThreaded,1,1000,5000,9.932969
Java,MultiThreaded,1,1000,5000,2.930450
IntelMKL,MultiThreaded,1,1000,10000,7.010215
OpenBLAS,MultiThreaded,1,1000,10000,19.757398
Java,MultiThreaded,1,1000,10000,5.434230
IntelMKL,MultiThreaded,1,2000,1,0.023460
OpenBLAS,MultiThreaded,1,2000,1,0.019632
Java,SingleThreaded,1,2000,1,0.014827
IntelMKL,MultiThreaded,1,2000,10,0.065613
OpenBLAS,MultiThreaded,1,2000,10,0.033173
Java,SingleThreaded,1,2000,10,0.045123
IntelMKL,MultiThreaded,1,2000,100,0.118278
OpenBLAS,MultiThreaded,1,2000,100,0.239601
Java,SingleThreaded,1,2000,100,0.181720
IntelMKL,MultiThreaded,1,2000,1000,0.288750
OpenBLAS,MultiThreaded,1,2000,1000,3.235483
Java,MultiThreaded,1,2000,1000,1.489389
IntelMKL,MultiThreaded,1,2000,2000,2.614357
OpenBLAS,MultiThreaded,1,2000,2000,7.843344

@dusenberrymw @nakul02 @fschueler As an FYI, with some preliminary experiments, Java is outperforming OpenBLAS even with avx and fma flags (but not Intel MKL). The command used to compile with OpenBLAS is:

g++ -o /home/[user]/libsystemml.so systemml-cpp/systemml.cpp  -I/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.111-1.b15.el7_2.x86_64/include -Isystemml-cpp -I/opt/openblas/include -I/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.111-1.b15.el7_2.x86_64/include/linux -lopenblas -lpthread -lm -ldl -DUSE_OPEN_BLAS -L/opt/openblas/lib -fopenmp -O3 -shared -fPIC -mavx -mfma

@fschueler Either I am not compiling openblas in an optimal manner (i.e. AVX/SSE2/FMA are not getting turned on by default) or your previous experiments might be using OpenCL?

niketanpansare · 2016-12-19T15:40:28Z

@mboehm7 Please see below the graphs comparing speedup with Intel MKL and OpenBLAS respectively. As per your suggestion, I have marked the cases where multi-threading was not enabled

@fschueler I think we donot see the speedup in OpenBLAS because it is likely not being compiled with AVX.

The R script used to generate above graphs:

t = read.csv("time1.txt", header=TRUE)
t = subset(t, BLAS != "IntelMKL" & Time > 0)
temp = subset(t, BLAS == "Java")
for(i in 1:nrow(temp)) {
  M1 = temp[i,]$M
  N2 = temp[i,]$N
  K1 = temp[i,]$K
  temp[i,]$Time = subset(t, BLAS == "Java" & M == M1 & N == N2 & K == K1)$Time /  subset(t, BLAS != "Java" & M == M1 & N == N2 & K == K1)$Time
}
qplot(N, Time, data=temp, color=IsSingleThreaded, facets=M~K, xlab="Common dimension (N)", ylab="Speedup", main="Matrix multiplication (Speedup = CP/OpenBLAS)")

@nakul02 @fschueler @dusenberrymw Can you help me with Mac command for OpenBLAS and Intel MKL ? https://github.com/niketanpansare/incubator-systemml/blob/for_cpp/src/main/python/setup.py#L70

niketanpansare · 2016-12-19T15:47:27Z

@fschueler I used following steps to compile OpenBLAS:

git clone https://github.com/xianyi/OpenBLAS
cd OpenBLAS
make FC=gfortran
sudo make PREFIX=/opt/openblas install

Also, the output of getarch utility included in OpenBLAS directory is as follows:

$ ./getarch 0
CORE=HASWELL
LIBCORE=haswell
NUM_CORES=24
HAVE_MMX=1
HAVE_SSE=1
HAVE_SSE2=1
HAVE_SSE3=1
HAVE_SSSE3=1
HAVE_SSE4_1=1
HAVE_SSE4_2=1
HAVE_AVX=1
HAVE_FMA3=1

niketanpansare · 2016-12-20T00:45:51Z

To summarize, I tried following with openblas, but still saw performance degradation:

Compiling OpenBLAS from source:

git clone https://github.com/xianyi/OpenBLAS
cd OpenBLAS
make FC=gfortran
sudo make PREFIX=/opt/openblas install

Using released OpenBLAS:

sudo yum install openblas
sudo ln -s /lib64/libopenblas.so.0 /usr/lib/libopenblas.so

Also, commented any threading logic in systemml.cpp to rule out possibility of bug due to extension APIs:

    //if(NUM_THREADS == -1) {
    //      NUM_THREADS = openblas_get_num_threads();
    //}
    // openblas_set_num_threads(NUM_THREADS);

Even though I was using newer Centos 7 and gcc 4.8 (both supporting avx instructions), I also tried updating binutils and reinstalling OpenBLAS as per https://github.com/xianyi/OpenBLAS/wiki/faq#binutils

My guess is that this is due to a performance bug in OpenBLAS on multi-socket machine: OpenMathLib/OpenBLAS#611

$ cat /proc/cpuinfo | grep "physical id" | sort -u | wc -l
2

niketanpansare · 2016-12-22T20:42:09Z

I have got this PR to functionally complete state. Here are the features that are included as part of this PR:

Automatic detection of BLAS and GPU as well as including the correct dependency. Please see http://niketanpansare.github.io/incubator-systemml/accelerator for usage.

By default, if the user invokes SystemML via commandline or scala shell, everything works as our existing setup. Only when user provides systemml-accelerator.jar does GPU and BLAS potentially get enabled.
No additional steps of including JCuda java and lib required.
This setup will work in hybrid cluster environment (with/without gpus, different blas) as well.
We support GPU on Linux (x86_64 and powerpc), Mac (x86_64), Windows (x86_64) and BLAS on Linux (x86_64), Mac (x86_64), Windows (x86_64). For other setups, we revert back to LibMatrixMult and non-GPU backend.

Support BLAS for matrix multiplication, conv2d and conv2d_backward_data.
Performance comparison of LibMatrixMult with MKL and OpenBLAS.

@nakul02 @mboehm7 @bertholdreinwald @dusenberrymw @deroneriksson @asurve @gweidner @fschueler Can you please try the following script in your environment and comment back on this PR with the log lines prefixed by INFO accelerator. This should help us identify the setup on which this PR is tested.

git clone https://github.com/niketanpansare/incubator-systemml.git
cd incubator-systemml
git checkout for_cpp
mvn clean package -P distribution
pip install target/systemml-0.12.0-incubating-SNAPSHOT-python.tgz
pyspark
>>> from systemml import random
>>> m1 = random.uniform(size=(1000,1000))
>>> m2 = random.uniform(size=(1000,1000))
>>> m3 = m1.dot(m2).toNumPy()

Here are the logs from my setups:

machine 1: installed openblas, no gpu, linux
machine 2: installed openblas, mkl , cuda 8, cudnn 5.1, linux

On machine 1:
16/12/22 11:21:38 INFO accelerator.LibraryLoader: Unable to load MKL:no mkl_rt in java.library.path
16/12/22 11:21:38 INFO accelerator.BLASHelper: Found BLAS: openblas
16/12/22 11:21:38 INFO accelerator.BLASHelper: Successfully loaded systemml library with openblas

On machine 2 (SYSTEMML_GPU and SYSTEMML_BLAS not set):
16/12/22 14:30:46 INFO accelerator.BLASHelper: Found BLAS: mkl
16/12/22 14:30:46 INFO accelerator.BLASHelper: Successfully loaded systemml library with mkl
16/12/22 14:30:52 INFO accelerator.JCudaHelper: Total number of GPUs on the machine: 2
16/12/22 14:30:52 INFO accelerator.JCudaHelper: GPU is enabled

On machine 2 (SYSTEMML_GPU not set and SYSTEMML_BLAS set to openblas):
16/12/22 14:33:07 INFO accelerator.BLASHelper: Found BLAS: openblas
16/12/22 14:33:07 INFO accelerator.BLASHelper: Successfully loaded systemml library with openblas
16/12/22 14:33:12 INFO accelerator.JCudaHelper: Total number of GPUs on the machine: 2
16/12/22 14:33:12 INFO accelerator.JCudaHelper: GPU is enabled

On machine 2 (SYSTEMML_GPU set to none and SYSTEMML_BLAS set to openblas):
16/12/22 14:34:05 INFO accelerator.BLASHelper: Found BLAS: openblas
16/12/22 14:34:05 INFO accelerator.BLASHelper: Successfully loaded systemml library with openblas
16/12/22 14:34:05 INFO accelerator.JCudaHelper: Not loading JCUDA as SYSTEMML_GPU=none

On machine 2 (SYSTEMML_GPU set to none and SYSTEMML_BLAS set to none):
16/12/22 14:34:54 INFO accelerator.BLASHelper: Not loading native BLAS as SYSTEMML_BLAS=none
16/12/22 14:34:54 INFO accelerator.JCudaHelper: Not loading JCUDA as SYSTEMML_GPU=none

If you have few more free cycles, please run the microbenchmarks: https://github.com/niketanpansare/incubator-systemml/tree/for_cpp/scripts/perftest/microbenchmarks

niketanpansare · 2016-12-30T02:49:04Z

All tests passed here.

dusenberrymw · 2017-01-09T23:29:30Z

@niketanpansare W.r.t. BLAS on OS X / macOS, currently, the preferred and default BLAS implementation is Apple Accelerate. Like C-BLAS, Atlas, OpenBLAS, Intel MKL, etc, Accelerate implements the BLAS API, and in terms of performance, Accelerate is generally the fastest BLAS implementation on OS X / macOS. Also, just to be clear, Accelerate is part of OS X / macOS, and does not require any installation.

For example, a pip install numpy on OS X / macOS will result in NumPy automatically linked to Accelerate:

import numpy as np
np.__config__.show()

blas_mkl_info:
  NOT AVAILABLE
openblas_info:
  NOT AVAILABLE
atlas_3_10_blas_threads_info:
  NOT AVAILABLE
atlas_3_10_blas_info:
  NOT AVAILABLE
atlas_blas_threads_info:
  NOT AVAILABLE
atlas_blas_info:
  NOT AVAILABLE
blas_opt_info:
    extra_compile_args = ['-msse3', '-I/System/Library/Frameworks/vecLib.framework/Headers']
    extra_link_args = ['-Wl,-framework', '-Wl,Accelerate']
    define_macros = [('NO_ATLAS_INFO', 3), ('HAVE_CBLAS', None)]
openblas_lapack_info:
  NOT AVAILABLE
lapack_mkl_info:
  NOT AVAILABLE
atlas_3_10_threads_info:
  NOT AVAILABLE
atlas_3_10_info:
  NOT AVAILABLE
atlas_threads_info:
  NOT AVAILABLE
atlas_info:
  NOT AVAILABLE
lapack_opt_info:
    extra_compile_args = ['-msse3']
    extra_link_args = ['-Wl,-framework', '-Wl,Accelerate']
    define_macros = [('NO_ATLAS_INFO', 3), ('HAVE_CBLAS', None)]

dusenberrymw · 2017-01-09T23:41:29Z

@niketanpansare I ran the quick test above, and I did not see any INFO accelerator ... lines. I did have the following though on OS X / macOS, no GPU:

17/01/09 15:36:17 INFO SparkContext: Added JAR /usr/local/lib/python2.7/site-packages/systemml/systemml-java/systemml-0.12.0-incubating-SNAPSHOT.jar at http://localhost:50037/jars/systemml-0.12.0-incubating-SNAPSHOT.jar with timestamp 1484004977792
17/01/09 15:36:17 INFO SparkContext: Added JAR /usr/local/lib/python2.7/site-packages/systemml/systemml-java/systemml-accelerator.jar at http://localhost:50037/jars/systemml-accelerator.jar with timestamp 1484004977848
17/01/09 15:36:35 INFO LibraryLoader: Unable to load MKL:no mkl_rt in java.library.path
17/01/09 15:36:35 INFO LibraryLoader: Unable to load OpenBLAS:no openblas in java.library.path
17/01/09 15:36:36 INFO LibraryLoader: Unable to load CUDA:no cuda in java.library.path

niketanpansare · 2017-01-09T23:50:32Z

Can you double check if OpenBLAS is available to Java ? https://github.com/niketanpansare/incubator-systemml/blob/9f325e1ac7155bf6a1cf98fc8e314ac339a198a0/docs/accelerator.md#frequently-asked-questions

Also, I agree with you that we should support Accelerate once this PR is in.

dusenberrymw · 2017-01-10T00:23:53Z

Well, I didn't install OpenBLAS, but I'll look into it on a Linux box. As for the rest of this PR, the systemml-accelerator.jar idea is awesome, but I don't think we will be able to ship a release that contains that JAR due to all of the binary files included. Thoughts?

cc @lresende, @deroneriksson

niketanpansare · 2017-01-10T00:36:27Z

Thanks @dusenberrymw .... We have three options:

Seperate project option: We have a seperate project https://github.com/niketanpansare/systemml-accelerator and

Provided dependency: We host systemml-accelerator.jar on maven central once this PR is accepted and provide the link in the documentation. I can also create a pip installer for that.
Compile dependency: Here we include a fat artifact where we compile systemml-accelerator.jar into SystemML. Thereby, no requiring an release of systemml-accelerator.jar.

One project option: We include the files from https://github.com/niketanpansare/systemml-accelerator into our project and have a separate profile to compile libraries conditionally.

…nd GPU backend 1. Support for automatic detection of BLAS (MKL and OpenBLAS) and GPU backend. 2. Added native matmult and conv2d functions. In case if native library is not available, we fallback to Java implementation. 3. This will allow us to explore distributed GPU solution.

niketanpansare · 2017-01-15T18:38:40Z

Closing this PR in favor of #344 and #291

niketanpansare force-pushed the for_cpp branch from 2dc5d86 to 4cc0484 Compare December 7, 2016 23:58

niketanpansare changed the title ~~[SYSTEMML-769] Support for native BLAS in SystemML~~ [SYSTEMML-769] [WIP] Support for native BLAS in SystemML Dec 16, 2016

niketanpansare force-pushed the for_cpp branch from 4cc0484 to 4e9a033 Compare December 17, 2016 03:53

niketanpansare force-pushed the for_cpp branch 2 times, most recently from 597757d to 46d9b04 Compare December 19, 2016 19:10

niketanpansare force-pushed the for_cpp branch from 29066ad to 263e32d Compare December 22, 2016 16:04

niketanpansare changed the title ~~[SYSTEMML-769] [WIP] Support for native BLAS in SystemML~~ [SYSTEMML-769] Support for native BLAS in SystemML Dec 22, 2016

niketanpansare changed the title ~~[SYSTEMML-769] Support for native BLAS in SystemML~~ [SYSTEMML-769] Support for native BLAS and simplify deployment for GPU backend Dec 23, 2016

niketanpansare mentioned this pull request Dec 23, 2016

[SYSTEMML-1118]Updated to use JCuda 0.8.0 #291

Closed

niketanpansare force-pushed the for_cpp branch 2 times, most recently from 78c7ae4 to e2d9a16 Compare January 6, 2017 18:47

niketanpansare force-pushed the for_cpp branch 2 times, most recently from 1ef5a13 to e93b6ef Compare January 11, 2017 00:06

Niketan Pansare added 7 commits January 10, 2017 19:16

Updated documentation

e27bd94

Updated documentation

420da60

Updated the documentation

177575a

Updated the documentation to switch from faq numbering to bullets

1b7daf5

Updated javadoc

4b7f506

Fix rebase error

1da242e

niketanpansare force-pushed the for_cpp branch from 496b311 to 1da242e Compare January 11, 2017 03:17

Added conv2dBiasAdd and conv2dBackwardFilter

a07498a

niketanpansare mentioned this pull request Jan 15, 2017

[SYSTEMML-769] [WIP] Adding support for native BLAS #344

Closed

niketanpansare closed this Jan 15, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SYSTEMML-769] Support for native BLAS and simplify deployment for GPU backend #307

[SYSTEMML-769] Support for native BLAS and simplify deployment for GPU backend #307

niketanpansare commented Dec 6, 2016

dusenberrymw commented Dec 7, 2016

niketanpansare commented Dec 7, 2016

nakul02 commented Dec 7, 2016

niketanpansare commented Dec 8, 2016

mboehm7 commented Dec 8, 2016

niketanpansare commented Dec 8, 2016

niketanpansare commented Dec 8, 2016

mboehm7 commented Dec 8, 2016

niketanpansare commented Dec 8, 2016

mboehm7 commented Dec 8, 2016

mboehm7 commented Dec 8, 2016 •

edited

dusenberrymw commented Dec 12, 2016

niketanpansare commented Dec 18, 2016 •

edited

niketanpansare commented Dec 19, 2016

niketanpansare commented Dec 19, 2016

niketanpansare commented Dec 20, 2016

niketanpansare commented Dec 22, 2016 •

edited

niketanpansare commented Dec 30, 2016

dusenberrymw commented Jan 9, 2017 •

edited

dusenberrymw commented Jan 9, 2017

niketanpansare commented Jan 9, 2017

dusenberrymw commented Jan 10, 2017 •

edited

niketanpansare commented Jan 10, 2017

niketanpansare commented Jan 15, 2017

[SYSTEMML-769] Support for native BLAS and simplify deployment for GPU backend #307

[SYSTEMML-769] Support for native BLAS and simplify deployment for GPU backend #307

Conversation

niketanpansare commented Dec 6, 2016

dusenberrymw commented Dec 7, 2016

niketanpansare commented Dec 7, 2016

nakul02 commented Dec 7, 2016

niketanpansare commented Dec 8, 2016

mboehm7 commented Dec 8, 2016

niketanpansare commented Dec 8, 2016

niketanpansare commented Dec 8, 2016

mboehm7 commented Dec 8, 2016

niketanpansare commented Dec 8, 2016

mboehm7 commented Dec 8, 2016

mboehm7 commented Dec 8, 2016 • edited

dusenberrymw commented Dec 12, 2016

niketanpansare commented Dec 18, 2016 • edited

niketanpansare commented Dec 19, 2016

niketanpansare commented Dec 19, 2016

niketanpansare commented Dec 20, 2016

niketanpansare commented Dec 22, 2016 • edited

niketanpansare commented Dec 30, 2016

dusenberrymw commented Jan 9, 2017 • edited

dusenberrymw commented Jan 9, 2017

niketanpansare commented Jan 9, 2017

dusenberrymw commented Jan 10, 2017 • edited

niketanpansare commented Jan 10, 2017

niketanpansare commented Jan 15, 2017

mboehm7 commented Dec 8, 2016 •

edited

niketanpansare commented Dec 18, 2016 •

edited

niketanpansare commented Dec 22, 2016 •

edited

dusenberrymw commented Jan 9, 2017 •

edited

dusenberrymw commented Jan 10, 2017 •

edited