Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SYSTEMML-769] Support for native BLAS and simplify deployment for GPU backend #307

Closed
wants to merge 8 commits into from

Conversation

niketanpansare
Copy link
Contributor

This is a standing PR to facilitate discussion on whether or not, we should support native BLAS in SystemML. After discussion and after resolving the issues with deployment, we can decide whether to turn on this feature by default. Since I wanted feedback from community before proceeding ahead, I did not complete the PR. The remaining tasks are:

  • Generalize to other BLAS, not just MKL. This would also involve completing the CMake file.
  • Add other operations: conv2d_backward_*, etc.

I ran some preliminary performance experiments comparing conv2d with/without sparse+caching and with/without native BLAS. I provided fairly large memory budget (-Xmx20g -Xms20g -Xmn2048m -server) and used Open JDK 1.8 64-Bit Server VM. The script tested the performance of conv2d using four commonly used setups for 1000 iterations:

max_iterations = 1000
setup = $2
numFilters = -1
numChannels = -1
filterSize = -1
pad = -1
if(setup == 1) {
        numFilters = 20
        numChannels = 1
        filterSize = 5
        pad = 0
}
else if(setup == 2) {
        numFilters = 50
        numChannels = 20
        filterSize = 5
        pad = 0
}
else if(setup == 3) {
        numFilters = 20
        numChannels = 1
        filterSize = 3
        pad = 1
}
else if(setup == 4) {
        numFilters = 50
        numChannels = 20
        filterSize = 3
        pad = 1
}
else {
        stop('Incorrect setup (needs to be [1, 4]).')
}
imgSize = 28
n = 60000
X = rand(rows=n, cols=numChannels*imgSize*imgSize)
batch_size = 64
w = rand(rows=numFilters, cols=numChannels*filterSize*filterSize)
P = (imgSize + 2 * pad - filterSize)  + 1
foo = matrix(0, rows=n, cols=numFilters*P*P)
for(iter in 1:max_iterations) {
        beg = (iter * batch_size) %% n + 1
        end = min(n, beg + batch_size)
        X_batch = X[beg:end, ]
        n_batch = nrow(X_batch)
        convOut_1 = conv2d(X_batch, w, input_shape=[n_batch,numChannels,imgSize,imgSize], filter_shape=[numFilters,numChannels,filterSize,filterSize], padding=[pad,pad], stride=[1,1])
        foo = convOut_1
}
print(sum(foo))

To compile the native SystemML library, please use:

export MKLROOT=/opt/intel/mkl
export JAVA_HOME=....
# Please go to https://software.intel.com/en-us/articles/intel-mkl-link-line-advisor to find LINKER_OPTIONS and COMPILER_OPTIONS
export LINKER_OPTIONS=" -L${MKLROOT}/lib/intel64 -Wl,--no-as-needed -lmkl_rt -lpthread -lm -ldl"
export COMPILER_OPTIONS=" -m64 -I${MKLROOT}/include"
g++ -shared -fPIC -o libsystemml.so systemml.cpp -I. -I$JAVA_HOME/include -I$JAVA_HOME/include/linux -lm -fopenmp -O3 $LINKER_OPTIONS $COMPILER_OPTIONS 

Please see below the results of the experiments. Both sparse and caching are disabled for the setup SystemML_native and SystemML_CP.

,Number of Iterations, Setup, Time in seconds
SystemML_native,1000,1,7.103096398
SystemML_CP,1000,1,6.498525426
SystemML_CP_WithCacheNSparseEnabled,1000,1,7.195620854
Tensorflow,1000,1,4.071731716
SystemML_native,1000,2,31.315343223
SystemML_CP,1000,2,81.769984552
SystemML_CP_WithCacheNSparseEnabled,1000,2,101.274622939
Tensorflow,1000,2,33.476548341
SystemML_native,1000,3,7.662274848
SystemML_CP,1000,3,6.355272119
SystemML_CP_WithCacheNSparseEnabled,1000,3,7.607337158
Tensorflow,1000,3,3.837932081
SystemML_native,1000,4,26.638438614
SystemML_CP,1000,4,49.716594505
SystemML_CP_WithCacheNSparseEnabled,1000,4,71.542244484
Tensorflow,1000,4,26.395180006

There are some additional overhead cost (such as initial compilation/validation, reuse of previously allocated but non-zeroed array, dynamic recompilation, GC, etc) which we have not yet optimized. These cost are beyond the scope of this PR and some of them are inherent to our design principles. We can work on them in a separate PR :)

@mboehm7 @bertholdreinwald @dusenberrymw @frreiss @prithvirajsen @fschueler @nakul02 @asurve @deroneriksson I understand the above experiments might not be sufficient to accept the change and would welcome your feedback on additional experiments/setups. I would also appreciate if some of you are willing to help me with these experiments too ;)

Here are the shapes of the matrix multiplication for the four setups:

Setup 1:
64 parallel matrix multiplication of shape (20, 25) %*% (25, 576) executed 1000 times.
 
Setup 2:
64 parallel matrix multiplication of shape (50, 500) %*% (500, 576) executed 1000 times.
 
Setup 3:
64 parallel matrix multiplication of shape (20, 25) %*% (25, 784) executed 1000 times.
 
Setup 4:
64 parallel matrix multiplication of shape (50, 500) %*% (500, 784) executed 1000 times.

I will provide an update soon comparing the results of the above matrix multiplications. If you are interested, here are the respective code path for the matrix multiplications:

@dusenberrymw
Copy link
Contributor

Overall, I'm quite pleased to see that we can achieve ~equal performance to TensorFlow, particularly on case 4, which is the most probable scenario of the given tests. 26 seconds vs 71 seconds in SystemML today is a huge difference, and this is even with a small image size (2828) compared to say (256256*3), and an even larger number of channels for intermediate layers.

I'd also be interested in a comparison of a full network, such as the LeNet example in the SystemML-NN package for the various SystemML setups.

As for the BLAS integration, I think it would be a good idea to generically target a BLAS implementation, vs. a specific one. A lot of people will be using OpenBLAS, for example.

@niketanpansare
Copy link
Contributor Author

I'd also be interested in a comparison of a full network, such as the LeNet example in the SystemML-NN package for the various SystemML setups.

Once we all agree that adding support for native BLAS is a good idea, we can add remaining operations and do comparison for a full network 👍

As for the BLAS integration, I think it would be a good idea to generically target a BLAS implementation, vs. a specific one. A lot of people will be using OpenBLAS, for example.

I agree. In fact, the code is written against cblas_* interface to make this happen. To switch over to generic BLAS, all we need to change is:

  1. Generalize two methods for setting the number of threads via conditional compilation: https://github.com/apache/incubator-systemml/pull/307/files#diff-601ed9dd7a60c7a3f26c91e73f68ea5cR48
  2. Update the headers: https://github.com/apache/incubator-systemml/pull/307/files#diff-601ed9dd7a60c7a3f26c91e73f68ea5cR32
  3. Edit CMake file: https://github.com/niketanpansare/incubator-systemml/blob/2dc5d862c107e6906c04a8ca66210a88b8547d2b/src/main/cpp/CMakeLists.txt

@nakul02
Copy link
Member

nakul02 commented Dec 7, 2016

The results look great 👍

I agree with @dusenberrymw's comments about generalizing the implementation to plug in any BLAS. We cannot assume the availability of a Intel MKL on every system.

@niketanpansare
Copy link
Contributor Author

@bertholdreinwald @mboehm7 @fschueler I ran experiments comparing the performance of matrix multiplication for 1000 iterations. The DML script used for this comparison is as follows:

X = matrix(0.1, rows=$1, cols=$2)
Y = matrix(0.2, rows=$2, cols=$3)
Z = matrix(0, rows=$1, cols=$3)
for(i in 1:1000) {
        Z = Z + X %*% Y
}
print(as.scalar(Z[1,1]))

The values provided to $1, $2 and $3 are 1 10 100 1000 2000 5000 10000, allowing us to test the performance of matrix multiplication for different shapes. I then extracted the time required for "ba+*" from the heavy hitters from each case and plotted them. The common dimension (i.e. $2) is plotted along the X-axis and the speedup time_for_ba+*_cp / time_for_ba+*_native is plotted along the Y-axis. The number of rows of X (i.e. $1) and the number of columns of Y (i.e. $3) are plotted using facets.

For example: The datapoint in red square refers to the speedup of native ba+* over LibMatrixMult for the case (1000, 5000) %*% (5000, 100). Here are the results for the cases in that box:

Setup, $1, $2, $2, $3, Time in seconds
Native,1000,1,1,100,0.499
CP,1000,1,1,100,0.377
Native,1000,10,10,100,0.522
CP,1000,10,10,100,0.955
Native,1000,100,100,100,0.472
CP,1000,100,100,100,1.780
Native,1000,1000,1000,100,1.286
CP,1000,1000,1000,100,6.849
Native,1000,2000,2000,100,2.196
CP,1000,2000,2000,100,12.234
Native,1000,5000,5000,100,5.870      <<<
CP,1000,5000,5000,100,28.050         <<<
Native,1000,10000,10000,100,10.249
CP,1000,10000,10000,100,55.804

matmultspeedup

It is important to note that we are competitive with the native library for matrix-vector multiplication.

@mboehm7
Copy link
Contributor

mboehm7 commented Dec 8, 2016

I'm a bit puzzled about the experimental setting and results. Are you comparing here sparse-sparse matrix multiply and sparse-sparse element-wise addition? If so, are you also using Sparse BLAS and the same sparse matrix representation for both (e.g., CSR)? Which fraction of the reported time is actually spent in matrix multiplication (excluding output allocation etc)?

@niketanpansare
Copy link
Contributor Author

I am only comparing the time taken for dense matrix multiplication. The time reported is from our heavy hitters, so only includes ba+* time.

@niketanpansare
Copy link
Contributor Author

Time for output allocation is included in both native and CP. I only edited LibMatrixMult to redirect to native for fair comparison.

@mboehm7
Copy link
Contributor

mboehm7 commented Dec 8, 2016

Hm, but why do you force dense matrix multiply on a sparse scenario of these shapes?

Anyway, could you please factor out a couple of things: on which scenarios did we apply multi-threading, how much time was spent in output allocation, how much time was spent in the actual matrix multiply, and how much time was spent in computing/maintaining the nnz in the output. Thanks.

@niketanpansare
Copy link
Contributor Author

May be I am missing something, why do you think the scenario should be sparse. If you prefer, I can test with random matrix instead of all 0.1 or 0.2. I didn't check each case but for a larger case, both Berthold and I noticed that all cores being used by CP. Also, I tried to ensure that the overhead (output allocation and nnz maintenance) happens for both the cases ... please see LibMatrixNative in this PR. I suspect if we factor out the overhead, speedup will be even more. Unfortunately, I won't be able to do any more experiments for at least another week, but I can share the setup if you or someone else want to take lead until then :)

@mboehm7
Copy link
Contributor

mboehm7 commented Dec 8, 2016

sorry, my bad - when skimming the scenario, I mistakenly read it as rand of sparsity 0.1 and 0.2. In that case your can safely ignore my previous comments. However, it would still be useful to actually understand the details of where time goes and why the results show such a non-monotonic behavior.

@mboehm7
Copy link
Contributor

mboehm7 commented Dec 8, 2016

just to clarify the issue of multi-threading: we have a parallelization threshold of 2MFLOPs as this is roughly the point where the creation of a thread pool is amortized. We experimented with a shared thread pool but the integration, especially with parfor, was not very clean so we did not put it into master. Please just annotate the cases where we don't use multi-threading to help understand the large variation.

@dusenberrymw
Copy link
Contributor

Can we add a "WIP" tag to this PR?

@niketanpansare niketanpansare changed the title [SYSTEMML-769] Support for native BLAS in SystemML [SYSTEMML-769] [WIP] Support for native BLAS in SystemML Dec 16, 2016
@niketanpansare
Copy link
Contributor Author

niketanpansare commented Dec 18, 2016

@mboehm7 I have created a branch with my experimental setup that annotates the cases where multi-threading was not enabled. Please see https://github.com/niketanpansare/incubator-systemml/blob/matmult_cpp_experiments/scripts/perftest/runMatMultExperiments.sh in case if you want to reproduce the result on a different machine.

I am running the experiments now and will provide an updated graph soon. Here is the preview of the results:

Java,SingleThreaded,1,1000,100,0.087730
IntelMKL,MultiThreaded,1,1000,1000,0.137696
OpenBLAS,MultiThreaded,1,1000,1000,1.190570
Java,SingleThreaded,1,1000,1000,0.619783
IntelMKL,MultiThreaded,1,1000,2000,0.220410
OpenBLAS,MultiThreaded,1,1000,2000,3.202002
Java,MultiThreaded,1,1000,2000,1.553647
IntelMKL,MultiThreaded,1,1000,5000,2.011847
OpenBLAS,MultiThreaded,1,1000,5000,9.932969
Java,MultiThreaded,1,1000,5000,2.930450
IntelMKL,MultiThreaded,1,1000,10000,7.010215
OpenBLAS,MultiThreaded,1,1000,10000,19.757398
Java,MultiThreaded,1,1000,10000,5.434230
IntelMKL,MultiThreaded,1,2000,1,0.023460
OpenBLAS,MultiThreaded,1,2000,1,0.019632
Java,SingleThreaded,1,2000,1,0.014827
IntelMKL,MultiThreaded,1,2000,10,0.065613
OpenBLAS,MultiThreaded,1,2000,10,0.033173
Java,SingleThreaded,1,2000,10,0.045123
IntelMKL,MultiThreaded,1,2000,100,0.118278
OpenBLAS,MultiThreaded,1,2000,100,0.239601
Java,SingleThreaded,1,2000,100,0.181720
IntelMKL,MultiThreaded,1,2000,1000,0.288750
OpenBLAS,MultiThreaded,1,2000,1000,3.235483
Java,MultiThreaded,1,2000,1000,1.489389
IntelMKL,MultiThreaded,1,2000,2000,2.614357
OpenBLAS,MultiThreaded,1,2000,2000,7.843344

@dusenberrymw @nakul02 @fschueler As an FYI, with some preliminary experiments, Java is outperforming OpenBLAS even with avx and fma flags (but not Intel MKL). The command used to compile with OpenBLAS is:

g++ -o /home/[user]/libsystemml.so systemml-cpp/systemml.cpp  -I/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.111-1.b15.el7_2.x86_64/include -Isystemml-cpp -I/opt/openblas/include -I/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.111-1.b15.el7_2.x86_64/include/linux -lopenblas -lpthread -lm -ldl -DUSE_OPEN_BLAS -L/opt/openblas/lib -fopenmp -O3 -shared -fPIC -mavx -mfma

@fschueler Either I am not compiling openblas in an optimal manner (i.e. AVX/SSE2/FMA are not getting turned on by default) or your previous experiments might be using OpenCL?

@niketanpansare
Copy link
Contributor Author

@mboehm7 Please see below the graphs comparing speedup with Intel MKL and OpenBLAS respectively. As per your suggestion, I have marked the cases where multi-threading was not enabled

@fschueler I think we donot see the speedup in OpenBLAS because it is likely not being compiled with AVX.

dec18_mkl
dec18_openblas

The R script used to generate above graphs:

t = read.csv("time1.txt", header=TRUE)
t = subset(t, BLAS != "IntelMKL" & Time > 0)
temp = subset(t, BLAS == "Java")
for(i in 1:nrow(temp)) {
  M1 = temp[i,]$M
  N2 = temp[i,]$N
  K1 = temp[i,]$K
  temp[i,]$Time = subset(t, BLAS == "Java" & M == M1 & N == N2 & K == K1)$Time /  subset(t, BLAS != "Java" & M == M1 & N == N2 & K == K1)$Time
}
qplot(N, Time, data=temp, color=IsSingleThreaded, facets=M~K, xlab="Common dimension (N)", ylab="Speedup", main="Matrix multiplication (Speedup = CP/OpenBLAS)")

@nakul02 @fschueler @dusenberrymw Can you help me with Mac command for OpenBLAS and Intel MKL ? https://github.com/niketanpansare/incubator-systemml/blob/for_cpp/src/main/python/setup.py#L70

@niketanpansare
Copy link
Contributor Author

@fschueler I used following steps to compile OpenBLAS:

git clone https://github.com/xianyi/OpenBLAS
cd OpenBLAS
make FC=gfortran
sudo make PREFIX=/opt/openblas install

Also, the output of getarch utility included in OpenBLAS directory is as follows:

$ ./getarch 0
CORE=HASWELL
LIBCORE=haswell
NUM_CORES=24
HAVE_MMX=1
HAVE_SSE=1
HAVE_SSE2=1
HAVE_SSE3=1
HAVE_SSSE3=1
HAVE_SSE4_1=1
HAVE_SSE4_2=1
HAVE_AVX=1
HAVE_FMA3=1

@niketanpansare
Copy link
Contributor Author

To summarize, I tried following with openblas, but still saw performance degradation:

  1. Compiling OpenBLAS from source:

    git clone https://github.com/xianyi/OpenBLAS
    cd OpenBLAS
    make FC=gfortran
    sudo make PREFIX=/opt/openblas install
  2. Using released OpenBLAS:

    sudo yum install openblas
    sudo ln -s /lib64/libopenblas.so.0 /usr/lib/libopenblas.so
  3. Also, commented any threading logic in systemml.cpp to rule out possibility of bug due to extension APIs:

        //if(NUM_THREADS == -1) {
        //      NUM_THREADS = openblas_get_num_threads();
        //}
        // openblas_set_num_threads(NUM_THREADS);
  4. Even though I was using newer Centos 7 and gcc 4.8 (both supporting avx instructions), I also tried updating binutils and reinstalling OpenBLAS as per https://github.com/xianyi/OpenBLAS/wiki/faq#binutils

My guess is that this is due to a performance bug in OpenBLAS on multi-socket machine: OpenMathLib/OpenBLAS#611

$ cat /proc/cpuinfo | grep "physical id" | sort -u | wc -l
2

@niketanpansare niketanpansare changed the title [SYSTEMML-769] [WIP] Support for native BLAS in SystemML [SYSTEMML-769] Support for native BLAS in SystemML Dec 22, 2016
@niketanpansare
Copy link
Contributor Author

niketanpansare commented Dec 22, 2016

I have got this PR to functionally complete state. Here are the features that are included as part of this PR:

  1. Automatic detection of BLAS and GPU as well as including the correct dependency. Please see http://niketanpansare.github.io/incubator-systemml/accelerator for usage.
  • By default, if the user invokes SystemML via commandline or scala shell, everything works as our existing setup. Only when user provides systemml-accelerator.jar does GPU and BLAS potentially get enabled.
  • No additional steps of including JCuda java and lib required.
  • This setup will work in hybrid cluster environment (with/without gpus, different blas) as well.
  • We support GPU on Linux (x86_64 and powerpc), Mac (x86_64), Windows (x86_64) and BLAS on Linux (x86_64), Mac (x86_64), Windows (x86_64). For other setups, we revert back to LibMatrixMult and non-GPU backend.
  1. Support BLAS for matrix multiplication, conv2d and conv2d_backward_data.

  2. Performance comparison of LibMatrixMult with MKL and OpenBLAS.

@nakul02 @mboehm7 @bertholdreinwald @dusenberrymw @deroneriksson @asurve @gweidner @fschueler Can you please try the following script in your environment and comment back on this PR with the log lines prefixed by INFO accelerator. This should help us identify the setup on which this PR is tested.

git clone https://github.com/niketanpansare/incubator-systemml.git
cd incubator-systemml
git checkout for_cpp
mvn clean package -P distribution
pip install target/systemml-0.12.0-incubating-SNAPSHOT-python.tgz
pyspark
>>> from systemml import random
>>> m1 = random.uniform(size=(1000,1000))
>>> m2 = random.uniform(size=(1000,1000))
>>> m3 = m1.dot(m2).toNumPy()

Here are the logs from my setups:

  • machine 1: installed openblas, no gpu, linux
  • machine 2: installed openblas, mkl , cuda 8, cudnn 5.1, linux
On machine 1:
16/12/22 11:21:38 INFO accelerator.LibraryLoader: Unable to load MKL:no mkl_rt in java.library.path
16/12/22 11:21:38 INFO accelerator.BLASHelper: Found BLAS: openblas
16/12/22 11:21:38 INFO accelerator.BLASHelper: Successfully loaded systemml library with openblas

On machine 2 (SYSTEMML_GPU and SYSTEMML_BLAS not set):
16/12/22 14:30:46 INFO accelerator.BLASHelper: Found BLAS: mkl
16/12/22 14:30:46 INFO accelerator.BLASHelper: Successfully loaded systemml library with mkl
16/12/22 14:30:52 INFO accelerator.JCudaHelper: Total number of GPUs on the machine: 2
16/12/22 14:30:52 INFO accelerator.JCudaHelper: GPU is enabled

On machine 2 (SYSTEMML_GPU not set and SYSTEMML_BLAS set to openblas):
16/12/22 14:33:07 INFO accelerator.BLASHelper: Found BLAS: openblas
16/12/22 14:33:07 INFO accelerator.BLASHelper: Successfully loaded systemml library with openblas
16/12/22 14:33:12 INFO accelerator.JCudaHelper: Total number of GPUs on the machine: 2
16/12/22 14:33:12 INFO accelerator.JCudaHelper: GPU is enabled

On machine 2 (SYSTEMML_GPU set to none and SYSTEMML_BLAS set to openblas):
16/12/22 14:34:05 INFO accelerator.BLASHelper: Found BLAS: openblas
16/12/22 14:34:05 INFO accelerator.BLASHelper: Successfully loaded systemml library with openblas
16/12/22 14:34:05 INFO accelerator.JCudaHelper: Not loading JCUDA as SYSTEMML_GPU=none

On machine 2 (SYSTEMML_GPU set to none and SYSTEMML_BLAS set to none):
16/12/22 14:34:54 INFO accelerator.BLASHelper: Not loading native BLAS as SYSTEMML_BLAS=none
16/12/22 14:34:54 INFO accelerator.JCudaHelper: Not loading JCUDA as SYSTEMML_GPU=none

If you have few more free cycles, please run the microbenchmarks: https://github.com/niketanpansare/incubator-systemml/tree/for_cpp/scripts/perftest/microbenchmarks

@niketanpansare niketanpansare changed the title [SYSTEMML-769] Support for native BLAS in SystemML [SYSTEMML-769] Support for native BLAS and simplify deployment for GPU backend Dec 23, 2016
@niketanpansare
Copy link
Contributor Author

All tests passed here.

@dusenberrymw
Copy link
Contributor

dusenberrymw commented Jan 9, 2017

@niketanpansare W.r.t. BLAS on OS X / macOS, currently, the preferred and default BLAS implementation is Apple Accelerate. Like C-BLAS, Atlas, OpenBLAS, Intel MKL, etc, Accelerate implements the BLAS API, and in terms of performance, Accelerate is generally the fastest BLAS implementation on OS X / macOS. Also, just to be clear, Accelerate is part of OS X / macOS, and does not require any installation.

For example, a pip install numpy on OS X / macOS will result in NumPy automatically linked to Accelerate:

import numpy as np
np.__config__.show()
blas_mkl_info:
  NOT AVAILABLE
openblas_info:
  NOT AVAILABLE
atlas_3_10_blas_threads_info:
  NOT AVAILABLE
atlas_3_10_blas_info:
  NOT AVAILABLE
atlas_blas_threads_info:
  NOT AVAILABLE
atlas_blas_info:
  NOT AVAILABLE
blas_opt_info:
    extra_compile_args = ['-msse3', '-I/System/Library/Frameworks/vecLib.framework/Headers']
    extra_link_args = ['-Wl,-framework', '-Wl,Accelerate']
    define_macros = [('NO_ATLAS_INFO', 3), ('HAVE_CBLAS', None)]
openblas_lapack_info:
  NOT AVAILABLE
lapack_mkl_info:
  NOT AVAILABLE
atlas_3_10_threads_info:
  NOT AVAILABLE
atlas_3_10_info:
  NOT AVAILABLE
atlas_threads_info:
  NOT AVAILABLE
atlas_info:
  NOT AVAILABLE
lapack_opt_info:
    extra_compile_args = ['-msse3']
    extra_link_args = ['-Wl,-framework', '-Wl,Accelerate']
    define_macros = [('NO_ATLAS_INFO', 3), ('HAVE_CBLAS', None)]

@dusenberrymw
Copy link
Contributor

@niketanpansare I ran the quick test above, and I did not see any INFO accelerator ... lines. I did have the following though on OS X / macOS, no GPU:

17/01/09 15:36:17 INFO SparkContext: Added JAR /usr/local/lib/python2.7/site-packages/systemml/systemml-java/systemml-0.12.0-incubating-SNAPSHOT.jar at http://localhost:50037/jars/systemml-0.12.0-incubating-SNAPSHOT.jar with timestamp 1484004977792
17/01/09 15:36:17 INFO SparkContext: Added JAR /usr/local/lib/python2.7/site-packages/systemml/systemml-java/systemml-accelerator.jar at http://localhost:50037/jars/systemml-accelerator.jar with timestamp 1484004977848
17/01/09 15:36:35 INFO LibraryLoader: Unable to load MKL:no mkl_rt in java.library.path
17/01/09 15:36:35 INFO LibraryLoader: Unable to load OpenBLAS:no openblas in java.library.path
17/01/09 15:36:36 INFO LibraryLoader: Unable to load CUDA:no cuda in java.library.path

@niketanpansare
Copy link
Contributor Author

Can you double check if OpenBLAS is available to Java ? https://github.com/niketanpansare/incubator-systemml/blob/9f325e1ac7155bf6a1cf98fc8e314ac339a198a0/docs/accelerator.md#frequently-asked-questions

Also, I agree with you that we should support Accelerate once this PR is in.

@dusenberrymw
Copy link
Contributor

dusenberrymw commented Jan 10, 2017

Well, I didn't install OpenBLAS, but I'll look into it on a Linux box. As for the rest of this PR, the systemml-accelerator.jar idea is awesome, but I don't think we will be able to ship a release that contains that JAR due to all of the binary files included. Thoughts?

cc @lresende, @deroneriksson

@niketanpansare
Copy link
Contributor Author

Thanks @dusenberrymw .... We have three options:

  1. Seperate project option: We have a seperate project https://github.com/niketanpansare/systemml-accelerator and
  • Provided dependency: We host systemml-accelerator.jar on maven central once this PR is accepted and provide the link in the documentation. I can also create a pip installer for that.
  • Compile dependency: Here we include a fat artifact where we compile systemml-accelerator.jar into SystemML. Thereby, no requiring an release of systemml-accelerator.jar.
  1. One project option: We include the files from https://github.com/niketanpansare/systemml-accelerator into our project and have a separate profile to compile libraries conditionally.

Niketan Pansare added 7 commits January 10, 2017 19:16
…nd GPU backend

1. Support for automatic detection of BLAS (MKL and OpenBLAS) and GPU
backend.
2. Added native matmult and conv2d functions. In case if native library is not available, we fallback to Java implementation.
3. This will allow us to explore distributed GPU solution.
@niketanpansare
Copy link
Contributor Author

Closing this PR in favor of #344 and #291

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants