This repository has been archived by the owner. It is now read-only.

Support for Open CL #27

Open
agibsonccc opened this Issue Mar 6, 2015 · 36 comments

Comments

Projects
None yet
@agibsonccc
Member

agibsonccc commented Mar 6, 2015

This will be the thread for AMD gpus.

@agibsonccc

This comment has been minimized.

Show comment
Hide comment
@agibsonccc

agibsonccc Apr 17, 2015

Member

Initial WIP happened. Converted most .cu files to .kh and mapped the terminology. Next will be integrating jocl.

Member

agibsonccc commented Apr 17, 2015

Initial WIP happened. Converted most .cu files to .kh and mapped the terminology. Next will be integrating jocl.

@peterstyles

This comment has been minimized.

Show comment
Hide comment
@peterstyles

peterstyles May 29, 2015

+1 for this - cheers for putting this all together!

peterstyles commented May 29, 2015

+1 for this - cheers for putting this all together!

@agibsonccc

This comment has been minimized.

Show comment
Hide comment
@agibsonccc

agibsonccc May 29, 2015

Member

Quick update on this: cuda architecture is baked.

I am going to base the opencl support off of the stuff I did in cuda. The
opencl kernels are written already. Just need to finish porting the cuda
stuff over now.
On May 28, 2015 6:01 PM, "Peter" notifications@github.com wrote:

+1 for this - cheers for putting this all together!


Reply to this email directly or view it on GitHub
#27 (comment).

Member

agibsonccc commented May 29, 2015

Quick update on this: cuda architecture is baked.

I am going to base the opencl support off of the stuff I did in cuda. The
opencl kernels are written already. Just need to finish porting the cuda
stuff over now.
On May 28, 2015 6:01 PM, "Peter" notifications@github.com wrote:

+1 for this - cheers for putting this all together!


Reply to this email directly or view it on GitHub
#27 (comment).

@Riotsup

This comment has been minimized.

Show comment
Hide comment
@Riotsup

Riotsup Jul 6, 2015

Hello is this already supported for AMD?
Cause, im planning on buying a Fury X for dl and for the lulz.

Riotsup commented Jul 6, 2015

Hello is this already supported for AMD?
Cause, im planning on buying a Fury X for dl and for the lulz.

@agibsonccc

This comment has been minimized.

Show comment
Hide comment
@agibsonccc

agibsonccc Jul 6, 2015

Member

Opencl supports AMD yes. I need to port the work I did for cuda to opencl though. The kernels are already written. Need bandwidth though.

Member

agibsonccc commented Jul 6, 2015

Opencl supports AMD yes. I need to port the work I did for cuda to opencl though. The kernels are already written. Need bandwidth though.

@Riotsup

This comment has been minimized.

Show comment
Hide comment
@Riotsup

Riotsup Jul 7, 2015

Great, but what is it that nd4j does not do for AMD that it does for CUDA (yet)?

Riotsup commented Jul 7, 2015

Great, but what is it that nd4j does not do for AMD that it does for CUDA (yet)?

@agibsonccc

This comment has been minimized.

Show comment
Hide comment
@agibsonccc

agibsonccc Jul 7, 2015

Member

Nd4j cuda is actually implemented. I don't have the Blas operations for
opencl yet.
On Jul 7, 2015 7:31 AM, "Riotsup" notifications@github.com wrote:

Great, but what is it that nd4j does not do for AMD that it does for CUDA
(yet)?


Reply to this email directly or view it on GitHub
#27 (comment).

Member

agibsonccc commented Jul 7, 2015

Nd4j cuda is actually implemented. I don't have the Blas operations for
opencl yet.
On Jul 7, 2015 7:31 AM, "Riotsup" notifications@github.com wrote:

Great, but what is it that nd4j does not do for AMD that it does for CUDA
(yet)?


Reply to this email directly or view it on GitHub
#27 (comment).

@agibsonccc

This comment has been minimized.

Show comment
Hide comment
@agibsonccc
Member

agibsonccc commented Jun 2, 2016

@rkraneis

This comment has been minimized.

Show comment
Hide comment
@rkraneis

rkraneis Jul 21, 2016

Hi, what is the current state of this? Manually compiling against clBLAS? nd4j-0.4 seems to have dropped anything Jocl ...

rkraneis commented Jul 21, 2016

Hi, what is the current state of this? Manually compiling against clBLAS? nd4j-0.4 seems to have dropped anything Jocl ...

@agibsonccc

This comment has been minimized.

Show comment
Hide comment
@agibsonccc

agibsonccc Jul 21, 2016

Member

We never really finished it. We don't have customers asking for it. You
are welcome to send us a pull request (hint: no one steps up)

If you want it we will direct you. Beyond that we don't have bandwidth.

On Jul 21, 2016 22:10, "René Kraneis" notifications@github.com wrote:

Hi, what is the current state of this? Manually compiling against clBLAS?
nd4j-0.4 seems to have dropped anything Jocl ...


You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
#27 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABF18ohbsfBXPq_VPAb87P4kor-Kd4mnks5qX2_VgaJpZM4DqZ7T
.

Member

agibsonccc commented Jul 21, 2016

We never really finished it. We don't have customers asking for it. You
are welcome to send us a pull request (hint: no one steps up)

If you want it we will direct you. Beyond that we don't have bandwidth.

On Jul 21, 2016 22:10, "René Kraneis" notifications@github.com wrote:

Hi, what is the current state of this? Manually compiling against clBLAS?
nd4j-0.4 seems to have dropped anything Jocl ...


You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
#27 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABF18ohbsfBXPq_VPAb87P4kor-Kd4mnks5qX2_VgaJpZM4DqZ7T
.

@raver119

This comment has been minimized.

Show comment
Hide comment
@raver119

raver119 Jul 21, 2016

Contributor

Just a side note: openCL is viable for Xeon Phi...

However i'm not sure if those Xeon Phi's are widely used in practice

Contributor

raver119 commented Jul 21, 2016

Just a side note: openCL is viable for Xeon Phi...

However i'm not sure if those Xeon Phi's are widely used in practice

@agibsonccc

This comment has been minimized.

Show comment
Hide comment
@agibsonccc

agibsonccc Jul 21, 2016

Member

Problem is versioning. What versions do we support? Where do we expect it to run? WHAT should we support? There's an insane amount of fragmentation out there.

Member

agibsonccc commented Jul 21, 2016

Problem is versioning. What versions do we support? Where do we expect it to run? WHAT should we support? There's an insane amount of fragmentation out there.

@raver119

This comment has been minimized.

Show comment
Hide comment
@raver119

raver119 Jul 21, 2016

Contributor

Yea, but right now i'm not mentioning AMD gpus at all, AMD lost that battle and i'm fine with that. Was speaking only about Xeon Phi, and their use with nd4j :)

Contributor

raver119 commented Jul 21, 2016

Yea, but right now i'm not mentioning AMD gpus at all, AMD lost that battle and i'm fine with that. Was speaking only about Xeon Phi, and their use with nd4j :)

@saudet

This comment has been minimized.

Show comment
Hide comment
@saudet

saudet Jul 22, 2016

Member

AFAIK, Intel is pushing more for OpenMP than OpenCL there, e.g.:

Q3: how is working with MIC/GPU compared to OpenCL in terms of performance, and coding comlexity?
A3: Good question, the feedback we got from customers wwere, e.g. Given one large C++ applications, it takes 6 monthes to re-implement it with OpenCL, but, it takes about 6 weeks using OpenMP with code modernization efforts to achieve a better perofrmance. In addition, based on a set of workload, OpenCL performance is ~10-30% below OpenMP code on MIC.

https://software.intel.com/en-us/videos/new-era-for-openmp-beyond-traditional-shared-memory-parallel-programming

Member

saudet commented Jul 22, 2016

AFAIK, Intel is pushing more for OpenMP than OpenCL there, e.g.:

Q3: how is working with MIC/GPU compared to OpenCL in terms of performance, and coding comlexity?
A3: Good question, the feedback we got from customers wwere, e.g. Given one large C++ applications, it takes 6 monthes to re-implement it with OpenCL, but, it takes about 6 weeks using OpenMP with code modernization efforts to achieve a better perofrmance. In addition, based on a set of workload, OpenCL performance is ~10-30% below OpenMP code on MIC.

https://software.intel.com/en-us/videos/new-era-for-openmp-beyond-traditional-shared-memory-parallel-programming

@rkraneis

This comment has been minimized.

Show comment
Hide comment
@rkraneis

rkraneis Jul 22, 2016

I'm mostly interested in increased performance of deeplearning4j on Amd/Intel GPUs. Are there benchmarks on the improvements for e.g. intel hd over core i7 (skylake)? Or is it just not worth the hassle?

@saudet: so you would recommend implementing nd4j on OpenMP rather OpenCL at least for Intel GPUs?

rkraneis commented Jul 22, 2016

I'm mostly interested in increased performance of deeplearning4j on Amd/Intel GPUs. Are there benchmarks on the improvements for e.g. intel hd over core i7 (skylake)? Or is it just not worth the hassle?

@saudet: so you would recommend implementing nd4j on OpenMP rather OpenCL at least for Intel GPUs?

@agibsonccc

This comment has been minimized.

Show comment
Hide comment
@agibsonccc

agibsonccc Jul 22, 2016

Member

We already did openmp.

On Jul 22, 2016 16:29, "René Kraneis" notifications@github.com wrote:

I'm mostly interested in increased performance of deeplearning4j on
Amd/Intel GPUs. Are there benchmarks on the improvements for e.g. intel hd
over core i7 (skylake)? Or is it just not worth the hassle?

@saudet https://github.com/saudet: so you would recommend implementing
nd4j on OpenMP rather OpenCL at least for Intel GPUs?


You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
#27 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABF18jCNWt9cL-NGTzDj2laUbXnqs1Zcks5qYHFbgaJpZM4DqZ7T
.

Member

agibsonccc commented Jul 22, 2016

We already did openmp.

On Jul 22, 2016 16:29, "René Kraneis" notifications@github.com wrote:

I'm mostly interested in increased performance of deeplearning4j on
Amd/Intel GPUs. Are there benchmarks on the improvements for e.g. intel hd
over core i7 (skylake)? Or is it just not worth the hassle?

@saudet https://github.com/saudet: so you would recommend implementing
nd4j on OpenMP rather OpenCL at least for Intel GPUs?


You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
#27 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABF18jCNWt9cL-NGTzDj2laUbXnqs1Zcks5qYHFbgaJpZM4DqZ7T
.

@rkraneis

This comment has been minimized.

Show comment
Hide comment
@rkraneis

rkraneis Jul 22, 2016

@agibsonccc, do you mean the "Linking with MKL" part?

rkraneis commented Jul 22, 2016

@agibsonccc, do you mean the "Linking with MKL" part?

@agibsonccc

This comment has been minimized.

Show comment
Hide comment
@agibsonccc

agibsonccc Jul 22, 2016

Member

Right. So we already have mkl in place

Member

agibsonccc commented Jul 22, 2016

Right. So we already have mkl in place

@saudet

This comment has been minimized.

Show comment
Hide comment
@saudet

saudet Jul 22, 2016

Member

@rkraneis Xeon Phi is more CPU than GPU, and in that case Intel seems to lean toward OpenMP yes. We'll need to optimize for the Xeon Phi obviously though. As for actual GPUs from Intel, those will require OpenCL, just like AMD. Although if OpenCL ends up being only useful in the case of GPUs, there's also Vulkan we should take a look at...

Member

saudet commented Jul 22, 2016

@rkraneis Xeon Phi is more CPU than GPU, and in that case Intel seems to lean toward OpenMP yes. We'll need to optimize for the Xeon Phi obviously though. As for actual GPUs from Intel, those will require OpenCL, just like AMD. Although if OpenCL ends up being only useful in the case of GPUs, there's also Vulkan we should take a look at...

@raver119

This comment has been minimized.

Show comment
Hide comment
@raver119

raver119 Jul 22, 2016

Contributor

I'm afraid that "optimize for Xeon Phi" will end up with memory management, since auto offload mode will hit into PCIe transfers price. It's the same as for CUDA - nd4j issues many atomic operations, on relatively small data That's why i've mentioned OpenCL there...

Contributor

raver119 commented Jul 22, 2016

I'm afraid that "optimize for Xeon Phi" will end up with memory management, since auto offload mode will hit into PCIe transfers price. It's the same as for CUDA - nd4j issues many atomic operations, on relatively small data That's why i've mentioned OpenCL there...

@bhack

This comment has been minimized.

Show comment
Hide comment
@bhack

bhack Jul 22, 2016

They are pushing on openmp for mkl-dnn and They will opensource the dnn component in Q3.

bhack commented Jul 22, 2016

They are pushing on openmp for mkl-dnn and They will opensource the dnn component in Q3.

smaryn pushed a commit that referenced this issue Dec 19, 2016

@Lord-of-the-Galaxy

This comment has been minimized.

Show comment
Hide comment
@Lord-of-the-Galaxy

Lord-of-the-Galaxy Jun 1, 2017

Is this still on?

Lord-of-the-Galaxy commented Jun 1, 2017

Is this still on?

@AlexDBlack

This comment has been minimized.

Show comment
Hide comment
@AlexDBlack

AlexDBlack Jun 13, 2017

Member

@Lord-of-the-Galaxy It's not on the immediate roadmap - it's a low priority for us currently (and would require a lot of engineering time), given that the main use case for OpenCL (i.e., AMD GPUs) have very little use in commercial contexts or cloud computing (AWS, Azure, etc).

Member

AlexDBlack commented Jun 13, 2017

@Lord-of-the-Galaxy It's not on the immediate roadmap - it's a low priority for us currently (and would require a lot of engineering time), given that the main use case for OpenCL (i.e., AMD GPUs) have very little use in commercial contexts or cloud computing (AWS, Azure, etc).

@Lord-of-the-Galaxy

This comment has been minimized.

Show comment
Hide comment
@Lord-of-the-Galaxy

Lord-of-the-Galaxy Jun 18, 2017

I see. I thought OpenCL may gain priority as neural networks are starting to get used in more and more programs that run on the consumer's device (there are some games using neural networks extensively), but I guess I was wrong.

Lord-of-the-Galaxy commented Jun 18, 2017

I see. I thought OpenCL may gain priority as neural networks are starting to get used in more and more programs that run on the consumer's device (there are some games using neural networks extensively), but I guess I was wrong.

@agibsonccc

This comment has been minimized.

Show comment
Hide comment
@agibsonccc

agibsonccc Jun 18, 2017

Member

NVIDIA effectively has a chokehold on the space. Not only are they giving away gpus, they are also involved in integration in to a lot of the open source packages (us notwithstanding, we write our own).

Honestly until mesos recognizes what an AMD gpu is it's not reallly interesting for us.

Opencl is a horribly fragmented standard where it would mean different things on different devices.

NVIDIA still has their opencl support on 2.0. The only reason amd even pushes "open" is because hte are the underdog. A lot of HPC code also doesn't tend to need to be "cross platform".

"Cross platform" tends to be "nvidia". As far as other kinds of devices are concerned like android it's a lot of the same story.

Member

agibsonccc commented Jun 18, 2017

NVIDIA effectively has a chokehold on the space. Not only are they giving away gpus, they are also involved in integration in to a lot of the open source packages (us notwithstanding, we write our own).

Honestly until mesos recognizes what an AMD gpu is it's not reallly interesting for us.

Opencl is a horribly fragmented standard where it would mean different things on different devices.

NVIDIA still has their opencl support on 2.0. The only reason amd even pushes "open" is because hte are the underdog. A lot of HPC code also doesn't tend to need to be "cross platform".

"Cross platform" tends to be "nvidia". As far as other kinds of devices are concerned like android it's a lot of the same story.

@raver119

This comment has been minimized.

Show comment
Hide comment
@raver119

raver119 Jun 18, 2017

Contributor

Right, we can add support for OpenCL, all our important stuff is c++ anyway, but i'm afraid fragmentation will make it nightmare for users.

Contributor

raver119 commented Jun 18, 2017

Right, we can add support for OpenCL, all our important stuff is c++ anyway, but i'm afraid fragmentation will make it nightmare for users.

@xvolks

This comment has been minimized.

Show comment
Hide comment
@xvolks

xvolks Jul 9, 2017

The upcoming iMacPro may be worth the effort to use DL4j on AMD GPUs. But Apple may decide to implement a Keras/Caffe for this computer.

xvolks commented Jul 9, 2017

The upcoming iMacPro may be worth the effort to use DL4j on AMD GPUs. But Apple may decide to implement a Keras/Caffe for this computer.

@agibsonccc

This comment has been minimized.

Show comment
Hide comment
@agibsonccc

agibsonccc Jul 9, 2017

Member

@xvolks So 1 thing here: apple implemented a "backend" called coreml with support for various framework formats. CoreML is mainly for inference on iphones right now.

Do some research before you make claims please :).

Member

agibsonccc commented Jul 9, 2017

@xvolks So 1 thing here: apple implemented a "backend" called coreml with support for various framework formats. CoreML is mainly for inference on iphones right now.

Do some research before you make claims please :).

@xvolks

This comment has been minimized.

Show comment
Hide comment
@xvolks

xvolks Jul 10, 2017

xvolks commented Jul 10, 2017

@InonS

This comment has been minimized.

Show comment
Hide comment
@InonS

InonS Jul 13, 2017

I'm not sure where things stand right now. README.md module section clearly states:

Several of these modules are different backend options for ND4J (including GPUs).
...
jocl-parent = Java bindings for OpenCL

from which it is understood that ND4J currently supports an OpenCL backend via JOCL.

However, looking into the ND4J's backend implementations package, I see only native or CUDA.

What happened to JOCL?

InonS commented Jul 13, 2017

I'm not sure where things stand right now. README.md module section clearly states:

Several of these modules are different backend options for ND4J (including GPUs).
...
jocl-parent = Java bindings for OpenCL

from which it is understood that ND4J currently supports an OpenCL backend via JOCL.

However, looking into the ND4J's backend implementations package, I see only native or CUDA.

What happened to JOCL?

@agibsonccc

This comment has been minimized.

Show comment
Hide comment
@agibsonccc

agibsonccc Jul 14, 2017

Member

Apologies, that's actually out of date. We had started an implementation of that a while back and ended up cancelling after seeing what it would take to build.

Member

agibsonccc commented Jul 14, 2017

Apologies, that's actually out of date. We had started an implementation of that a while back and ended up cancelling after seeing what it would take to build.

@Lord-of-the-Galaxy

This comment has been minimized.

Show comment
Hide comment
@Lord-of-the-Galaxy

Lord-of-the-Galaxy Sep 28, 2017

And any plans to attempt to use VULKAN (Is it even possible? I'm mostly uninformed on the topic)?

Lord-of-the-Galaxy commented Sep 28, 2017

And any plans to attempt to use VULKAN (Is it even possible? I'm mostly uninformed on the topic)?

@agibsonccc

This comment has been minimized.

Show comment
Hide comment
@agibsonccc

agibsonccc Sep 28, 2017

Member

We're really only looking at cuda for the short term.

Member

agibsonccc commented Sep 28, 2017

We're really only looking at cuda for the short term.

@raver119

This comment has been minimized.

Show comment
Hide comment
@raver119

raver119 Sep 28, 2017

Contributor

at this moment of time vulkan isn't really suited for math. at least that's what I've got after reading it's documentation. it's tailored for graphics mostly, and future implementations might change that

Contributor

raver119 commented Sep 28, 2017

at this moment of time vulkan isn't really suited for math. at least that's what I've got after reading it's documentation. it's tailored for graphics mostly, and future implementations might change that

@hristo-vrigazov

This comment has been minimized.

Show comment
Hide comment
@hristo-vrigazov

hristo-vrigazov Mar 30, 2018

I have seen people build pretty decent machines in terms of teraflops / dollar with AMD GPUs, but unforturnately, that forces you to use some Caffe fork since it is the only one that supports them. Check this post out: https://medium.com/intuitionmachine/building-a-50-teraflops-amd-vega-deep-learning-box-for-under-3k-ebdd60d4a93c

hristo-vrigazov commented Mar 30, 2018

I have seen people build pretty decent machines in terms of teraflops / dollar with AMD GPUs, but unforturnately, that forces you to use some Caffe fork since it is the only one that supports them. Check this post out: https://medium.com/intuitionmachine/building-a-50-teraflops-amd-vega-deep-learning-box-for-under-3k-ebdd60d4a93c

@treo

This comment has been minimized.

Show comment
Hide comment
@treo

treo Mar 30, 2018

Member

Due to the current crypto mining craze, I would actually argue that - at least for single precision performance - AMD isn't that much better, and at 8GB, while HBM2, are pretty small.

The half precision story is a bit different, but I'd argue that it still isn't enough to make supporting it worth the effort.

Member

treo commented Mar 30, 2018

Due to the current crypto mining craze, I would actually argue that - at least for single precision performance - AMD isn't that much better, and at 8GB, while HBM2, are pretty small.

The half precision story is a bit different, but I'd argue that it still isn't enough to make supporting it worth the effort.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.