Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Documentation for the callbacks on the CBLAS-like API #15

Closed
fommil opened this issue Sep 10, 2013 · 17 comments
Closed

Documentation for the callbacks on the CBLAS-like API #15

fommil opened this issue Sep 10, 2013 · 17 comments

Comments

@fommil
Copy link

fommil commented Sep 10, 2013

I just got clBLAS compiling on OS X (see #7) and did a very simple DGEMM test, see https://github.com/fommil/netlib-java/

The results are, frankly, a little unbelievable... so I'm going to have to check that the DGEMM is actually being performed.

However, as part of the setup I found it very hard to understand the clblasDgemm API. It looks like you've added offsets to the arrays (which doesn't make much sense in C, since this can be done by just moving the pointer) and also added the following

    cl_uint numCommandQueues,
    cl_command_queue *commandQueues,
    cl_uint numEventsInWaitList,
    const cl_event *eventWaitList,
    cl_event *events

I just set these to NULL or 0 as appropriate: https://github.com/fommil/netlib-java/blob/master/perf/src/main/c/clwrapper.c

What are these for and what are sensible defaults just to get me up and running?

@kknox
Copy link
Contributor

kknox commented Sep 11, 2013

Hi @fommil,
As OpenCL is meant to operate on devices that operate with disparate memory addresses, OpenCL treats memory in a 'black box' fashion. Allocating opencl memory does not in fact return a pointer, it returns a handle that is meant to be used in further opencl operations. These handles are cl_mem objects, and our API's take the cl_mem objects as parameters. As you can not apply pointer arithmetic to a handle, we add an extra offset parameter for every cl_mem parameter, to allow a user to specify a starting offset into the buffer.
The extra parameters appended to the BLAS API are the openCL objects that control the execution of the OpenCL kernels. If you set them to NULL, the API will not do anything and I'm sure will appear to run very fast.
We provide library documentation for our API, but it already assumes that you are familiar and comfortable with the OpenCL language. If you would like to start learning OpenCL, the OpenCL specification is not a terrible read, and then AMD has additional resources for developers.

@fommil
Copy link
Author

fommil commented Sep 11, 2013

@kknox a little example of how to call the BLAS functions wouldn't go amiss. the equivalent cuBLAS functions are much more closely aligned with the original BLAS API in comparison... although it is rather frustrating that neither library actually implements the BLAS that decades of middleware has conformed to. Hence my wrapper layer.

@fommil
Copy link
Author

fommil commented Sep 11, 2013

you don't have an explicit dgemm example, but the C examples you pointed me at were useful.

it looks to me like you're still some way from users being able to call you as BLAS. Ill attempt to wrap DDOT and DGEMM over the coming months, but I'll pause at that point to see where to go.

@kknox
Copy link
Contributor

kknox commented Sep 11, 2013

Hi @fommil
If you are looking for code examples for how to call the BLAS functions, take a look at the samples directory of the repository, we have simple examples of calling almost every routine that we support in single precision. You should be able to compile and step through a sample in a debugger and see what is needed to initialize OpenCL and call into a BLAS routine.

We recognize that the clBLAS API is slightly different than as defined with traditional NetLib BLAS; we did not break the BLAS API lightly or arbitrarily. The concerns for designing for heterogeneous platforms like modern GPU platforms necessitate different decisions than were made 30 years ago for homogeneous platforms like traditional CPU servers. There is a heavy cost in transferring data to and from the heterogeneous device (i.e. the GPU over the PCI express bus) and if data is managed carelessly, the performance will actually be worse than not having offloaded the computation in the first place.

Our API, built on top of OpenCL, allows our clients to manage their own data. They control when and where data is transferred to and from the heterogeneous device. This is the reason that we added the extra OpenCL parameters to the BLAS API's; the user manages the OpenCL state and passes it into the library which ultimately generates OpenCL kernels and enqueues them into the command queue. With this API, the client controls when data is transferred to the device, executes a series of BLAS calls (or user defined kernels) while the data remains on the device and then transfers data back to the Host only when they are done processing. Otherwise, you get in a situation where data is transferred in a round-trip fashion to the device and back on every BLAS call, and then find yourself in the uncomfortable situation where you are better off not having offloaded to the device in the first place 😃

@fommil
Copy link
Author

fommil commented Sep 11, 2013

@kknox can you please take a look at this? It's a translation of your sgemm sample.

https://github.com/fommil/netlib-java/blob/master/perf/src/main/c/clwrapper.c

When I run my test file

https://github.com/fommil/netlib-java/blob/master/perf/src/main/c/dgemmtest.c

(compilation instructions at the top)

I see this :-(

found 1 OpenCL platforms
found 1 OpenCL devices
created context
created command queue
setup clblas
created buffers
enqueud buffers
Segmentation fault: 11

I'm on OS X. Note that I changed the CL_DEVICE_TYPE_GPU as I was getting 0 devices with it. I have another machine that I can try this out on... perhaps my laptop doesn't have GPU OpenCL (first I've heard of it! It's an Intel HD Graphics 3000).

@fommil
Copy link
Author

fommil commented Sep 11, 2013

for completeness, I thought I would note that my Macbook Air doesn't seem to support OpenCL on the GPU :-( http://forums.macrumors.com/showthread.php?t=1119312

@simonmcs
Copy link

Apple will provide OpenCL 1.2 support on the integrated Iris graphics of Haswell-based MBA's in Mavericks when that's released soon. Sounds like you'll be justified in treating yourself to a new laptop! ;-)

http://forums.macrumors.com/showthread.php?t=1620203

http://docs.huihoo.com/apple/wwdc/2013/session_508__working_with_opencl.pdf

Simon

On 11 Sep 2013, at 15:49, Sam Halliday notifications@github.com wrote:

for completeness, I thought I would note that my Macbook Air doesn't seem to support OpenCL on the GPU :-( http://forums.macrumors.com/showthread.php?t=1119312


Reply to this email directly or view it on GitHub.

Head of Microelectronics Group and University of Bristol Business Fellow
High Performance Computing and Architectures, Department of Computer Science
University of Bristol, Merchant Venturers Building, Woodland Road, Clifton, Bristol, BS8 1UB, UK
Phone: +44 (0)117 331 5324, Twitter: simonmcs, Web: http://www.cs.bris.ac.uk/~simonm/
Microelectronics Group webpage: http://www.cs.bris.ac.uk/Research/Micro/

@fommil
Copy link
Author

fommil commented Sep 11, 2013

@simonmcs heh, nah... I've got a relatively new iMac that I'll use for GPU performance tests. And clBLAS needs to work without segfaults before I can rationalise a frivolous upgrade :-P

@pavanky
Copy link
Contributor

pavanky commented Sep 13, 2013

@fommil

I don't understand what you are trying to do here

size_t off  = 1;
size_t offA = K + 1;   /* K + off */
size_t offB = N + 1;   /* N + off */
size_t offC = N + 1;   /* N + off */

To use clBLAS all you need to do is make offsets 0 and pass the other parameters as is. You are making it more complicated than it is worth. The segmentation fault is likely occurring because you are using yoru CPU as your OpenCL device and the wrapper code you have written is trying to access elements that are out of bounds.

@fommil
Copy link
Author

fommil commented Sep 13, 2013

@pavanky I am copying the code from the example. I don't understand why the offsets are +1! I thought it was some device specific nonsenses.

@pavanky
Copy link
Contributor

pavanky commented Sep 13, 2013

The example has the following line.
/* Call clblas extended function. Perform gemm for the lower right sub-matrices */

Since you want matrix multiplication on the entire matrix, try setting offsets to 0 for your case. Use M, N, K, LDA, LDB, LDC directly.

@fommil
Copy link
Author

fommil commented Sep 13, 2013

oh, I missed that bit :-D

now, why would a gemm example not do gemm?

@pavanky
Copy link
Contributor

pavanky commented Sep 13, 2013

@fommil it is doing gemm, but only on the bottom right corner of the buffers.

The equivalent in standard gemm would've used something like A + offA, B + offB and C + offC.

This kind of an API is necessary for OpenCL because such offsets to pointers are not possible from the host side. But such offsets are required for some libraries that are downstream from BLAS (such as various LAPACK implementations).

@fommil
Copy link
Author

fommil commented Sep 13, 2013

@pavanky I'm still getting the segfault with no offsets. Actually, this happened last night too and that's why I added all the offsets (I thought it was some hocus pocus and didn't see the note about sub matrices).

@fommil
Copy link
Author

fommil commented Sep 15, 2013

I get the segfaults when on a GPU device as well. I won't be able to test this again until next weekend.

@fommil
Copy link
Author

fommil commented Sep 16, 2013

@kknox @pavanky I'm still unable to get results with clBLAS but I've been able to run some DGEMM tests with CUDA to confirm your comments about the memory overhead. Indeed, it is pretty spectacular. Turquoise (light blue below the red lines, keep pace with the green ATLAS) is CUDA + overhead, dark blue is CUDA just the dgemm call (and I checked that it is computing the result correctly!)

dgemm

@kknox
Copy link
Contributor

kknox commented Jan 2, 2014

Closing old clBLAS issues for the new year

I believe that this question has been answered, in part here and in part with the comments in #12.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants