Skip to content
Fetching latest commit...
Cannot retrieve the latest commit at this time.
Failed to load latest commit information.


This project is copyright 2012 by Chris Jang (

How to use tools

List compute devices                    LSdevice.amd

If you only have one vendor's OpenCL SDK installed, then most likely you do not
have to worry about setting the LD_LIBRARY_PATH (or something else like
SHLIB_PATH if this is ported to another platform besides Linux).

With OpenCL SDKs from multiple vendors or several versions of a vendor SDK, you
may have to worry about the shared library path. On my GPU testing machine, all
three OpenCL vendor SDKs (AMD, Intel, NVIDIA) are installed. I do not need to
set the LD_LIBRARY_PATH. Everything works.

However, when I was testing with multiple versions of the ATI/AMD SDK, I often
neglected to fully uninstall before re-installing. If the LD_LIBRARY_PATH was
not set, chimera binaries happened (application is compiled against one vendor
SDK but dynamically links against another when running - results in segfault).

I think the vendors are getting smarter about this. So there is good chance
everything will work for you. If not, an easy way to explicitly set the
LD_LIBRARY_PATH is with the ENV.<VENDOR> files generated by the build.

    (optional) . ../ENV.<VENDOR>

If the vendor SDK(s) are installed properly and device drivers are running,
all expected compute devices should be listed. OpenCL uses a standard registry
in /etc/OpenCL/vendors where vendor SDKs install an .icd file. It does not
matter if you use LSdevice.amd,, or LSdevice.nvidia. The listed
devices should be the same. Every vendor SDK sees all compute devices.


This is what I see on my GPU testing machine. It has two discrete GPUs, an ATI
Radeon HD 5870 and NVIDIA GTX 480. The CPU, a Core i7 920 is also a compute

    $ ./LSdevice.amd 
    0 | GPU | GeForce GTX 480 | NVIDIA Corporation
    1 | GPU | Cypress | Advanced Micro Devices, Inc.
    2 | CPU | Intel(R) Core(TM) i7 CPU         920  @ 2.67GHz | GenuineIntel

If you don't see all the compute devices you expect (e.g. you have a AMD or
NVIDIA GPU and it does not appear in the list), that means a SDK is either
installed wrong or the device driver is not loaded/running. Don't panic. My
experience is that everything does work, although you may have to fully
uninstall before re-installing.

Auto-tune matrix multiply (GEMM)        ATmatmul.amd

(Note: examples use the ".amd" version for no particular reason)

To print the usage summary, just run the tool without arguments. You may need
to set the LD_LIBRARY_PATH as described above.

    $ ./ATmatmul.amd
    usage: ./ATmatmul.amd -f configspec -p UISD{3} -m M [-n N -k K] [-t NT{2}] [-c batching] [-g|-G] [-v 042|x1x|777] [-s] [-r]

A sample configspec file is ../devspec.cfg . This is what I use during
development-integration-test. If you have the same compute devices that I use,
it will even work for you too. More likely, you will have different CPUs and
GPUs than mine. So you will need to edit your own configspec file.

OpenCL doesn't have a standardized namespace or brokering mechanism for compute
devices. The API ultimately relies on grepping strings for device and vendor
names. That's why the LSdevice.<vendor> tool is useful. It prints out those
strings. It's not difficult to find a set of keywords that uniquely identifies
a compute device.

The format of the configspec file should be straightforward. I am testing with
the three major vendors: AMD, Intel, NVIDIA. So you can add any new compute
devices and follow the pattern in the ../devspec.cfg file I use.

If your configspec file is myspec.cfg, then to auto-tune SGEMM for 80 x 80
square matrices:

    $ ./ATmatmul.amd -f myspec.cfg -p SSS -m 80

This generates a file with a name of the form: journal<DEVICE>.<VENDOR> . This
file is the journalled memo used by the JIT for auto-tuning.

Another example, GEMM with single precision A, double precision B and C, and
matrix B transposed:

    $ ./ATmatmul.amd -f myspec.cfg -p SDD -m 80 -tNT -G

Now the same except including I/O for sending and reading back data to the
compute device:

    $ ./ATmatmul.amd -f myspec.cfg -p SDD -m 80 -tNT -G -s -r

Here's an example of multiplication with rectangular matrices:

    $ ./ATmatmul.amd -f myspec.cfg -p DDD -m 32 -n 64 -k 256

Now the same except with three kernels batched together:

    $ ./ATmatmul.amd -f myspec.cfg -p DDD -m 32 -n 64 -k 256 -c 3

Here's an example where A must be an image and B must be a memory buffer:

    $ ./ATmatmul.amd -f myspec.cfg -p DDD -m 256 -v 07x

It is useful to have these options for narrowing the auto-tuning search.

Note the configspec file is order dependent. If there are multiple compute
devices it could use, it will take the first one it finds in the file with
"Evergreen" capability.

(What is this "Evergreen"? It is an ATI/AMD hardware architecture for a family
 of GPU models. I designed the parameterized kernel families for GEMM and GEMV
 for the Evergreen architecture. Later, I discovered that the code also worked
 on Intel and NVIDIA. That's why non-AMD compute devices have the "Evergreen"
 capability in the ../devspec.cfg file.)

Auto-tune matrix vector multiply (GEMV) ATmatvec.amd

(Note: examples use the ".amd" version for no particular reason)

To print the usage summary, just run the tool without arguments. You may need
to set the LD_LIBRARY_PATH as described above.

    $ ./ATmatvec.amd
    usage: ./ATmatvec.amd -f configspec -p UISD{3} -m M [-n N] [-t NT] [-c batching] [-g|-G] [-v 042|x1x|777] [-s] [-r]

This is very similar to ATmatmul as described above. The differences are the
same as between GEMM and GEMV.

GEMM has three problem dimensions in M, N, K.
GEMV has two problem dimensions in M, N.

GEMM may transpose A and B.
GEMV may transpose A. (Transposing B, a vector, doesn't make sense here.)

Otherwise, ATmatvec works in the same way as ATmatmul.

Also note that ATmatvec uses the same journal file as ATmatmul. That's ok and
what we want. Part of the key to the journal/memo is the kernel algorithm name.
Journal/memos are particular to the combination of compute device and vendor
SDK/driver. Each journal/memo contains entries for all kernels for that device
and vendor pair.

Retry wrapper script                    retry

The retry wrapper script can automate restarting the auto-tuning applications
ATmatmul and ATmatvec if they exit due to watchdog timeout or crashing from
segmentation fault.

For example, if the following repeatedly fails to complete:

    $ ./ATmatmul -f configspec -p DDD -m 640

then instead of restarting it by hand, just run it using the wrapper:

    $ ./retry ./ATmatmul -f configspec -p DDD -m 640

    Why is this even necessary?

    The OpenCL compilers in vendor SDKs sometimes hang or crash. Auto-tuning
    optimizes kernel performance. It is also a stress test of the software
    stack and tends to find bugs in shader compilers and device drivers.
Something went wrong with that request. Please try again.