FAQ

Frequently Asked Questions

Getting Started

Which GEMM variant should I use?

It depends on your precision needs and hardware:

Scenario	Recommended variant
Maximum accuracy	`aocl_gemm_f32f32f32of32`
Good accuracy, less memory (Zen4+)	`aocl_gemm_bf16bf16f32of32`
Good accuracy (Zen1-3, no AVX512_BF16)	`aocl_gemm_bf16bf16f32of32` (auto-falls back to f32)
Quantized inference	`aocl_gemm_u8s8s32os32` or `aocl_gemm_s8s8s32os32`
Weight-only quantization	`aocl_gemm_bf16s4f32of32`
Half-precision pipeline (Zen5+)	`aocl_gemm_f16f16f16of16` (or `aocl_gemm_f16f16f16of32` for F32 output)
FP16 weights, F32 activations (Zen4+)	`aocl_gemm_f32f16f32of32`
Quantized inference, FP16 output	`aocl_gemm_u8s8s32of16` or `aocl_gemm_s8s8s32of16`

See the GEMM Guide for the full data type matrix.

How do I check if my CPU supports AVX512_BF16?

# Linux
grep -o 'avx512_bf16' /proc/cpuinfo | head -1

# If output is empty, your CPU does not have native BF16 support.
# AOCL-DLP will automatically fall back to f32 kernels.

Can I use AOCL-DLP on Intel CPUs?

Yes. AOCL-DLP is optimized for AMD processors but is compatible with any x86_64 CPU that meets the minimum ISA requirements (AVX2 for f32/bf16, AVX512_VNNI for integer). Performance is tuned for AMD microarchitectures, so you may see different performance characteristics on Intel hardware.

Building & Linking

Do I need `--whole-archive` for static linking?

Yes. When statically linking AOCL-DLP, the --whole-archive flag is required. Without it, the linker may discard constructor functions that initialize internal kernel dispatch tables, leading to silent performance degradation or runtime failures.

# Correct
gcc -o app main.c -Wl,--whole-archive -laocl-dlp_static -Wl,--no-whole-archive -lstdc++ -lm -fopenmp

# Wrong -- may silently break
gcc -o app main.c -laocl-dlp_static -lstdc++ -lm -fopenmp

With CMake 3.24+, use $<LINK_LIBRARY:WHOLE_ARCHIVE,...>. See the Integration Guide for full details.

CMake can't find AoclDlp

Set CMAKE_PREFIX_PATH to the AOCL-DLP install location:

cmake -DCMAKE_PREFIX_PATH=/usr/local ..

Library not found at runtime

Set LD_LIBRARY_PATH to include the install directory:

export LD_LIBRARY_PATH=/usr/local/lib:$LD_LIBRARY_PATH

Or use rpath during linking. See Integration Guide - Troubleshooting.

Threading

What is the difference between `DLP_NUM_THREADS` and `OMP_NUM_THREADS`?

Both control the thread count, but DLP_NUM_THREADS takes higher precedence:

API calls (dlp_thread_set_num_threads()) -- highest
DLP_NUM_THREADS -- library-specific, overrides OpenMP
OpenMP API (omp_set_num_threads())
OMP_NUM_THREADS -- OpenMP environment variable
System default -- number of available cores

Use DLP_NUM_THREADS when you want DLP threading independent of your application's OpenMP settings.

What are `DLP_JC_NT` and `DLP_IC_NT`?

These control 2D thread decomposition for the GEMM loops:

DLP_JC_NT -- threads for the outer (JC) loop
DLP_IC_NT -- threads for the inner (IC) loop

When both are set, DLP_NUM_THREADS is ignored and total threads = JC_NT * IC_NT.

My application uses OpenMP too. Will threads conflict?

Potentially. If both your application and DLP use OpenMP, thread over-subscription can occur. Consider:

Set DLP_NUM_THREADS to control DLP independently
Avoid calling DLP from within an OpenMP parallel region
Use dlp_thread_set_num_threads() to limit DLP threads when needed

How many threads are used by default?

It depends on the build's threading model. OpenMP builds default to the OpenMP thread count (typically the number of available cores). Serial (non-OpenMP) builds run single-threaded -- the default thread count is 1.

See Environment Variables for full threading configuration.

Performance

Why is my BF16 code running slower than expected?

On CPUs without native AVX512_BF16 (Zen1-3, Intel pre-Cooper Lake), AOCL-DLP transparently falls back to f32 kernels. This means:

BF16 inputs are converted to f32 before computation
Computation runs on f32 kernels
Output is converted back to bf16 if needed

This fallback is correct but incurs conversion overhead and uses 2x memory bandwidth. Check your hardware with grep avx512_bf16 /proc/cpuinfo. See Library Overview for details.

How do I get the best performance on multi-socket systems?

Use NUMA-aware thread binding:

OMP_WAIT_POLICY=active \
OMP_NUM_THREADS=128 \
OMP_PLACES=cores \
OMP_PROC_BIND=close \
numactl --cpunodebind=1 --interleave=1 \
./your_application

See the Performance Guide and Environment Variables for detailed tuning.

Should I reorder my weight matrices?

Yes, if you reuse the same weight matrix across multiple GEMM calls (common in inference). Reordering transforms the matrix into a cache-friendly layout that the GEMM kernel accesses optimally. The overhead of a single reorder call is amortized across many subsequent GEMM calls.

See GEMM Guide - Matrix Reordering.

API Usage

How do I get the library version at runtime?

int major, minor, patch;
dlp_version_query(&major, &minor, &patch);
printf("AOCL-DLP version: %d.%d.%d\n", major, minor, patch);

See the version.c example in the examples directory.

How do I check for errors after a GEMM call?

Pass a dlp_metadata_t struct and inspect error_hndl.error_code:

dlp_metadata_t meta = {0};
aocl_gemm_f32f32f32of32('R', 'N', 'N', m, n, k,
    1.0f, a, lda, 'N', b, ldb, 'N', 0.0f, c, ldc, &meta);

if (meta.error_hndl.error_code != DLP_CLSC_SUCCESS) {
    printf("Error: %d\n", meta.error_hndl.error_code);
}

Error codes are defined in dlp_errors.h. See the GEMM Guide for the full list.

Uh oh!

FAQ

Frequently Asked Questions

Getting Started

Which GEMM variant should I use?

How do I check if my CPU supports AVX512_BF16?

Can I use AOCL-DLP on Intel CPUs?

Building & Linking

Do I need --whole-archive for static linking?

CMake can't find AoclDlp

Library not found at runtime

Threading

What is the difference between DLP_NUM_THREADS and OMP_NUM_THREADS?

What are DLP_JC_NT and DLP_IC_NT?

My application uses OpenMP too. Will threads conflict?

How many threads are used by default?

Performance

Why is my BF16 code running slower than expected?

How do I get the best performance on multi-socket systems?

Should I reorder my weight matrices?

API Usage

How do I get the library version at runtime?

How do I check for errors after a GEMM call?

See Also

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally

Do I need `--whole-archive` for static linking?

What is the difference between `DLP_NUM_THREADS` and `OMP_NUM_THREADS`?

What are `DLP_JC_NT` and `DLP_IC_NT`?