Skip to content
Nallani Bhaskar edited this page Jun 15, 2026 · 4 revisions

Frequently Asked Questions

Getting Started

Which GEMM variant should I use?

It depends on your precision needs and hardware:

Scenario Recommended variant
Maximum accuracy aocl_gemm_f32f32f32of32
Good accuracy, less memory (Zen4+) aocl_gemm_bf16bf16f32of32
Good accuracy (Zen1-3, no AVX512_BF16) aocl_gemm_bf16bf16f32of32 (auto-falls back to f32)
Quantized inference aocl_gemm_u8s8s32os32 or aocl_gemm_s8s8s32os32
Weight-only quantization aocl_gemm_bf16s4f32of32
Half-precision pipeline (Zen5+) aocl_gemm_f16f16f16of16 (or aocl_gemm_f16f16f16of32 for F32 output)
FP16 weights, F32 activations (Zen4+) aocl_gemm_f32f16f32of32
Quantized inference, FP16 output aocl_gemm_u8s8s32of16 or aocl_gemm_s8s8s32of16

See the GEMM Guide for the full data type matrix.

How do I check if my CPU supports AVX512_BF16?

# Linux
grep -o 'avx512_bf16' /proc/cpuinfo | head -1

# If output is empty, your CPU does not have native BF16 support.
# AOCL-DLP will automatically fall back to f32 kernels.

Can I use AOCL-DLP on Intel CPUs?

Yes. AOCL-DLP is optimized for AMD processors but is compatible with any x86_64 CPU that meets the minimum ISA requirements (AVX2 for f32/bf16, AVX512_VNNI for integer). Performance is tuned for AMD microarchitectures, so you may see different performance characteristics on Intel hardware.

Building & Linking

Do I need --whole-archive for static linking?

Yes. When statically linking AOCL-DLP, the --whole-archive flag is required. Without it, the linker may discard constructor functions that initialize internal kernel dispatch tables, leading to silent performance degradation or runtime failures.

# Correct
gcc -o app main.c -Wl,--whole-archive -laocl-dlp_static -Wl,--no-whole-archive -lstdc++ -lm -fopenmp

# Wrong -- may silently break
gcc -o app main.c -laocl-dlp_static -lstdc++ -lm -fopenmp

With CMake 3.24+, use $<LINK_LIBRARY:WHOLE_ARCHIVE,...>. See the Integration Guide for full details.

CMake can't find AoclDlp

Set CMAKE_PREFIX_PATH to the AOCL-DLP install location:

cmake -DCMAKE_PREFIX_PATH=/usr/local ..

Library not found at runtime

Set LD_LIBRARY_PATH to include the install directory:

export LD_LIBRARY_PATH=/usr/local/lib:$LD_LIBRARY_PATH

Or use rpath during linking. See Integration Guide - Troubleshooting.

Threading

What is the difference between DLP_NUM_THREADS and OMP_NUM_THREADS?

Both control the thread count, but DLP_NUM_THREADS takes higher precedence:

  1. API calls (dlp_thread_set_num_threads()) -- highest
  2. DLP_NUM_THREADS -- library-specific, overrides OpenMP
  3. OpenMP API (omp_set_num_threads())
  4. OMP_NUM_THREADS -- OpenMP environment variable
  5. System default -- number of available cores

Use DLP_NUM_THREADS when you want DLP threading independent of your application's OpenMP settings.

What are DLP_JC_NT and DLP_IC_NT?

These control 2D thread decomposition for the GEMM loops:

  • DLP_JC_NT -- threads for the outer (JC) loop
  • DLP_IC_NT -- threads for the inner (IC) loop

When both are set, DLP_NUM_THREADS is ignored and total threads = JC_NT * IC_NT.

My application uses OpenMP too. Will threads conflict?

Potentially. If both your application and DLP use OpenMP, thread over-subscription can occur. Consider:

  • Set DLP_NUM_THREADS to control DLP independently
  • Avoid calling DLP from within an OpenMP parallel region
  • Use dlp_thread_set_num_threads() to limit DLP threads when needed

How many threads are used by default?

It depends on the build's threading model. OpenMP builds default to the OpenMP thread count (typically the number of available cores). Serial (non-OpenMP) builds run single-threaded -- the default thread count is 1.

See Environment Variables for full threading configuration.

Performance

Why is my BF16 code running slower than expected?

On CPUs without native AVX512_BF16 (Zen1-3, Intel pre-Cooper Lake), AOCL-DLP transparently falls back to f32 kernels. This means:

  • BF16 inputs are converted to f32 before computation
  • Computation runs on f32 kernels
  • Output is converted back to bf16 if needed

This fallback is correct but incurs conversion overhead and uses 2x memory bandwidth. Check your hardware with grep avx512_bf16 /proc/cpuinfo. See Library Overview for details.

How do I get the best performance on multi-socket systems?

Use NUMA-aware thread binding:

OMP_WAIT_POLICY=active \
OMP_NUM_THREADS=128 \
OMP_PLACES=cores \
OMP_PROC_BIND=close \
numactl --cpunodebind=1 --interleave=1 \
./your_application

See the Performance Guide and Environment Variables for detailed tuning.

Should I reorder my weight matrices?

Yes, if you reuse the same weight matrix across multiple GEMM calls (common in inference). Reordering transforms the matrix into a cache-friendly layout that the GEMM kernel accesses optimally. The overhead of a single reorder call is amortized across many subsequent GEMM calls.

See GEMM Guide - Matrix Reordering.

API Usage

How do I get the library version at runtime?

int major, minor, patch;
dlp_version_query(&major, &minor, &patch);
printf("AOCL-DLP version: %d.%d.%d\n", major, minor, patch);

See the version.c example in the examples directory.

How do I check for errors after a GEMM call?

Pass a dlp_metadata_t struct and inspect error_hndl.error_code:

dlp_metadata_t meta = {0};
aocl_gemm_f32f32f32of32('R', 'N', 'N', m, n, k,
    1.0f, a, lda, 'N', b, ldb, 'N', 0.0f, c, ldc, &meta);

if (meta.error_hndl.error_code != DLP_CLSC_SUCCESS) {
    printf("Error: %d\n", meta.error_hndl.error_code);
}

Error codes are defined in dlp_errors.h. See the GEMM Guide for the full list.

See Also

Clone this wiki locally