Skip to content
Abhiram S edited this page Oct 23, 2025 · 11 revisions

AOCL-DLP Documentation Hub

Welcome to the AOCL-DLP (AMD Optimizing CPU Libraries - Deep Learning Primitives) documentation hub! This wiki serves as your central navigation point to all AOCL-DLP documentation resources.

About AOCL-DLP

AOCL-DLP is a high-performance library designed to provide optimized deep learning primitives for AMD processors. It implements Low Precision GEMM (LPGEMM) operations for machine learning applications, supporting multiple data types, pre-operations, and post-operations. The library is specifically optimized to leverage AMD hardware capabilities including AVX2, AVX512, AVX512_VNNI, and AVX512_BF16 instruction sets.

Key Features

  • Highly Optimized GEMM Operations - High-performance matrix multiplication targeting AMD CPUs
  • Multiple Data Type Support - Various precision formats (float, bfloat16, int8, uint8, int32)
  • Comprehensive Post-Operations - SUM, ELTWISE, BIAS, SCALE, MATRIX_ADD, MATRIX_MUL
  • Batch GEMM Support - Optimized for handling multiple GEMM operations
  • Symmetric Quantization - Specialized routines for quantized operations
  • Thread Optimization - Parallel execution via OpenMP

📚 Documentation Resources

Project Information

Community & Support

Development Tools & Coverage

🔧 API Documentation

Core APIs (C Interface)

GEMM API

High-performance General Matrix Multiplication operations supporting various data type combinations:

  • Float Operations: f32f32f32of32
  • Bfloat16 Operations: bf16bf16f32of32, bf16bf16f32obf16, bf16s4f32of32, bf16s4f32obf16
  • Integer Operations: u8s8s32os32, u8s8s32os8, u8s8s32ou8, u8s8s32of32, u8s8s32obf16
  • Signed Integer: s8s8s32os32, s8s8s32os8, s8s8s32ou8, s8s8s32of32, s8s8s32obf16

📖 Detailed API Reference: GEMM API Documentation

Batch GEMM API

Optimized batch processing for multiple GEMM operations with support for all data type combinations.

📖 Detailed API Reference: Batch GEMM Documentation

Post-Operations API

Comprehensive post-processing operations that can be chained with GEMM:

  • SUM: Element-wise addition with scaling and zero point
  • ELTWISE: Activation functions (RELU, PRELU, GELU_TANH, GELU_ERF, CLIP, SWISH, TANH, SIGMOID)
  • BIAS: Bias addition to results
  • SCALE: Scaling operations
  • MATRIX_ADD/MUL: Matrix operations with scaling

📖 Detailed API Reference: Post-Operations Documentation

Quantization API

Specialized symmetric quantization routines:

  • s8s8s32of32_sym_quant
  • s8s8s32obf16_sym_quant

📖 Detailed API Reference: Quantization Documentation

Utility Functions API

Standalone mathematical operations:

  • gelu_tanh_f32 - GELU activation with tanh approximation
  • gelu_erf_f32 - GELU activation with erf approximation
  • softmax_f32 - Softmax function for float

📖 Detailed API Reference: Utility Functions Documentation

Eltwise Operations API

Specialized element-wise operations supporting various input/output type combinations:

  • bf16of32, bf16obf16, f32of32, f32obf16, f32os32, f32os8

📖 Detailed API Reference: Eltwise Operations Documentation

Library Management API

Essential functions for library initialization, cleanup, and configuration.

📖 Detailed API Reference: Library Interface Documentation

📋 Complete API Reference

Comprehensive Documentation

🚀 Getting Started

Quick Start

  • Quick Start Guide - NEW USERS START HERE! Get up and running in 5 minutes
    • Installation
    • Your first AOCL-DLP program
    • Build and run examples
    • Common first-time issues

Integration Guide

  • 📖 Integration Guide - COMPREHENSIVE REFERENCE for integrating AOCL-DLP into your application
    • CMake package integration (recommended)
    • Manual linking instructions
    • Static vs dynamic linking (including critical --whole-archive flag)
    • Complete working examples
    • Troubleshooting & FAQ

💡 Examples and Tutorials

Code Examples

Explore practical implementations and usage patterns:

Tutorial Documentation

🧪 Testing and Validation

Testing Framework

Comprehensive testing suite for validation and performance benchmarking:

  • Integration Tests - End-to-end workflow testing
  • Performance Benchmarks - Speed and accuracy measurements
  • Correctness Validation - Numerical accuracy verification

📖 Testing Documentation: Testing Framework Guide

🏗️ Development Resources

Build System

  • CMake Configuration - Modern CMake-based build system
  • Cross-platform Support - Linux, Windows compatibility
  • Dependency Management - Automated dependency resolution
  • Build Options - Customizable build configurations

Development Tools

  • Code Formatting - Consistent code style with clang-format
  • Pre-commit Hooks - Automated code quality checks
  • Continuous Integration - Automated testing and validation

Advanced Development

⚡ Performance and Hardware Support

Optimized for AMD Hardware

  • Zen1+ Architecture - AVX2/FMA3 support
  • Zen4+ Architecture - AVX512, AVX512_VNNI, AVX512_BF16 support
  • x86_64 Compatibility - Runs on any compatible x86_64 CPU

Performance Features

  • NUMA Awareness - Optimized memory access patterns
  • Thread Scaling - Efficient parallel execution
  • Memory Management - Optimized data layouts and caching

This documentation hub is continuously updated. For the latest information, please refer to the linked resources above.

Clone this wiki locally