• AutoGEMM
  • I - Introduction to AutoGEMM
  • II - Achieving Peak GEMM Performance
  • 1. Global memory bandwidth as a limiting factor
  • 2. Global memory latency as a limiting factor
  • 3. Local memory bandwidth as a limiting factor
  • 4. Local memory latency as a limiting factor
  • 5. Kernel throughput is high
  • 6. High percentage of MADD instructions
  • 7. Special considerations for small MxN matrices
  • 8. Special considerations for small K matrices
  • 9. Special considerations for "skinny" matrices
  • III - Architecture
  • AutoGEMM Parameters
  • Kernel selection data
  • Generating the kernel selection data
  • AutoGEMM generated files
  • "Building" AutoGEMM with CMake for clBLAS
  • IV - Customizing AutoGEMM to your needs
  • Customize kernel assortment for an application
  • Tuning kernel selection to new GPU