
# SYCL Introduction Extras



# Profiling SYCL applications on DevCloud
First steps how how to run profiling can be found here [Intel® VTune™ Profiler](https://www.intel.com/content/www/us/en/docs/vtune-profiler/cookbook/2023-0/using-vtune-server-with-vs-code-intel-devcloud.html#FINISH).

I outline the steps I took to simplify matters for others looking to do the same. *Here we go!*


 # oneMKL
oneMKL, or oneAPI Math Kernel Library, is a comprehensive and highly optimized library developed by Intel. It offers a wide range of mathematical functions designed to maximize performance on various computing architectures, including CPUs, GPUs, and other accelerators.

You can view a description of **oneMKL*** and the provided excercise and example here [oneMKL](https://github.com/codeplaysoftware/syclacademy/tree/main/Code_Exercises/OneMKL_gemm)

Navigate into the `syclacademy/Code_Exercises/OneMKL_gemm/` and you can view the excercise ` source_onemkl_buffer_gemm.cpp` and `source_onemkl_usm_gemm.cpp` one for *buffers* and *USM*.

I explain the solution in the following if you want to work on the solution yourself you can do now before continuing. 

### Matrix product using buffers

1. Initialize random matrices:

```cpp
 // A(M, N)
  for (size_t i = 0; i < M; i++)
    for (size_t j = 0; j < N; j++)
      A[i * N + j] = dis(gen);
  // B(N, P)
  for (size_t i = 0; i < N; i++)
    for (size_t j = 0; j < P; j++)
      B[i * P + j] = dis(gen);

  // Resultant matrix: C_serial = A*B
  for (size_t i = 0; i < M; i++) {
    for (size_t j = 0; j < P; j++) {
      for (size_t d = 0; d < N; d++) {
        C_host[i * P + j] += A[i * N + d] * B[d * P + j];
      }
    }
  }
```

2. Declare Queue and device:

```cpp
 // Create a SYCL in-order queue targetting GPU device
  sycl::queue Q{sycl::gpu_selector_v, sycl::property::queue::in_order{}};
  // Prints some basic info related to the hardware
  print_device_info(Q);
```

3. Declare buffers:

```cpp
// TODO: Allocate memory on device, (using sycl::malloc_device APIs)
  // Creating 1D buffers for matrices which are bound to host memory array
  sycl::buffer<T, 1> a{A.data(), sycl::range<1>{M * N}};
  sycl::buffer<T, 1> b{B.data(), sycl::range<1>{N * P}};
  sycl::buffer<T, 1> c{C_host.data(), sycl::range<1>{M * P}};
```

4.  Use **oneMKL GEMM** library for the matrix multiplication  

```cpp
 // TODO: Use oneMKL GEMM USM API
  oneapi::mkl::transpose transA = oneapi::mkl::transpose::nontrans;
  oneapi::mkl::transpose transB = oneapi::mkl::transpose::nontrans;
  oneapi::mkl::blas::column_major::gemm(Q, transA, transB, n, m, k, alpha, b,
                                        ldB, a, ldA, beta, c, ldC);
  Q.wait();
  sycl::host_accessor C_device{c};
```

These libabries like **GMM** are great for specific vendors who provide efficent and optimized functionality for specific harderware that we can use using oneMLK. 

> We dont have to do the heavy lifting ourself we let **oneMKL*** do it for us.
 

# TODO
1. check devices 
2. Review performance 
3. Time output check
