# 01 - Introduction to SYCL Programming for GPUs

Argonne Leadership Computing Facility, UChicago Argonne, LLC, All rights reserved

In the rapidly evolving world of computing, the ability to harness the
power of heterogeneous systems—where CPUs coexist with GPUs and other
accelerators—has become increasingly vital. **SYCL** stands as a
cutting-edge, single-source programming model designed to bridge this
gap. Developed to be used with modern C++, SYCL abstracts the
complexities associated with direct accelerator programming, making it
accessible to both novice and experienced developers.

### What is SYCL?

SYCL is an open standard developed by the Khronos Group, the same group
responsible for OpenGL. It allows developers to write code for
heterogeneous systems using completely standard C++. This means that the
same code can target CPUs, GPUs, DSPs, FPGAs, and other types of
accelerators without modification. SYCL builds upon the foundation laid
by OpenCL, offering a higher level of abstraction and deeper integration
with C++.

<img width="800" src=https://www.khronos.org/assets/uploads/apis/2022-sycl-diagram.jpg>
Image Source. https://www.khronos.org/sycl/

<img src=https://raw.githubusercontent.com/oneapi-src/oneAPI-samples/495ff2bb29b50698e9c6d3b12f7d8cf476e73d02/DirectProgramming/C++SYCL/Jupyter/oneapi-essentials-training/01_oneAPI_Intro/Assets/oneapi1.png>

### Advantages of SYCL

One of the primary advantages of SYCL is its ability to integrate
seamlessly with C++17 and upcoming versions, enabling features like
lambda functions, auto-typing, and templating. This integration not only
improves the programmability and readability of the code but also
leverages the type safety and performance optimizations provided by
modern C++. Here are a few key benefits: - 
* **Single-Source Development**: Unlike traditional approaches that might require
maintaining separate code bases for different architectures, SYCL
unifies the code into a single source. This simplifies development and
reduces maintenance burdens.
* **Cross-Platform Portability**: SYCL code
can be executed on any device that has a compatible SYCL runtime,
providing true cross-platform capabilities.
* **Performance**: With
SYCL, developers do not have to sacrifice performance for portability.
It allows fine control over parallel execution and memory management,
which are critical for achieving optimal performance on GPUs.

As GPUs continue to play a crucial role in fields ranging from
scientific computing to machine learning, mastering SYCL can provide
developers with the tools needed to fully exploit the capabilities of
these powerful devices. The following sections will guide you through
setting up your development environment, understanding the core concepts
of SYCL, and walking you through practical examples to kickstart your
journey in high-performance computing with SYCL.

------------------------------------------------------------------------

This introduction sets the stage for learning SYCL by highlighting its
relevance, advantages, and integration with modern C++. It aims to build
a strong foundation for the subsequent sections that delve deeper into
SYCL programming.

------------------------------------------------------------------------

# Basics of a SYCL Kernel 

In SYCL, all computations are submitted through a queue. This queue is associated with a device, and any computation assigned to the queue is executed on this device[^1].
This is how we check if a gpu is available for use and then initialize a sycl queue for a gpu:
```c++
// Check for available GPU devices

auto selector = sycl::default_selector_v;
auto selector = sycl::gpu_selector{};
auto selector = sycl::cpu_selector{};
auto selector = sycl::accelerator_selector{};
```

# Creating a queue
```c++
// Create a queue using the GPU selector
auto myQueue = sycl::queue{selector};
```



# Understanding SYCL Kernel Command Group Execution

 A command group is a fundamental construct that encapsulates a set of operations meant to be executed on a device.

 ```c++
gpuQueue.submit([&](sycl::handler &cgh) {
  /* Command group function */
})
```

<img width="255" alt="" src="images/image11.png" >

> The diagram illustrates the process of defining and submitting a SYCL command group.
> It begins with a call to the submit function on a SYCL queue, which initiates the creation of a command group.
> The submit function takes a command group function as its argument, within which a command group handler `cgh` is created.
> Inside the command group function, the handler is used to specify dependencies, define the kernel function, and set up accessors for memory objects that the kernel will use. Once these elements are defined, the command group is assembled and ready for execution on the device.



# Enqueuing A Kernel

In SYCL, all computations are submitted through a queue. This queue is
associated with a device, and any computation assigned to the queue is
executed on this device.

SYCL offers two methods for managing data: 

1. **Buffer/Accessor Model:**
This model uses buffers to store data and accessors to read or write
data, ensuring safe memory management and synchronization.

2. **Unified
Shared Memory (USM) Model:** This model allows for direct data sharing
between the host and device, simplifying memory management by
eliminating the need for explicit buffers and accessors.

# Scheduling

A schedulre is a component responsible for managing the order and
execution of tasks on computational resources.

#### Scheduling Overview
<img width="600" src="images/image33.png">

-   When the **submit** function is called, it creates a command group
    handler (**`cgh`**) and submits it to the scheduler.
-   The scheduler is responsible for executing the commands on the
    designated target device.

# Command Groups

A command group is a fundamental construct that encapsulates a set of
operations meant to be executed on a device.

<img width="305" src="images/image11.png" >



-   Command groups are defined by calling the **submit** function on the
    queue.
-   The **submit** function takes a command group handler (`cgh`) which
    facilitates the composition of the command group.
-   Inside the **submit** function, a handler is created and passed to
    the `cgh`.
-   This handler is then used by the `cgh` to assemble the command
    group.

``` c++
gpuQueue.submit([&](sycl::handler &cgh) {
  /* Command group function */
})
```

### Lambda functions 

In SYCL, lambdas play a crucial role similar to their use in general programming, but they are specifically tailored for defining operations on data that will be executed on parallel devices like GPUs and CPUs. Like in other programming contexts, lambdas in SYCL allow for writing concise, anonymous functions. This capability is especially valuable in SYCL due to the nature of parallel computing, where operations often need to be defined locally and executed across a range of data elements.

Lambdas in SYCL are structured similarly to standard C++ lambdas, but are specifically utilized within the SYCL framework to define the functionality of kernels that execute on parallel compute devices. The basic syntax of a lambda in SYCL can be summarized as follows:

```cpp
[capture_clause](input_signature) -> return_specification {
    // execution_block
}
```

In the context of SYCL you typically encounter the following types of captures:

- `[]` : Captures nothing from the enclosing scope. This is used when the lambda does not need to access any external variables.

- `[&]` : Captures all accessible variables from the surrounding scope by reference. Useful when you need to modify the external variables or when copying them is expensive.

- `[=]` : Captures all accessible variables from the surrounding scope by value. This is safe when the lambda is executed asynchronously or on a separate device, ensuring that it works with a consistent copy of the data.

For example, when defining a SYCL kernel, a developer might use a lambda to specify the computation that each thread should perform on the elements of a buffer. This lambda can capture necessary variables from its surrounding scope to use within the kernel execution:

```c++
buffer<float, 1> buf(data, range<1>(data_size));
queue.submit([&](handler& cgh) {
    auto acc = buf.get_access<access::mode::read_write>(cgh);
    cgh.parallel_for(range<1>(data_size), [=](id<1> idx) {
        acc[idx] *= 2; // Example operation: double each element
    });
});

```

## Enqueuing SYCL Kernel Function Single_task example

##### **<font color="green">EXPLAIN HERE</font>**
```c++

// Select GPU devices
auto gpu_selector = sycl::gpu_selector{};

auto myQueue = sycl::queue{gpu_selector};

// Submit a command group to the queue
myQueue.submit([&](sycl::handler &cgh) {
    // Create a stream for output within kernel
    auto os = sycl::stream{128, 128, cgh};
    // Execute a single task
    cgh.single_task([=]() {
      os << "Hello World!" << sycl::endl;
    });
    
}).wait(); // Wait for completion of gpuQueue


**<font color="red">SEE example [00-hello.ipynb](examples/00-hello.ipynb)</font>**

# Managing Data


## Buffers & Accessors

Buffers and accessors are used in SYCL for managing and accessing data
across different computing devices, including CPUs, GPUs, and other
accelerators:


<img width="600" src="images/image22.png">
Diagram illustrating the relationship between buffers, accessors, and
devices

-   **Buffers**: Buffers are used to manage data across the host and
    various devices. A buffer abstractly represents a block of data and
    handles the storage, synchronization, and consistency of this data
    across different memory environments. When a buffer object is
    constructed, it does not immediately allocate or copy data to the
    device memory. This allocation or transfer only occurs when the
    runtime determines that a device needs to access the data,
    optimizing memory usage and data transfer.


-   **Accessors**: Accessors are used to request access to data that is
    stored in buffers. They specify how and when the data in a buffer
    should be accessed by a kernel function, either on the host or a
    specific device. Accessors help in defining the required access
    pattern (read, write, or read/write) and are crucial for ensuring
    data consistency and coherency between the host and devices.


**Examples Buffer/Accessor Model:**
```c++

std::vector<int> vectorA(N, 1);  // Vector A filled with 1s

// Buffers 
sycl::buffer<int> bufA {vectorA.data(),vectorA.size() };
// or
//auto bufA = sycl::buffer{vectorA.data(), sycl::range{N}};

// Accessor
sycl::accessor accA { bufA, cgh, sycl::read_only};

```

### Buffers

Explain HERE 

```c++
int const size = 10;
//  buffer is the memory object to transfer  data between host and device
buffer<int> A{ size };
// cgh is a handler that defines the command group which contains the task function
myQueue.submit([&](sycl::handler &cgh) {
    // accessor object allows access the buffer elements
    sycl::accessor accA { bufA, cgh};
};

// host_accessor allows the host to access the buffer memory
sycl::host_accessor result(A);  
```

## **Unified Shared Memory (USM) Model:** 
This model allows for direct data sharing between the host and device, simplifying memory management by eliminating the need for explicit buffers and accessors. Here is the following changes from the buffer/accessor model to USM model:

```c++
// Allocate memory using USM
 float* usmA = sycl::malloc_shared<float>(N, gpuQueue);

 // Initialize USM memory
 std::copy(vectorA.begin(), vectorA.end(), usmA);
```

# How to compile SYCL code



```bash
icpx -fsycl compute.cpp -o ./a.out
```

### TODO


### Parallel_for

Explain HERE 

```c++
myQueue.submit([&](sycl::handler &cgh) {
    sycl::accessor accA { bufA, cgh, sycl::write_only};
    cgh.parallel_for(N, [=](auto idx) { 
        accA[i] = idx });
    });
```