# 1. Introduction to SYCL Programming for GPUs

Argonne Leadership Computing Facility, UChicago Argonne, LLC, All rights reserved

In today's fast-paced world of HPC, **heterogenous systems** — which combine CPUs, GPUs, and other 
accelerators — are crucial for scaling computational workloads. However, programming for such diverse
architectures can be a challenge. This is where **SYCL** comes in.

**SYCL (pronounced "sickle")** is an open standard for single-source programming designed to simplify
development for heterogeneous systems. SYCL allows you to write **modern C++** code that runs on CPUs,
GPUs, and other accelerators without having to write separate code for each architecture. Whether you're
a seasoned developer of just starting, SYCL helps you focus on writing algorithms, while it handles the
underlying complexities of device management and memory allocation.
<!--
In the rapidly evolving world of computing, the ability to harness the
power of heterogeneous systems—where CPUs coexist with GPUs and other
accelerators—has become increasingly vital. **SYCL** stands as a
cutting-edge, single-source programming model designed to bridge this
gap. Developed to be used with modern C++, SYCL abstracts the
complexities associated with direct accelerator programming, making it
accessible to both novice and experienced developers.
-->

## What is SYCL?

SYCL is an open standard developed by the **Khronos Group**, the same group
responsible for **OpenCL**. It allows developers to write code for
heterogeneous systems using completely standard C++. This means that the
same code can target CPUs, GPUs, DSPs, FPGAs, and other types of
accelerators without modification. 

SYCL builds upon the foundation laid
by OpenCL, offering a higher level of abstraction and deeper integration
with C++. While OpenCL requires developers to manage host and device code separtely, SYCL allows both to be written in a single, unified C++ source file. This enables a more intuitive and efficient programming experience, making it easier to develop portabl and high-performance applications for diverse hardware platforms.

<img width="800" src=https://www.khronos.org/assets/uploads/apis/2022-sycl-diagram.jpg>
Image Source. https://www.khronos.org/sycl/

<img src=https://raw.githubusercontent.com/oneapi-src/oneAPI-samples/495ff2bb29b50698e9c6d3b12f7d8cf476e73d02/DirectProgramming/C++SYCL/Jupyter/oneapi-essentials-training/01_oneAPI_Intro/Assets/oneapi1.png>

## Advantages of SYCL

One of the primary advantages of SYCL is its ability to integrate
seamlessly with C++17 and upcoming versions, enabling features like
lambda functions, auto-typing, and templating. This integration not only
improves the programmability and readability of the code but also
leverages the **type safety** and **performance optimizations** provided by
modern C++. Here are a few key benefits: - 
* **Single-Source Development**: Unlike traditional approaches that might require maintaining separate code bases for different architectures (e.g., separate code for CPUs and GPUs), SYCL unifies the code into a **single source**. This simplifies development and reduces maintenance burdens, making it easier to write code that works across different devices without duplication.
* **Cross-Platform Portability**: SYCL code can be executed on any device that has a compatible SYCL runtime, providing true cross-platform capabilities. Whether you're working with **Intel GPUs, AMD GPUs, NVIDIA GPUs,** or even FPGAs, the same SYCL codebase can be compiled and executed, ensuring broad compatibility.
* **Performance**: With SYCL, developers do not have to sacrifice performance for portability. It allows fine control over **parallel execution** and **memory management**, which are critical for achieving optimal performance on GPUs and other accelerators. SYCL's abstraction ensures you can write high-level code without losing the ability to perform low-level optimizations when needed.

As GPUs continue to play a crucial role in fields ranging from
**scientific computing** to **machine learning**, mastering SYCL can provide
developers with the tools needed to fully exploit the capabilities of
these powerful devices. The following sections will guide you through
setting up your development environment, understanding the core concepts
of SYCL, and walking you through practical examples to kickstart your
journey in high-performance computing with SYCL.

<!--
------------------------------------------------------------------------

This introduction sets the stage for learning SYCL by highlighting its
relevance, advantages, and integration with modern C++. It aims to build
a strong foundation for the subsequent sections that delve deeper into
SYCL programming.

------------------------------------------------------------------------
-->

# 2. Setting Up a SYCL Program 
<!--
In SYCL, all computations are submitted through a queue. This queue is associated with a device, and any computation assigned to the queue is executed on this device[^1].
This is how we check if a gpu is available for use and then initialize a sycl queue for a gpu:
-->
In SYCL, computations are submitted to a **queue**, which is associated with a specific device (such as a CPU, GPU, or accelerator). The **queue** is the core abstraction that handles task submission and ensures that your kernels (functions that run on the device) are executed on the chosen hardware.

## Device Selection
One of SYCL's strengths is the ability to choose the device on which your code will execute. This selection is made through **device selectors**, which help determine whether the code runs on a CPU, GPU, or another accelerator.

Here's how you can check if a GPU is available and initialize a SYCL queue to run on that GPU:
```c++
// Check for available GPU devices
auto selector = sycl::gpu_selector{};           // Select a GPU device
auto myQueue = sycl::queue{selector};           // Create a queue for GPU
```
SYCL also provides other device selectors for different platforms:
* `sycl::default_selector_v`: Automatically selects the best available device (GPU if available, otherwise CPU or another device).
* `sycl::gpu_selector{}`: Specifically selects a GPU if present.
* `sycl::cpu_selector{}`: Selects a CPU for execution.
* `sycl::accelerator_selector{}`: Selects a specialized accelerator device like an FPGA.
  
Each selector is designed to give developers control over the type of hardware used for computations. If you dont' have a GPU or need to run the code on a CPU for testing purposes, the `cpu_selector{}` provides an easy fallback.

## Creating a Queue
After selecting a device, the next step is to create a **queue**. The queue manages the execution tasks (such as kernel functions) on the chosen device. Once a queue is created, you can submit tasks to it.
```c++
// Create a queue using the GPU selctor
auto myQueue = sycl::queue{sycl::gpu_selector{}};
```
Here, the queue is associated with a GPU. If no GPU is available, SYCL will throw an exception, which you can handle to provide a fallback, such as using the CPU.


# 3. Managing Data

In SYCL, efficient data management between the host (CPU) and device (GPU or other accelerators) is critical for performance. SYCL provides two main models for managing memory: the **Buffer/Accessor Model** and **Unified Shared Memory (USM)**. Each model has its advantages, and the right choice depends on how much control you need over memory management and the complexity of your application.

Let’s walk through both approaches, how they work, and when you might choose one over the other.

## Buffers & Accessors

The **Buffer/Accessor Model** is the more traditional way of handling memory in SYCL and is often the easiest to use. It abstracts away much of the complexity, allowing you to focus on your computation rather than on manually managing memory.

So, what does this model look like?

### Buffers

A **buffer** represents a block of data, such as an array or vector, that can be accessed by both the host and the device. Buffers are created on the host but can be used by the device when necessary. One of the key benefits of using buffers in SYCL is that they don’t immediately allocate memory on the device. Instead, SYCL waits until the data is actually needed on the device before performing any memory transfers, minimizing unnecessary data movement.

This means SYCL will handle the heavy lifting of moving data between the host and the device, and you won’t have to manage it manually. You can think of buffers as containers that efficiently handle data movement in the background.

### Accessors

Once you’ve created a buffer, the device still needs a way to access that data. That’s where **accessors** come into play. Accessors allow a kernel (the function running on the device) to access the data stored in the buffer. With an accessor, you specify how the data should be accessed—whether it’s **read**, **write**, or **read/write**.

Accessors ensure that the data is properly synchronized between the host and device, which is especially important when you’re working with multiple kernels or devices. SYCL automatically takes care of ensuring that the right data is available at the right time on the device.

Here’s a diagram that shows how buffers and accessors work together in SYCL:

<img width="600" src="images/image22.png">

The diagram above shows the relationship between **buffers**, **accessors**, and the device. Notice how buffers are created on the host, but the actual data may be needed on the device. SYCL takes care of the underlying memory transfers, optimizing when and how data is moved.

**Keep in mind:**
* **Automatic memory management**: You don’t have to manually transfer data between host and device. SYCL takes care of it for you.
* **Data synchronization**: Accessors make sure that the data stays consistent across host and device.
* **Efficient data movement**: SYCL minimizes unnecessary data transfers by only moving data when needed.

## Unified Shared Memory (USM) Model

Now, what if you need more control over how and when data moves between the host and device? That’s where **Unified Shared Memory (USM)** comes in.

With USM, you have the flexibility to directly allocate memory that is shared between the host and device. This means you can allocate memory on the host or device, and both can access it without needing to explicitly manage buffers or accessors. USM also lets you work with pointers, which can feel more intuitive for developers used to traditional C++ or other pointer-based programming.

In contrast to the **Buffer/Accessor Model**, where SYCL manages the memory transfers for you, USM allows you to take control of memory movement. You decide when and how the memory is shared between the host and device. This can be particularly useful when optimizing for performance, or when porting existing code that uses raw pointers.

To make it clearer:
* **Buffers/Accessors**: SYCL moves memory as needed, and you don’t have to worry about it.
* **USM**: You manage memory manually, which gives you more control but also more responsibility.

## Choosing Between Buffer/Accessor Model and USM

Now that you’ve been introduced to both models, how do you decide which one to use? It really depends on how much control you want over memory management and what your application needs.

**Use the Buffer/Accessor Model when:**
* You prefer **automatic memory management**. SYCL handles memory transfers for you, ensuring the correct data is available on the device when needed. This model simplifies development, allowing you to focus on writing kernels rather than managing memory.
* You’re working with **complex data dependencies**. When your application involves multiple kernels or devices, accessors simplify specifying how and when data is accessed. This ensures everything stays synchronized across host and device.
* You need **robust data synchronization**. SYCL ensures that data is automatically synchronized when using accessors, making it easier to manage consistency across different devices.

Think of the **Buffer/Accessor Model** as a fully-managed solution. SYCL does the hard work of managing memory and keeping data synchronized, making it a great choice for most applications, especially when you want to focus on the computation rather than memory management.

**Use USM when:**
* You need **fine-grained control** over memory. USM allows you to directly manage when and how memory is transferred between host and device, which can be useful for performance optimizations or working with specific hardware configurations.
* You want to work with **pointers**. USM gives you direct access to memory using pointers, which can be more intuitive if you’re coming from a background in C++ or CUDA, or if you’re porting existing code.
* You’re working on **performance-critical applications**. USM allows you to fine-tune when data is copied between host and device, giving you more control over the overall performance of your application.

**USM** is perfect for advanced users who need more direct control over memory management and want to optimize the performance of their application.

# 4. Basics of SYCL Kernels

In SYCL, **kernels** are the functions that run on a device (such as a GPU or CPU). A kernel contains the actual code that will be executed in parallel across many processing units, allowing you to take advantage of the computational power of accelerators. To submit kernels for execution, we rely on **command groups**, which organize the work to be performed on the device.

Let’s break this down step by step.

## Understanding SYCL Kernel Command Group Execution

A **command group** in SYCL is like a container that holds everything the device needs to execute—this can include the kernel itself (the code to run), memory management tasks, and any synchronization that needs to happen between the host and device.

When you submit a command group to a **queue**, it gets sent to the device for execution. Within the command group, SYCL manages dependencies between tasks and ensures that operations occur in the right order. This is critical when you’re dealing with parallel processing, as you want to make sure that the right data is available when the kernel needs it and that no tasks are executed out of order.

## Submitting a Command Group

To submit a kernel for execution, you use a **command group**. The process starts with the `submit()` function of the queue. Inside this function, you define the tasks that will run on the device. Here’s a simple example:

 ```c++
// Submit a command group to the queue
myQueue.submit([&](sycl::handler &cgh) {
    // Command Group Function:
    // Inside this lambda, we define the operations to be performed on the device
    // For example, kernel execution, data transfers, etc.
    // Lambda functions are explored further below
})
```
The `submit()` function takes a **lambda function** as its argument. This lambda function contains the actual operations that will be performed on the device—whether it's running a kernel, moving data, or synchronizing tasks. The **command group handler (cgh)** inside the lambda acts as the go-between for the host (CPU) and the device (GPU, FPGA, etc.). You use the cgh to define:

* **Kernel execution**: Specify the computation that will be carried out on the device.
* **Data dependencies**: Ensure that the right data is available on the device when the kernel runs.
* **Memory management**: Manage access to memory resources like buffers and other data objects.

Here's an example of a command group submission that includes a kernel:

```c++
myQueue.submit([&](sycl::handler &cgh) {
    // Define the operations inside this lambda
    cgh.single_task([=]() {
        // Kernel code: executed on the device
        // Your parallel task or operation goes here
    });
}).wait();  // Wait for the operation to complete
```
In this case, the **kernel** (a function that performs a task) is defined within **single_task()**. The **wait()** function ensures that the host waits for the kernel to finish executing on the device before moving on.

<img width="255" alt="" src="images/image11.png" >

> The diagram above illustrates how SYCL handles command group submission. First, the `submit()` function initiates the process. The command group handler (`cgh`) is created inside the lambda, where you can specify dependencies, define kernels, and set up memory access. Once everything is set, SYCL moves the command group to the device for execution.

## Lambda Functions in SYCL 

SYCL relies heavily on `lambda functions`, which are anonymous functions that allow you to define operations locally. Lambdas are especially useful in SYCL because they make it easy to define what each element in a data set should do in parallel. They’re also a key part of how we submit kernels to the device.

Lambdas in SYCL are similar to those in standard C++, but they have an important role in parallel computing. When you submit a kernel to SYCL, you typically use a lambda to specify the work that needs to be done on each data element.

Here’s the basic structure of a lambda function in SYCL:

```cpp
[capture_clause](input_signature) -> return_specification {
    // execution_block
}
```
* **Capture clause**: Defines what variables from the surrounding scope the lambda can access.
* **Input signature**: Describes the inputs for the lambda (if any).
* **Execution block**: The actual code that will be executed inside the lambda.

In SYCL, you’ll typically see these types of capture clauses:

- `[]` : Captures nothing from the enclosing scope. This is used when the lambda does not need to access any external variables.

- `[&]` : Captures all accessible variables from the surrounding scope by reference. Useful when you need to modify the external variables or when copying them is expensive.

- `[=]` : Captures all accessible variables from the surrounding scope by value. This is safe when the lambda is executed asynchronously or on a separate device, ensuring that it works with a consistent copy of the data.

Let’s look at an example where we define a parallel kernel using a lambda:

```c++
buffer<float, 1> buf(data, range<1>(data_size));
myQueue.submit([&](handler& cgh) {
    auto acc = buf.get_access<access::mode::read_write>(cgh);
    cgh.parallel_for(range<1>(data_size), [=](id<1> idx) {
        acc[idx] *= 2; // Example operation: double each element
    });
});

```
In this example:
* We create a **buffer** to hold the data.
* We use an **accessor** (`acc`) to get read/write access to the buffer.
* The `parallel_for()` function executes the kernel in parallel across each element in the data set, multiplying each element by 2.

## Quick Review

* **Command groups** allow you to organize and submit work to the device. Inside a command group, you define the operations that need to happen on the device, from running kernels to managing memory.
* **Lambda functions** are used to define these operations locally. In SYCL, lambdas allow you to write concise code that can execute in parallel across many elements.
* SYCL’s structure makes it easy to submit tasks for parallel execution while managing dependencies and memory efficiently.

# 5. Command Groups in Detail

In SYCL, **command groups** are the heart of the process for submitting work to a device. They contain everything needed for the device to execute tasks, including the kernel itself, memory management, and data dependencies. Understanding how command groups work is crucial for writing effective SYCL programs.

## What is a Command Group?

A **command group** is a construct that organizes and submits a set of operations to be performed on a device. Think of it as a “task package” that SYCL sends to the device for execution. Inside this package, you can define what computations need to run and how memory should be accessed or synchronized.

When you submit a command group, you're essentially telling SYCL, “Here’s what I want to do on the device. Now, take care of it.”

``` c++
myQueue.submit([&](sycl::handler &cgh) {
  /* Command group function */
})
```

Let’s break this down:

* **Command groups** are submitted via the `submit` function, which belongs to a `queue` (that’s where the work is scheduled).
* The `submit` function takes a **lambda function** where you define the operations to be performed on the device.
* Inside this lambda, a **command group handler (cgh)** is created. This handler is responsible for specifying the kernel execution and managing access to the memory (buffers, accessors, etc.).

The command group is then sent to the device for execution.

<img width="305" src="images/image11.png" >

> The diagram shows the flow of a command group being defined, submitted to a queue, and eventually executed on a device.

## Defining Dependencies in Command Groups

In parallel computing, it’s essential to manage the order in which tasks execute and how data is transferred between the host (CPU) and device (GPU). This is where **defining dependencies** comes into play.

Within a command group, you define the **data dependencies** between kernels and memory. This ensures that all necessary data is available on the device when the kernel runs, and the execution happens in the correct order. SYCL manages this using **accessors**, which act as bridges between memory buffers and kernels.

For example, before running a kernel that modifies data in a buffer, you want to make sure that the correct data has been transferred from the host to the device. Likewise, once the kernel finishes, SYCL needs to transfer the modified data back to the host if it’s required. This automatic data synchronization is one of SYCL’s strengths.

## Command Group Handler (`cgh`)

The **command group handler (cgh)** is like the “director” of a command group. It directs the operations that will take place on the device. The `cgh` gives you the tools to:

* **Specify kernel execution**: This is where you tell the device what computation to run.
* **Manage memory access**: You use `cgh` to set up access to memory resources (buffers, accessors) so the kernel can read or write data.
* **Synchronize data**: Ensures the data dependencies between tasks are managed correctly.

For example, you could define a command group that specifies how data is read from a buffer, processed in a kernel, and then written back:

```c++
myQueue.submit([&](sycl::handler &cgh) {
    auto acc = buffer.get_access<sycl::access::mode::read_write>(cgh);
    cgh.parallel_for(range<1>(data_size), [=](sycl::id<1> idx) {
        acc[idx] += 1;  // Increment each element in the buffer
    });
});
```
Here:
* We use `get_access` to allow the kernel to read and modify the buffer.
* The kernel itself (defined in `parallel_for`) will perform the operation.

The `cgh` ensures that these steps happen in the correct order and that the data in the buffer is synchronized between the host and device as needed.

## Enqueuing SYCL Kernel Function: `single_task` Example

Let’s walk through an example of how to use the `single_task` kernel in SYCL. The `single_task` function is the simplest way to run a task on a device, particularly when you don’t need parallel execution. This is a great place to start if you want to run basic operations on a GPU or another accelerator.

Here’s how it looks in code:
```c++
// Select GPU devices
// Select GPU device
auto gpu_selector = sycl::gpu_selector{};
auto myQueue = sycl::queue{gpu_selector};

// Submit a command group to the queue
myQueue.submit([&](sycl::handler &cgh) {
    // Create a stream for output within kernel
    auto os = sycl::stream{128, 128, cgh};
    
    // Execute a single task
    cgh.single_task([=]() {
      os << "Hello World!" << sycl::endl;  // Output to the stream
    });
    
}).wait();  // Wait for the GPU to finish the task
```
Let’s break it down:
* **Device Selection**: We use the **GPU selector** to make sure SYCL runs on a GPU. If a GPU is available, this code will run the kernel on it. Otherwise, it will throw an exception, which can be handled to fall back to a CPU or other device.

* **Command Group Submission**: We submit a command group to the queue using the `submit()` function. Inside this function, we define the task using `single_task()`.

* **SYCL Stream**: We introduce `sycl::stream`, which lets you output data directly from within the kernel. This can be useful for debugging or logging. In our case, we’re printing **"Hello World!"** from the GPU, which confirms the device is working.

* **Wait for Completion**: We use `wait()` to ensure that the CPU doesn’t move on until the GPU finishes executing the kernel.

### Why Use `single_task`?

The `single_task` function is ideal when you want to perform a single operation on a device without the need for parallel execution. It’s simple and straightforward—perfect for testing or running a basic task.

While more complex kernels (such as those using `parallel_for`) allow you to split work across many threads or processing units, `single_task` runs only once. This makes it perfect for small, one-off operations, such as initializing device memory, writing test kernels, or printing debug information.

For instance, when you’re getting started with SYCL, running a `single_task` that prints “Hello World” from the GPU is an excellent way to confirm that your SYCL environment is set up correctly.

### Waiting for Completion

In SYCL, after submitting a kernel to a device, you might need to ensure that the device finishes the work before the host (CPU) continues executing the rest of the program. That’s where `wait()` comes in. Without calling `wait()`, the CPU could continue executing subsequent code while the GPU is still processing, which might cause inconsistencies or unexpected results, especially in more complex applications.

By using `wait()`, you force the host to pause and wait for the device to finish its work, ensuring that your program runs in the correct sequence.

## Introducing `parallel_for` for Parallel Execution

While `single_task` is useful for executing a simple task on a device, SYCL really shines when it comes to parallel execution. This is where `parallel_for` comes in.

The `parallel_for` function allows you to break down a task into multiple smaller tasks, each of which can run concurrently on different parts of the data. This is what enables SYCL to harness the computational power of GPUs and other accelerators for high-performance parallel computing.

In SYCL, `parallel_for` executes the same operation across a **range of work-items**—these work-items can be thought of as individual "threads" working in parallel. Each work-item processes a different portion of the data, allowing the device to work on many elements at the same time.

Let’s walk through a simple example of how to use `parallel_for` in SYCL:

```c++
buffer<float, 1> buf(data, range<1>(data_size)); // Create a buffer to hold the data

myQueue.submit([&](sycl::handler &cgh) {
    auto acc = buf.get_access<sycl::access::mode::read_write>(cgh);
    
    // Submit a parallel task to the queue
    cgh.parallel_for(sycl::range<1>(data_size), [=](sycl::id<1> idx) {
        acc[idx] *= 2;  // Multiply each element by 2
    });
});
```

In this example:
* We create a **buffer** to store our data, as we did earlier with `single_task`.
* We use `get_access` to allow the device to read and write to the buffer.
* The `parallel_for` function splits the task of multiplying each element by 2 across all the data points, running the operation in parallel on the device.

### Why Use `parallel_for`?

The power of `parallel_for` lies in its ability to run the same operation on multiple data points simultaneously, which is ideal for data-parallel tasks. Instead of processing each data point sequentially on the CPU, `parallel_for` enables the GPU to work on many data points in parallel, drastically speeding up execution for large datasets.

Here’s why you’d use `parallel_for`:

* **Data parallelism**: When your computation involves the same operation across multiple elements (like in the above example of multiplying every element in an array), `parallel_for` can distribute the work across the available hardware, making it much faster.
* **Efficient use of hardware**: GPUs are designed to execute thousands of threads in parallel. By using `parallel_for`, you can fully utilize the computational power of your device.
* **Scalability**: As the size of your data grows, `parallel_for` scales well because it can distribute the work evenly across the available processing units.

### `parallel_for` Example: A Deeper Look

Let’s dig a little deeper into how `parallel_for` works in practice. In SYCL, `parallel_for` can be used with a **range** of work-items, and each work-item is identified by an **id**. The **id** represents the index of the current work-item, so each work-item processes a different portion of the data.

Here’s an example where we square each element in a buffer:

```c++
buffer<float, 1> buf(data, range<1>(data_size));  // Buffer to hold data

myQueue.submit([&](sycl::handler &cgh) {
    auto acc = buf.get_access<sycl::access::mode::read_write>(cgh);  // Get access to the buffer
    
    // Use parallel_for to run the task in parallel
    cgh.parallel_for(sycl::range<1>(data_size), [=](sycl::id<1> idx) {
        acc[idx] = acc[idx] * acc[idx];  // Square each element
    });
});
```

In this example:

* We define a **range** of size `data_size` to indicate how many work-items we want. Each work-item will be responsible for squaring one element in the buffer.
* The **id** parameter in the lambda function (`sycl::id<1> idx`) represents the index of the current work-item, allowing each work-item to access a different element in the buffer.

Each work-item runs the same operation (`acc[idx] = acc[idx] * acc[idx];`), but on a different element of the data. This enables parallelism and makes your program much more efficient when dealing with large datasets.

## Quick Review

In SYCL, command groups are where you define the work that will be sent to a device. The **command group handler (cgh)** allows you to manage dependencies, define kernel execution, and control access to memory.

Using simple constructs like `single_task`, you can start running basic tasks on accelerators like GPUs. As you become more comfortable, you’ll explore more complex kernel execution patterns, like parallel_for, that let you fully harness the power of parallel devices.

The `parallel_for` function is a cornerstone of parallel computing in SYCL. It allows you to leverage the full computational power of devices like GPUs by breaking tasks into smaller parts and processing them in parallel. You’ll often use `parallel_for` when your task involves data parallelism, where the same operation needs to be performed across many data points.

In the next section, we’ll dive deeper into **scheduling and execution**, where we’ll explore how SYCL manages the execution of parallel tasks, dependencies, and synchronization across multiple kernels or command groups.

# 6. Scheduling and Execition in SYCL

When you write SYCL programs, one of the key concepts to grasp is how tasks are scheduled and executed on different devices (e.g., GPUs, CPUs, FPGAs). SYCL abstracts much of the complexity involved in managing the execution of tasks, but understanding how the SYCL scheduler works—and how task dependencies and parallel execution are handled—will help you write more efficient and optimized code.

Let’s take a deeper dive into how scheduling, task dependencies, and parallel execution work in SYCL.

## Scheduling in SYCL

SYCL’s scheduler is the engine behind managing task execution. Its job is to ensure that tasks (command groups) are executed on the target devices in the correct order, respecting any dependencies between tasks and data.

Here’s how the scheduling process works:

* **Task Submission**: When you submit a command group to a queue using the `submit()` function, that command group is handed off to the scheduler.
* **Dependency Resolution**: The scheduler evaluates the dependencies between tasks to determine which tasks need to wait for others to finish before they can execute. Dependencies can arise from shared data (buffers, USM) or synchronization requirements between command groups.
* **Resource Allocation**: The scheduler allocates resources on the target device (such as processing units and memory) and prepares the task for execution.
* **Task Execution**: Once dependencies are resolved and resources are allocated, the scheduler sends the task to the target device, where it executes in parallel with other tasks, if possible.

The diagram below illustrates this process:

<img width="600" src="images/image33.png">

## Task Dependencies in SYCL

In SYCL, dependencies between tasks must be carefully managed to ensure correct program execution. These dependencies arise when one task needs to wait for the completion of another task, usually due to shared data.

SYCL provides two main mechanisms for managing task dependencies:

* **Accessors (Buffer/Accessor Model)**: Accessors are used to specify how a kernel accesses data in a buffer. SYCL automatically resolves dependencies based on these access modes (e.g., `read`, `write`, or `read_write`). For example, if two kernels are using the same buffer but one is writing to it and the other is reading from it, SYCL ensures that the writer finishes before the reader starts.

* **Events**: Events provide a more explicit way to manage dependencies. When you submit a task, you can create an event that represents its execution. You can then make other tasks wait on this event, ensuring that tasks are completed in the right order.

Here’s an example that shows how dependencies work using accessors:

```c++
sycl::buffer<int, 1> buf(data, sycl::range<1>(data_size));

myQueue.submit([&](sycl::handler &cgh) {
    auto acc = buf.get_access<sycl::access::mode::read_write>(cgh);
    cgh.parallel_for(sycl::range<1>(data_size), [=](sycl::id<1> idx) {
        acc[idx] = acc[idx] * 2;  // First kernel: double each element
    });
});

myQueue.submit([&](sycl::handler &cgh) {
    auto acc = buf.get_access<sycl::access::mode::read>(cgh);
    cgh.parallel_for(sycl::range<1>(data_size), [=](sycl::id<1> idx) {
        std::cout << acc[idx] << " ";  // Second kernel: print each element
    });
});
```

In this example:

* The first command group writes to the buffer (`read_write` access mode).
* The second command group reads from the buffer (`read` access mode).
* SYCL’s scheduler will ensure that the first kernel finishes before the second kernel starts reading from the buffer, as there’s a data dependency.

## Parallel Execution in SYCL

Now that we’ve introduced `parallel_for` in the earlier section, let’s expand on how SYCL handles parallel execution and how the scheduler manages parallel tasks.

Parallel execution is one of SYCL’s most powerful features. By distributing work across many **work-items** on a GPU (or other parallel devices), SYCL can perform computations on large datasets much faster than on a CPU alone. The `parallel_for` function is how SYCL accomplishes this.

In SYCL, parallelism can occur at different levels:

* **Work-Item Level**: Each work-item represents a single unit of work (like processing one element in an array). SYCL runs multiple work-items concurrently, each executing the same kernel function on different portions of the data.
* **Work-Group Level**: Work-items are grouped into **work-groups**, which share local memory and can synchronize more efficiently.

### Deep Dive into `parallel_for`

The `parallel_for` function splits a task across many work-items. Each work-item is assigned a unique **ID**, which it uses to determine which part of the data to process. The scheduler ensures that these work-items run in parallel, taking advantage of the hardware’s compute capabilities.

Let’s revisit the `parallel_for` example:

```c++
sycl::buffer<float, 1> buf(data, sycl::range<1>(data_size));

myQueue.submit([&](sycl::handler &cgh) {
    auto acc = buf.get_access<sycl::access::mode::read_write>(cgh);
    
    // Each work-item multiplies one element in the buffer
    cgh.parallel_for(sycl::range<1>(data_size), [=](sycl::id<1> idx) {
        acc[idx] *= 2;
    });
});
```

In this example:

* `sycl::range<1>(data_size)` defines how many work-items are created. In this case, one work-item for each element in the buffer.
* `sycl::id<1> idx` gives each work-item a unique ID, allowing it to process a different element of the buffer.

## How the Scheduler Manages Parallel Execution

SYCL’s scheduler is responsible for distributing work-items across the available compute units. This process involves:

1. **Dividing the work**: The scheduler breaks down the range of work-items and distributes them across the available compute resources.
2. **Work-Item Execution**: Work-items execute concurrently, with each processing a different portion of the data.
3. **Synchronization**: SYCL ensures that tasks are properly synchronized, using either explicit events or implicit synchronization based on data dependencies.

The benefit of this approach is that you can easily write parallel code without worrying about low-level details like thread management. SYCL abstracts these complexities and allows you to focus on the computation itself.

### When to Use `parallel_for`

Use `parallel_for` when:

* You need to process large amounts of data simultaneously (e.g., array operations, matrix multiplications).
* The task is **data-parallel**—meaning the same operation is applied to multiple elements independently.
* You want to leverage the computational power of devices like GPUs that are optimized for parallel workloads.

## Quick Review

Understanding SYCL’s scheduler, task dependencies, and parallel execution model is key to writing efficient and scalable SYCL programs. Here’s what we’ve covered:

* **SYCL’s scheduler** ensures tasks are executed in the right order and optimally distributed across devices.
* **Task dependencies** are managed automatically by SYCL’s runtime, using accessors or events to synchronize tasks.
* `parallel_for` allows you to break down large tasks into smaller parallel units, distributing the work across multiple work-items for faster execution.

In the next section, we’ll explore **advanced execution patterns**, including multi-kernel execution and strategies for optimizing task scheduling.

**<font color="red">SEE example [00-hello.ipynb](examples/00-hello.ipynb)</font>**

# 7. Advanced Kernel Execution Patterns

By now, you’ve seen how to submit simple tasks to a device and run parallel computations using `single_task` and `parallel_for`. But SYCL offers a lot more flexibility when it comes to organizing and coordinating complex computations across multiple kernels. In this section, we’ll explore some of these advanced patterns, focusing on:

* How to design command groups for more complex workflows.
* How to execute multiple kernels and ensure they run in the correct sequence.
* How to synchronize data and manage dependencies between different tasks.

## Advanced Command Group Patterns

A **command group** is the basic unit of work that SYCL submits to a device. As we’ve seen, it allows you to define both the computation (the kernel) and the data dependencies (through accessors or USM). But when you're working with more complex applications, you’ll likely need more control over how these command groups interact with one another. This is where advanced command group patterns come into play.

### Multiple Kernels in One Command Group

While each command group typically contains a single kernel, it’s possible to include multiple kernels within a single command group, especially if those kernels work on the same data and share dependencies. This can be useful when you want to perform a series of transformations on the same buffer without creating multiple command groups.

Here’s an example where two kernels are submitted within the same command group:

```c++
sycl::buffer<float, 1> buf(data, sycl::range<1>(data_size));

myQueue.submit([&](sycl::handler &cgh) {
    auto acc = buf.get_access<sycl::access::mode::read_write>(cgh);
    
    // First kernel: multiply each element by 2
    cgh.parallel_for(sycl::range<1>(data_size), [=](sycl::id<1> idx) {
        acc[idx] *= 2;
    });
    
    // Second kernel: add 10 to each element
    cgh.parallel_for(sycl::range<1>(data_size), [=](sycl::id<1> idx) {
        acc[idx] += 10;
    });
});
```

Both kernels operate on the same buffer (`buf`), and SYCL ensures that the second kernel (which adds 10) runs after the first kernel (which multiplies by 2). By keeping these kernels within the same command group, SYCL can optimize the execution flow and minimize unnecessary synchronization or data transfers.

### Command Groups with Multiple Accessors

Another advanced pattern involves managing multiple data dependencies across different buffers or memory spaces. You might have a situation where a command group operates on several buffers, each with different access modes.

Here’s an example of using multiple buffers in a command group:

```c++
sycl::buffer<float, 1> bufA(dataA, sycl::range<1>(sizeA));
sycl::buffer<float, 1> bufB(dataB, sycl::range<1>(sizeB));

myQueue.submit([&](sycl::handler &cgh) {
    auto accA = bufA.get_access<sycl::access::mode::read>(cgh);   // Read buffer A
    auto accB = bufB.get_access<sycl::access::mode::write>(cgh);  // Write buffer B
    
    cgh.parallel_for(sycl::range<1>(sizeA), [=](sycl::id<1> idx) {
        accB[idx] = accA[idx] * 2.0f;  // Double values from A and store in B
    });
});
```

In this case, we have two buffers: `bufA` for reading and `bufB` for writing. The SYCL runtime automatically resolves the dependencies between these buffers and ensures that the operations happen in the right order.

## Multi-Kernel Execution

In real-world applications, it's common to run multiple kernels in sequence or in parallel, depending on the nature of the task. SYCL provides several mechanisms to support multi-kernel execution, and mastering this can help you create more efficient and scalable applications.

### Sequential Execution with Dependencies

When executing multiple kernels, you often need to ensure that one kernel finishes before the next starts. This can be managed with SYCL’s dependency system, which automatically handles dependencies through accessors or events. We saw earlier how SYCL can resolve dependencies between command groups that share access to the same data.

Here’s a more complex example where multiple kernels operate on shared data, but must run in sequence:

```c++
sycl::buffer<float, 1> buf(data, sycl::range<1>(data_size));

myQueue.submit([&](sycl::handler &cgh) {
    auto acc = buf.get_access<sycl::access::mode::read_write>(cgh);
    
    // First kernel: multiply each element by 2
    cgh.parallel_for(sycl::range<1>(data_size), [=](sycl::id<1> idx) {
        acc[idx] *= 2;
    });
});

myQueue.submit([&](sycl::handler &cgh) {
    auto acc = buf.get_access<sycl::access::mode::read_write>(cgh);
    
    // Second kernel: add 10 to each element
    cgh.parallel_for(sycl::range<1>(data_size), [=](sycl::id<1> idx) {
        acc[idx] += 10;
    });
});
```

In this example:

* The first kernel doubles each element in the buffer.
* The second kernel adds 10 to each element.

Because both kernels operate on the same buffer, SYCL will ensure that the second kernel waits for the first to finish before it starts, thanks to the dependency created by the shared accessor.

### Parallel Execution of Independent Kernels

There are also cases where you might want to run multiple kernels in parallel, especially if they operate on independent data. SYCL can execute these kernels concurrently, maximizing the use of available hardware resources.

Here’s how you could structure a SYCL program to run two independent kernels in parallel:

```c++
sycl::buffer<float, 1> bufA(dataA, sycl::range<1>(sizeA));
sycl::buffer<float, 1> bufB(dataB, sycl::range<1>(sizeB));

// First kernel works on buffer A
myQueue.submit([&](sycl::handler &cgh) {
    auto accA = bufA.get_access<sycl::access::mode::read_write>(cgh);
    
    cgh.parallel_for(sycl::range<1>(sizeA), [=](sycl::id<1> idx) {
        accA[idx] *= 2;
    });
});

// Second kernel works on buffer B
myQueue.submit([&](sycl::handler &cgh) {
    auto accB = bufB.get_access<sycl::access::mode::read_write>(cgh);
    
    cgh.parallel_for(sycl::range<1>(sizeB), [=](sycl::id<1> idx) {
        accB[idx] += 10;
    });
});
```

In this example:

* The first kernel operates on `bufA`, and the second kernel operates on `bufB`.
    Since these kernels don’t share any dependencies, SYCL can execute them in parallel, depending on the available hardware resources.

Running independent tasks concurrently can significantly boost performance, especially on GPUs that are designed to handle massive parallelism.

## Data Synchronization

Synchronization is critical in any parallel program, ensuring that tasks accessing the same data do so in the correct order. SYCL provides several mechanisms to handle synchronization:

1. **Automatic Synchronization with Accessors**: SYCL automatically synchronizes data between the host and the device using accessors. When a kernel requests access to a buffer, SYCL ensures that the data is up-to-date and correctly synchronized across all devices before the kernel runs.

2. **Explicit Synchronization with Events**: In more complex scenarios, where you need finer control over when kernels or tasks execute, SYCL provides **events** to manage synchronization explicitly. Events are returned by the `submit()` function and can be used to wait for the completion of a task.

Here’s an example using events for explicit synchronization:

```c++
sycl::event event1 = myQueue.submit([&](sycl::handler &cgh) {
    auto acc = buf.get_access<sycl::access::mode::read_write>(cgh);
    cgh.parallel_for(sycl::range<1>(data_size), [=](sycl::id<1> idx) {
        acc[idx] *= 2;
    });
});

myQueue.submit([&](sycl::handler &cgh) {
    cgh.depends_on(event1);  // Wait for the first kernel to finish
    auto acc = buf.get_access<sycl::access::mode::read_write>(cgh);
    cgh.parallel_for(sycl::range<1>(data_size), [=](sycl::id<1> idx) {
        acc[idx] += 10;
    });
});
```

Here, `event1` is used to ensure that the second kernel doesn’t start until the first has completed. This provides more explicit control over the execution flow, allowing you to fine-tune how tasks are synchronized.

## Quick Review

In this section, we’ve explored some of the more advanced features of SYCL’s command group and kernel execution system:

* We’ve seen how you can structure command groups to include multiple kernels and manage complex data dependencies.
* We’ve explored multi-kernel execution, showing how SYCL handles both sequential and parallel kernel execution.
* And we’ve discussed how SYCL provides automatic and explicit synchronization mechanisms to ensure correct task execution.

These patterns are essential for building more complex and efficient SYCL applications. In the next section, we’ll look at practical examples and exercises that bring together everything we’ve covered so far.

# 8. Compiling and Running SYCL Programs

Now that we've explored writing SYCL kernels and managing memory, the next step is figuring out how to actually **compile and run your code**. Compiling SYCL programs isn’t too different from regular C++, but there are a few extra things happening behind the scenes since we’re working with heterogeneous systems (like CPUs, GPUs, and other accelerators).

## How to Compile SYCL Code

If you're using **Intel's DPC++ (Data Parallel C++) compiler**, which is part of Intel's oneAPI toolkit, the command for compiling SYCL code looks like this:

```bash
icpx -fsycl compute.cpp -o ./a.out
```

Let’s break this down so you understand what’s happening:

* `icpx`: This is the **DPC++ compiler command**. It’s based on Intel’s compiler and is designed to handle SYCL code. If you’ve used `g++` or `clang++` to compile regular C++ code, this will feel familiar.

* `-fsycl`: This flag tells the compiler that you’re working with SYCL. It enables the compilation of SYCL kernels and ensures that the compiler targets both the **host** (like your CPU) and the **device** (like a GPU or accelerator). Without this flag, the compiler would treat your code like regular C++.

* `compute.cpp`: This is the **source file** that contains your SYCL code. In this example, it’s the file where we’ve written our SYCL kernel.

* `-o ./a.out`: This flag specifies the **output file**. After compiling, your program will be saved as `a.out`, which is the file you’ll run later. You can change the name to anything you want (e.g., `my_program`) by modifying this part.

## Behind the Scenes: What Happens When You Compile SYCL Code

SYCL programs have two distinct parts: **host code** (which runs on the CPU) and device code (which runs on the GPU, FPGA, or other accelerators). When you use the `-fsycl` flag, the compiler automatically generates code for both the host and the device.

Here’s what’s happening under the hood:

1. **Host Compilation**: The host code—like setting up the queue, submitting command groups, and managing buffers—is compiled into regular machine code that runs on your CPU.
2. **Device Code Generation**: SYCL kernels, which are written to run on a GPU or other accelerator, are compiled into device-specific machine code. This code is designed to execute on the target device you’ve selected (e.g., a GPU).
3. **Device Offloading**: During runtime, the SYCL runtime system manages the execution of the device code, ensuring it’s sent to the correct device (GPU, FPGA, etc.) for execution. This is what makes SYCL powerful—it abstracts away a lot of the complexity of manually offloading computation to a device.

## Running Your Program

Once your code is compiled, running your SYCL program is just like running any other program on your system:

```bash
./a.out
```

This command executes your SYCL program. If your system is set up correctly, the oneAPI runtime will take care of everything—selecting the appropriate device (CPU, GPU, etc.) and executing your kernels.

**Tip**: If you’ve explicitly selected a device using a **device selector** (like `gpu_selector` or `cpu_selector`), your program will execute on that device. Otherwise, SYCL will choose a device for you, often based on default settings or system configuration.

## Handling Device Selection

You might recall from earlier sections that SYCL allows you to specify which device you want to run on by using **device selectors**. Here’s a quick refresher on how to specify a device:

```c++
sycl::queue myQueue{sycl::gpu_selector{}};
```

In this case, you’re telling SYCL to use a **GPU** if one is available. If no GPU is found, the runtime will throw an exception, so it’s a good idea to handle that with a fallback option:

```c++
try {
    sycl::queue myQueue{sycl::gpu_selector{}};
} catch (sycl::exception& e) {
    std::cerr << "No GPU available, falling back to CPU." << std::endl;
    sycl::queue myQueue{sycl::cpu_selector{}};
}
```

This ensures that your program doesn’t crash if the selected device isn’t available. Instead, it falls back to a CPU or another available device, keeping things flexible.

## Compilation and Execution on Different Platforms

If you're working on a platform that doesn't support Intel's DPC++ compiler (for example, using a non-Intel GPU or another SYCL implementation like **hipSYCL** or **ComputeCpp**), the compilation commands will be slightly different. Here's an example for **hipSYCL**:

```bash
syclcc -o my_program compute.cpp
```

Regardless of the SYCL implementation you use, the concept remains the same—your code is compiled to run on both the host and the device, and SYCL’s runtime will handle the execution across the selected devices.

## Debugging and Profiling SYCL Code

Once your program is running, you’ll want to debug and profile it to ensure everything is working as expected. Fortunately, SYCL provides several tools to help with this:

### Debugging

For debugging, you can use the same tools you’d use with C++ programs, such as **gdb** or **lldb**. These debuggers work on the host side of your code, allowing you to step through the queue submissions and memory management.

However, debugging device code (e.g., kernels running on a GPU) requires specialized tools. Intel provides **Intel VTune Profiler** and **Intel Inspector** as part of the oneAPI toolkit, which help you track performance, spot bottlenecks, and debug code running on accelerators.

### Profiling

Profiling your SYCL program is a great way to identify performance bottlenecks and optimize your code for faster execution. Intel’s **VTune Profiler** is one of the best tools for this, providing deep insights into how your SYCL kernels are performing across both host and device. VTune allows you to:

* Analyze kernel execution times.
* Monitor memory transfers between host and device.
* Identify hotspots in your device code.

If you're using another SYCL implementation, check their documentation for profiling tools, but in general, the principles of profiling remain the same: you want to ensure your kernels are running efficiently and that you're minimizing memory transfer overhead.

## Final Thoughts

To summarize, here are the key steps for compiling and running your SYCL program:

1. **Compile your code**: Use the `icpx -fsycl` command to compile your SYCL program into an executable.
2. **Run your program**: Execute the compiled program using `./a.out`, and let SYCL manage device selection and execution.
3. **Debug and Profile**: Use tools like **gdb**, **VTune Profiler**, or other available profiling tools to optimize and debug your program.

SYCL makes the process of running on heterogeneous devices seamless, but understanding what’s happening behind the scenes will help you write better, more efficient code.