# SYCL 概述 | SYCL Program Structure

##### 章节 | Sections
- [什么是数据并行 C++ 和SYCL？ | What is Data Parallel C++ and SYCL?](#What-is-Data-Parallel-C++-and-SYCL?)
- __代码 | Code:__ [设备选择器 | Device Selector](#Device-Selector)
- [数据并行内核 | Data Parallel Kernels](#Parallel-Kernels)
- [SYCL代码剖析 | SYCL Code Anatomy](#SYCL-Code-Anatomy)
- __代码 | Code:__ [访问器的隐式依赖 | Implicit dependency with Accessors](#Implicit-dependency-with-Accessors)
- __代码 | Code:__ [同步：主机访问器 | Synchronization: Host Accessor](#Synchronization:-Host-Accessor)
- __代码 | Code:__ [同步：缓冲区销毁 | Synchronization: Buffer Destruction](#Synchronization:-Buffer-Destruction)
- __代码 | Code:__ [自定义设备选择器 | Custom Device Selector](#Custom-Device-Selector)
- __代码 | Code:__ [复数乘法 | Complex Number Multiplication](#Code-Sample:-Complex-Number-Multiplication)
- __动手实验 | Lab Exercise:__ [矢量相加 | Vector Add](#Lab-Exercise:-Vector-Add)

## 学习目标 | Learning Objectives
* 解释 __SYCL__ 基本类 | Explain the __SYCL__ fundamental classes
* 使用 __设备选择__ 卸载内核工作负载 | Use __device selection__ to offload kernel workloads
* 决定何时使用 __基本并行内核__ 和 __ND Range 内核__ | Decide when to use __basic parallel kernels__ and __ND Range Kernels__
* 创建 __主机访问器__ | Create a __host Accessor__
* 通过动手实验练习构建样例 __SYCL应用__ | Build a sample __SYCL application__ through hands-on lab exercises

<a id="What-is-Data-Parallel-C++-and-SYCL?"> </a>
## 什么是数据并行 C++和SYCL | What is Data Parallel C++ and SYCL?

__Data Parallel C++ (DPC++)__ 是oneAPI基于SYCL的实现。它基于现代 C++ 的效率优势和众所周知的构造，并结合了用于数据并行性和异构编程的 __SYCL*__ 标准。SYCL 是一个 __单源__ 语言，可将 __主机代码__ 和 __异构加速器内核__ 混编在同一个源文件中。在主机上调用 SYCL 程序，并将计算卸载到加速器。程序员可以使用熟悉的 C++ 和库构造及多项新增功能（如用于工作定向的 __队列(queue)__ 、用于数据管理的 __缓冲区(buffer)__ 和用于并行性的 __parallel_for__ ）来管理可以分流卸载的部分计算和数据。

__Data Parallel C++ (DPC++)__ is oneAPI's implementation of SYCL. It is based on modern C++ productivity benefits and familiar constructs and incorporates the __SYCL__ standard for data parallelism and heterogeneous programming. SYCL is a __single source__ where __host code__ and __heterogeneous accelerator kernels__ can be mixed in same source files. A SYCL program is invoked on the host computer and offloads the computation to an accelerator. Programmers use familiar C++ and library constructs with added functionalities like a __queue__ for work targeting, __buffer__ for data management, and __parallel_for__ for parallelism to direct which parts of the computation and data should be offloaded.

## 设备(Device) | Device

__device__ class 表示在一个使用英特尔&reg; oneAPI 工具套件的系统中加速器的能力特性。设备类包含用于查询设备信息的成员函数，此类信息用于支持创建多个设备的 SYCL 程序。
 * 函数 __get_info__ 提供以下设备信息：
   * 设备名称、厂商名称、和设备的版本
   * 局部和全局工作项的ID
   * 内置类型的位宽、时钟频率、缓存位宽和大小，在线或离线


```cpp
queue q;
device my_device = q.get_device();
std::cout << "Device: " << my_device.get_info<info::device::name>() << std::endl;
```

The __device__ class represents the capabilities of the accelerators in a system utilizing Intel&reg; oneAPI Toolkits. The device class contains member functions for querying information about the device, which is useful for SYCL programs where multiple devices are created.
* The function __get_info__ gives information about the device:
 * Name, vendor, and version of the device
 * The local and global work item IDs
 * Width for built in types, clock frequency, cache width and sizes, online or offline
 
```cpp
queue q;
device my_device = q.get_device();
std::cout << "Device: " << my_device.get_info<info::device::name>() << "\n";
```


<a id="Device-Selector"> </a>
## 设备选择器(Device Selector) | Device Selector
这些类能在运行的时侯选择根据用户提供的启发式信息在特定设备上运行内核。以下代码示例显示了标准设备选择器 (__default_selector_v, cpu_selector_v, gpu_selector_v, accelerator_selector_v__) 和派生的 device_selector 的用法

These classes enable the runtime selection of a particular device to execute kernels based upon user-provided heuristics. The following code sample shows use of the standard device selectors (__default_selector_v, cpu_selector_v, gpu_selector_v, accelerator_selector_v__)

```cpp
queue q(gpu_selector_v);
//queue q(cpu_selector_v);
//queue q(accelerator_selector_v);
//queue q(default_selector_v);
//queue q;

std::cout << "Device: " << q.get_device().get_info<info::device::name>() << "\n";
```

下面的 SYCL 代码显示了不同的设备选择器：检查代码，没有进行任何修改的必要：
1. 检查下面的代码单元，然后单击运行 ▶，将代码保存到文件中
2. 接下来，在代码下面的 __构建并运行__ 部分中运行 ▶该单元，以编译和执行代码。

The SYCL code below shows different device selectors: Inspect code, there are no modifications necessary:
1. Inspect the code cell below and click run ▶ to save the code to file
2. Next run ▶ the cell in the __Build and Run__ section below the code to compile and execute the code.

In [None]:
%%writefile lab/gpu_sample.cpp
//==============================================================
// Copyright © Intel Corporation
//
// SPDX-License-Identifier: MIT
// =============================================================
#include <sycl/sycl.hpp>

using namespace sycl;

int main() {
  //# Create a device queue with device selector
  
  queue q(gpu_selector_v);
  //queue q(cpu_selector_v);
  //queue q(accelerator_selector_v);
  //queue q(default_selector_v);
  //queue q;

  //# Print the device name
  std::cout << "Device: " << q.get_device().get_info<info::device::name>() << "\n";

  return 0;
}

### 构建并运行 | Build and Run
选择下面的单元，然后点击运行 ▶，以编译并执行代码：

Select the cell below and click run ▶ to compile and execute the code:

In [None]:
! chmod 755 q; chmod 755 run_gpu.sh;if [ -x "$(command -v qsub)" ]; then ./q run_gpu.sh; else ./run_gpu.sh; fi

## 队列(Queue) | Queue
 __队列__ 提交要由 SYCL 运行时执行的命令组。队列是一种 __将工作提交到__ 设备的机制。一个队列映射到一个设备，多个队列可以映射到同一设备。

__Queue__ submits command groups to be executed by the SYCL runtime. Queue is a mechanism where __work is submitted__ to a device.A queue map to one device and multiple queues can be mapped to the same device.

```cpp
q.submit([&](handler& h) {
    //COMMAND GROUP CODE
});
```


## 内核(Kernel) | Kernel

__kernel__ 类封装了实例化命令组(command group)时在设备上执行代码的方法和数据。内核对象不是由用户明确构造的，而是在调用内核调度函数（如 __parallel_for__）时构造的 

The __kernel__ class encapsulates methods and data for executing code on the device when a command group is instantiated. Kernel object is not explicitly constructed by the user and is constructed when a kernel dispatch function, such as __parallel_for__, is called 

```cpp
 q.submit([&](handler& h) {
  h.parallel_for(range<1>(N), [=](id<1> i) {
    A[i] = B[i] + C[i]);
  });
});
```


## 选择设备内核的运行位置 | Choosing where device kernels run

工作被提交到队列中，并且每个队列仅与一个设备（例如特定的 GPU 或 FPGA）关联。 您可以决定与队列关联的设备（如果需要），并拥有所需的队列数量，以便在异构系统中分配工作。     

Work is submitted to queues and each queue is associated with exactly one device (e.g. a specific GPU or FPGA). You can decide which device a queue is associated with (if you want) and have as many queues as desired for dispatching work in heterogeneous systems.        


|目标设备|队列|
|-----|-------|
|创建针对任何设备的队列(queue): | queue() |
|创建针对一个预先配置完成的设备类型的队列: | queue(cpu_selector_v); queue(gpu_selector_v); queue(accelerator_selector_v); queue(default_selector_v);|
|创建针对特定设备(自定义标准): | queue(custom_selector); |       



|Target Device |Queue|
|-----|-------|
|Create queue targeting any device: | queue() |
| Create queue targeting a pre-configured classes of devices: | queue(cpu_selector_v); queue(gpu_selector_v); queue(accelerator_selector_v); queue(default_selector_v);|
|Create queue targeting specific device (custom criteria): | queue(custom_selector); |                    
                                                          
                                                               







        
<img src="Assets/queue.png">

## SYCL语言和运行时 | SYCL Language and Runtime

SYCL语言和运行时由一组 C++ 类、模板和库组成。

SYCL language and runtime consists of a set of C++ classes, templates, and libraries.

  __应用范围__ 和 __命令组范围__ ：
 * 在主机上执行的代码
 * C++ 的全部功能可在应用和命令组范围内使用 
 
 __Application scope__ and __command group scope__:
 * Code that executes on the host
 * The full capabilities of C++ are available at application and command group scope 

 __内核__ 范围：
 * 在设备上执行的代码。 
 * 在 __内核__范围 内，接受的 C++ 存在 __局限性__

__Kernel__ scope:
 * Code that executes on the device. 
 * At __kernel__ scope there are __limitations__ in accepted C++


<a id="Parallel-Kernels"> </a>
## 并行内核 | Parallel Kernels

__Parallel Kernel__ 允许一个操作的多个实例并行执行。这对于 __卸载__ 基本 __for-loop__ 的并行化执行很有用。在该循环中，每个迭代都是完全独立的，并且不分顺序。使用 __parallel_for__ 函数表示并行内核。
C++ 应用中一个的简单‘for’循环如下所示

__Parallel Kernel__ allows multiple instances of an operation to execute in parallel. This is useful to __offload__ parallel execution of a basic __for-loop__ in which each iteration is completely independent and in any order. Parallel kernels are expressed using the __parallel_for__ function
A simple 'for' loop in a C++ application is written as below

```cpp
for(int i=0; i < 1024; i++){
    a[i] = b[i] + c[i];
});
```

以下是卸载到加速器的方法

Below is how you can offload to accelerator

```cpp
h.parallel_for(range<1>(1024), [=](id<1> i){
    A[i] =  B[i] + C[i];
});
```


## 基本并行内核 | Basic Parallel Kernels

基本并行内核的功能通过 __range__ 、 __id__ 和 __item__ 类提供。__Range__ 类用于描述并行执行的 __迭代空间__ ，__id__ 类用于为并行执行中内核的单个实例 __创建索引__

The functionality of basic parallel kernels is exposed via __range__, __id__, and __item__ classes. __Range__ class is used to describe the __iteration space__ of parallel execution and __id__ class is used to __index__ an individual instance of a kernel in a parallel execution


```cpp
h.parallel_for(range<1>(1024), [=](id<1> i){
// CODE THAT RUNS ON DEVICE 
});

```

如果您只需要 __index (id)__，那么上面的示例就可以了。但如果您需要内核代码中的 __range__ 值，那么你可以使用 __item__ 类，而不是 __id__ 类。您可以使用如下所示方式对于 __range__ 进行查询。__item__ 类代表内核函数的 __单个实例__ ，向执行范围内的属性查询提供其他函数

The above example is good if all you need is the __index (id)__, but if you need the __range__ value in your kernel code, then you can use __item__ class instead of __id__ class, which you can use to query for the __range__ as shown below.  __item__ class represents an __individual instance__ of a kernel function, exposes additional functions to query properties of the execution range


```cpp
h.parallel_for(range<1>(1024), [=](item<1> item){
    auto i = item.get_id();
    auto R = item.get_range();
    // CODE THAT RUNS ON DEVICE
    
    
});

```

## ND RANGE 内核 | ND RANGE KERNELS

基本并行内核是实现 for-loop (for循环) 并行化的简便方法，但不允许在硬件级别进行性能优化。__ND-Range 内核__ 是另一种表示并行性的方法，通过提供对 __本地内存的访问以及将执行映射__ 到硬件上的计算单元实现底层性能调整。整个迭代空间被划分为称为 __工作组__ 的小群组，一个工作组内的 __工作项__ 安排在硬件的单个计算单元上。

Basic Parallel Kernels are easy way to parallelize a for-loop but does not allow performance optimization at hardware level. __ND-Range kernel__ is another way to expresses parallelism which enable low level performance tuning by providing access to __local memory and mapping executions__ to compute units on hardware. The entire iteration space is divided into smaller groups called __work-groups__, __work-items__ within a work-group are scheduled on a single compute unit on hardware.

通过将内核执行分组到工作组中，您将可以控制资源使用和负载均衡工作分配。nd_range 内核的功能通过 __nd_range__ 和 __nd_item__ 类提供。__nd_range__ 类表示使用每个工作组的全局执行范围和本地执行范围的 __分组执行范围__ 。__nd_item__ 类表示内核函数的 __单个实例__ ，并允许查询工作组范围和索引。 

The grouping of kernel executions into work-groups  will allow control of resource usage and load balance work distribution.The functionality of nd_range kernels is exposed via __nd_range__ and __nd_item__ classes. __nd_range__ class represents a __grouped execution range__ using global execution range and the local execution range of each work-group. __nd_item__ class  represents an __individual instance__ of a kernel function and allows to query for work-group range and index.

```cpp
h.parallel_for(nd_range<1>(range<1>(1024),range<1>(64)), [=](nd_item<1> item){
    auto idx = item.get_global_id();
    auto local_id = item.get_local_id();
    // CODE THAT RUNS ON DEVICE
});
```
<img src="Assets/ndrange.png">

## 缓冲区模型 | Buffer Model

__缓冲区将数据封装__ 在跨设备和主机的 SYCL 应用中。__Accessor(访问器)__ 是访问缓冲区数据的机制。

__Buffers encapsulate__ data in a SYCL application across both devices and host. __Accessors__ is the mechanism to access buffer data.

<a id="SYCL-Code-Anatomy"> </a>
## SYCL代码剖析 | SYCL Code Anatomy

使用 oneAPI 的程序需要包含 __cl/sycl.hpp__ 。建议使用命名空间语句将键入的重复引用保存到 cl::sycl 命名空间中。
Programs which utilize oneAPI require the include of __cl/sycl.hpp__. It is recommended to employ the namespace statement to save typing repeated references into the cl::sycl namespace.

```cpp
#include <sycl/sycl.hpp>
using namespace cl::sycl;
```

__SYCL程序__ 是标准 C++。该程序在 __主机__ 上调用，并将计算卸载到 __加速器__。程序员可以使用SYCL的 __queue(队列)、buffer(缓冲区)、device(设备)和内核抽象__ 来管理可以分流卸载的部分计算和数据。

作为SYCL程序的第一步，我们创建一个 __queue(队列)__ 。通过将任务提交到队列，我们可以将计算卸载至 __device(设备)__ 。程序员可以通过 __selector(选择器)__ 选择 CPU、GPU、FPGA 和其他设备。该程序在此处使用默认的 q，这意味着SYCL运行时(runtime)会使用默认选择器选择功能最强大的设备。我们将在后面的模块中讨论设备、设备选择器以及缓冲区、访问器和内核的概念。下面是一个简单的SYCL程序，可帮助您快速了解上述概述。

设备(device)和主机(host)可以共享物理 __内存__ 或拥有截然不同的内存。当内存不同时，卸载计算需要 __在主机和设备之间复制数据__ 。 SYCL无需程序员来管理数据复制。通过创建 __缓冲区(buffer)和访问器(accessor)__ ，SYCL能够确保相应的数据无需您的接入也可供主机和设备使用。SYCL还允许程序员在必要时对于数据的移动进行显式地控制，以实现最佳性能。


在一个SYCL程序中，我们定义了一个 __kernel(内核)__ ，该内核将应用于索引空间中的每个点。对于本例这样的简单程序，索引空间直接映射到数组的元素。内核封装在 __C++ lambda函数__ 中。将lambda函数传递给索引空间中的一个点作为坐标数组。对于这一简单程序，索引空间坐标与数组索引相同。以下程序中的 __parallel_for__ 将lambda应用于索引空间。索引空间在parallel_for的第一个参数中定义为 __从0到 N-1的一维范围__ 。


__SYCL programs__ are standard C++. The program is invoked on the __host__ computer, and offloads computation to the __accelerator__. A programmer uses SYCL’s __queue, buffer, device, and kernel abstractions__ to direct which parts of the computation and data should be offloaded.

As a first step in a SYCL program we create a __queue__. We offload computation to a __device__ by submitting tasks to a queue. The programmer can choose CPU, GPU, FPGA, and other devices through the __selector__. This program uses the default  q here, which means SYCL runtime selects the most capable device available at runtime by using the default selector. We will talk about the devices, device selectors, and the concepts of buffers, accessors and kernels in the upcoming modules but below is a simple SYCL program for you to get started with the above concepts.

Device and host can either share physical __memory__ or have distinct memories. When the memories are distinct, offloading computation requires __copying data between host and device__. SYCL does not require the programmer to manage the data copies. By creating __Buffers and Accessors__, SYCL ensures that the data is available to host and device without any programmer effort. SYCL also allows the programmer explicit control over data movement when it is necessary to achieve best peformance.

In a SYCL program, we define a __kernel__, which is applied to every point in an index space. For simple programs like this one, the index space maps directly to the elements of the array. The kernel is encapsulated in a __C++ lambda function__. The lambda function is passed a point in the index space as an array of coordinates. For this simple program, the index space coordinate is the same as the array index. The __parallel_for__ in the below program applies the lambda to the index space. The index space is defined in the first argument of the parallel_for as a 1 dimensional __range from 0 to N-1__.

下面的代码展示了使用SYCL的简单矢量加法。请阅读第 1 步至第 6 步中的注释。

The code below shows Simple Vector addition using SYCL. Read through the comments addressed in step 1 through step 6.

```cpp
void SYCL_code(int* a, int* b, int* c, int N) {
  //Step 1: create a device queue
  //(developer can specify a device type via device selector or use default selector)
  auto R = range<1>(N);
  queue q;
  //Step 2: create buffers (represent both host and device memory)
  buffer buf_a(a, R);
  buffer buf_b(b, R);
  buffer buf_c(c, R);
  //Step 3: submit a command for (asynchronous) execution
  q.submit([&](handler &h){
  //Step 4: create buffer accessors to access buffer data on the device
  accessor A(buf_a,h,read_only);
  accessor B(buf_b,h,read_only);
  accessor C(buf_c,h,write_only);
  
  //Step 5: send a kernel (lambda) for execution
  h.parallel_for(range<1>(N), [=](auto i){
    //Step 6: write a kernel
    //Kernel invocations are executed in parallel
    //Kernel is invoked for each element of the range
    //Kernel invocation has access to the invocation id
    C[i] = A[i] + B[i];
    });
  });
}
```

<a id="Implicit-dependency-with-Accessors"> </a>
## 访问器(Accessor)的隐式依赖 | Implicit dependency with Accessors

* 访问器在 SYCL 图中创建 __数据依赖关系__ ，以对内核执行进行排序
* 如果两个内核使用相同的缓冲区，则第二个内核需要等待第一个内核完成，以避免争用。

* Accessors create __data dependencies__ in the SYCL graph that order kernel executions
* If two kernels use the same buffer, the second kernel needs to wait for the completion of the first kernel to avoid race conditions. 


<img src="Assets/buffer1.png">

下面的SYCL代码演示了访问器的隐式依赖关系：检查代码，没有必要进行任何修改：

1. 检查下面的代码单元，然后单击运行 ▶，将代码保存到文件

2. 接下来，在代码下面的 __构建并运行__ 部分中运行 ▶单元，以编译和执行代码。

The SYCL code below demonstrates Implicit dependency with Accessors: Inspect code, there are no modifications necessary:

1. Inspect the code cell below and click run ▶ to save the code to file

2. Next run ▶ the cell in the __Build and Run__ section below the code to compile and execute the code.

In [None]:
%%writefile lab/buffer_sample.cpp
//==============================================================
// Copyright © Intel Corporation
//
// SPDX-License-Identifier: MIT
// =============================================================
#include <sycl/sycl.hpp>

constexpr int num=16;
using namespace sycl;

  int main() {
  auto R = range<1>{ num };
  //Create Buffers A and B
  buffer<int> A{ R }, B{ R };
  //Create a device queue
  queue Q;
  //Submit Kernel 1
  Q.submit([&](handler& h) {
    //Accessor for buffer A
    accessor out(A,h,write_only);
    h.parallel_for(R, [=](auto idx) {
      out[idx] = idx[0]; }); });
  //Submit Kernel 2
  Q.submit([&](handler& h) {
    //This task will wait till the first queue is complete
    accessor out(A,h,write_only);
    h.parallel_for(R, [=](auto idx) {
      out[idx] += idx[0]; }); });
  //Submit Kernel 3
  Q.submit([&](handler& h) { 
    //Accessor for Buffer B
    accessor out(B,h,write_only);
    h.parallel_for(R, [=](auto idx) {
      out[idx] = idx[0]; }); });
  //Submit task 4
  Q.submit([&](handler& h) {
   //This task will wait till kernel 2 and 3 are complete
   accessor in (A,h,read_only);
   accessor inout(B,h);
  h.parallel_for(R, [=](auto idx) {
    inout[idx] *= in[idx]; }); }); 
      
 // And the following is back to device code
 host_accessor result(B,read_only);
  for (int i=0; i<num; ++i)
    std::cout << result[i] << "\n";      
  return 0;
}

###  构建并运行 | Build and Run

选择下面的单元，然后点击运行 ▶，以编译并执行代码：

Select the cell below and click run ▶ to compile and execute the code:

In [None]:
! chmod 755 q; chmod 755 run_buffer.sh;if [ -x "$(command -v qsub)" ]; then ./q run_buffer.sh; else ./run_buffer.sh; fi

<a id="Synchronization:-Host-Accessor"> </a>
## 主机访问器 | Host Accessors

主机访问器是使用主机缓冲区访问目标的访问器。它是在命令组范围之外创建的，它可以访问的数据将可以在主机上使用。这些用于通过构建主机访问器对象将数据同步回主机。缓冲区销毁是将数据同步回主机的另一种方法。

The Host Accessor is an accessor which uses host buffer access target. It is created outside of the scope of the command group and the data that this gives access to will be available on the host. These are used to synchronize the data back to the host by constructing the host accessor objects. Buffer destruction is the other way to synchronize the data back to the host.


## 同步：主机访问器 | Synchronization: Host Accessor

缓冲区掌控着存储在矢量中的数据。创建主机访问器是一个 __阻塞请求__ ，只有任何队列中修改相同缓冲区的所有排队的SYCL内核均完成执行并且数据可通过该主机访问器提供给主机后，才会返回。

Buffer takes ownership of the data stored in vector. Creating host accessor is a __blocking call__ and will only return after all enqueued SYCL kernels that modify the same buffer in any queue completes execution and the data is available to the host via this host accessor.

下面的SYCL代码演示了与主机访问器的同步：检查代码，没有必要进行任何修改：
1. 检查下面的代码单元，然后单击运行 ▶，将代码保存到文件

2. 接下来，在代码下面的 __构建并运行__ 部分中运行 ▶单元，以编译和执行代码。

The SYCL code below demonstrates Synchronization with Host Accessor: Inspect code, there are no modifications necessary:

1. Inspect the code cell below and click run ▶ to save the code to file

2. Next run ▶ the cell in the __Build and Run__ section below the code to compile and execute the code.


In [None]:
%%writefile lab/host_accessor_sample.cpp
//==============================================================
// Copyright © Intel Corporation
//
// SPDX-License-Identifier: MIT
// =============================================================

#include <sycl/sycl.hpp>
using namespace sycl;

int main() {
  constexpr int N = 16;
  auto R = range<1>(N);
  std::vector<int> v(N, 10);
  queue q;
  // Buffer takes ownership of the data stored in vector.  
  buffer buf(v);
  q.submit([&](handler& h) {
    accessor a(buf,h);
    h.parallel_for(R, [=](auto i) { a[i] -= 2; });
  });
  // Creating host accessor is a blocking call and will only return after all
  // enqueued SYCL kernels that modify the same buffer in any queue completes
  // execution and the data is available to the host via this host accessor.
  host_accessor b(buf,read_only);
  for (int i = 0; i < N; i++) std::cout << b[i] << " ";
  return 0;
}

### 构建并运行 | Build and Run

选择下面的单元，然后点击运行 ▶，以编译并执行代码：

Select the cell below and click run ▶ to compile and execute the code:

In [None]:
! chmod 755 q; chmod 755 run_host_accessor.sh;if [ -x "$(command -v qsub)" ]; then ./q run_host_accessor.sh; else ./run_host_accessor.sh; fi

<a id="Synchronization:-Buffer-Destruction"> </a>
## 同步：缓冲区销毁 | Synchronization: Buffer Destruction

在下面的示例中，缓冲区创建在单独的函数范围内进行。当执行超出此 __函数范围__ 时，将调用缓冲区析构函数，该析构函数将放弃数据的所有权并将其复制回主机内存。

In the below example Buffer creation happens within a separate function scope. When execution advances beyond this __function scope__, buffer destructor is invoked which relinquishes the ownership of data and copies back the data to the host memory.

下面的SYCL代码演示了缓冲区破坏同步：检查代码，没有必要进行任何修改：

1. 检查下面的代码单元，然后单击运行 ▶，将代码保存到文件

2. 接下来，在代码下面的 __构建并运行__ 部分中运行 ▶单元，以编译和执行代码。

The SYCL code below demonstrates Synchronization with Buffer Destruction: Inspect code, there are no modifications necessary:

1. Inspect the code cell below and click run ▶ to save the code to a file.

2. Next run ▶ the cell in the __Build and Run__ section below the code to compile and execute the code.

In [None]:
%%writefile lab/buffer_destruction2.cpp
//==============================================================
// Copyright © Intel Corporation
//
// SPDX-License-Identifier: MIT
// =============================================================

#include <sycl/sycl.hpp>
constexpr int N = 16;
using namespace sycl;

// Buffer creation happens within a separate function scope.
void SYCL_code(std::vector<int> &v, queue &q) {
  auto R = range<1>(N);
  buffer buf(v);
  q.submit([&](handler &h) {
    accessor a(buf,h);
    h.parallel_for(R, [=](auto i) { a[i] -= 2; });
  });
}
int main() {
  std::vector<int> v(N, 10);
  queue q;
  SYCL_code(v, q);
  // When execution advances beyond this function scope, buffer destructor is
  // invoked which relinquishes the ownership of data and copies back the data to
  // the host memory.
  for (int i = 0; i < N; i++) std::cout << v[i] << " ";
  return 0;
}

### 构建并运行 | Build and Run

选择下面的单元，然后点击运行 ▶，以编译并执行代码：

Select the cell below and click run ▶ to compile and execute the code:

In [None]:
! chmod 755 q; chmod 755 run_buffer_destruction.sh;if [ -x "$(command -v qsub)" ]; then ./q run_buffer_destruction.sh; else ./run_buffer_destruction.sh; fi

<a id="Custom-Device-Selector"> </a>
## 自定义设备选择器 | Custom Device Selector

以下代码展示了使用你自己逻辑的自定义设备选择器。缺省的设备选择器会优先考虑GPU，因为该设备返回的整数评级比CPU或其他加速器高。

The following code shows custom device selector using your own logic. The selected device prioritizes a GPU device because the integer rating returned is higher than for CPU or other accelerator. 

#### 指定特定供应商名称(vendor name)的自定义设备选择器样例 | Example of custom device selector with specific vendor name
```cpp
// Return 1 if the vendor name is "Intel" or 0 else.
// 0 does not prevent another device to be picked as a second choice
int custom_device_selector(const sycl::device& d ) {
  return d.get_info<sycl::info::device::vendor>() == "Intel";
}

sycl::device preferred_device { custom_device_selector };
sycl::queue q(preferred_device);
```

#### 指定特定GPU设备名称(device name)的自定义设备选择器样例 | Example of custom device selector with specific GPU device name
```cpp
// Return 1 if device is GPU and name has "Intel"
int custom_device_selector(const sycl::device& d ) {
  return dev.is_gpu() & (dev.get_info<info::device::name>().find("Intel") != std::string::npos);
}

sycl::device preferred_device { custom_device_selector };
sycl::queue q(preferred_device);
```

#### 基于设备种类优先顺序(device name)的自定义设备选择器样例 | Example of custom device selector with priority based on device
```cpp
// Highest priority for Xeon device, then any GPU, then any CPU.
int custom_device_selector(const sycl::device& d ) {
  int rating = 0;
  if (d.get_info<info::device::name>().find("Xeon") != std::string::npos)) rating = 3;
  else if (d.is_gpu()) rating = 2;
  else if (d.is_cpu()) rating = 1;
  return rating;    
}

sycl::device preferred_device { custom_device_selector };
sycl::queue q(preferred_device);
```

以下SYCL代码展示了自定义设备选择器：检查代码，没有必要进行任何修改：

1. 检查下面的代码单元，然后单击运行 ▶，将代码保存到文件。

2. 接下来，在代码下面的 __构建并运行__ 部分中运行 ▶单元，以编译和执行代码。


The SYCL code below demonstrates Custom Device Selector: Inspect code, there are no modifications necessary:

1. Inspect the code cell below and click run ▶ to save the code to a file.

2. Next run ▶ the cell in the __Build and Run__ section below the code to compile and execute the code.

In [None]:
%%writefile lab/custom_device_sample.cpp
//==============================================================
// Copyright © Intel Corporation
//
// SPDX-License-Identifier: MIT
// =============================================================
#include <sycl/sycl.hpp>
#include <iostream>
using namespace sycl;
class my_device_selector {
public:
    my_device_selector(std::string vendorName) : vendorName_(vendorName){};
    int operator()(const device& dev) const {
    int rating = 0;
    //We are querying for the custom device specific to a Vendor and if it is a GPU device we
    //are giving the highest rating as 3 . The second preference is given to any GPU device and the third preference is given to
    //CPU device.
    if (dev.is_gpu() & (dev.get_info<info::device::name>().find(vendorName_) != std::string::npos))
        rating = 3;
    else if (dev.is_gpu()) rating = 2;
    else if (dev.is_cpu()) rating = 1;
    return rating;
    };
    
private:
    std::string vendorName_;
};
int main() {
    //pass in the name of the vendor for which the device you want to query 
    std::string vendor_name = "Intel";
    //std::string vendor_name = "AMD";
    //std::string vendor_name = "Nvidia";
    my_device_selector selector(vendor_name);
    queue q(selector);
    std::cout << "Device: "
    << q.get_device().get_info<info::device::name>() << "\n";
    return 0;
}


### 构建并运行 | Build and Run

选择下面的单元，然后点击运行 ▶，以编译并执行代码：

Select the cell below and click run ▶ to compile and execute the code:

In [None]:
! chmod 755 q; chmod 755 run_custom_device.sh;if [ -x "$(command -v qsub)" ]; then ./q run_custom_device.sh; else ./run_custom_device.sh; fi

<a id="Code-Sample:-Complex-Number-Multiplication"> </a>
# 代码样例：复数乘法 | Code Sample: Complex Number Multiplication

以下是表示复数的自定义类类型的定义。
* [Complex.hpp](./src/Complex.hpp) 文件定义了 Complex2 类。
* Complex2 类有两个 int 类型的成员变量："real” 和 "imag"。
* Complex2 类有一个用于执行复数相乘的成员函数。函数 complex_mul 返回执行两个复数相乘的 Complex2 类的对象。
* 我们要通过SYCL代码调用 complex_mul 函数。

The following is the definition of a custom class type that represents complex numbers.
* The file [Complex.hpp](./src/Complex.hpp) defines the Complex2 class.
* The Complex2 Class got two member variables "real" and "imag" of type int.
* The Complex2 class got a member function for performing complex number multiplication. The function complex_mul returns the object of type Complex2 performing the multiplication of two complex numbers.
* We are going to call complex_mul function from our SYCL code.

1. 检查下面的代码单元，然后单击运行 ▶，将代码保存到文件
2. 接下来，在代码下面的 __构建并运行__ 部分中运行 ▶单元，以编译和执行代码。


1. Inspect the code cell below, click run ▶ to save the code to file.
2. Next run ▶ the cell in the __Build and Run__ section below the code to compile and execute the code.

In [None]:
%%writefile lab/complex_mult.cpp
//==============================================================
// Copyright © Intel Corporation
//
// SPDX-License-Identifier: MIT
// =============================================================
#include <sycl/sycl.hpp>
#include <iomanip>
#include <vector>
#include "Complex.hpp"

using namespace sycl;
using namespace std;

// Number of complex numbers passing to the SYCL code
static const int num_elements = 10000;

class CustomDeviceSelector {
 public:
  CustomDeviceSelector(std::string vendorName) : vendorName_(vendorName){};
  int operator()(const device &dev) {
    int device_rating = 0;
    //We are querying for the custom device specific to a Vendor and if it is a GPU device we
    //are giving the highest rating as 3 . The second preference is given to any GPU device and the third preference is given to
    //CPU device. 
    if (dev.is_gpu() & (dev.get_info<info::device::name>().find(vendorName_) !=
                        std::string::npos))
      device_rating = 3;
    else if (dev.is_gpu())
      device_rating = 2;
    else if (dev.is_cpu())
      device_rating = 1;
    return device_rating;
  };

 private:
  std::string vendorName_;
};

// in_vect1 and in_vect2 are the vectors with num_elements complex nubers and
// are inputs to the parallel function
void SYCLParallel(queue &q, std::vector<Complex2> &in_vect1,
                   std::vector<Complex2> &in_vect2,
                   std::vector<Complex2> &out_vect) {
  auto R = range(in_vect1.size());
  if (in_vect2.size() != in_vect1.size() || out_vect.size() != in_vect1.size()){ 
    std::cout << "ERROR: Vector sizes do not  match"<< "\n";
    return;
  }
  // Setup input buffers
  buffer bufin_vect1(in_vect1);
  buffer bufin_vect2(in_vect2);

  // Setup Output buffers 
  buffer bufout_vect(out_vect);

  std::cout << "Target Device: "
            << q.get_device().get_info<info::device::name>() << "\n";
  // Submit Command group function object to the queue
  q.submit([&](auto &h) {
    // Accessors set as read mode
    accessor V1(bufin_vect1,h,read_only);
    accessor V2(bufin_vect2,h,read_only);
    // Accessor set to Write mode
    accessor V3 (bufout_vect,h,write_only);
    h.parallel_for(R, [=](auto i) {
      V3[i] = V1[i].complex_mul(V2[i]);
    });
  });
  q.wait_and_throw();
}
void Scalar(std::vector<Complex2> &in_vect1,
                 std::vector<Complex2> &in_vect2,
                 std::vector<Complex2> &out_vect) {
  if ((in_vect2.size() != in_vect1.size()) || (out_vect.size() != in_vect1.size())){
    std::cout<<"ERROR: Vector sizes do not match"<<"\n";
    return;
    }
  for (int i = 0; i < in_vect1.size(); i++) {
    out_vect[i] = in_vect1[i].complex_mul(in_vect2[i]);
  }
}
// Compare the results of the two output vectors from parallel and scalar. They
// should be equal
int Compare(std::vector<Complex2> &v1, std::vector<Complex2> &v2) {
  int ret_code = 1;
  if(v1.size() != v2.size()){
    ret_code = -1;
  }
  for (int i = 0; i < v1.size(); i++) {
    if (v1[i] != v2[i]) {
      ret_code = -1;
      break;
    }
  }
  return ret_code;
}
int main() {
  // Declare your Input and Output vectors of the Complex2 class
  vector<Complex2> input_vect1;
  vector<Complex2> input_vect2;
  vector<Complex2> out_vect_parallel;
  vector<Complex2> out_vect_scalar;

  for (int i = 0; i < num_elements; i++) {
    input_vect1.push_back(Complex2(i + 2, i + 4));
    input_vect2.push_back(Complex2(i + 4, i + 6));
    out_vect_parallel.push_back(Complex2(0, 0));
    out_vect_scalar.push_back(Complex2(0, 0));
  }

  // Initialize your Input and Output Vectors. Inputs are initialized as below.
  // Outputs are initialized with 0
  try {
    // Pass in the name of the vendor for which the device you want to query
    std::string vendor_name = "Intel";
    // std::string vendor_name = "AMD";
    // std::string vendor_name = "Nvidia";
    CustomDeviceSelector selector(vendor_name);
    queue q(selector);
    // Call the SYCLParallel with the required inputs and outputs
    SYCLParallel(q, input_vect1, input_vect2, out_vect_parallel);
  } catch (...) {
    // some other exception detected
    std::cout << "Failure" << "\n";
    std::terminate();
  }

  std::cout
      << "****************************************Multiplying Complex numbers "
         "in Parallel********************************************************"
      << "\n";
  // Print the outputs of the Parallel function
  int indices[]{0, 1, 2, 3, 4, (num_elements - 1)};
  constexpr size_t indices_size = sizeof(indices) / sizeof(int);

  for (int i = 0; i < indices_size; i++) {
    int j = indices[i];
    if (i == indices_size - 1) std::cout << "...\n";
    std::cout << "[" << j << "] " << input_vect1[j] << " * " << input_vect2[j]
              << " = " << out_vect_parallel[j] << "\n";
  }
  // Call the Scalar function with the required input and outputs
  Scalar(input_vect1, input_vect2, out_vect_scalar);

  // Compare the outputs from the parallel and the scalar functions. They should
  // be equal

  int ret_code = Compare(out_vect_parallel, out_vect_scalar);
  if (ret_code == 1) {
    std::cout << "Complex multiplication successfully run on the device"
              << "\n";
  } else
    std::cout
        << "*********************************************Verification Failed. Results are "
           "not matched**************************"
        << "\n";

  return 0;
}


### 构建并运行 | Build and Run
选择下面的单元，然后点击运行 ▶，以编译并执行代码：

Select the cell below and click run ▶ to compile and execute the code:

In [None]:
! chmod 755 q; chmod 755 run_complex_mult.sh; if [ -x "$(command -v qsub)" ]; then ./q run_complex_mult.sh; else ./run_complex_mult.sh; fi

<a id="Lab-Exercise:-Vector-Add"> </a>

## 实验练习: 矢量相加 | Lab Exercise: Vector Add

使用学习到的SYCL中Buffer(缓冲器)及Accessor(访问器)，完成下面的代码练习：
- 代码中有三个矢量， `vector1` 在主机端(host)初始化。
- 核心代码 `vector1` 加1。
- 创建第二个矢量 `vector2` 初始化值为20。
- 为上面两个矢量创建SYCL 缓冲区（buffer）。
- 在核心代码中， 为第二个矢量缓冲器创建第二个访问器（accessor）。
- 将矢量增量运算变为相加运算，将 `vector2` 与 `vector1` 相加。

Complete the coding excercise below using SYCL Buffer and Accessor concepts:
- The code has three vector `vector1` initialized on host
- The kernel code increments the `vector1` by 1.
- Create a new second `vector2` and initialize to value 20.
- Create sycl buffers for the above second vector
- In the kernel code, create a second accessor for the second vector buffer
- Modify the vector increment to vector add, by adding `vector2` to `vector1`


1. 根据下面每一步的指示，编辑并完成在代码单元框中的代码，然后点击运行 ▶，将代码保存到文件中
2. 然后在代码下面的 __构建并运行__ 部分中运行 ▶ 单元代码，以编译并执行代码


1. Edit the code cell below by following the steps and then click run ▶ to save the code to a file.
2. Next run ▶ the cell in the __Build and Run__ section below the code to compile and execute the code.

In [None]:
%%writefile lab/vector_add.cpp
//==============================================================
// Copyright © Intel Corporation
//
// SPDX-License-Identifier: MIT
// =============================================================
#include <sycl/sycl.hpp>

using namespace sycl;

int main() {
    const int N = 256;
    
    //# Initialize a vector and print values
    std::vector<int> vector1(N, 10);
    std::cout<<"\nInput Vector1: ";    
    for (int i = 0; i < N; i++) std::cout << vector1[i] << " ";

    //# STEP 1 : Create second vector, initialize to 20 and print values

    //# YOUR CODE GOES HERE
    
    
    
    
    //# Create Buffer
    
    buffer vector1_buffer(vector1);

    //# STEP 2 : Create buffer for second vector 

    //# YOUR CODE GOES HERE




    //# Submit task to add vector
    queue q;
    q.submit([&](handler &h) {
      //# Create accessor for vector1_buffer
      accessor vector1_accessor (vector1_buffer,h);
      
      //# STEP 3 - add second accessor for second buffer

      //# YOUR CODE GOES HERE



      h.parallel_for(range<1>(N), [=](id<1> index) {

        //# STEP 4 : Modify the code below to add the second vector to first one

        vector1_accessor[index] += 1;



      });
   });

 
  //# Create a host accessor to copy data from device to host
  host_accessor h_a(vector1_buffer,read_only);

  //# Print Output values 
  std::cout<<"\nOutput Values: ";
  for (int i = 0; i < N; i++) std::cout<< vector1[i] << " ";
  std::cout<<"\n";

  return 0;
}


### 构建并运行 | Build and Run
选择下面的单元，然后点击运行 ▶，以编译并执行代码：

Select the cell below and click Run ▶ to compile and execute the code above:

In [None]:
! chmod 755 q; chmod 755 run_vector_add.sh; if [ -x "$(command -v qsub)" ]; then ./q run_vector_add.sh; else ./run_vector_add.sh; fi

***
# 总结 | Summary

在本课程中，您已学到：
* 基本 SYCL 类
* 如何选择卸载到内核工作负载的设备
* 如何使用缓冲区、访问器、命令组句柄和内核编写SYCL程序
* 如何使用主机访问器和缓冲区销毁进行同步

In this module you learned:
* The fundamental SYCL Classes
* How to select the device to offload to kernel workloads
* How to write a SYCL program using Buffers, Accessors, Command Group handler, and kernel
* How to use the Host accessors and Buffer destruction to do the synchronization


<html><body><span style="color:green"><h1>调查 | Survey</h1></span></body></html>

[非常感谢您提供的任何反馈意见，以便我们改善整体培训质量和体验。谢谢 | We would appreciate any feedback you’d care to give, so that we can improve the overall training quality and experience. Thanks! ](https://intel.az1.qualtrics.com/jfe/form/SV_6zljPDDUQ0RBRsx)