# SYCL

## Buffer and Accessor:
- refer slides

## Buffer:


## Accessor:


## Buffer and Accessor:
- Buffer Class
- Accessor Class
- Buffer & Accessor in SYCL code.
    - buffer<data_type, dimensions> buff_a(range<dimensions>(size));
    - buffer<data_type, dimensions> buff_c(range<dimensions>(size));
    - auto acc_A = buff_a.get_access<access::mode::read>(cgh);
    - auto acc_C = buff_c.get_access<access::mode::read_write>(cgh);


- Instructor tries to connect to PARAM UTKARSH using browser.
- Instructor says that if we try to connect with ssh port number is 4422 but through the web browser port number is 8447.

- Instructor opens a script file which does setup for SYCL.
- command: $vim /scratch/parade/Laxmikant/set_path.sh
- this file contains setup for intel's compiler for SYCL called icpx
- for execution: $source /scratch/parade/Laxmikant/binaries/oneAPI_basetoolkit_inst_dir/setvars.sh
- you can also use sh instead of source, but it opens a new shell, so source is more preferred.
- you can see in the output, Intel's version of compiler is called icpx.

# Sample code SYCL program:

```c++
#include<iostream>
#include<sycl/sycl.hpp>

using namespace sycl;

# define N 10

int main(){
    // this task will be executed on CPU:
    Queue q(cpu_selector_v);
    // this task will be executed on GPU:
    //Queue q(gpu_selector_v);
    // executed on default device specified on time of installation of SYCL:
    //Queue q();


    int *A = (int*)malloc_shared(sizeof(int)*N, q);
    int *B = (int*)malloc_shared(sizeof(int)*N, q);
    int *OUT = (int*)malloc_shared(sizeof(int)*N, q);

    buffer<int, 1> buffA(A, range<1>(N));
    buffer<int, 1> buffB(B, range<1>(N));
    buffer<int, 1> buffOUT(OUT, range<1>(N));

    q.submit([&] handler(&cgh) {
        auto accA = buffA.get_access<access::mode::read> (cgh);
        auto accB = buffB.get_access<access::mode::read> (cgh);
        auto accOUT = buffOUT.get_access<access::mode::read_write> (cgh);

        // for parallel execution:
        cgh.parallel_for(range<1>N, [=](id<1> idx){
            accOUT[idx] = accA[idx] + accB[idx];
        });

        // For serial execution:
        // cgh.single_task(range<1>N, [=](id<1> idx){
        //     accOUT[idx] = accA[idx] + accB[idx];
        // });
    });

    return 0;
}

```

- for finding icpx compiler type: $which icpx
- You need to add icpx into your path variable.
- for compilation: 
    - $icpx -fsycl vectoradd_sycl.cpp -o sycl1_obj
- Syntax:
    - $icpx -fsycl filename.cpp -o objectname
- To view the object file in current directory: 
    - $ls -l | grep sycl1
- for running the object file:
    - ./sycl1_obj


## Enqueuing a Kernel:
- Learn about Queues and how to submit works to them.
- refer slides.

## Queue:
- In SYCL all work is submitted via commands to a queue.
- The queue has a associated device, any commands enqueued to the queue will target that device.
- There are several different ways to construct a queue. the most straight forward way is to default construct one.

## Command Groups:
- In buffer/accessor model commands must be enqueued via command groups.
- Command group consists of a series of commands to be executed by a device.
- These commands include (refer slides)
    

## Composing Command Groups:
- refer slides.
- use a function called .wait() will ensure that all GPU side execution is complete before continuing CPU side execution.

## SYCL Kernel function:
- SYCL kernel functions are defined using a kernel invoke API's provided by the handler.
- These add a SYCL kernel function command to the command group.
- Only one SYCL kernel function command is allowed in a command group. 

- Note:
- lambda functions can capture external variables from enclosing scope by using the [] operator. 
- in lambda function [&] means all values are captured by reference.
- in lambda function [=] means all values are captured by value.


## Unified Shared Memory:
- refer slides.
- Unified Virtual address space.
- pointer basesd structure
- shared memory allocations.

## USM has 3 different kinds of allocations:
- host allocation 
- device allocation
- shared allocations in shared memory which can be accessed by host and device both.

- we can use malloc_shared() instead of using buffer and accessor model.
- malloc_shared() syntax is much more concise.

## Important notes on PARAM Utkarsh's login:
- The IP address to access PARAM Utkarsh has changed.
- The new IP address to access PARAM Utkarsh is: 10.180.16.12 

## Use these commands to add SYCL to environment path variables in terminal:

```bash

export PATH=/scratch/parade/gccv12.2.0_inst_dir/bin/:$PATH

export LD_LIBRARY_PATH=/scratch/parade/gccv12.2.0_inst_dir/lib64/:/scratch/parade/mpcv1.2.1._inst_dir/lib/:scratch/parade/mpfrv4.1.0_inst_dir/lib:/scratch/parade/gmpv6.2.1_inst_dir/lib/:$LD_LIBRARY_PATH

```

## make sure gcc is installed:
```bash
which gcc
```

## Finally load the libraries by running the following command:
```bash
source /scratch/parade/Laxmikant/binaries/oneAPI_basetoolkit_inst_dir/setvars.sh
```

## Check whether icpx is loaded or not
```bash
which icpx
```

## You will notice that icpx is installed in the following location:
- /scratch/parade/Laxmikant/binaries/oneAPI_basetoolkit_inst_dir/compiler/2023.1.0/linux/bin/icpx


## Compiling a SYCL program:
- use the following syntax: 
```bash
    icpx -fsycl filename.cpp -o objectname
```
- for now use:
```bash
    icpx -fsycl vectoradd_sycl_usm.cpp -o vecadd_USM
```
- create a shell script file:
```bash
    vim slurm.sh
```
- and write the following bash script inside that file:
```bash
#!/bin/bash

#SBATCH --nodes=1
#SBATCH --ntasks-per-core=10
#SBATCH --cpus-per-task=10
#SBATCH --partition=standard
##SBATCH --gres=gpu:1
#SBATCH --job-name=sycl
#SBATCH --error=error_%j.err
#SBATCH --output=output_%j.out
#SBATCH --time=01:00:00

##SBATCH --mem=default
##SBATCH --mail-user=email@gmail.com
##SBATCH --mail-type=ALL
##SBATCH --ntasks-per-sockets=4

ulimit -s unlimited

##source /scratch/parade/Laxmikant/set_path.sh

##./SYCLvecadd_icpx

./vecadd_USM
###############################

```

- note that this script will be interpreted by slurm.
- note that lines with only # will be interpreted while lines with ## will be ignored as comments.
- "ulimit -s" unlimited will increases the size of the stack to be unlimited.

## Executing a SYCL program:
- type "$sinfo" to view all the the partitions, number of nodes, state of nodes, availablity, and list of nodes.
- type the following to submit the job:
```bash
    sbatch slurm.sh
```
- type the following to see the status of your job:
```bash
    squeue | grep Insert_JOBID_here
```
- if the job is already completed, then you wont see anything. simply type "$ls" to find the output and error file in current directory.
- usually output file will be in following format: output_jobID.out
- and error file will be in the following format: error_jobID.err
- use this to display the output file:
```bash
    cat output_Insert_JOBID_here.out 
```
- the following output will be displayed:
<pre>
==========================================
SLURM_CLUSTER_NAME = paramutkarsh
SLURM_ARRAY_JOB_ID = 
SLURM_ARRAY_TASK_ID = 
SLURM_ARRAY_TASK_COUNT = 
SLURM_ARRAY_TASK_MAX = 
SLURM_ARRAY_TASK_MIN = 
SLURM_JOB_ACCOUNT = nsmexternal
SLURM_JOB_ID = 158209
SLURM_JOB_NAME = sycl
SLURM_JOB_NODELIST = hm013
SLURM_JOB_USER = c-huk48
SLURM_JOB_UID = 21915
SLURM_JOB_PARTITION = standard
SLURM_TASK_PID = 38835
SLURM_SUBMIT_DIR = /home/c-huk48/SYCL
SLURM_CPUS_ON_NODE = 10
SLURM_NTASKS = 
SLURM_TASK_PID = 38835
==========================================
100 100 100 100 100 100 100 100 100 100 
</pre>

## Another example SYCL code by using buffer-accessor model:

- give filename as "vectoradd_sycl.cpp"

```cpp
#include<iostream>
#include<CL/sycl.hpp>

using std::cout;
using std::endl;
using namespace sycl;

#define N 10

int main(){
    int *dataA, *dataB, *dataC, i;
    queue Q;

    dataA = new int[N];
    dataB = new int[N];
    dataC = new int[N];

    for(i = 0; i < N; i++){
        dataA[i] = 90;
        dataB[i] = 10;
    }

    buffer<int, 1> buff_a(dataA, range(N));
    buffer<int, 1> buff_b(dataB, range(N));
    buffer<int, 1> buff_c(dataC, range(N));

    Q.submit([&](handler &cgh){
        auto acc_A = buff_a.get_access<access::mode::read>(cgh);
        auto acc_B = buff_b.get_access<access::mode::read>(cgh);
        auto acc_C = buff_c.get_access<access::mode::read_write>(cgh);

        // here indx is the iterator, id<1> represents its an iterator of 1 dimensional array:
        cgh.parallel_for(range<1>(N), [=](id<1> indx){
            acc_C[indx] = acc_A[indx] + acc_B[indx];
        });
    });

    auto C = buff_c.get_access<access::mode::read>();

    for(i=0; i<N; i++){
        cout << C[i] << " ";
    }

    cout << std::endl;

    delete[] dataA;
    delete[] dataB;
    delete[] dataC;

    return 0;
}
```

- Compile using: 
```bash
    icpx -fsycl vectoradd_sycl.cpp -o vecadd_BUFACC
```

- edit the slurm file's following portion: vecadd_USM -> vecadd_BUFACC
- submit the job by using:
```bash
    sbatch slurm.sh
```
- Once sucessfully submitted, you will receive the jobID of your submitted job.
- check the status of job by using:
```bash
    squeue | grep insert_job_id_here
```
- view the output by using the following:
```bash
    cat output_JobID.out
```
- You will receive the following output:
<pre>
==========================================
SLURM_CLUSTER_NAME = paramutkarsh
SLURM_ARRAY_JOB_ID = 
SLURM_ARRAY_TASK_ID = 
SLURM_ARRAY_TASK_COUNT = 
SLURM_ARRAY_TASK_MAX = 
SLURM_ARRAY_TASK_MIN = 
SLURM_JOB_ACCOUNT = nsmexternal
SLURM_JOB_ID = 158257
SLURM_JOB_NAME = sycl
SLURM_JOB_NODELIST = hm013
SLURM_JOB_USER = c-huk48
SLURM_JOB_UID = 21915
SLURM_JOB_PARTITION = standard
SLURM_TASK_PID = 39611
SLURM_SUBMIT_DIR = /home/c-huk48/SYCL
SLURM_CPUS_ON_NODE = 10
SLURM_NTASKS = 
SLURM_TASK_PID = 39611
==========================================
100 100 100 100 100 100 100 100 100 100 

</pre>


## Matrix multiplication in SYCL code:

```cpp
#include<iostream>
#include<CL/sycl.hpp>


using std::cout;
using std::endl;
using namespace sycl;

#define N 10


int main(){

        int i, j, k;

        // Declare queue object here:
        queue Q(cpu_selector_v);

        int A[N][N], B[N][N], C[N][N];

        // int *A = (int*)malloc_shared(sizeof(int)*N*N, Q);
        // int *B = (int*)malloc_shared(sizeof(int)*N*N, Q);
        // int *C = (int*)malloc_shared(sizeof(int)*N*N, Q);

        // Initializing the matrix's
        for(i=0; i<N; i++){
                for(j=0; j<N; j++){
                        A[i][j] = (j%10)+1;
                }
        }

        for(i=0; i<N; i++){
                for(j=0; j<N; j++){
                        B[i][j] = (j%10)+1;                                   }
        }

        // Create SYCL buffers for matrix A, B and C here:

        buffer<int, 2> buff_A((int*)A, range<2>(N,N));
        buffer<int, 2> buff_B((int*)B, range<2>(N,N));
        buffer<int, 2> buff_C((int*)C, range<2>(N,N));
        Q.submit([&](handler &cgh){
                // Create SYCL accessors for buffers buff_a, buff_b, buff_c here:
                auto accessorA = buff_A.get_access<access::mode::read>(cgh);
                auto accessorB = buff_B.get_access<access::mode::read>(cgh);
                auto result = buff_C.get_access<access::mode::write>(cgh);
                cgh.parallel_for<class MatrixMultiply>(range<2>(N,N), [=](item<2> item){
                        // Define iterator using items here:
                        int i = item.get_id(0);
                        int j = item.get_id(1);
                        int k;
                        int sum = 0;
                        for(k=0; k<N; k++){
                                sum += accessorA[i][k] * accessorB[k][j];
                        }
                        result[item] = sum;
                });
        }).wait();

        cout << "\n\nFirst matrix :"<<std::endl;
        for(i=0; i<N; i++){
                cout << "\t\t\t" << std::endl;
                for(j=0; j<N; j++)
                        cout << A[i][j] << " ";
        }
        cout << "\n\nSecond matrix :" << std::endl;
        for(i=0; i<N; i++){
                cout << "\t\t\t" << std::endl;
                for(j=0; j<N; j++)
                        cout << B[i][j] << " ";
        }
        cout << "\n\nResultant matrix :" << std::endl;
        for(i=0; i<N; i++){
                cout << "\t\t\t" << std::endl;
                for(j=0; j<N; j++)
                        cout << C[i][j] << " ";
        } cout << "\n\n";
        return 0;
}
```

- edit the slurm file's following portion: vecadd_USM -> vecMUL
- submit the job by using:
```bash
    sbatch slurm.sh
```
- Once sucessfully submitted, you will receive the jobID of your submitted job.
- check the status of job by using:
```bash
    squeue | grep insert_job_id_here
```
- view the output by using the following:
```bash
    cat output_JobID.out
```
- You will get the output as follows:
<pre>
==========================================
SLURM_CLUSTER_NAME = paramutkarsh
SLURM_ARRAY_JOB_ID = 
SLURM_ARRAY_TASK_ID = 
SLURM_ARRAY_TASK_COUNT = 
SLURM_ARRAY_TASK_MAX = 
SLURM_ARRAY_TASK_MIN = 
SLURM_JOB_ACCOUNT = nsmexternal
SLURM_JOB_ID = 158314
SLURM_JOB_NAME = sycl
SLURM_JOB_NODELIST = gpu006
SLURM_JOB_USER = c-huk48
SLURM_JOB_UID = 21915
SLURM_JOB_PARTITION = standard
SLURM_TASK_PID = 17139
SLURM_SUBMIT_DIR = /home/c-huk48/SYCL
SLURM_CPUS_ON_NODE = 10
SLURM_NTASKS = 
SLURM_TASK_PID = 17139
==========================================


First matrix :

1 2 3 4 5 6 7 8 9 10 
1 2 3 4 5 6 7 8 9 10 
1 2 3 4 5 6 7 8 9 10 
1 2 3 4 5 6 7 8 9 10 
1 2 3 4 5 6 7 8 9 10 
1 2 3 4 5 6 7 8 9 10 
1 2 3 4 5 6 7 8 9 10 
1 2 3 4 5 6 7 8 9 10 
1 2 3 4 5 6 7 8 9 10 
1 2 3 4 5 6 7 8 9 10 

Second matrix :

1 2 3 4 5 6 7 8 9 10 
1 2 3 4 5 6 7 8 9 10 
1 2 3 4 5 6 7 8 9 10 
1 2 3 4 5 6 7 8 9 10 
1 2 3 4 5 6 7 8 9 10 
1 2 3 4 5 6 7 8 9 10 
1 2 3 4 5 6 7 8 9 10 
1 2 3 4 5 6 7 8 9 10 
1 2 3 4 5 6 7 8 9 10 
1 2 3 4 5 6 7 8 9 10 

Resultant matrix :

55 110 165 220 275 330 385 440 495 550 
55 110 165 220 275 330 385 440 495 550 
55 110 165 220 275 330 385 440 495 550 
55 110 165 220 275 330 385 440 495 550 
55 110 165 220 275 330 385 440 495 550 
55 110 165 220 275 330 385 440 495 550 
55 110 165 220 275 330 385 440 495 550 
55 110 165 220 275 330 385 440 495 550 
55 110 165 220 275 330 385 440 495 550 
55 110 165 220 275 330 385 440 495 550 

</pre>

- For more coding examples on SYCL, go to: www.github.com/codeplaysoftware/sysclacademy
- for understanding the specific functions of sSYCL go to:
    - www.khronos.org/sycl
    - click on specifications.
    - click on reference guide pdf and see the reference card.
    - you can find all the different functions present in the SYCL.
- These are the two sites you can visit to learn in-depth about SYCL.
    - https://github.com/codeplaysoftware/syclacademy/tree/main/Lesson_Materials
    - https://www.khronos.org/files/sycl/sycl-2020-reference-guide.pdf