Before we begin, let's execute the cell below to display information about the CUDA driver and GPUs running on the server by running the `nvidia-smi` command. To do this, execute the cell block below by giving it focus (clicking on it with your mouse), and hitting Ctrl-Enter, or pressing the play button in the toolbar above. If all goes well, you should see some output returned below the grey cell.

In [None]:
!nvidia-smi

## Learning objectives
The **goal** of this lab is to:

- Give brief introduction to Kokkos Framework and its usage
- Understand the steps for making a sequential code parallel using Kokkos Framework

We do not intend to cover:
- Detailed optimization techniques and all the components of Kokkos Ecosystem


# Kokkos
Kokkos is a productive, portable, performant, shared-memory programming model.
- It is a C++ library, not a new language or language extension
- Let's you write algorithms once and run on many architectures, e.g. multi-core CPU, NVIDIA GPU, Xeon Phi, etc.
- Solves the data layout problem by using multi-dimensional arrays with architecture-dependent layouts to achieve performance portability


## Kokkos Abstractions

Kokkos fundamentally works on the principle of separation of concern. It decouples two concepts which helps it to achieve Performance Portability. The first being abstract machine model which describes fundamental concepts required for the development of future portable and performant high performance computing applications. The second is a concrete instantiation of the programming this model written in C++, which allows programmers to write to the concept machine model. It is important to decouple these two entities because the underlying model being used by Kokkos could, in the future, be instantiated in additional languages beyond. The diagram given below shows abstractions provided by Kokkos to achieve the same. 

<img src="../images/kokkos_ecosystem.png">

For more information, please checkout this [Kokkos Tutorial](http://on-demand.gputechconf.com/gtc/2017/presentation/s7344-christian-trott-Kokkos.pdf).

The Kokkos programming model is characterized by 6 core abstractions: Execution Spaces, Execution Patterns, Execution Policies, Memory Spaces, Memory Layout and Memory Traits. These abstraction concepts allow the formulation of generic algorithms
and data structures which can then be mapped to different types of architectures.

We will be providing brief introduction to Kokkos abstraction necessary for us to get started and run our code on GPU using Kokkos library in this tutorial. We do not intend to cover Memory Layout and Memory Traits in this tutorial which may be essential to get best performance on our code. Participants are recommended to go through reference section below and  can optimize the code further.


## Concepts of threaded data parallelism

In this section we will introduce you to core abstractions and concepts in Kokkos based on which we will port the **Pair Calculation algorithm**

**Pattern**: In Kokkos, *Execution Patterns* are the fundamental parallel algorithms in which an application has to be expressed. To list a few:
- parallel_for: Dispatches a parallel "for loop" with independent iterations. We will be using this pattern during our porting exercise
- parallel_reduce: Combines a parallel_for execution with a reduction operation,
- parallel_scan: Combines a parallel_for operation with a prefix or postfix scan on output values of each operation, and


**Execution Policy:**  An Execution Policy defines how computations are executed (static scheduling, dynamic scheduling etc.) For example the most simple form of execution policies are _Range Policies_. They are used to execute an operation once for each element in a range. We will be using this policy in our code. On the other hand _Team policies_ are used to implement hierarchical parallelism. You will see that CUDA Programming model is hierarchical in nature. In fact CUDA programming model inspired Kokkos' thread team model. 

**Computational Body:** Consists of code which performs each unit of work (e.g. the loop body)

Pattern and policy together drive the computational body.

The diagram below shows these three concepts in a sequential code. In our code we will choosing one among the available patterns and execution policies to run in parallel on GPU. 

<img src="../images/kokkos_abstraction.png">


## Core Capabilities

Kokkos supports multiple patterns and Execution policies. The capabilities listed below will be required for you to be familiar with in order to move a serial code to a GPU.

### Parallel Loops:
Parallel loop pattern maps the work to computation unit core where 
- Each iteration of a computational body is a unit of work.
- An iteration index identifies a particular unit of work.
- An iteration range identifies a total amount of work.

We will be using Kokkos::parallel_for to map the work to cores: ```parallel_for ( ... );```

```parallel_for``` is the most common parallel dispatch operation. It corresponds to the OpenMP construct ```#pragma omp parallel for```. Parallel_for splits the index range over the available hardware resources and executes the loop body in parallel. Each iteration is executed independently. Kokkos promises nothing about the loop order or the amount of work which actually runs concurrently. This means in particular that not all loop iterations are active at the same time 

The two key ways of writing computational bodies are: 

**Functor Based** : Functor is a common pattern used in C++ and sample code below demonstrates the use of functor. If you are new to functors don't worry. They are nothing but structures with function which overloads the () operator.

```cpp
struct ParallelFunctor {
...
    void operator ()( a work assignment based on index) const {
        /* ... computational body ... */
...
};
```
A sample code of usage of functor for daxpy (```y = a*x + y```) operation is as follows:

```cpp
//Define functor with member variable
struct Functor {
    double *_x , *_y , _a;
    Functor (x, y, a) :
        _x(x), _y(y), _a(a) {}
    void operator ()( const size_t i) {
        _y [i] = _a * _x [i] + _y [i ];
    }
};

//Call functor by creating an object and calling parallel_for
Functor functor ( x , y, a );
Kokkos :: parallel_for ( vector_size , functor );

```

**C++11 Lambda** : Lambdas were first introduced in C++11 and are a very concise way of writing code. Basically Lambdas are compiler generated functors you can use.

Sample usage of Lambdas for our example is shown above:

```cpp 
double * x = new double [N ]; // also y
parallel_for (N , [=] ( const size_t i) {
    y[i ] = a * x[i] + y[i ];
});

```

Kokkos lets users choose whether to use a functor or a lambda. Lambdas are convenient for short loop bodies. For a much more complicated loop body, you might find it easier for testing to separate it out and name it as a functor

### Execution Space

Execution space defines where the parallel code will run. Types of Execution spaces include: Serial, Threads, OpenMP, CUDA, ROCm. Execution space can be defined either at compile time or run time as part of the policy. If none specified it will run in a default execution policy set during compilation of the Kokkos Core library. 


### Memory Space

Memory Spaces are the places where data resides. They specify physical location of data as well as certain access  characteristics. Different physical locations correspond to things such as high bandwidth memory, on die scratch memory or non-volatile bulk storage. Different logical memory spaces allow for concepts such as UVM memory in the CUDA programming model, which is accessible from Host and GPU. 

In the code sample above both x,y arrays reside in a CPU memory.

We need a way of storing data (multidimensional arrays) which can be communicated to an accelerator (GPU). This is done via _views_.

**Views:** Views are a lightweight C++ class with a pointer to array data and some meta-data. 
- Or a simple definition could be that Views are *like pointers* and needs to be copied inside functor.
- Views are multi dimensions and the dimensions are fixed at compile time

For the daxpy code here is how the Views will get created.
```cpp
View < double *, ... > x (...) , y (...);
    //... populate x , y ...
    
parallel_for (N , [=] ( const size_t i) {
    // Views x and y are captured by value ( copy )
    y(i ) = a * x(i) + y(i );
});
```

### Data Transfer

Every view stores its data in a memory space set at compile time.

```cpp 
View<double***,MemorySpace> data(...); 
```

If none specified it will chose the default execution policy. Since views are similar to pointers, we need to perform deep copies explicitly (unless we are making use of UVM: [Unified Virtual Memory Space](../GPU_Architecture_Terminologies.ipynb) supported by CUDA). 

In this example we intend to use explicit copies. The example below demonstrates that *view* resides in the Default Execution space while we create a mirror which resides in the host execution space. Then we can copy data back and forth between two views using the *deep_copy* API.

```cpp
//Define a View pointing to default execution space.
typedef Kokkos :: View < double ** > ViewType ;
ViewType view (...);

//Create a Host Mirror of View
ViewType :: HostMirror hostView = Kokkos :: createmirrorview ( view );

// copying from host to device
Kokkos::deep_copy(view, host_view); 
...

// copying from device to host
Kokkos::deep_copy(host_view,view); 
```

<img src="../images/kokkos_mirror_view.png">

ref: [Kokkos Tutorial](http://on-demand.gputechconf.com/gtc/2017/presentation/s7344-christian-trott-Kokkos.pdf)

## Kokkos Initialization and Finalize

In order to use Kokkos an initialization call is required. That call is responsible for acquiring hardware resources such as threads. Typically, this call should be placed right at the start of a program

The simplest way to initialize Kokkos is by calling the following function:
```cpp
Kokkos::initialize(int& argc, char* argv[]); 
```

At the end of each program, Kokkos needs to be shut down in order to free resources; do this by calling 
```cpp
Kokkos::finalize()
```

## Atomic Construct

In the code you will also require one more construct which will help you in getting the right results. Kokkos atomic construct ensures that a particular variable is accessed and/or updated atomically to prevent indeterminate results and race conditions. In other words, it prevents one thread from stepping on the toes of other threads due to accessing a variable simultaneously, resulting in different results run-to-run. For example, if I want to count the number of elements that have a value greater than zero, we could write the following:


```cpp
if ( val > 0 )
{
    Kokkos::atomic_increment(&cnt));
}
```



Now, lets start modifying the original code and add the Kokkos contracts. From the top menu, click on *File*, and *Open* `rdf.cpp` and `dcdread.h` from the current directory at `C/source_code/kokkos` directory. Remember to **SAVE** your code after changes, before running below cells.

**Note**: Look at *Todo* in your code and fill the right execution pattern and copy directions 

### Compile and Run for NVIDIA GPU

Having added Kokkos API calls, let us compile the code. We will be using _nvcc_wrapper_ script which comes as part of Kkokkos source code for compilation. We link the code to a pre-compiled Kokkos library libkokkoscore.a.

Also in order to enable Lambdas we will add two more flags compilation ```--expt-extended-lambda``` and ```-std=c++11```  

In [None]:
#Compile the code for default execution space:: GPU
!cd ../../source_code/kokkos && /opt/kokkos/kokkos-master/build/install/bin/nvcc_wrapper -I/opt/kokkos/kokkos-master/build/install/include -L/opt/kokkos/kokkos-master/build/install/lib -lkokkoscore  --expt-extended-lambda -std=c++11 -lnvToolsExt  rdf.cpp

Make sure to validate the output by running the executable and validate the output.

In [None]:
#Run code on default execution space
!cd ../../source_code/kokkos && ./rdf && cat Pair_entropy.dat

The output entropy value should be the following:

```
s2 value is -2.43191
s2bond value is -3.87014
```


In [None]:
#profile and see output of nvptx
!cd ../../source_code/kokkos && nsys profile -t nvtx --stats=true --force-overwrite true -o rdf_kokkos ./rdf

Let's checkout the profiler's report. [Download the profiler output](../../source_code/kokkos/rdf_kokkos.qdrep) and open it via the GUI. Have a look at the example expected profiler report below:

<img src="../images/.png">

# Kokkos Analysis

**Usage Scenarios**
- Kokkos was developed keeping 3 aspects as core to its design: Performance, Portatbility and Productivity through abstraction.
    - It compiles and runs on multiple architectures
    - Obtains performant memory access patterns across architecture and hence providing performance portability
    - Allows developers to utilize architecture-specific features where possible
- Kokkos has proven itself to provide good performance for various architectures. It has active support of community and developed by Sandia National Laboratories. The most widely used package LAMMPS in MD has a branch which uses of Kokkos.

**Limitations/Constraints**
1. Kokkos is primarily for only C++11 onwards development.  
2. Using Kokkos	is invasive, for example significant part of data structures need to be taken over to get perforamance out from code. 

**How is Kokkos different from other directive based methods like OpenMP or OpenACC?**

- Kokkos uses C++ templates, rather then compiler pragmas, to generate parallel code for the GPU.



# Optional Exercise

## Run on a multicore execution space

Try using the multicore execution space and run the code on a multicore.
You can refer to [Kokkos Documetation](https://github.com/kokkos/kokkos/wiki/The-Kokkos-Programming-Guide) for more information.

**Understand and analyze** the code present at:

[RDF Code](../../source_code/kokkos/rdf.cpp)

[File Reader](../../source_code/kokkos/dcdread.h)

In [None]:
#Compile the code for multicore execution space:: GPU
! cd ../../source_code/kokkos && /opt/kokkos/kokkos-master/build/install/bin/nvcc_wrapper -I/opt/kokkos/kokkos-master/build/install/include -L/opt/kokkos/kokkos-master/build/install/lib -lkokkoscore  --expt-extended-lambda -std=c++11 -lnvToolsExt  rdf.cpp

In [None]:
#profile and see output using nvptx
!cd ../../source_code/kokkos && nsys profile -t nvtx --stats=true --force-overwrite true -o rdf_kokkos_multicore ./rdf

Let's checkout the profiler's report. [Download the profiler output](../../source_code/kokkos/rdf_kokkos_multicore.qdrep) and open it via the GUI. Have a look at the example expected profiler report below:

<img src="../images/.png">

Feel free to checkout the [solution](../../source_code/kokkos/SOLUTION/rdf.cpp) to help you understand better or compare your implementation with the sample solution.

## Post-Lab Summary

If you would like to download this lab for later viewing, it is recommend you go to your browsers File menu (not the Jupyter notebook file menu) and save the complete web page.  This will ensure the images are copied down as well. You can also execute the following cell block to create a zip-file of the files you've been working on, and download it with the link below.

In [None]:
%%bash
cd ..
rm -f nways_files.zip
zip -r nways_files.zip *

**After** executing the above zip command, you should be able to download the zip file [here](../nways_files.zip). Let us now go back to parallelizing our code using other approaches.

**IMPORTANT**: Please click on **HOME** to go back to the main notebook for *N ways of GPU programming for MD* code.

-----

# <p style="text-align:center;border:3px; border-style:solid; border-color:#FF0000  ; padding: 1em"> <a href=../../../nways_MD_start.ipynb>HOME</a></p>

-----


# Links and Resources
[Kokkos Download](https://github.com/kokkos/kokkos)

[Kokkos Sample Codes](https://github.com/kokkos/kokkos-tutorials)

[Kokkos Tutorial](http://on-demand.gputechconf.com/gtc/2017/presentation/s7344-christian-trott-Kokkos.pdf)

[NVIDIA Nsight System](https://docs.nvidia.com/nsight-systems/)

**NOTE**: To be able to see the Nsight System profiler output, please download Nsight System latest version from [here](https://developer.nvidia.com/nsight-systems).

Don't forget to check out additional [OpenACC Resources](https://www.openacc.org/resources) and join our [OpenACC Slack Channel](https://www.openacc.org/community#slack) to share your experience and get more help from the community.

--- 

## Licensing 

This material is released by OpenACC-Standard.org, in collaboration with NVIDIA Corporation, under the Creative Commons Attribution 4.0 International (CC BY 4.0).