

#### **Contents**

- Motivation, target, analysis
- Assembling our own toolchain
- Toolchain usecase: axpy example
- Development schedule

### 1. Motivation, target, analysis

## Why generation?

The need of huge numerical models porting onto GPUs:

 All individual model blocks have too small self perf impact (~10%), resulting into small speedups, if only one block is ported



## Why generation?

The need of huge numerical models porting onto GPUs:

- A lot of code requiring lots of similar transformations
- A lot of code versions with minor differences, each requiring manual testing & support
- COSMO, Meteo-France: science teams are not ready to work with new paradigms (moreover, tied with propriety products), compute teams have no resources to support a lot of new code

## Why generation?

So, in fact science groups are ready to start GPUbased modeling, if three main requirements are met:

- Model works on GPUs without specific extensions
- Model works on GPUs and gives accurate enough results in comparison with control host version
- Model works on GPUs faster

### **Our target**

Port already parallel models in Fortran onto GPUs:

- Conserving original Fortran source code (i.e. keeping all C/CUDA/OpenCL in intermediate files)
- Minimizing manual work on specific code (i.e. developed toolchain is expected to be reusable with other codes)

"Already parallel" means the model gives us some data decomposition grid to map 1 GPU onto 1 MPI process or thread.

### Similar tools: PGI CUDA Fortran

#### Not really similar:

- Same manual coding as with CUDA C we want to minimize
- PGI's own Fortran language extensions

#### Similar tools: PGI Accelerator

Very similar, but:

- Still needs to set manual annotations on loops
- Is a propriety "black box" with limited info about implemented features

#### Similar tools: PathScale HMPP

#### Almost same as PGI Accelerator:

- Also has CAPS for automatic capable loops lookup
- Introduces some inapplicable constraints on accelerated loops, for instance, being a pure function (not really a problem with partial linktime kernels compilation)
- Tries to standardize OpenHMPP
- HMPP is a "black box", but PathScale host compiler recently became open-source

### Similar tools: f2c-acc

Actually, a work-in-progress equivalent for PGI Accelerator

Still a lot of manual assistance needed

### Conclusion

- Clearly, it would be a right decision to use f2c-acc or PGI or HMPP, if they could support GPU kernels generation without directives/annotations
- While they don't, adopting models is a complicated long-term task
- Can we build our own toolchain with dependencies analysis instead of directives?

### 2. Assembling our own toolchain

## Ingredients

- Compiler split original code into host and device parts and compile them into single object
  - Code splitter (source-to-source preprocessor)
  - Target device code generator
- Runtime library implementation of specific internal functions used in generated code
  - Data management
  - → Kernel invocation
  - Kernel results verification



We start with original source code, selecting loops suitable for device acceleration.



Equivalent device code is generated for suitable loops.

(see "Code generation workflow" for details)



Equivalent device code is generated for suitable loops.

Additionally global constructors are generated to initialize configuration structures (with status, profiling, permanent dependencies, etc.) for each kernel.



























## Code generation workflow

Two parts of code generation process:

- Compile time generate kernels strictly corresponding to original host loops
- Runtime generate kernels, using additional info: inline external functions, optimize compute grid, etc.



Loops suitable for device execution are identified in original source code, their bodies are surrounded with if-statement to switch between original loop and call to device kernel for this loop. Each suitable loop is duplicated in form of subroutine in a separate compilation unit. Additionally, helper initialization anchors are generated.



Objects for host code and device code helper functions can be generated directly with CPU compiler used by application.

Device code is compiled into Low-Level Virtual Machine Intermediate representation (LLVM IR).



Code from LLVM IR is translated into C, CUDA or OpenCL using modified LLVM C Backend and compiled using the corresponding device compiler.



Finally, objects for all parts of the code are merged into single object to conserve "1 source → 1 object" layout. LLVM IR is also embedded into resulting object.

## Code generation workflow (runtime part)



Without runtime optimizations enabled, the previously compiled kernel binary could be built and executed.

# Code generation workflow (runtime part)



#### 3. Toolchain usecase

# 8 basic steps in total



#### **Example: axpy**

Consider toolchain steps in detail for the following simple test program:

```
subroutine axpy(n, a, x, y)
implicit none
integer, intent(in) :: n
real, intent(in) :: a, x(n)
real, intent(inout) :: y(n)
integer :: i
do i = 1, n
 y(i) = y(i) + a * x(i)
enddo
print *, 'Value of i after cycle = ', i
end subroutine axpy
```

kgen-gforscale -Wk,--gforscale-scene-path=/opt/kgen/transforms/split/-Wk,--gforscale-mode=tree axpy.f90

```
subroutine axpy loop 1 gforscale(n, y, a, x)
implicit none
interface
subroutine axpy loop 1 gforscale blockidx x(index, start, end)
bind(C)
use iso c binding
integer(c int) :: index
integer(c int), value :: start, end
end subroutine
subroutine axpy loop 1 gforscale blockidx y(index, start, end)
bind(C)
use iso c binding
integer(c int) :: index
integer(c int), value :: start, end
end subroutine
end interface
integer :: i
integer, intent(in) :: n
real, intent(inout) :: y(n)
real, intent(in) :: a
real, intent(in) :: x(n)
#ifdef CUDA DEVICE FUNC
call axpy loop 1 gforscale blockidx x(i, 1, n)
#else
do i = 1. n
#endif
  y(i) = y(i) + a * x(i)
#ifndef CUDA DEVICE FUNC
enddo
#endif
end subroutine axpy loop 1 gforscale
```

```
subroutine axpy(n, a, x, y)
USE GEORSCALE
implicit none
integer, intent(in) :: n
real, intent(in) :: a, x(n)
real, intent(inout) :: y(n)
integer :: i
!$GFORSCALE SELECT axpy loop 1 gforscale
if (gforscale select(0, 1, 'axpy' // char(0))) then
!$GFORSCALE CALL axpy_loop_1_gforscale
#ifdef CUDA DEVICE FUNC
  call gforscale launch('axpy loop 1 gforscale ' // char(0), &
   1, n, 0, 0, 4, n, sizeof(n), y, sizeof(y), a, sizeof(a), x, sizeof(x))
i = n + 1
#else
  call axpy loop 1 gforscale(n, y, a, x)
#endif
!$GFORSCALE END CALL axpy loop 1 gforscale
!$GFORSCALE LOOP axpy loop 1 gforscale
do i = 1. n
 y(i) = y(i) + a * x(i)
!$GFORSCALE END LOOP axpy loop 1 gforscale
endif
!$GFORSCALE END SELECT axpy loop 1 gforscale
print *. 'Value of i after cycle = '. i
end subroutine axpy
```

dragonegg-gfortran -c axpy.axpy\_loop\_1\_gforscale.F90
-D\_\_CUDA\_DEVICE\_FUNC\_\_ -ffree-line-length-none
-fplugin=/opt/llvm/dragonegg/lib64/dragonegg.so -O0 -S
-fplugin-arg-dragonegg-emit-ir -o axpy.axpy loop 1 gforscale.F90.bc

```
%10 = mul i64 %7, 4
: ModuleID = 'axpv.axpv loop 1 aforscale.F90'
                                                                                     %11 = load i32* %0, align 4
%12 = sext i32 %11 to i64
i64:64:64-f32:32:32-f64:64:64-v64:64-v128:128:128-a0:0:64-s0:64:64-
                                                                                     %13 = icmp sqe i64 %12.0
f80:128:128-f128:128:128-n8:16:32:64"
                                                                                     %14 = select i1 %13, i64 %12, i64 0
target triple = "x86 64-unknown-linux-gnu"
                                                                                     %15 = add nsw i64 %14, -1
                                                                                     %16 = mul i64 %14, 32
module asm "\09.ident\09\22GCC: (GNU) 4.5.4 20110527 (prerelease) LLVM:
                                                                                     %17 = mul i64 %14, 4
131968\22"
                                                                                     %18 = load i32* %0, align 4
                                                                                     call void (i32*, i32, i32, ...)* @axpy loop 1 gforscale blockidx x(i32*
%"integer(kind=4)" = type i32
                                                                                   noalias %memtmp, i32 1, i32 %18) nounwind
%"real(kind=4)" = type float
                                                                                     %19 = load i32* %memtmp, align 4
                                                                                     %20 = sext i32 %19 to i64
define void @axpy loop 1 gforscale (i32* %n, [0 x float]* %y, float* %a, [0
                                                                                     %21 = add nsw i64 %20, -1
x floatl* %x) nounwind {
                                                                                     %22 = load i32* %memtmp, align 4
entry:
                                                                                     %23 = sext i32 %22 to i64
  %n addr = alloca i32*, align 8
                                                                                     %24 = add nsw i64 %23, -1
  %y addr = alloca [0 x float]*, align 8
                                                                                     %25 = bitcast [0 x float]* %1 to float*
  %a addr = alloca float*, align 8
                                                                                     %26 = getelementptr float* %25, i64 %24
  %x addr = alloca [0 x float]*, align 8
                                                                                     %27 = load float* %26. align 4
  %memtmp = alloca i32
                                                                                     %28 = load float* %2. align 4
  %"alloca point" = bitcast i32 0 to i32
                                                                                     %29 = load i32* %memtmp, align 4
  store i32* %n, i32** %n addr
                                                                                     %30 = sext i32 %29 to i64
  store [0 x float]* %v, [0 x float]** %v addr
                                                                                     %31 = add nsw i64 %30, -1
  store float* %a, float** %a addr
                                                                                     %32 = bitcast [0 x float]* %3 to float*
  store [0 x float]* %x, [0 x float]** %x addr
                                                                                     %33 = getelementptr float* %32, i64 %31
  %0 = load i32** %n addr, align 64
                                                                                     %34 = load float* %33. align 4
  %1 = load [0 \times float]** %y addr, align 64
                                                                                     %35 = fmul float %28, %34
  %2 = load float** %a addr, align 64
                                                                                     %36 = fadd float %27, %35
  %3 = load [0 \times float]** %x addr. align 64
                                                                                     %37 = bitcast [0 x float]* %1 to float*
  %"ssa point" = bitcast i32 0 to i32
                                                                                     %38 = getelementptr float* %37, i64 %21
  br label %"2"
                                                                                     store float %36, float* %38, align 4
                                                                                     br label %return
                                                 ; preds = %entry
  %4 = load i32* %0. align 4
                                                                                    return:
                                                                                                                                    : preds = %"2"
  %5 = sext i32 %4 to i64
                                                                                     ret void
  %6 = icmp sae i64 %5.0
  %7 = select i1 %6, i64 %5, i64 0
  %8 = add nsw i64 \%7. -1
                                                                                   declare void @axpy loop 1 gforscale blockidx x(i32* noalias, i32, i32, ...)
  %9 = mul i64 %7, 32
```

/opt/llvm/bin/opt -std-compile-opts axpy.axpy\_loop\_1\_gforscale.F90.bc -S -o axpy.axpy loop 1 gforscale.F90.bc.opt

```
; ModuleID = 'axpy.axpy loop 1 gforscale.F90.bc'
target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v64:64:64-v128:128:128-a0:0:64-
s0:64:64-f80:128:128-f128:128:128-n8:16:32:64"
target triple = "x86 64-unknown-linux-gnu"
module asm "\09.ident\09\22GCC: (GNU) 4.5.4 20110527 (prerelease) LLVM: 131968\22"
define void @axpy loop 1 gforscale (i32* nocapture %n, [0 x float]* nocapture %y, float* %a, [0 x float]* %x) nounwind {
  %memtmp = alloca i32, align 4
  %0 = load i32* %n, align 4
  call void (i32*, i32, i32, ...)* @axpy_loop_1_gforscale_blockidx_x(i32* noalias %memtmp, i32 1, i32 %0) nounwind
  %1 = load i32* %memtmp, align 4
  %2 = sext i32 %1 to i64
  %3 = add nsw i64 %2, -1
  %4 = getelementptr [0 x float]* %y, i64 0, i64 %3
  %5 = load float* %4, align 4
  %6 = load float* %a, align 4
  %7 = getelementptr [0 x float]* %x, i64 0, i64 %3
  %8 = load float* %7, align 4
  %9 = fmul float %6, %8
  %10 = fadd float %5, %9
  store float %10, float* %4, align 4
  ret void
declare void @axpy loop 1 gforscale blockidx x(i32* noalias, i32, i32, ...)
```

/opt/llvm/bin/llc -march=c axpy.axpy\_loop\_1\_gforscale.F90.bc.opt -o axpy.axpy\_loop\_1\_gforscale.F90.bc.cu

```
asm("\t.ident\t\"GCC: (GNU) 4.5.4 20110527 (prerelease) LLVM: 131968\"\n"
"");
#ifdef CUDA DEVICE FUNC
device
#endif
void axpy loop 1 gforscale (unsigned int *llvm cbe n, struct l unnamed0 (*llvm cbe y), float *llvm cbe a, struct l unnamed0 (*llvm cbe x));
#ifdef CUDA DEVICE FUNC
device
#endif
void axpy loop 1 gforscale blockidx x(unsigned int *, unsigned int , unsigned int );
void axpy loop 1 gforscale (unsigned int *llvm cbe n, struct l unnamed0 (*llvm cbe y), float *llvm cbe a, struct l unnamed0 (*llvm cbe x)) {
  unsigned int llvm cbe memtmp; /* Address-exposed local */
  unsigned int llvm cbe tmp 1;
  unsigned int llvm cbe tmp 2;
  unsigned long long llvm cbe tmp 3;
  float *llvm cbe tmp 4;
  float llvm cbe tmp 5:
  float llvm cbe tmp 6;
  float llvm cbe tmp 7;
  llvm cbe tmp 1 = *llvm cbe n;
  axpy loop 1 \overline{g}forscale \overline{b}lockidx x((&llvm cbe memtmp), \frac{1}{u}, llvm cbe tmp 1);
  llvm cbe tmp 2 = *(\&llvm cbe memtmp);
  llym cbe tmp 3 = ((unsigned long long )(((unsigned long long )(((signed long long )(signed int )llym cbe tmp 2))) + ((unsigned long long long )
18446744073709551615ull))):
 llvm cbe tmp 4 = (\&(*ilvm cbe y).array[((signed long long )llvm cbe tmp 3)]);
 llvm cbe tmp 5 = *llvm cbe tmp 4;
  llvm cbe tmp 6 = *llvm cbe a;
 llvm cbe tmp 7 = *((\&(*llvm cbe x).array[((signed long long )llvm cbe tmp 3)]));
  *llvm cbe tmp 4 = (((float )(llvm cbe tmp 5 + (((float )(llvm cbe tmp 6 * llvm cbe tmp 7))))));
  return:
```

```
asm("\t.ident\t\"GCC: (GNU) 4.5.4 20110527 (prerelease) LLVM: 131968 \"\n"
"");
#ifdef __CUDA_DEVICE_FUNC
extern "C" __global _
#endif
void axpy loop 1 gforscale (unsigned int *llvm cbe n, struct l unnamed0 (*llvm cbe y), float *llvm cbe a, struct l unnamed0 (*llvm cbe x));
#ifdef CUDA DEVICE FUNC
 device
#endif
void axpy loop 1 gforscale blockidx x(unsigned int* index, unsigned int start, unsigned int end) { *index = blockIdx.x + start; }
void axpy loop 1 gforscale (unsigned int *llvm cbe n, struct l unnamed0 (*llvm cbe y), float *llvm cbe a, struct l unnamed0 (*llvm cbe x)) {
  unsigned int llvm cbe memtmp; /* Address-exposed local */
  unsigned int llvm cbe tmp 1;
  unsigned int llvm cbe tmp 2;
  unsigned long long llvm cbe tmp 3;
  float *llvm cbe tmp 4;
  float llvm cbe tmp 5;
  float llvm cbe tmp 6;
  float llvm_cbe_tmp_7;
  llvm cbe tmp 1 = *llvm cbe n;
  axpy loop 1 gforscale blockidx x((&llvm cbe memtmp), lu, llvm cbe tmp 1);
  llvm cbe tmp 2 = *(\&llvm cbe memtmp);
  llvm cbe tmp 3 = ((unsigned long long )(((unsigned long long )(((signed long long )(signed int )llvm cbe tmp 2))) + ((unsigned long long )
18446744073709551615ull));
  llvm cbe tmp 4 = (\&(*llvm cbe y).array[(( signed long long )llvm cbe tmp 3)]);
  llvm cbe tmp 5 = *llvm cbe tmp 4:
  llvm cbe tmp 6 = *llvm cbe a;
  llvm cbe tmp 7 = *((&(*llvm cbe x).array[(( signed long long )llvm cbe tmp 3)]));
  *llvm cbe tmp 4 = (((float)(llvm cbe tmp 5 + (((float)(llvm cbe tmp 6 * llvm cbe tmp 7)))))):
  return;
}
```

#### Steps 6-8

Final compilation of host and device parts, assembling into single object file:

```
# 6
nvcc -g -c axpy.axpy_loop_1_gforscale.F90.bc.cu -D__CUDA_DEVICE_FUNC__ -G -o
axpy.axpy_loop_1_gforscale.F90.o

# 7
gfortran -g -c axpy.host.F90 -D__CUDA_DEVICE_FUNC__ -ffree-line-length-none
-l/opt/kgen/include -o axpy.host.F90.o

# 8
/usr/bin/ld --unresolved-symbols=ignore-all -r -o axpy.o_kgen axpy.host.F90.o
axpy.axpy_loop_1_gforscale.F90.o
```

#### **Testing axpy**

#### # By default – execute on CPU

```
[marcusmae@T61p axpy]$ ./axpy
Usage: ./axpy <n> <eps>
[marcusmae@T61p axpy]$ ./axpy 10 0.001
  Value of i after cycle = 11
Max abs diff = 0.000000
```

# Use GPU kernel if the corresponding environment variable set to 1 # (extra debug output showing how arguments are mapped into device memory)

```
[marcusmae@T61p axpy]$ axpy_1=1 ./axpy 1000 0.001 arg 0 maps memory segment [140735344500736 .. 140735344508928] to [1052672 .. 1060864] arg 1 maps memory segment [28172288 .. 28184576] to [1060864 .. 1073152] arg 2 reuses mapping created by arg 0 arg 3 reuses mapping created by arg 1 Value of i after cycle = 1001 Max abs diff = 0.000000
```



# **COSMO - coverage**



The generated kernel behaviour

# **COSMO - performance**



The generated kernel name

# WRF - coverage



The generated kernel behaviour

# WRF - performance



The generated kernel name

#### 6. Development schedule

#### Stage 1 (April - June)

- Put together all necessary toolchain parts, write the main script
- Test C code generation, file bugs to Ilvm, patch C backend for CUDA support
- Complete existing host-device code split transform (previously started in 2009 for CellBE)
- Implement kernel invocation runtime
- Implement kernel self-checking runtime
- Compile COSMO with toolchain and present charts showing the percentage of successfully generated kernels with checked correct results

# Stage 2 (July - October)

- Improve support/coverage
- Improve efficiency
- Compare performance with other generation tools
- Present the work and <u>carefully listen to feedback</u>

# Stage 2 (July - October)

- Improve support/coverage
  - More testing on COSMO and other models, file bugs (+2 RHM fellows)
  - Fix the most hot bugs in host-device code split transform
  - Use Polly/Pluto for more accurate capable loops recognition
  - Support link-time generation for kernels with external dependencies
- Improve efficiency
  - Use shared memory in stencils (+1 contractor)
  - Implement both zero-copy and active data synchronization modes
  - Kernel invocation configs caching
  - [variant] Consider putting serial code into single GPU thread as well, to have the whole model instance running on GPU
  - [variant] Consider selective/prioritized data synchronization support, using data dependencies lookup
  - > [variant, suggested by S.K.] CPU ↔ GPU work sharing inside MPI process
- Compare performance with other generation tools
- Present the work and <u>carefully listen to feedback</u>

#### 5. Team & resources

#### **Team**



Artem Petrov

Dr Yulia Martynova

(testing, coordination)

(WRF testing)

#### **Team**



**Artem Petrov** 

Dr Yulia Martynova

(testing, coordination)

(WRF testing)

Alexander Myltsev

**Dmitry Mikushin** 

(development, testing)

(development, planning)

#### **Team**



Artem Petrov

Dr Yulia Martynova

(testing, coordination)

(WRF testing)

Alexander Myltsev

**Dmitry Mikushin** 

(development, testing)

(development, planning)

Support from communities:



Polly



**LLVM** 

Polly/LLVM

gcc/gfortran

#### **Toolchain installation**

/opt/kgen/ bin/ q95xml-refids Source code XML-markup g95xml-tree Script performing steps 1-8 kgen kgen-gforscale Source-to-source translator kgen-gfortran Frontend ("compiler") include/ gforscale.h gforscale.mod Runtime-modules gforscale.dragonegg.mod lib/ Runtime-library libgforscale.so transforms/ Tree of code transformation split/ rules performing host/device code split