# Part I - Devito Performance modes

This tutorial illustrates the impact of several Devito optimization modes to the generated code.

In Devito, an Operator has two preset optimization levels: `noop` and `advanced`.  With `noop`, no performance optimizations are introduced by the compiler. With `advanced`, several flop-reducing and data locality
optimizations are applied. Examples of flop-reducing optimizations are common sub-expressions elimination and factorization; examples of data locality optimizations are loop fusion and cache blocking. SIMD vectorization is also applied through compiler auto-vectorization.

To choose the performance optimization level we one shoud set the env variable DEVITO_OPT. By default it is set to the maximum level, `advanced`.

e.g. `export DEVITO_OPT=advanced`

alternaltively we can set the opt level at runtime using:

`configuration['opt'] = advanced`

or pass it directly to an `Operator` like:

```
eq = Eq(u.forward, u+2)
op = Operator(eq, opt='noop')
```

We will use a xxx `Operator` that, at each time step, increments by 1 all points in the physical domain and the code produced in each case.


We will incrementally apply several optimizations before reachint the advanced level.

In [1]:
# This function will be used to print the difference between the generated code.
def _unidiff_output(expected, actual):
    """
    Helper function. Returns a string containing the unified diff of two multiline strings.
    """
    import difflib
    expected=expected.splitlines(1)
    actual=actual.splitlines(1)

    diff=difflib.unified_diff(expected, actual)

    return ''.join(diff)

In [2]:
from devito import clear_cache
import numpy as np
clear_cache()

In [3]:
from devito import Grid, TimeFunction, Eq, Operator, clear_cache
from examples.cfd import plot_field, init_hat
from devito import Eq, solve
from devito import configuration

# Initialise our problem parameters
nx = 200
ny = 200
nz = 200

grid = Grid(shape=(nx, ny, nz))
u = TimeFunction(name='u', grid=grid, space_order=2)
eq = Eq(u.forward, u.laplace + 0.1)

The next cell illustrates the diff between an `Operator` that is not optimized at all, versus an `Operator` that is exploting OpenMP parallelism. We notice from the printed diff that the OpenMP code has:

 - a header `#include "omp.h"`
as well as OpenMP directives:
 - `#pragma omp parallel num_threads(nthreads)`
 - `#pragma omp for collapse(1) schedule(static,1)`

In [4]:
#NBVAL_IGNORE_OUTPUT
op_noop = Operator(eq, opt=('noop', {'openmp': False}))
op_noop_openmp = Operator(eq, opt=('noop', {'openmp': True}))

print(_unidiff_output(str(op_noop), str(op_noop_openmp)))

--- 
+++ 
@@ -2,6 +2,7 @@
 #include "stdlib.h"
 #include "math.h"
 #include "sys/time.h"
+#include "omp.h"
 
 struct dataobj
 {
@@ -20,7 +21,7 @@
 } ;
 
 
-int Kernel(const float h_x, const float h_y, const float h_z, struct dataobj *restrict u_vec, const int time_M, const int time_m, struct profiler * timers, const int x_M, const int x_m, const int y_M, const int y_m, const int z_M, const int z_m)
+int Kernel(const float h_x, const float h_y, const float h_z, struct dataobj *restrict u_vec, const int time_M, const int time_m, struct profiler * timers, const int x_M, const int x_m, const int y_M, const int y_m, const int z_M, const int z_m, const int nthreads)
 {
   float (*restrict u)[u_vec->size[1]][u_vec->size[2]][u_vec->size[3]] __attribute__ ((aligned (64))) = (float (*)[u_vec->size[1]][u_vec->size[2]][u_vec->size[3]]) u_vec->data;
   for (int time = time_m, t0 = (time)%(2), t1 = (time + 1)%(2); time <= time_M; time += 1, t0 = (time)%(2), t1 = (time + 1)%(2))
@@ -28,13 +29,17 @@
 

We showed the diff in the code by enabling OpenMP parallelism. Let's have a look at what happens when we enable some data-locality optimizations. The next cell prints the generated code when enabling cache blocking in addition to OpenMP parallelism. We notice the new `bf0` function where we iterate through blocks, still using OpenMP parallelism. The size of the blocks is by default decided form the autotuner unless explicitly defined by the user.

In [5]:
#NBVAL_IGNORE_OUTPUT
op_blocking = Operator(eq, opt=('blocking', {'openmp': True}))
print(op_blocking.ccode)

#define _POSIX_C_SOURCE 200809L
#include "stdlib.h"
#include "math.h"
#include "sys/time.h"
#include "omp.h"

struct dataobj
{
  void *restrict data;
  int * size;
  int * npsize;
  int * dsize;
  int * hsize;
  int * hofs;
  int * oofs;
} ;

struct profiler
{
  double section0;
} ;

void bf0(const float h_x, const float h_y, const float h_z, struct dataobj *restrict u_vec, const int t0, const int t1, const int x0_blk0_size, const int x_M, const int x_m, const int y0_blk0_size, const int y_M, const int y_m, const int z_M, const int z_m, const int nthreads);

int Kernel(const float h_x, const float h_y, const float h_z, struct dataobj *restrict u_vec, const int time_M, const int time_m, struct profiler * timers, const int x0_blk0_size, const int x_M, const int x_m, const int y0_blk0_size, const int y_M, const int y_m, const int z_M, const int z_m, const int nthreads)
{
  for (int time = time_m, t0 = (time)%(2), t1 = (time + 1)%(2); time <= time_M; time += 1, t0 = (time)%(2), t1 = (time 

Usually we avoid blocking the innermost loop in order to achieve better performance due to SIMD vectorization.
However, if someone wants to, can enable innermost loop blocking as shown in the next cell: 

In [6]:
op_blocking_inner = Operator(eq, opt=('blocking', {'blockinner': True, 'openmp': True}))
# print(op_blocking_inner.ccode)

As said before, we can also add SIMD vectorization to the generated code. We prefer that over innermost loop blocking. The next cell shows the addition of vectorization in the innermost `z loop` compared to plain blocking.

In [7]:
#NBVAL_IGNORE_OUTPUT
op_blocking_simd = Operator(eq, opt=('blocking', 'simd', {'openmp': True}))
print(_unidiff_output(str(op_blocking), str(op_blocking_simd)))

--- 
+++ 
@@ -58,6 +58,7 @@
         {
           for (int y = y0_blk0; y <= y0_blk0 + y0_blk0_size - 1; y += 1)
           {
+            #pragma omp simd aligned(u:32)
             for (int z = z_m; z <= z_M; z += 1)
             {
               u[t1][x + 2][y + 2][z + 2] = 1.0e-1F + u[t0][x + 2][y + 2][z + 1]/pow(h_z, 2) - 2.0F*u[t0][x + 2][y + 2][z + 2]/pow(h_z, 2) + u[t0][x + 2][y + 2][z + 3]/pow(h_z, 2) + u[t0][x + 2][y + 1][z + 2]/pow(h_y, 2) - 2.0F*u[t0][x + 2][y + 2][z + 2]/pow(h_y, 2) + u[t0][x + 2][y + 3][z + 2]/pow(h_y, 2) + u[t0][x + 1][y + 2][z + 2]/pow(h_x, 2) - 2.0F*u[t0][x + 2][y + 2][z + 2]/pow(h_x, 2) + u[t0][x + 3][y + 2][z + 2]/pow(h_x, 2);



However Devito by default, can optimize even more the generated code. When Devito optimization level is set to
`advanced` as stated in the introductory first cell. Let's have a look at the diff between the `advanced` optimization level version and a version with loop blocking and SIMD vectorization. 

In [8]:
#NBVAL_IGNORE_OUTPUT
op_advanced = Operator(eq, opt=('advanced', {'openmp': True}))
print(_unidiff_output(str(op_blocking_simd), str(op_advanced)))

--- 
+++ 
@@ -2,6 +2,8 @@
 #include "stdlib.h"
 #include "math.h"
 #include "sys/time.h"
+#include "xmmintrin.h"
+#include "pmmintrin.h"
 #include "omp.h"
 
 struct dataobj
@@ -24,6 +26,9 @@
 
 int Kernel(const float h_x, const float h_y, const float h_z, struct dataobj *restrict u_vec, const int time_M, const int time_m, struct profiler * timers, const int x0_blk0_size, const int x_M, const int x_m, const int y0_blk0_size, const int y_M, const int y_m, const int z_M, const int z_m, const int nthreads)
 {
+  /* Flush denormal numbers to zero in hardware */
+  _MM_SET_DENORMALS_ZERO_MODE(_MM_DENORMALS_ZERO_ON);
+  _MM_SET_FLUSH_ZERO_MODE(_MM_FLUSH_ZERO_ON);
   for (int time = time_m, t0 = (time)%(2), t1 = (time + 1)%(2); time <= time_M; time += 1, t0 = (time)%(2), t1 = (time + 1)%(2))
   {
     struct timeval start_section0, end_section0;
@@ -61,7 +66,8 @@
             #pragma omp simd aligned(u:32)
             for (int z = z_m; z <= z_M; z += 1)
             {
-              u[t1][x +

The code diff in the cell above depicts some differences between these two modes.
First of all, we notice the addition of 
```
+#include "xmmintrin.h"
+#include "pmmintrin.h"
```
and 
```
+  /* Flush denormal numbers to zero in hardware */
+  _MM_SET_DENORMALS_ZERO_MODE(_MM_DENORMALS_ZERO_ON);
+  _MM_SET_FLUSH_ZERO_MODE(_MM_FLUSH_ZERO_ON);
```
Denormals are normally flushed when using SSE-based instruction sets, except when compiling shared objects.

Then, at the PDE update we see that we get `float r0` out of the main update expression as it is a common expression that can we avoid computing again and again.

This flop reduction after symbolic optimization results to an `Operator` that is cheaper.

In [9]:
#NBVAL_IGNORE_OUTPUT
configuration['log-level']='DEBUG'
op_blocking_simd = Operator(eq, opt=('blocking', 'simd', {'openmp': True}))
op_advanced = Operator(eq, opt=('advanced', {'openmp': True}))

Operator `Kernel` generated in 0.12 s
  * lowering.IET: 0.06 s (52.6 %)
     * specializing.IET: 0.04 s (35.1 %)
  * lowering.Clusters: 0.03 s (26.3 %)
Flops reduction after symbolic optimization: [30 --> 30]
Operator `Kernel` generated in 0.23 s
  * lowering.IET: 0.13 s (58.2 %)
     * specializing.IET: 0.11 s (49.3 %)
        * optimize_halospots: 0.07 s (31.4 %)
  * lowering.Clusters: 0.06 s (26.9 %)
     * specializing.Clusters: 0.05 s (22.4 %)
Flops reduction after symbolic optimization: [30 --> 19]


A 30-flop expression

`u[t1][x + 2][y + 2][z + 2] = 1.0e-1F + u[t0][x + 2][y + 2][z + 1]/pow(h_z, 2) - 2.0F*u[t0][x + 2][y + 2][z + 2]/pow(h_z, 2) + u[t0][x + 2][y + 2][z + 3]/pow(h_z, 2) + u[t0][x + 2][y + 1][z + 2]/pow(h_y, 2) - 2.0F*u[t0][x + 2][y + 2][z + 2]/pow(h_y, 2) + u[t0][x + 2][y + 3][z + 2]/pow(h_y, 2) + u[t0][x + 1][y + 2][z + 2]/pow(h_x, 2) - 2.0F*u[t0][x + 2][y + 2][z + 2]/pow(h_x, 2) + u[t0][x + 3][y + 2][z + 2]/pow(h_x, 2);`

is now a 19-flop

`
float r0 = -2.0F*u[t0][x + 2][y + 2][z + 2];
u[t1][x + 2][y + 2][z + 2] = 1.0e-1F + (r0 + u[t0][x + 2][y + 2][z + 1] + u[t0][x + 2][y + 2][z + 3])/((h_z*h_z)) + (r0 + u[t0][x + 2][y + 1][z + 2] + u[t0][x + 2][y + 3][z + 2])/((h_y*h_y)) + (r0 + u[t0][x + 1][y + 2][z + 2] + u[t0][x + 3][y + 2][z + 2])/((h_x*h_x));
`

We suggest to always run with `opt='advanced'` (set by default) unless preferred otherwise.