# Part I - Devito Performance modes

This tutorial illustrates the impact of several Devito optimization modes to the generated code.

An Operator has two preset optimization levels: `noop` and `advanced`.  With `noop`, no performance optimizations are introduced by the compiler. With `advanced`, several flop-reducing and data locality optimizations are applied. Examples of flop-reducing optimizations are common sub-expressions elimination and factorization; examples of data locality optimizations are loop fusion and cache blocking. SIMD vectorization is also applied through compiler auto-vectorization.

To choose the performance optimization level one should set the env variable DEVITO_OPT. By default it is set to the maximum level, `advanced`.

e.g. `export DEVITO_OPT=advanced`

alternaltively we can set the opt level at runtime using:

`configuration['opt'] = advanced`

or pass it directly to an `Operator` like:

```
eq = Eq(u.forward, u+2)
op = Operator(eq, opt='advanced')
```

We will use an `Operator` with the `Equation` `eq = Eq(u.forward, u.dx.dx)` in order to .....

Several optimizations will be incrementally applied starting from `noop` level and step by step going towards the `advanced` level.

In [None]:
# This function will be used to print the difference between the generated code.
def _unidiff_output(expected, actual):
    """
    Helper function. Returns a string containing the unified diff of two multiline strings.
    """
    import difflib
    expected=expected.splitlines(1)
    actual=actual.splitlines(1)

    diff=difflib.unified_diff(expected, actual)

    return ''.join(diff)

In [None]:
from devito import clear_cache
import numpy as np
clear_cache()

In [None]:
from devito import Grid, TimeFunction, Eq, Operator, clear_cache
from examples.cfd import plot_field, init_hat
from devito import Eq, solve
from devito import configuration

# Initialise our problem parameters
nx = 200
ny = 200
nz = 200

grid = Grid(shape=(nx, ny, nz))
u = TimeFunction(name='u', grid=grid, space_order=2)
eq = Eq(u.forward, u.dx.dx)

The next cell illustrates the diff between an `Operator` that is not optimized at all, versus an `Operator` that is exploting OpenMP parallelism. We notice from the printed diff that the OpenMP code has:

 - a header `#include "omp.h"`
as well as OpenMP directives:
 - `#pragma omp parallel num_threads(nthreads)`
 - `#pragma omp for collapse(...) schedule(dynamic,1)`

In [None]:
#NBVAL_IGNORE_OUTPUT
op_noop = Operator(eq, opt=('noop', {'openmp': False}))
op_noop_openmp = Operator(eq, opt=('noop', {'openmp': True}))

print(_unidiff_output(str(op_noop), str(op_noop_openmp)))

We showed the diff in the code by enabling OpenMP parallelism. Let's have a look at the code when we enable some data-locality optimizations. The next cell prints the generated code when enabling cache blocking in addition to OpenMP parallelism. Notice the new `bf0` function where we iterate through blocks, still using OpenMP parallelism. The size of the blocks is by default decided form the autotuner unless explicitly defined by the user. 

In [None]:
#NBVAL_IGNORE_OUTPUT
op_blocking = Operator(eq, opt=('blocking', {'openmp': True}))
print(op_blocking.ccode)

For better performance we avoid blocking the innermost loop. Vectorizing the innermost loop is performing better in the general case. However, if someone wants to, loop blocking the innermost loop can be set as shown in the next cell: 

In [None]:
op_blocking_inner = Operator(eq, opt=('blocking', {'blockinner': True, 'openmp': True}))
# print(op_blocking_inner.ccode)

Let's add SIMD vectorization in the generated code. The next cell shows the addition of vectorization in the innermost `z loop` compared to basic loop blocking.

In [None]:
#NBVAL_IGNORE_OUTPUT
op_blocking_simd = Operator(eq, opt=('blocking', 'simd', {'openmp': True}))
print(_unidiff_output(str(op_blocking), str(op_blocking_simd)))

However Devito *by default*, optimizes even more the generated code. This happens when Devito optimization level is set to `advanced` as previoulsly stated in the introductory first cell. Let's have a look at the diff between the `advanced` optimization level versus loop blocking and SIMD vectorization. 

In [None]:
#NBVAL_IGNORE_OUTPUT
op_advanced = Operator(eq, opt=('advanced', {'openmp': True}))
print(_unidiff_output(str(op_blocking_simd), str(op_advanced)))

The code diff in the cell above shows the differences between these two versions.
First of all, we notice the addition of 
```
+#include "xmmintrin.h"
+#include "pmmintrin.h"
```

and 

```
+  /* Flush denormal numbers to zero in hardware */
+  _MM_SET_DENORMALS_ZERO_MODE(_MM_DENORMALS_ZERO_ON);
+  _MM_SET_FLUSH_ZERO_MODE(_MM_FLUSH_ZERO_ON);
```
Denormals are normally flushed when using SSE-based instruction sets, except when compiling shared objects.

Then, at the PDE update we see that we get `float r0` out of the main update expression as it is a common expression that can we avoid computing again and again.

This flop reduction after symbolic optimization results to an `Operator` that is computationally cheaper.

In [None]:
#NBVAL_IGNORE_OUTPUT
configuration['log-level']='DEBUG'
op_blocking_simd = Operator(eq, opt=('blocking', 'simd', {'openmp': True}))
op_advanced = Operator(eq, opt=('advanced', {'openmp': True}))

A 15-flop expression

`u[t1][x + 2][y + 2][z + 2] = -(-u[t0][x + 2][y + 2][z + 2]/h_x + u[t0][x + 3][y + 2][z + 2]/h_x)/h_x + (-u[t0][x + 3][y + 2][z + 2]/h_x + u[t0][x + 4][y + 2][z + 2]/h_x)/h_x;`

is now a 6-flop

`r2[xs][ys][z] = (-u[t0][x + 3][y + 2][z + 2] + u[t0][x + 4][y + 2][z + 2])/h_x;`
plus
`u[t1][x + 2][y + 2][z + 2] = (-r2[xs][ys][z] + r2[xs + 1][ys][z])/h_x;`



We suggest to always run with `opt='advanced'` (set by default) unless preferred otherwise.

# References

- Michael E. Wolf and Monica S. Lam. 1991. A data locality optimizing algorithm. In Proceedings of the ACM SIGPLAN 1991 conference on Programming language design and implementation (PLDI ’91). Association for Computing Machinery, New York, NY, USA, 30–44. DOI:https://doi.org/10.1145/113445.113449