# Performance optimization overview

This tutorial illustrates the performance optimizations applied to the code generated by an `Operator`. As we shall see, most optimizations are automatically applied as they're known to systematically improve performance. Others, whose impact varies across different `Operator`s, are instead to be enabled through specific options.

An Operator has several preset **optimization levels**; the fundamental ones are `noop` and `advanced`. With `noop`, no performance optimizations are introduced by the compiler. With `advanced`, several flop-reducing and data locality _optimization passes_ are performed. Examples of flop-reducing optimization passes are common sub-expressions elimination and factorization; examples of data locality optimization passes are loop fusion and cache blocking. SIMD vectorization, via compiler auto-vectorization, is enforced through OpenMP pragmas.

An optimization pass may provide knobs, or **options**, for fine-grained tuning. As explained in the next sections, some of these options are given at compile-time, others at run-time.

**\*\* Remark \*\***

Parallelism -- both shared-memory (e.g., OpenMP) and distributed-memory (MPI) -- is _by default disabled_ and is _not_ controlled via the optimization level. In this tutorial we will also show how to enable OpenMP parallelism (you'll see it's trivial!). A mini-guide about parallelism in Devito and related aspects is also available [here](https://github.com/devitocodes/devito/tree/master/benchmarks/user#a-step-back-configuring-your-machine-for-reliable-benchmarking). 

**\*\*\*\***

## API

The optimization level may be changed in various ways:

* globally, through the `DEVITO_OPT` environment variable. For example, to disable all optimizations on all `Operator`s, one should run with

```
DEVITO_OPT=noop python ...
```

* programmatically, adding the following lines to a program

```
from devito import configuration
configuration['opt'] = 'noop'
```

* locally, as an `Operator` argument

```
Operator(..., opt='noop')
```

Local takes precedence over programmatic, and programmatic takes precedence over global.

The optimization options, instead, may only be changed locally. The syntax to specify an option is

```
Operator(..., opt=('advanced', {<optimization options>})
```

A concrete example (you can ignore the meaning for now) is

```
Operator(..., opt=('advanced', {'blocklevels': 2})
```

That is, options are to be specified _together with_ the optimization level.

## Default values

By default, all `Operator`s are run with the optimization level set to `advanced` and with all options disabled. So this

```
Operator(Eq(...))
```

is equivalent to

```
Operator(Eq(...), opt='advanced')
```

and obviously also to

```
Operator(Eq(...), opt=('advanced', {}))
```

In virtually all scenarios, regardless of application and underlying architecture, this ensures very good performance -- but not necessarily the very best.

## Utilities

The following functions will be used throughout the notebook for various purposes.

In [None]:
#TODO: George -- I guess we could (and should!) move this under examples/performance/utils.py ... 
def _unidiff_output(expected, actual):
    """
    Helper function. Returns a string containing the unified diff of two multiline strings.
    """
    import difflib
    expected=expected.splitlines(1)
    actual=actual.splitlines(1)

    diff=difflib.unified_diff(expected, actual)

    return ''.join(diff)

## Running example

Throughout the notebook we will generate `Operator`s for the following time-marching `Eq`.

In [None]:
from devito import Eq, Grid, Operator, TimeFunction

grid = Grid(shape=(80, 80, 80))
u = TimeFunction(name='u', grid=grid, space_order=2)
eq = Eq(u.forward, u.dy.dy + 1)

Despite its simplicity, this `Eq` is all we need to showcase the key components of the Devito optimization engine.

## OpenMP Parallelism

There are several ways to enable OpenMP parallelism. The one we use here consists of supplying an **option** to an `Operator`. The next cell illustrates the difference between two `Operator`s generated with the `noop` optimization level, but with OpenMP enabled on the latter `Operator`. 

In [1]:
# NOTE: we only need this cell for Continuous Integration, which runs with OpenMP enabled
from devito import configuration
configuration['language'] = 'C'

In [None]:
op0 = Operator(eq, opt=('noop'))
op0_omp = Operator(eq, opt=('noop', {'openmp': True}))

# print(op0)
# print(_unidiff_output(str(op0), str(op0_omp)))  # Uncomment to print out the diff only
print(op0_omp)

The OpenMP-ized `op0_omp` `Operator`:

 - includes the necessary header file `#include "omp.h"`
 - the `#pragma omp parallel num_threads(nthreads)` directive
 - the `#pragma omp for collapse(...) schedule(dynamic,1)` directive
 
More complex `Operator`s will have more directives, more types of directives, different iteration scheduling strategies based on heuristics and empirical tuning (e.g., `static` instead of `dynamic`), etc.

We note how the OpenMP optimization pass also introduces a new symbol, `nthreads`. This allows users to explicitly control the number of threads with which an `Operator` is run.

In [None]:
op0_omp.apply(time_M=0)  # Picks up `nthreads` from the standard environment variable OMP_NUM_THREADS
op0_omp.apply(time_M=0, nthreads=2)  # Runs with 2 threads per parallel loop

## The `advanced` mode

As already explained, `advanced` is the default optimization level in Devito. This mode performs several compilation passes to optimize the `Operator` for computation (number of flops), working set size, and data locality.

In the next paragraphs we dissect the `advanced` mode and analyze, one by one, some of its key passes.

### Loop blocking

The next cell creates a new `Operator` that adds loop blocking to what we had in `op0_omp`. 

In [None]:
op1_omp = Operator(eq, opt=('blocking', {'openmp': True}))
print(op1_omp)

**\*\* Remark \*\***

`'blocking'` is **not** an optimization level -- it rather identifies a very specific compilation pass. In other words, the `advanced` mode represents a set of passes, and `blocking` is one such pass.

**\*\*\*\***

The `blocking` pass creates additional loops over blocks. In this simple `Operator` there is just one loop nest, so only a pair of additional loops are created. In more complex `Operator`s, several loop nests may individually be blocked, whereas others may be left unblocked -- this is decided by the Devito compiler according to certain heuristics. The size of a block is represented by the symbols `x0_blk0_size` and `y0_blk0_size`, which are runtime parameters akin to `nthreads`. 

By default, Devito applies 2D blocking and sets the default block shape to 8x8. There are two ways to set a different block shape:

* passing an explicit value. In the example below we would run with a 24x8 block shape

```
op1_omp.apply(..., x0_blk0_size=24)
```

* letting the autotuner run for a while, looking for a better block shape. There are several autotuning modes. A short summary is available [here](https://github.com/devitocodes/devito/wiki/FAQ#devito_autotuning)

```
op1_omp.apply(..., autotune='aggressive')
```

Loop blocking also provides two optimization options:

* `blockinner={False, True}` -- to enable 3D (or any nD, n>2) blocking
* `blocklevels={int}` -- to enable hierarchical blocking, to exploit the cache hierarchy 

In the example below, we construct an `Operator` featuring six-dimensional loop blocking, where the first three loops represent outer blocks, whereas the second three loops represent inner blocks within an outer block.

In [None]:
op1_omp_6D = Operator(eq, opt=('blocking', {'blockinner': True, 'blocklevels': 2, 'openmp': True}))
print(op1_omp_6D)

### SIMD vectorization

Devito enforces SIMD vectorization through OpenMP pragmas.

In [None]:
op2_omp = Operator(eq, opt=('blocking', 'simd', {'openmp': True}))
# print(op2_omp)  # Uncomment to see the generated code
# print(_unidiff_output(str(op1_omp), str(op2_omp)))  # Uncomment to print out the diff only

However Devito *by default*, optimizes even more the generated code. This happens when Devito optimization level is set to `advanced` as previoulsly stated in the introductory first cell. Let's have a look at the diff between the `advanced` optimization level versus loop blocking and SIMD vectorization. 

In [None]:
#NBVAL_IGNORE_OUTPUT
op_advanced = Operator(eq, opt=('advanced', {'openmp': True}))
print(_unidiff_output(str(op_blocking_simd), str(op_advanced)))

The code diff in the cell above shows the differences between these two versions.
First of all, we notice the addition of 
```
+#include "xmmintrin.h"
+#include "pmmintrin.h"
```

and 

```
+  /* Flush denormal numbers to zero in hardware */
+  _MM_SET_DENORMALS_ZERO_MODE(_MM_DENORMALS_ZERO_ON);
+  _MM_SET_FLUSH_ZERO_MODE(_MM_FLUSH_ZERO_ON);
```
Denormals are normally flushed when using SSE-based instruction sets, except when compiling shared objects.

Then, at the PDE update we see that we get `float r0` out of the main update expression as it is a common expression that can we avoid computing again and again.

This flop reduction after symbolic optimization results to an `Operator` that is computationally cheaper.

In [None]:
#NBVAL_IGNORE_OUTPUT
configuration['log-level']='DEBUG'
op_blocking_simd = Operator(eq, opt=('blocking', 'simd', {'openmp': True}))
op_advanced = Operator(eq, opt=('advanced', {'openmp': True}))

A 15-flop expression

`u[t1][x + 2][y + 2][z + 2] = -(-u[t0][x + 2][y + 2][z + 2]/h_x + u[t0][x + 3][y + 2][z + 2]/h_x)/h_x + (-u[t0][x + 3][y + 2][z + 2]/h_x + u[t0][x + 4][y + 2][z + 2]/h_x)/h_x;`

is now a 6-flop

`r2[xs][ys][z] = (-u[t0][x + 3][y + 2][z + 2] + u[t0][x + 4][y + 2][z + 2])/h_x;`
plus
`u[t1][x + 2][y + 2][z + 2] = (-r2[xs][ys][z] + r2[xs + 1][ys][z])/h_x;`



We suggest to always run with `opt='advanced'` (set by default) unless preferred otherwise.

# References

- Michael E. Wolf and Monica S. Lam. 1991. A data locality optimizing algorithm. In Proceedings of the ACM SIGPLAN 1991 conference on Programming language design and implementation (PLDI ’91). Association for Computing Machinery, New York, NY, USA, 30–44. DOI:https://doi.org/10.1145/113445.113449