# Performance optimization overview

The purpose of this tutorial is twofold

* Illustrate the performance optimizations applied to the code generated by an `Operator`.
* Describe the options Devito provides to users to steer the optimization process.

As we shall see, most optimizations are automatically applied as they're known to systematically improve performance. Others, whose impact varies across different `Operator`'s, are instead to be enabled through specific flags.

An Operator has several preset **optimization levels**; the fundamental ones are `noop` and `advanced`. With `noop`, no performance optimizations are introduced. With `advanced`, several flop-reducing and data locality optimization passes are applied. Examples of flop-reducing optimization passes are common sub-expressions elimination and factorization, while examples of data locality optimization passes are loop fusion and cache blocking. Optimization levels in Devito are conceptually akin to the `-O2, -O3, ...` flags in classic C/C++/Fortran compilers.

An optimization pass may provide knobs, or **options**, for fine-grained tuning. As explained in the next sections, some of these options are given at compile-time, others at run-time.

**\*\* Remark \*\***

Parallelism -- both shared-memory (e.g., OpenMP) and distributed-memory (MPI) -- is _by default disabled_ and is _not_ controlled via the optimization level. In this tutorial we will also show how to enable OpenMP parallelism (you'll see it's trivial!). Another mini-guide about parallelism in Devito and related aspects is available [here](https://github.com/devitocodes/devito/tree/master/benchmarks/user#a-step-back-configuring-your-machine-for-reliable-benchmarking). 

**\*\*\*\***

## Outline

* [API](#API)
* [Default values](#Default-values)
* [Running example](#Running-example)
* [OpenMP parallelism](#OpenMP-parallelism)
* [The `advanced` mode](#The-advanced-mode)
* [The `advanced-fsg` mode](#The-advanced-fsg-mode)

## API

The optimization level may be changed in various ways:

* globally, through the `DEVITO_OPT` environment variable. For example, to disable all optimizations on all `Operator`'s, one could run with

```
DEVITO_OPT=noop python ...
```

* programmatically, adding the following lines to a program

```
from devito import configuration
configuration['opt'] = 'noop'
```

* locally, as an `Operator` argument

```
Operator(..., opt='noop')
```

Local takes precedence over programmatic, and programmatic takes precedence over global.

The optimization options, instead, may only be changed locally. The syntax to specify an option is

```
Operator(..., opt=('advanced', {<optimization options>})
```

A concrete example (you can ignore the meaning for now) is

```
Operator(..., opt=('advanced', {'blocklevels': 2})
```

That is, options are to be specified _together with_ the optimization level (`advanced`).

## Default values

By default, all `Operator`'s are run with the optimization level set to `advanced`. So this

```
Operator(Eq(...))
```

is equivalent to

```
Operator(Eq(...), opt='advanced')
```

and obviously also to

```
Operator(Eq(...), opt=('advanced', {}))
```

In virtually all scenarios, regardless of application and underlying architecture, this ensures very good performance -- but not necessarily the very best.

## Misc

The following functions will be used throughout the notebook for printing generated code.

In [1]:
from examples.performance.utils import unidiff_output, print_kernel