In [1]:
function f(n=10^8)
    s = 0.0
    for k in 1:n
        s += sin(k)/k
    end
    1 + 2s
end

@time f()
@time f()
@time f()

  2.539775 seconds
  2.605501 seconds
  2.653800 seconds


3.1415926695599605

In [2]:
using LoopVectorization

function g(n=10^8)
    s = 0.0
    @turbo for k in 1:n
        s += sin(k)/k
    end
    1 + 2s
end

@time g()
@time g()
@time g()

  0.223640 seconds
  0.210904 seconds
  0.210637 seconds


3.1415926695577863

In [3]:
??@turbo

```julia
@turbo
```

Annotate a `for` loop, or a set of nested `for` loops whose bounds are constant across iterations, to optimize the computation. For example:

```julia
function AmulB!(C, A, B)
    @turbo for m ∈ indices((A,C), 1), n ∈ indices((B,C), 2) # indices((A,C),1) == axes(A,1) == axes(C,1)
        Cₘₙ = zero(eltype(C))
        for k ∈ indices((A,B), (2,1)) # indices((A,B), (2,1)) == axes(A,2) == axes(B,1)
            Cₘₙ += A[m,k] * B[k,n]
        end
        C[m,n] = Cₘₙ
    end
end
```

The macro models the set of nested loops, and chooses an ordering of the three loops to minimize predicted computation time.

Current limitations:

1. It assumes that loop iterations are independent.
2. It does not perform bounds checks.
3. It assumes that each loop iterates at least once. (Use `@turbo check_empty=true` to lift this assumption.)
4. That there is only one loop at each level of the nest.

It may also apply to broadcasts:

```jldoctest
julia> using LoopVectorization

julia> a = rand(100);

julia> b = @turbo exp.(2 .* a);

julia> c = similar(b);

julia> @turbo @. c = exp(2a);

julia> b ≈ c
true
```

# Extended help

Advanced users can customize the implementation of the `@turbo`-annotated block using keyword arguments:

```julia
@turbo inline = false unroll = 2 thread = 4 body
```

where `body` is the code of the block (e.g., `for ... end`).

`thread` is either a Boolean, or an integer. The integer's value indicates the number of threads to use. It is clamped to be between `1` and `min(Threads.nthreads(),LoopVectorization.num_cores())`. `false` is equivalent to `1`, and `true` is equivalent to `min(Threads.nthreads(),LoopVectorization.num_cores())`.

`safe` (defaults to `true`) will cause `@turbo` to fall back to `@inbounds @fastmath` if `can_turbo` returns false for any of the functions called in the loop. You can disable the associated warning with `warn_check_args=false`.

Setting the keyword argument `warn_check_args=true` (e.g. `@turbo warn_check_args=true for ...`) in a loop or broadcast statement will cause it to warn once if `LoopVectorization.check_args` fails and the fallback loop is executed instead of the LoopVectorization-optimized loop. Setting it to an integer > 0 will warn that many times, while setting it to a negative integer will warn an unlimited amount of times. The default is `warn_check_args = 1`. Failure means that there may have been an array with unsupported type, unsupported element types, or (if `safe=true`) a function for which `can_turbo` returned `false`.

`inline` is a Boolean. When `true`, `body` will be directly inlined into the function (via a forced-inlining call to `_turbo_!`). When `false`, it wont force inlining of the call to `_turbo_!` instead, letting Julia's own inlining engine determine whether the call to `_turbo_!` should be inlined. (Typically, it won't.) Sometimes not inlining can lead to substantially worse code generation, and >40% regressions, even in very large problems (2-d convolutions are a case where this has been observed). One can find some circumstances where `inline=true` is faster, and other circumstances where `inline=false` is faster, so the best setting may require experimentation. By default, the macro tries to guess. Currently the algorithm is simple: roughly, if there are more than two dynamically sized loops or and no convolutions, it will probably not force inlining. Otherwise, it probably will.

`check_empty` (default is `false`) determines whether or not it will check if any of the iterators are empty. If false, you must ensure yourself that they are not empty, else the behavior of the loop is undefined and (like with `@inbounds`) segmentation faults are likely.

`unroll` is an integer that specifies the loop unrolling factor, or a tuple `(u₁, u₂) = (4, 2)` signaling that the generated code should unroll more than one loop. `u₁` is the unrolling factor for the first unrolled loop and `u₂` for the next (if present), but it applies to the loop ordering and unrolling that will be chosen by LoopVectorization, *not* the order in `body`. `uᵢ=0` (the default) indicates that LoopVectorization should pick its own value, and `uᵢ=-1` disables unrolling for the correspond loop.

The `@turbo` macro also checks the array arguments using `LoopVectorization.check_args` to try and determine if they are compatible with the macro. If `check_args` returns false, a fall back loop annotated with `@inbounds` and `@fastmath` is generated. Note that `VectorizationBase` provides functions such as `vadd` and `vmul` that will ignore `@fastmath`, preserving IEEE semantics both within `@turbo` and `@fastmath`. `check_args` currently returns false for some wrapper types like `LinearAlgebra.UpperTriangular`, requiring you to use their `parent`. Triangular loops aren't yet supported.
