# Vectorization, Numpy Universal Functions

In order to write performant code using numerical libraries, it is useful to keep the following rule in mind:
> Code that is predictable can be made fast

The most common example of predictable code is a fixed length loop which performs the same operation at each iteration (up to changes in index)

Examples
1. Matrix-vector multiplication
2. Functions applied element-wise to an array

Non-examples:
1. Code with branch instructions (`if`, `else`, etc.)
2. Code with recursive function calls (at least in Python)

One reason why predictable code can be fast is that most CPUs have what is called a [branch predictor](https://en.wikipedia.org/wiki/Branch_predictor) in them, which pre-loads computation.  If a branch is predicted incorrectly, then the CPU has to switch gears and go along the correct brach, which takes time.  Code without branches will minimize the number of branch prediction errors, speeding up code.

Another reason why predictable code can be made fast is vectorization.  If you are performing the same operation in a predictable way, code can employ special instructions such as [AVX](https://en.wikipedia.org/wiki/Advanced_Vector_Extensions) which can greatly increase efficiency.

You don't need to worry about the details in Python, but it is good to know how to write code that allows libraries like NumPy to take advantage of these techniques.  Note that standard Python loops will not take advantage of these things - you typically need to use libraries.

## Universal Functions

Numpy universal functions (or ufuncs) are functions that are applied element-wise to an array.  Examples include most math operations and logical comparisons.  You can find additional information in [the ufunc documentation](https://numpy.org/doc/stable/reference/ufuncs.html).




In [7]:
import numpy as np
import time

In [28]:
# set up two vectors
n = 1_000_000
x = np.random.randn(n)
y = np.random.randn(n)

In [29]:
def naive_add(x, y):
    """
    add two arrays using a Python for-loop
    """
    z = np.empty_like(x)
    for i in range(len(x)):
        z[i] = x[i] + y[i]
        
    return z

In [30]:
start = time.time()
z = naive_add(x, y)
end = time.time()
print("time for naive add: {:0.3e} sec".format(end - start))

start = time.time()
z = np.add(x, y)
end = time.time()
print("time for numpy add: {:0.3e} sec".format(end - start))

time for naive add: 4.067e-01 sec
time for numpy add: 2.717e-03 sec


### Exercise

1. perform some timing tests that compare a naive python loop implementation with a numpy ufunc.

In [None]:
## Your code here


## More complicated functions

Some functions that can be sped up considerably are a bit more complicated than ufuncs.  One example is matrix-vector multiplication.  We'll use the notation `Ax = y`, where
\begin{equation}
y_i = \sum_j A_{i,j} x_j
\end{equation}

In [33]:
# set up matrix and vector for multiplication
m, n = 500, 1000
A = np.random.randn(m, n)
x = np.random.randn(n)

In [37]:
def naive_matvec(A, x):
    """
    naive matrix-vector multiplication implementation
    """
    m, n = A.shape
    y = np.zeros(m)
    for i in range(m):
        for j in range(n):
            y[i] = y[i] + A[i,j] * x[j]
    
    return y

In [38]:
start = time.time()
y1 = naive_matvec(A, x)
end = time.time()
print("time for naive matvec: {:0.3e} sec".format(end - start))

start = time.time()
y2 = np.matmul(A, x)
end = time.time()
print("time for numpy matvec: {:0.3e} sec".format(end - start))

np.linalg.norm(y1 - y2)

time for naive add: 3.248e-01 sec
time for numpy add: 4.988e-04 sec


8.619051610592001e-13

## Numba

[Numba](https://numba.pydata.org/) is a just in time (JIT) compiler for Python code.  It provides several decorators which make it very easy to get speedups for numerical code in many situations.

Just in time compilation is an increasingly popular solution that bridges the gap between interpreted and compiled languages.  Generally:
* Interpreted languages (such as Python) simply read code line-by-line and execute as they go along.
* Compiled languages (such as C, C++, fortran) compile code into a binary, which can be optimized to run quickly

Compilation takes time intially but saves time when you use the binary.  Python libraries such as NumPy and SciPy use compiled libraries under the hood for speed.  Interpreted languages tend to be slower, but are easier to develop in.

Just in time compilation will produce a compiled version of a function the first time it is needed.  [Julia](https://julialang.org/) is a relatively new language which uses JIT to produce fast code with less development overhead.

One of the things you need to know to compile code is the types used - if you want to use different types (e.g. single *and* double precision versions of a function), you need different compiled versions.  Python usually allows you to not worry too much about type, but this is one reason why you need to know about it anyways in scientific computing.

First:
```bash
$ conda install numba
```

Let's look at how Numba can be used with our `naive_add` ufunc.

In [39]:
from numba import jit

@jit # this is the only thing we do different
def numba_add(x, y):
    """
    add two arrays using a Python for-loop
    """
    z = np.empty_like(x)
    for i in range(len(x)):
        z[i] = x[i] + y[i]
        
    return z

In [40]:
# set up two vectors
n = 1000000
x = np.random.randn(n)
y = np.random.randn(n)

start = time.time()
z = naive_add(x, y)
end = time.time()
print("time for naive add: {:0.3e} sec".format(end - start))

start = time.time()
z = np.add(x, y)
end = time.time()
print("time for numpy add: {:0.3e} sec".format(end - start))

start = time.time()
z = numba_add(x, y)
end = time.time()
print("time for numba add: {:0.3e} sec".format(end - start))

time for naive add: 3.657e-01 sec
time for numpy add: 2.508e-03 sec
time for numba add: 2.498e-01 sec


The `numba` JIT function runs in about the same time as the naive function.  Let's see what happens when we run the code again:

In [41]:
# set up two vectors
n = 1000000
x = np.random.randn(n)
y = np.random.randn(n)

start = time.time()
z = naive_add(x, y)
end = time.time()
print("time for naive add: {:0.3e} sec".format(end - start))

start = time.time()
z = np.add(x, y)
end = time.time()
print("time for numpy add: {:0.3e} sec".format(end - start))

start = time.time()
z = numba_add(x, y)
end = time.time()
print("time for numba add: {:0.3e} sec".format(end - start))

time for naive add: 3.882e-01 sec
time for numpy add: 2.324e-03 sec
time for numba add: 3.130e-03 sec


Now the `numba` function is *much* faster.  This is because the first time the function is called, it must be compiled.  Every subsequent time you call the function, it will run much faster. 

The take-away is that it is advantageous to use JIT with functions you will use repeatedly, but not necessarily worth the time for functions you will only use once.

### Advanced numba

You can get a lot of mileage out of `numba` without too much trouble.  It is always good to look at [the documentation](https://numba.readthedocs.io/en/stable/index.html) to learn more.  Here are a few examples:

[Parallelization](https://numba.readthedocs.io/en/stable/user/parallel.html#numba-parallel) (this is supported on a handful of known operations).

In [44]:
from numba import prange # parallel range

@jit(nopython=True, parallel=True)
def numba_add_parallel(x, y):
    """
    add two arrays using a Python for-loop
    """
    z = np.empty_like(x)
    for i in prange(len(x)):
        z[i] = x[i] + y[i]
        
    return z

In [53]:
# set up two vectors
n = 10_000_000
x = np.random.randn(n)
y = np.random.randn(n)

z = numba_add_parallel(x, y) # precompile

start = time.time()
z = numba_add(x, y)
end = time.time()
print("time for numba add: {:0.3e} sec".format(end - start))

start = time.time()
z = np.add(x, y)
end = time.time()
print("time for numpy add: {:0.3e} sec".format(end - start))

start = time.time()
z = numba_add_parallel(x, y)
end = time.time()
print("time for numba parallel add: {:0.3e} sec".format(end - start))

time for numba add: 5.053e-02 sec
time for numpy add: 2.579e-02 sec
time for numba parallel add: 1.978e-02 sec


Parallelization of `matvec`:

In [51]:
@jit(nopython=True, parallel=True)
def numba_matvec(A, x):
    """
    naive matrix-vector multiplication implementation
    """
    m, n = A.shape
    y = np.zeros(m)
    for i in prange(m):
        for j in range(n):
            y[i] = y[i] + A[i,j] * x[j]
    
    return y

In [61]:
# set up matrix and vector for multiplication
m, n = 2000, 1000
A = np.random.randn(m, n)
x = np.random.randn(n)

y = numba_matvec(A, x) # precompile

start = time.time()
y1 = numba_matvec(A, x)
end = time.time()
print("time for numba parallel matvec: {:0.3e} sec".format(end - start))

start = time.time()
y2 = np.matmul(A, x)
end = time.time()
print("time for numpy matvec: {:0.3e} sec".format(end - start))

np.linalg.norm(y1 - y2)

time for numba parallel matvec: 1.851e-03 sec
time for numpy matvec: 1.236e-03 sec


1.654898895484153e-12

For more information, see [performance hints](https://numba.readthedocs.io/en/stable/user/performance-tips.html)