# Hardware SIMD to Numpy cheat sheet

## What it means:

   * SSE (1999) executes 4/2 floats/doubles in one hardware instruction
   * AVX (2008) and AVX2 (2011) executes 8/4 floats/doubles in one hardware instruction
   * AVX-512 (2013) executes 16/8 floats/doubles in one hardware instruction
   * NVidia GPUs execute 32 values in one hardware instruction ("warps")
   * AMD GPUs execute 64 values in one hardware instruction ("wavefront")
   * Numpy functions execute **an arbitrary number** (e.g. millions) of values in one **Python instruction**, and is implemented with the best hardware vectorization available.

## Terminology:

| hardware vectorization | Numpy vectorization |
|-|-|
| mask | [fancy indexing with boolean array](https://docs.scipy.org/doc/numpy-1.13.0/user/basics.indexing.html#boolean-or-mask-index-arrays) |
| gather | [fancy indexing with integer array](https://docs.scipy.org/doc/numpy-1.13.0/user/basics.indexing.html#index-arrays) (get) |
| scatter | [fancy indexing with integer array](https://docs.scipy.org/doc/numpy-1.13.0/user/basics.indexing.html#index-arrays) (set) |
| stride | [strides](https://docs.scipy.org/doc/numpy-1.14.0/reference/generated/numpy.ndarray.strides.html) (or reshape-slice-reshape) |
| structure packing | [structured array](https://docs.scipy.org/doc/numpy/user/basics.rec.html) or "recarray" |

## Bottom lines:

Numpy vectorization is the same *basic concept* as hardware vectorization, but at a higher level of abstraction: Single (Python) Instruction on Multiple Data, rather than a Single (hardware) Instruction on Multiple Data.

Reorganizing code for Numpy speedups enables hardware vectorization: sequential access, simple loops with no loop-carried dependencies.