- Major revision: 2
The goal is for a non-intrusive worksharing construct that enables the use of many CPU cores, without getting in the way, and while still being as close to pure Python as we can manage. Simplicity and safety is valued above handling all cases. We divide into:
- Loops that can be trivially parallelized as they are written. These should be parallelized with a minimum of fuss.
- Loops that require a little knowledge of threading. In these cases we require one to be a little more explicit.
- "Complicated" use-cases -- these fall outside of the scope of the
prangefacility. These cases may well not even require language support in Cython, or only in indirect ways -- if you are expert enough to know about threading locks, you can always write a closure to do threading.
A very simple example is to multiply an array by a scalar:
from cython.parallel import * def func(np.ndarray[double] x, double alpha): cdef Py_ssize_t with nogil: for i in prange(x.shape): x[i] = alpha * x[i]
The only change required is
prange. This simply hints to Cython that
the iterations of the loop body can be executed in any order. If this can be proven to not be the case, Cython is even free to raise a compile-time error.
prange loop body has the restriction that any functions called must be reentrant.
In pure Python,
prange (from the shadow module) is simply implemented as
range. Using multithreading would slow down the majority of usecases anyway, because of the GIL.
Other choices of syntax, such as
parallel_for i in range(n), may be more rational, but are unable to execute as pure Python code, and are therefore ruled out.
The first implementation will use (a small subset of!) OpenMP, simply in order to get somewhere quickly. But we do keep the door open for changing that.
We do automatic inference of thread-private variables in the cases where it is safe. Consider the following example:
from cython.parallel cimport * def f(np.ndarray[double] x, double alpha): cdef double s = 0 cdef double tmp with nogil: for i in prange(x.shape): # alpha is only read, so shared # tmp assigned before being used -> safe and natural to make it implicitly thread-private tmp = alpha * i s += x[i] * tmp # turns into reduction + thread-private s += tmp * 10 # after the loop we emulate sequential loop execution(OpenMP lastprivate) return s
prange, we promise that the order of execution of each iteration is arbitrary, so
the loop body can not depend on values computed by another iteration of the same loop:
- Variables that are only read become thread-shared
- Variables that are modified in the loop body are considered thread-private (but see below).
- If Cython can prove through control flow analysis that a variable is read before it is assigned, so that it is guaranteed to "spill over" from one iteration to the next, it should raise an error.
- At any rate, at the beginning of the loop body we initialize all floats to NaN and all integers to the value furthest from 0, to increase odds of failing hard early (so NOT firstprivate).
- If a variable is only modified by using an inplace operator (same each time), it is treated as a reduction variable. It is also thread-private and can be read without changing semantics (though the value will be somewhat meaningless). Note: This is the only kind of variable that can be modified in the loop and spill over to the next iteration. Using two different inplace operators, or assigning to it, would cause a compiler error.
- It would perhaps be more consistent to disallow reading. Reading a reduction variable would then be a compiler error (because you're breaking the promise that the order of iterations does not matter).
- After the loop as well as in any
else-block, the variables have the values they would have had if the loop had executed sequentially (lastprivate).
Unlike the simple case above, one may need explicit thread-local variables, e.g., for caching purposes. Because values now carry across iterations of the loop, one needs to explicitly declare them.
from cython.parallel import prange, threadlocal def f(np.ndarray[int] ell_arr, np.ndarray[double] out): cdef Py_ssize_t i, ell cdef threadlocal(int) old_ell = -1 cdef threadlocal(double) alpha = np.nan assert ell_arr.shape == m_arr.shape == out.shape assert np.all(ell_arr > 0) with nogil: for i in prange(out.shape): ell = ell_arr[i] if ell != old_ell: alpha = function_of_ell(ell) old_ell = ell out[i] = alpha * sqrt(ell + i)
The idea is that
ell_arr has long stretches of constant value, and that
function_of_ell is slow to compute, so that each thread wants to have their own cached copy of
alpha between each iteration.
Note that this is a very different use of thread-local variables. We:
- Require explicit declaration, unlike above
- Are firstprivate, unlike above
- Are lastprivate, like above
Note that pure Python operation still works just fine.
Some threads require setup and teardown of scratch space. For these cases there's the parallel block, as well as a couple of auxiliary functions:
from cython.parallel cimport parallel, prange, threadsavailable, threadid cdef double* buf = <double*>malloc(100 * threadsavailable(schedule='dynamic') * sizeof(double)) cdef double* threadbuf with nogil, parallel: threadbuf = buf + threadid() * 100 # thread setup for i in prange(n, schedule='dynamic'): ... # any thread teardown free(buf)
There may be more than one
prange after one another; at least for the time being they'll terminate with a barrier. Non-
prange loops are executed in each thread (if you use the parallel section, you're considered an expert that can deal with this subtle difference).
The parallel block is an obvious place to support more of the OpenMP standard eventually, such as critical sections, barriers etc., if we wish to do so.
threadsavailable must take the same scheduling parameters as
prange in order to give an accurate answer. The implementation is based on simply entering an OpenMP parallel section, fetch number of threads, and return.
prange takes a number of keyword arguments. For now these would simply be a subset of the OpenMP scheduler flags. We may have a mix of arguments that are hints (backends that don't support them ignore them) and required (backends that don't support them can't be used).
with gil should be supported eventually, although it can be skipped in a first release. It can be implemented through using the master block in OpenMP. Of course, getting the GIL serializes the execution, so the likely usecase is for raising exceptions.
In the first implementation one would perhaps not support
yield is likely never supported?).
But when they get implemented:
- Flags may have to be used instead of using goto statements for
raisecan be used inside
with gilblocks. When an exception is raised by any thread, the semantics is to terminate any other thread running the same loop at a random point.
- After entering
with gil, one does an OpenMP flush and check a thread termination flag (one can do one before as well to improve performance in some cases, but the one within must still be present). This ensures only one thread gets to raise an exception.
- For more responsive threads (make sure they don't go on computing for 10 minutes before exception finally propagates), one may want to also have guards in other places. The cost of doing a flush and whether a flush is strictly needed is an open question.
It is likely OK to sacrifice a little bit of performance (on the order of 10-20%) in order to have threads be responsive and terminate quickly on an exception. If one needs top performance one can always avoid raise/continue/break and use flags manually.