HTTPS clone URL
Subversion checkout URL
- DagSverreSeljebotn dev1
- DagSverreSeljebotn PipelineThoughts
- DagSverreSeljebotn soc
- DagSverreSeljebotn soc codeexample
- DagSverreSeljebotn soc details
- DagSverreSeljebotn status
- DagSverreSeljebotn WhyILikeSimpleGrammars
- docs specialmethods
- enchancements metadefintions
- enhancements 3000
- enhancements argumentnonechecks
- enhancements array
- enhancements arraysandviews
- enhancements arraytypes
- enhancements assumptions
- enhancements Ast
- enhancements Ast CythonVisitorTest
- enhancements Ast Mutator
- enhancements Ast Status
- enhancements Ast Visitor
- enhancements buffer
- enhancements buffer syntax
- enhancements buffer_external
- enhancements buffersyntax
- enhancements build
- enhancements builtins
- enhancements builtins pythontypestoo
- enhancements cep1000
- enhancements cep1001
- enhancements closures
- enhancements compiledducktyping
- enhancements compilerdirectives
- enhancements const
- enhancements cpp
- enhancements cppexceptions
- enhancements ctypemethods
- enhancements decorators
- enhancements distutils_preprocessing
- enhancements division
- enhancements FlowGraph
- enhancements forin
- enhancements fortran
- enhancements FullPython2.5
- enhancements funcpointers
- enhancements fusedtypes
- enhancements generators
- enhancements import
- enhancements inline
- enhancements inlineable
- enhancements inlinecpp
- enhancements inlining
- enhancements lessannotations
- enhancements libcython
- enhancements locals
- enhancements metaprogramming
- enhancements methodtemplates
- enhancements nativecall
- enhancements numericsplan
- enhancements numpy
- enhancements numpy gcc
- enhancements numpy getitem
- enhancements opencl
- enhancements openmp
- enhancements operators
- enhancements operators ambitious
- enhancements overlaypythonmodules
- enhancements overloading
- enhancements parallel
- enhancements parallelblock
- enhancements Parser
- enhancements parsetreetransforms
- enhancements pex
- enhancements phaseseperation
- enhancements pickling
- enhancements prange
- enhancements pure
- enhancements pypy
- enhancements refactorcoercion
- enhancements refnanny
- enhancements runtimectypes
- enhancements safetymode
- enhancements scope
- enhancements scope transformimpl
- enhancements scopevars
- enhancements signals
- enhancements simd
- enhancements stringcoercion
- enhancements stringliterals
- enhancements switch
- enhancements syntaxhighlighting
- enhancements treevisitors
- enhancements typeinference
- enhancements typeparameters
- enhancements uneval
- enhancements uneval emails
- enhancements vtab_optimizations
- enhancements w3cdom
- enhancements with
- examples mandelbrot
- FabrizioM soc2008
- FabrizioM soc2008 Schedule
- FabrizioM soc2008 submitted
- FAQ cdef_derive
- gsoc prajwal
- gsoc09 daniloaf progress
- HaoyuBai GSoC2010
- kwsmith soc09
- MarkLodato CreatingUfuncs
- ReleaseNotes 0.11.2
- ReleaseNotes 0.12
- ReleaseNotes 0.12.1
- ReleaseNotes 0.13
- ReleaseNotes 0.14
- ReleaseNotes 0.14.1
- ReleaseNotes 0.15
- ReleaseNotes 0.15.1
- ReleaseNotes 0.16
- ReleaseNotes 0.17
- ReleaseNotes 0.17.1
- ReleaseNotes 0.17.2
- ReleaseNotes 0.17.3
- ReleaseNotes 0.17.4
- ReleaseNotes 0.18
- ReleaseNotes 0.19
- ReleaseNotes 0.19.1
- talks notur2009
- talks scipy2009
- tutorials gettingstarted
- tutorials numpy
- tutorials NumpyPointerToC
- tutorials simplecdef
- workshop1 OpenCythonDay
- workshop2011 internalrefactor
- workshop2011 Schedule
- workshop2011 Topics
Clone this wiki locally
- Major revision: 2
The goal is for a non-intrusive worksharing construct that enables the use of many CPU cores, without getting in the way, and while still being as close to pure Python as we can manage. Simplicity and safety is valued above handling all cases. We divide into:
- Loops that can be trivially parallelized as they are written. These should be parallelized with a minimum of fuss.
- Loops that require a little knowledge of threading. In these cases we require one to be a little more explicit.
- "Complicated" use-cases -- these fall outside of the scope of the
prangefacility. These cases may well not even require language support in Cython, or only in indirect ways -- if you are expert enough to know about threading locks, you can always write a closure to do threading.
A very simple example is to multiply an array by a scalar:
from cython.parallel import * def func(np.ndarray[double] x, double alpha): cdef Py_ssize_t with nogil: for i in prange(x.shape): x[i] = alpha * x[i]
The only change required is
prange. This simply hints to Cython that
the iterations of the loop body can be executed in any order. If this can be proven to not be the case, Cython is even free to raise a compile-time error.
prange loop body has the restriction that any functions called must be reentrant.
In pure Python,
prange (from the shadow module) is simply implemented as
range. Using multithreading would slow down the majority of usecases anyway, because of the GIL.
Other choices of syntax, such as
parallel_for i in range(n), may be more rational, but are unable to execute as pure Python code, and are therefore ruled out.
The first implementation will use (a small subset of!) OpenMP, simply in order to get somewhere quickly. But we do keep the door open for changing that.
We do automatic inference of thread-private variables in the cases where it is safe. Consider the following example:
from cython.parallel cimport * def f(np.ndarray[double] x, double alpha): cdef double s = 0 cdef double tmp with nogil: for i in prange(x.shape): # alpha is only read, so shared # tmp assigned before being used -> safe and natural to make it implicitly thread-private tmp = alpha * i s += x[i] * tmp # turns into reduction + thread-private s += tmp * 10 # after the loop we emulate sequential loop execution(OpenMP lastprivate) return s
prange, we promise that the order of execution of each iteration is arbitrary, so
the loop body can not depend on values computed by another iteration of the same loop:
- Variables that are only read become thread-shared
- Variables that are modified in the loop body are considered thread-private (but see below).
- If Cython can prove through control flow analysis that a variable is read before it is assigned, so that it is guaranteed to "spill over" from one iteration to the next, it should raise an error.
- At any rate, at the beginning of the loop body we initialize all floats to NaN and all integers to the value furthest from 0, to increase odds of failing hard early (so NOT firstprivate).
- If a variable is only modified by using an inplace operator (same each time), it is treated as a reduction variable. It is also thread-private and can be read without changing semantics (though the value will be somewhat meaningless). Note: This is the only kind of variable that can be modified in the loop and spill over to the next iteration. Using two different inplace operators, or assigning to it, would cause a compiler error.
- It would perhaps be more consistent to disallow reading. Reading a reduction variable would then be a compiler error (because you're breaking the promise that the order of iterations does not matter).
- After the loop as well as in any
else-block, the variables have the values they would have had if the loop had executed sequentially (lastprivate).
Unlike the simple case above, one may need explicit thread-local variables, e.g., for caching purposes. Because values now carry across iterations of the loop, one needs to explicitly declare them.
from cython.parallel import prange, threadlocal def f(np.ndarray[int] ell_arr, np.ndarray[double] out): cdef Py_ssize_t i, ell cdef threadlocal(int) old_ell = -1 cdef threadlocal(double) alpha = np.nan assert ell_arr.shape == m_arr.shape == out.shape assert np.all(ell_arr > 0) with nogil: for i in prange(out.shape): ell = ell_arr[i] if ell != old_ell: alpha = function_of_ell(ell) old_ell = ell out[i] = alpha * sqrt(ell + i)
The idea is that
ell_arr has long stretches of constant value, and that
function_of_ell is slow to compute, so that each thread wants to have their own cached copy of
alpha between each iteration.
Note that this is a very different use of thread-local variables. We:
- Require explicit declaration, unlike above
- Are firstprivate, unlike above
- Are lastprivate, like above
Note that pure Python operation still works just fine.
Some threads require setup and teardown of scratch space. For these cases there's the parallel block, as well as a couple of auxiliary functions:
from cython.parallel cimport parallel, prange, threadsavailable, threadid cdef double* buf = <double*>malloc(100 * threadsavailable(schedule='dynamic') * sizeof(double)) cdef double* threadbuf with nogil, parallel: threadbuf = buf + threadid() * 100 # thread setup for i in prange(n, schedule='dynamic'): ... # any thread teardown free(buf)
There may be more than one
prange after one another; at least for the time being they'll terminate with a barrier. Non-
prange loops are executed in each thread (if you use the parallel section, you're considered an expert that can deal with this subtle difference).
The parallel block is an obvious place to support more of the OpenMP standard eventually, such as critical sections, barriers etc., if we wish to do so.
threadsavailable must take the same scheduling parameters as
prange in order to give an accurate answer. The implementation is based on simply entering an OpenMP parallel section, fetch number of threads, and return.
prange takes a number of keyword arguments. For now these would simply be a subset of the OpenMP scheduler flags. We may have a mix of arguments that are hints (backends that don't support them ignore them) and required (backends that don't support them can't be used).
with gil should be supported eventually, although it can be skipped in a first release. It can be implemented through using the master block in OpenMP. Of course, getting the GIL serializes the execution, so the likely usecase is for raising exceptions.
In the first implementation one would perhaps not support
yield is likely never supported?).
But when they get implemented:
- Flags may have to be used instead of using goto statements for
raisecan be used inside
with gilblocks. When an exception is raised by any thread, the semantics is to terminate any other thread running the same loop at a random point.
- After entering
with gil, one does an OpenMP flush and check a thread termination flag (one can do one before as well to improve performance in some cases, but the one within must still be present). This ensures only one thread gets to raise an exception.
- For more responsive threads (make sure they don't go on computing for 10 minutes before exception finally propagates), one may want to also have guards in other places. The cost of doing a flush and whether a flush is strictly needed is an open question.
It is likely OK to sacrifice a little bit of performance (on the order of 10-20%) in order to have threads be responsive and terminate quickly on an exception. If one needs top performance one can always avoid raise/continue/break and use flags manually.