This is a proposal for parallelism that is much more explicit than enhancements/prange. This makes it a bit more verbose, but also simplifies the rules a bit.
This is starting to look like OpenMP -- which may be good, why redo a widely used standard. At the same time there is a big deal of simplification.
The intention is still to be an almost safe solution that covers 80% of the usecases for parallel computation.
nthreads = cython.parallel.numavailable(schedule='dynamic') cdef double* buf = <double*>malloc(100 * nthreads * sizeof(double)) cdef double alpha cdef double s = 0 try: with cython.nogil, cython.parallel(schedule='dynamic') as p: cdef double* threadbuf = buf + p.threadid * 100 cdef Py_ssize_t i, j for i in range(n): # run in parallel compute_frobnication(i, threadbuf) for j in range(100): # serial s += alpha * threadbuf[j] # s is reduction variable finally: free(buf)
The cython.parallel sets up a scope/block that is run by multiple threads in parallel. Within the block:
One must declare thread-private variables, which have one copy per thread, within the block
The main block is run by all threads, and is intended for thread initialization/cleanup
The only looping construct allowed is a simple for-loop over a range. This will be divided amongst threads. Within the for-loop any Cython code is available.
- Variables from the outer scope are in general read-only, and assignment is a syntax error. However they can be written to in two circumstances:
- By using a single in-place operator, it is understood that a reduction should happen over the variable. It is possible to have, e.g., += in two places, but a mix of += and -= raises an error.
- In the else-block of a loop, shared variables are mutable, while the thread-private variables are also available. The else-block is executed by the thread that does the last operation (i == n - 1 above)
- There can be multiple for-loops in a single parallel block.
- Is there is a barrier after each loop? In OpenMP this is an option, the default is to have a barrier but you can turn it off.
- firstprivate: One simply initializes the thread-local variables when declaring them
- lastprivate: One simply writes to shared variables in the else-block
This is Pythonic in the sense that rather than many automatic constructs one needs to use more simpler and explicit constructs.
This proposal is not complete, but should be understood from the ongoing ML discussion and enhancements/prange.