Skip to content

enhancements parallelblock

DagSverreSeljebotn edited this page · 6 revisions
Clone this wiki locally

Parallel block


This is a proposal for parallelism that is much more explicit than enhancements/prange. This makes it a bit more verbose, but also simplifies the rules a bit.

This is starting to look like OpenMP -- which may be good, why redo a widely used standard. At the same time there is a big deal of simplification.

The intention is still to be an almost safe solution that covers 80% of the usecases for parallel computation.

The parallel block


nthreads = cython.parallel.numavailable(schedule='dynamic')
cdef double* buf = <double*>malloc(100 * nthreads * sizeof(double))
cdef double alpha
cdef double s = 0
    with cython.nogil, cython.parallel(schedule='dynamic') as p:
        cdef double* threadbuf = buf + p.threadid * 100
        cdef Py_ssize_t i, j
        for i in range(n): # run in parallel
            compute_frobnication(i, threadbuf)
            for j in range(100): # serial
                s += alpha * threadbuf[j] # s is reduction variable

The cython.parallel sets up a scope/block that is run by multiple threads in parallel. Within the block:

  • One must declare thread-private variables, which have one copy per thread, within the block

  • The main block is run by all threads, and is intended for thread initialization/cleanup

  • The only looping construct allowed is a simple for-loop over a range. This will be divided amongst threads. Within the for-loop any Cython code is available.

  • Variables from the outer scope are in general read-only, and assignment is a syntax error. However they can be written to in two circumstances:
    • By using a single in-place operator, it is understood that a reduction should happen over the variable. It is possible to have, e.g., += in two places, but a mix of += and -= raises an error.
    • In the else-block of a loop, shared variables are mutable, while the thread-private variables are also available. The else-block is executed by the thread that does the last operation (i == n - 1 above)
  • There can be multiple for-loops in a single parallel block.
    • Is there is a barrier after each loop? In OpenMP this is an option, the default is to have a barrier but you can turn it off.

Vs. OpenMP

  • firstprivate: One simply initializes the thread-local variables when declaring them
  • lastprivate: One simply writes to shared variables in the else-block

This is Pythonic in the sense that rather than many automatic constructs one needs to use more simpler and explicit constructs.

The rest

This proposal is not complete, but should be understood from the ongoing ML discussion and enhancements/prange.

Something went wrong with that request. Please try again.