OpenCL can be used to execute code in parallel on GPUs, CPUs and other accelerator devices. We can use OpenCL in two different contexts:
Cython can generate multiple implementations of code. Then the code can, depending on the size of the iteration space, on where the bulk of the data resides (e.g. in main memory or in GPU memory), and depending on the latencies of the hardware and overhead of the OpenCL implementation, decide which implementation to execute.
Preliminary testing shows non-trivial startup costs (even for the CPU), so as a rule of thumb to break-even one will need to execute at least a couple thousand operations. The implementation can, at startup or on first encounter of an OpenCL code region, figure out the exact heuristics, and pick either a native C implementation or an OpenCL one with an appropriate device.
The OpenCL code should be compiled on first encounter, and should probably be cached on disk (for bonus
See memoryviews for documentation.
The discussion on SIMD vector operations can be found here: SIMD CEP. Again, for small data sets the implementation would default to a native C implementation, possibly annotated to appease auto-vectorization with a variety of compilers. Generating native SSE (and detecting CPU capabilities etc at runtime) is kind of a pain.
The OpenCL implementations should use OpenCL vector data types whenever possible (depending on how the data is accessed (e.g. through a host pointer), whether it is contiguous, and depending on alignment). The implementation is free to copy data to contiguous and aligned memory first (in a blocking fashion), and then perform the requested operations.
The memoryviews will keep track of where their data is. As such, all, or part, of their data may live on the
GPU (or whatever accelerator device is used). Memory is copied back whenever a view is requested,
whenever the implementation wants, or whenever this is asked for explicitly (a
copy_to_host method or some
such). It could also have a method to ensure data is always in main memory and never in device memory.
Whenever programs are compiled with OpenCL support enabled, it will be the user's responsibility to either copy back memory explicitly, lock it in main memory, or not access it outside of Cython space. E.g. it is fine to index a memoryview outside of a parallel section, the compiler shall detect and correctly handle such uses (e.g. copy back data at the first encounter in the function).
Potential problems arise when there are multiple views of the same memory region (or on partly overlapping memory regions), but no memoryview object is shared. e.g.
myarray = numpy.empty(...) cdef double[:] view1 = myarray cdef double[:] view2 = myarray
Here Cython could copy the data of view1 (shared by view2!) to the GPU, with view2 totally unaware. To avoid such scenarios, the memoryview objects shall keep track of (refcounted or weak-referenced) memory regions. When view1 is obtained, it globally registers the memory area occupied, and when view2 is instantiated it will find this area and share a data structure with view1. The rules are as follows:
Using these rules the following things are enforced:
With these regions in place the runtime can efficiently apply heuristics to determine where code should be executed, and it handles consistency in the case of overlapping memory.
Note: the memory regions are associated with memoryview objects, which means operating on small subslices created in Cython space may copy a lot more data than necessary. Alternatively they could be associated with the slices themselves.
During computation memory regions are considered locked, and only in the context of cython.parallel shall concurrent (read) accesses have defined results. E.g. the following is fine
cdef Py_ssize_t i cdef double[[, ]] myslice = ... cdef double norm = 0.0 with nogil, cython.parallel.parallel(num_threads=2): if cython.parallel.threadid() % 2 == 0: norm += sum(myslice[::2] **2) # work on even rows else: norm += sum(myslice[1::2] ** 2) # work on odd rows
as the compiler will decide before entering the parallel section where data will live for all slices. The following, however, is NOT fine
import threading t = threading.Thread(target=work, args=(myslice[1::2],)) t.start() with nogil: work(myslice[::2]) t.join()
as the data may be on the GPU and the routine may decide to copy the data (or vice versa) concurrently.
The compiler can try to compile parallel sections to OpenCL, depending on several criteria:
Break, return and continue can be supported in the same conceptual way as the current OpenMP support. That is, it shall give its best effort but guarantee nothing for either break or return.
Although the GIL may not be acquired, there might still be some errors. One is bounds checking. Bounds checking should, whenever possible, be done outside the loop, but if the indices depend on unknown state, or on code with possible side effects, this optimization cannot be performed. In this case it could return an error indicator (with position and error information), which the caller could use to raise an appropriate exception.
There are a variety of different math functions available:
And now we would also have access to OpenCL's math intrinsics. That said, math functions can be used on either
scalars or on vectors (or N-dimensional matrices). I'm not entirely sure what the best way is to solve the
problem, but perhaps we should introduce yet another way:
cython.math. This module could provide basic math
functions for both vectors (elemental) and scalars. It could perhaps also provide a few reductions like
product. Depending on whether native C or OpenCL is used it could pick the right function.
Perhaps if the user has a
cimport numpy in the code, Cython could detect use of any of its math functions
(on memoryview slices), and use its
cython.math equivalent in the context of OpenCL.
This raises another issue, namely that different implementations might not provide the same numerical results.
Some environment variables could be introduced to allow the user to specify things like OpenCL compiler options, etc.
OpenCL should be something optional that can be disabled with a compile-time switch (
--no-opencl). It should
also be possible to use only the CPU with OpenCL (