Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Explicitly offloading nogil code to device/GPU #3342

Open
fschlimb opened this issue Jan 30, 2020 · 4 comments
Open

Explicitly offloading nogil code to device/GPU #3342

fschlimb opened this issue Jan 30, 2020 · 4 comments

Comments

@fschlimb
Copy link
Contributor

@fschlimb fschlimb commented Jan 30, 2020

Allow marking nogil-code to be run on a device/GPU.

There is also a discussion on the mailing-list: https://mail.python.org/pipermail/cython-devel/2020-January/005262.html.

Older CEPs suggest doing similar things by

A first and already very powerful step would be to explicitly mark code that should be offloaded, minimize language extensions and not require extra tools.

As a start, we could consider only parallel devices, such as GPUs and use OpenMP target since Cython already uses OpenMP for parallelism

Let's consider a simple example for computing pairwise distances between vectors in parallel:

def pairwise_distance_host(double[:, ::1] X):
    cdef int M = X.shape[0]
    cdef int N = X.shape[1]
    cdef double tmp, d
    cdef double[:, ::1] D = np.empty((M, M), dtype=np.float64)
    cdef int i,j,k
    with nogil:
        for i in prange(M):
            for j in prange(M):
                d = 0.0
                for k in range(N):
                    tmp = X[i, k] - X[j, k]
                    d = d + tmp * tmp  # cython would interpret '+=' as a reduction variable!
                    D[i, j] = sqrt(d)
    return np.asarray(D)

The parallel region is implicitly defined by the first, outermost prange. For offloading we could demand that the parallel region needs to be defined explicitly:

...
    with nogil, parallel():
        for i in prange(M):
...

Now all we need is a marker that the parallel region should be offloaded to a device:

...
    with nogil, parallel(device={}):
        for i in prange(M):
...

Cython should take care of a safe way to define data mappings: transferring the necessary data to the device and from device to the host:

  • by default arrays are sent from host to device when entering the parallel region and from the device to host when exiting
  • read-only data is only sent from host to device but not from device to host
  • ideally, write-only data will not be sent from host to device but only from device to host
  • ideally, we can also detect that a variable is not used outside the parallel region so that we do not need to transfer any data (only allocate and deallocate)

Because complex indexing can make it impossible to correctly determine the best mapping and because data-movement is often the biggest performance bottleneck, we also need a way for experts to optimize the data movement. For that, device should accept a dictionary mapping variable names to a map-type (as borrowed from OpenMP target map clauses)

  • "to" means host to device
  • "from" means device to host
  • "tofrom" means host to device and device to host
  • "alloc" means not data-transfer at all

In the above example, the input array/memview is read-only on the device, so we could indicate it like this:

...
    with nogil, parallel(device={D:'to'}):
        for i in prange(M):
...

Map-values provided in device overwrite whatever Cython would automatically infer.

Another common challenge in offloading is that computation might go back and forth between host and GPU. In such cases it is often required to keep data on the GPU between different GPU regions even if a host-section is in between. As an example, let's look at the above code and block the computation by only computing a single row of the output array at once. Note that this will be needed anyway if the input array becomes large since the output vector size increases with quadratically and might simply not fit on the GPU.

def pairwise_distance_row(double[:, ::1] X):
    cdef int M = X.shape[0]
    cdef int N = X.shape[1]
    cdef double tmp, d
    cdef double[:, ::1] D = np.empty((M, M), dtype=np.float64)
    cdef double[::1] Dslice
    cdef int i,j,k
    with nogil:
        for i in range(M):
            Dslice = D[i,:]
            with parallel(device={Dslice:'from'}):
                for j in prange(M):
                    d = 0.0
                    for k in range(N):
                        tmp = X[i, k] - X[j, k]
                        d = d + tmp * tmp  # cython would interpret '+=' as a reduction variable!
                        Dslice[j] = sqrt(d)
    return np.asarray(D)  

Even though we only transfer slices of D in Dslice from device to host, the entire input array X will be send from host to device in every iteration of the outermost loop. The suggested solution adds a data-context (to be used with with) defining the lifetime of variables on the device. Let's simply use the same keyword device and let it accept the same mappings:

...
    with nogil, device({X:'to'}):
        for i in range(M):
...

Since this is an expert tool we might not want or need to infer any map-type and leave it to the programmer.

A first prototype is implemented here: https://github.com/fschlimb/cython/tree/offload

Limitations, open questions etc:

  • OpenMP does not allow mapping overlapping memory regions. We need to at least check that 2 memviews do not overlap
  • C-pointers are not supported (yet?)
  • Only C-contiguous memviews are supported
  • need support for setup.py/distutils
  • need tests
  • need docu (syntax, semantics, and how to setup offload compiler)

@ogrisel @GaelVaroquaux @oleksandr-pavlyk @DrTodd13

@ogrisel

This comment has been minimized.

Copy link
Contributor

@ogrisel ogrisel commented Jan 31, 2020

Thank you very much @fschlimb. @jeremiedbb is likely to also be interested in trying this branch. Could you please point us to some build instructions to get it to work with CUDA-enabled in Intel GEN Graphics GPUs for testing?

@fschlimb

This comment has been minimized.

Copy link
Contributor Author

@fschlimb fschlimb commented Jan 31, 2020

You need a compiler which supports OpenMP 4.5 with offloading enabled. Unfortunately most pre-compiled compilers do not have it enabled. So you probably need to build gcc or llvm yourself. I was successful only with llvm so far. See https://hpc-wiki.info/hpc/Building_LLVM/Clang_with_OpenMP_Offloading_to_NVIDIA_GPUs. Notice that even newer versions of llvm support only cuda version 9.x, not 10.x!

Once you have a working compiler, you simply need to use this clang instead of whatever default compiler and use the following CFLAGS: CFLAGS="-fopenmp -fopenmp-targets=nvptx64"

Unfortunately distutils don't seem to be able to switch to using clang with a command line. Setting CC=clang luckily does the job. Setting LD=clang however has no effect, You have to somehow tweak your setup.py or explicitly call the linker on the command line :(

Suggestions on how to make this work nicely are appreciated!

@jeremiedbb

This comment has been minimized.

Copy link

@jeremiedbb jeremiedbb commented Jan 31, 2020

Setting LD=clang however has no effect,

to build with clang I usually do: CC=clang LDSHARED='clang -shared'

@fschlimb

This comment has been minimized.

Copy link
Contributor Author

@fschlimb fschlimb commented Jan 31, 2020

Cool, this works. Thanks @jeremiedbb!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
3 participants
You can’t perform that action at this time.