# Comparison between PyOpenCL, PyCUDA, Cython and NumPy

@author: Adrian Oeftiger
@date: 24.02.2017

We compare a simple entry-wise sum of two large double-precision arrays for timing. The goal is to evaluate whether code maintenance is better (i.e. less lines of code, convenience) with PyOpenCL or PyCUDA + Cython while keeping an eye on timing issues.

In [1]:
from __future__ import absolute_import, print_function

import numpy as np

import pyopencl as cl

from pycuda.autoinit import context
from pycuda import gpuarray as gp
from pycuda.elementwise import ElementwiseKernel
from pycuda.compiler import SourceModule

The two arrays to be summed:

In [2]:
a_np = np.random.rand(10000000) # 10 million
b_np = np.random.rand(10000000)

## Available Hardware for the Test

### CPU

In [3]:
!cat /proc/cpuinfo | head -5

processor	: 0
vendor_id	: GenuineIntel
cpu family	: 6
model		: 45
model name	: Intel(R) Xeon(R) CPU E5-2630 0 @ 2.30GHz


In [4]:
!lscpu

Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                24
On-line CPU(s) list:   0-23
Thread(s) per core:    2
Core(s) per socket:    6
Socket(s):             2
NUMA node(s):          2
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 45
Stepping:              7
CPU MHz:               2301.000
BogoMIPS:              4601.03
Virtualization:        VT-x
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              15360K
NUMA node0 CPU(s):     0-5,12-17
NUMA node1 CPU(s):     6-11,18-23


### GPU

In [5]:
!nvidia-smi -L

GPU 0: Tesla C2075 (UUID: GPU-2005e721-47ec-a062-2010-b4ccb09bdc6a)
GPU 1: Tesla C2075 (UUID: GPU-18212ff0-da40-6804-5022-ab1b3950fba4)
GPU 2: Tesla C2075 (UUID: GPU-38f8d367-fb09-76d6-ae39-90aeb286e83e)
GPU 3: Tesla C2075 (UUID: GPU-271a5abe-433e-e72b-a9b1-855934defca8)


... so let the testing begin!

## I. NumPy

In [6]:
%timeit -n 10 a_np + b_np

10 loops, best of 3: 77.9 ms per loop


## II. Cython

In [7]:
%load_ext Cython

In [8]:
%%cython --name add_cython

cimport cython

@cython.boundscheck(False)
@cython.wraparound(False)
def add(double[::1] a, double[::1] b, double[::1] r):
    cdef int n = len(a)
    cdef int i
    for i in xrange(n):
        r[i] = a[i] + b[i]

In [9]:
import add_cython

r_np = np.empty_like(a_np)

In [10]:
%%timeit
add_cython.add(a_np, b_np, r_np)

10 loops, best of 3: 56.7 ms per loop


In [11]:
# Check on CPU with Numpy:
print(r_np - (a_np + b_np))
print(np.linalg.norm(r_np - (a_np + b_np)))

[ 0.  0.  0. ...,  0.  0.  0.]
0.0


## II. PyOpenCL

### first on the CPU:

In [12]:
ctx = cl.create_some_context(interactive=True)
queue = cl.CommandQueue(
    ctx, properties=cl.command_queue_properties.PROFILING_ENABLE
)

Choose platform:
[0] <pyopencl.Platform 'AMD Accelerated Parallel Processing' at 0x7ff4057ca430>
[1] <pyopencl.Platform 'NVIDIA CUDA' at 0x30b7dd0>
Choice [0]:0
Set the environment variable PYOPENCL_CTX='0' to avoid being asked again.


In [13]:
mf = cl.mem_flags
a_g = cl.Buffer(ctx, mf.READ_ONLY | mf.COPY_HOST_PTR, hostbuf=a_np)
b_g = cl.Buffer(ctx, mf.READ_ONLY | mf.COPY_HOST_PTR, hostbuf=b_np)

prg = cl.Program(ctx, """
#pragma OPENCL EXTENSION cl_amd_fp64 : enable // (AMD)
#pragma OPENCL EXTENSION cl_khr_fp64 : enable // (NVIDIA)
    
__kernel void sum(
    __global const double *a_g, __global const double *b_g, __global double *res_g)
{
  int gid = get_global_id(0);
  res_g[gid] = a_g[gid] + b_g[gid];
}
""").build()

res_g = cl.Buffer(ctx, mf.WRITE_ONLY, a_np.nbytes)

event = prg.sum(queue, a_np.shape, None, a_g, b_g, res_g)
event.wait()

print ('Computation took {:.3f} ms.'.format(
    1e-6*(event.profile.end - event.profile.start)))

res_np = np.empty_like(a_np)
cl.enqueue_copy(queue, res_np, res_g)

# Check on CPU with Numpy:
print(res_np - (a_np + b_np))
print(np.linalg.norm(res_np - (a_np + b_np)))



Computation took 41.399 ms.
[ 0.  0.  0. ...,  0.  0.  0.]
0.0


### Then on the GPU:

In [14]:
ctx = cl.create_some_context(interactive=True)
queue = cl.CommandQueue(
    ctx, properties=cl.command_queue_properties.PROFILING_ENABLE
)

Choose platform:
[0] <pyopencl.Platform 'AMD Accelerated Parallel Processing' at 0x7ff4057ca430>
[1] <pyopencl.Platform 'NVIDIA CUDA' at 0x30b7dd0>
Choice [0]:1
Choose device(s):
[0] <pyopencl.Device 'Tesla C2075' on 'NVIDIA CUDA' at 0x30b7eb0>
[1] <pyopencl.Device 'Tesla C2075' on 'NVIDIA CUDA' at 0x30b7f40>
[2] <pyopencl.Device 'Tesla C2075' on 'NVIDIA CUDA' at 0x30b7fd0>
[3] <pyopencl.Device 'Tesla C2075' on 'NVIDIA CUDA' at 0x30b8060>
Choice, comma-separated [0]:1
Set the environment variable PYOPENCL_CTX='1:1' to avoid being asked again.


In [15]:
mf = cl.mem_flags
a_g = cl.Buffer(ctx, mf.READ_ONLY | mf.COPY_HOST_PTR, hostbuf=a_np)
b_g = cl.Buffer(ctx, mf.READ_ONLY | mf.COPY_HOST_PTR, hostbuf=b_np)

prg = cl.Program(ctx, """
#pragma OPENCL EXTENSION cl_amd_fp64 : enable // (AMD)
#pragma OPENCL EXTENSION cl_khr_fp64 : enable // (NVIDIA)
    
__kernel void sum(
    __global const double *a_g, __global const double *b_g, __global double *res_g)
{
  int gid = get_global_id(0);
  res_g[gid] = a_g[gid] + b_g[gid];
}
""").build()

res_g = cl.Buffer(ctx, mf.WRITE_ONLY, a_np.nbytes)

event = prg.sum(queue, a_np.shape, None, a_g, b_g, res_g)
event.wait()

print ('Computation took {:.3f} ms.'.format(
    1e-6*(event.profile.end - event.profile.start)))

res_np = np.empty_like(a_np)
cl.enqueue_copy(queue, res_np, res_g)

# Check on CPU with Numpy:
print(res_np - (a_np + b_np))
print(np.linalg.norm(res_np - (a_np + b_np)))

Computation took 2.553 ms.
[ 0.  0.  0. ...,  0.  0.  0.]
0.0


## III. PyCUDA

In [16]:
a_pyc, b_pyc = gp.to_gpu(a_np), gp.to_gpu(b_np)

### Simple GPUArray adding:

In [17]:
%%timeit 
global r_pyc
r_pyc = a_pyc + b_pyc

The slowest run took 72.78 times longer than the fastest. This could mean that an intermediate result is being cached.
1 loop, best of 3: 4.47 ms per loop


In [18]:
# Check on CPU with Numpy:
print(r_pyc.get() - (a_np + b_np))
print(np.linalg.norm(r_pyc.get() - (a_np + b_np)))

[ 0.  0.  0. ...,  0.  0.  0.]
0.0


### With an ElementwiseKernel:

In [19]:
add = ElementwiseKernel(
    'double* a, double* b, double* r',
    'r[i] = a[i] + b[i];'
)

r_pyc = gp.empty_like(a_pyc)

In [20]:
%%timeit
add(a_pyc, b_pyc, r_pyc)
context.synchronize()

The slowest run took 112.97 times longer than the fastest. This could mean that an intermediate result is being cached.
1 loop, best of 3: 3.02 ms per loop


In [21]:
# Check on CPU with Numpy:
print(r_pyc.get() - (a_np + b_np))
print(np.linalg.norm(r_pyc.get() - (a_np + b_np)))

[ 0.  0.  0. ...,  0.  0.  0.]
0.0


### And finally with a SourceModule:

In [22]:
mod = SourceModule('''
__global__ void add(int n, double* a, double* b, double* r) {
    for (int i = blockIdx.x * blockDim.x + threadIdx.x;
             i < n;
             i += blockDim.x * gridDim.x) {
        r[i] = a[i] + b[i];
    }
}
''')

add_sm = mod.get_function('add')

add_sm.prepare('iPPP')

def idivup(a, b):
    ''' Compute int(a)//int(b) and round up to next integer if a%b != 0 '''
    a = np.int32(a)
    b = np.int32(b)
    z = (a // b + 1) if (a % b != 0) else (a // b)
    return int(z)

In [23]:
%%timeit
add_sm.prepared_call(
    # grid, block:
    (idivup(len(a_np), 256), 1, 1), (256, 1, 1),
    # in- and outputs:
    np.int32(len(a_np)), a_pyc.gpudata, b_pyc.gpudata, r_pyc.gpudata
)
context.synchronize()

100 loops, best of 3: 2.64 ms per loop


In [24]:
# Check on CPU with Numpy:
print(r_pyc.get() - (a_np + b_np))
print(np.linalg.norm(r_pyc.get() - (a_np + b_np)))

[ 0.  0.  0. ...,  0.  0.  0.]
0.0


## Conclusion

Results:

- CPU NumPy: 77.9ms
- CPU Cython: 56.7ms
- CPU PyOpenCL: 41.4ms
- GPU PyOpenCL: 2.6ms
- GPU PyCUDA GPUArray: 4.5ms
- GPU PyCUDA ElementwiseKernel: 3ms
- GPU PyCUDA SourceModule: 2.6ms

On the GPU, both PyOpenCL and PyCUDA (with its `SourceModule`) yield the same timing results. Any abstraction provided by PyCUDA (the `ElementwiseKernel` and the `GPUArray` direct summing) slows down the performance. PyOpenCL introduces more "clutter" in terms of object handling compared to PyCUDA but opens the chance to run on the CPU as well, using the same source code!

On the CPU, PyOpenCL accelerates close to a factor 2 in comparison to NumPy, while Cython accelerates by a factor 1.4. Hence, PyOpenCL is even the best choice in this case.

### So, all in all...
...even for the present simple summation example, PyOpenCL provides a viable choice to reduce the total amount of code to be handled for GPU and CPU (~30 lines of PyOpenCL vs. 17 lines with Cython + 25 with PyCUDA).