CUDA example for Matrix-Matrix multiplication taken from [here](https://nyu-cds.github.io/python-numba/05-cuda/)

In [1]:
#https://nyu-cds.github.io/python-numba/05-cuda/
from __future__ import division
from numba import cuda

import numpy
import math
import sys
print(sys.version)

import numba
print(numba.__version__)
print(cuda.gpus)

3.6.6 | packaged by conda-forge | (default, Oct 12 2018, 14:08:43) 
[GCC 4.8.2 20140120 (Red Hat 4.8.2-15)]
0.40.0
<Managed Device 0>, <Managed Device 1>, <Managed Device 2>, <Managed Device 3>, <Managed Device 4>, <Managed Device 5>, <Managed Device 6>, <Managed Device 7>


In [None]:
!numba -s #output removed for sec

In [2]:
cuda.select_device( 0 )

<weakproxy at 0x7f64f77cd728 to Device at 0x7f651f45e9b0>

In [3]:
@cuda.jit
def matmul(A, B, C):
    """Perform square matrix multiplication of C = A * B
    """
    i, j = cuda.grid(2)
    if i < C.shape[0] and j < C.shape[1]:
        tmp = 0.
        for k in range(A.shape[1]):
            tmp += A[i, k] * B[k, j]
        C[i, j] = tmp

In [4]:
# Initialize the data arrays
A = numpy.full((24, 12), 3, numpy.float) # matrix containing all 3's
B = numpy.full((12, 22), 4, numpy.float) # matrix containing all 4's

# Copy the arrays to the device
A_global_mem = cuda.to_device(A)
B_global_mem = cuda.to_device(B)

# Allocate memory on the device for the result
C_global_mem = cuda.device_array((24, 22))

# Configure the blocks
threadsperblock = (16, 16)
blockspergrid_x = int(math.ceil(A.shape[0] / threadsperblock[0]))
blockspergrid_y = int(math.ceil(B.shape[1] / threadsperblock[1]))
blockspergrid = (blockspergrid_x, blockspergrid_y)

# Start the kernel 
matmul[blockspergrid, threadsperblock](A_global_mem, B_global_mem, C_global_mem)

# Copy the result back to the host
%timeit C = C_global_mem.copy_to_host()

#print(C)

32.3 µs ± 57.1 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)


Numba in cuda (incomplete)

```
@vectorize(["float32(float32, float32)"])
def multiply(p, q):
    q = p * q
@vectorize(["float32(float32, float32)"], target='cuda')
def multiply(p, q):
    q = p * q
    ```