### 常用名词

• host: the CPU

• device: the GPU

• host memory: the system main memory

• device memory: onboard memory on a GPU card

• kernels: 被主机执行，在设备中计算

• device function: 设备函数，只能被kernel函数调用或被其他设备函数调用

• blockspergrid：线程块数

• threadsperblock：线程数，每个线程块的线程有共享内存，基本互相不会影响速度，所以要设置的尽可能大，只要不会超内存就行，一般设为128、256、512、1024等
    ** 这两个参数都可以设置为单个整数或维度为1、2、3的tupple

### 计算当前线程的位置：
    cuda.threadIdx.x + cuda.blockIdx.x * cuda.blockDim.x
    当前线程数+线程块数*线程块的维度

### 直接返回绝对位置：
    numba.cuda.grid(ndim)： 返回当前位置在所有线程块的绝对位置，ndim为xyz的个数
    numba.cuda.gridsize(ndim)：返回其维度

### 内存控制：
    numba.cuda.device_array(shape, dtype=np.float_, strides=None, order='C', stream=0)：声明一个空向量
    numba.cuda.device_array_like(ary, stream=0)：同上
    numba.cuda.to_device(obj, stream=0, copy=True, to=None)：返回一个copy到设备的变量
    numba.copy_to_host()：返回一个copy到主机的变量
    numba.cuda.as_cuda_array(obj, sync=True):不复制数据，创造一个DeviceNDArray对象
    numba.cuda.is_cuda_array(obj)：检测该对象是否被定义为__cuda_array_interface__

### 设备向量
    类：class numba.cuda.cudadrv.devicearray.DeviceNDArray(shape, strides, dtype, stream=0, gpu_data=None)
    方法：copy_to_host(ary=None, stream=0)
        is_c_contiguous()：判断向量是否行存储在内存中的地址是连续的
        is_f_contiguous()：列
        ravel(order='C', stream=0)：展平处理
        reshape(*newshape, **kws)

### 固定内存
    numba.cuda.pinned(*arylist):A context manager for temporary pinning a sequence of host ndarrays.没理解
    numba.cuda.pinned_array(shape, dtype=np.float_, strides=None, order='C')：初始化一个空的固定内存向量
    numba.cuda.pinned_array_like(ary)：同上

### 映射内存
    numba.cuda.mapped(*arylist, **kws)：同上没理解
    numba.cuda.mapped_array(shape, dtype=np.float_, strides=None, order='C', stream=0, portable=False,wc=False)：初始化
    numba.cuda.mapped_array_like(ary, stream=0, portable=False, wc=False)：同上

### 共享内存和线程同步
   numba.cuda.shared.array(shape, type)：初始化，shape可以为整数或tuple ，在一个线程块中共享  
   <font color=#FF0000 > numba.cuda.syncthreads():同步一个线程块内的所有线程 </font>

### 本地内存
    numba.cuda.local.array(shape, type)：初始化，属于单个线程

### 常量内存
    numba.cuda.const.array_like(arr)：只可读，不可写的内存

### 释放内存
    numba.cuda.defer_cleanup()：
    案例
    with defer_cleanup():
        # all cleanup is deferred in here
        do_speed_critical_code()
        # cleanup can occur here

### 核函数
    没有返回值，输入输出都要在参数列表内

### 设备函数
    可以有返回值

### 不支持的python结构：
    • Exception handling (try .. except, try .. finally)
    • Context management (the with statement)
    • Comprehensions (either list, dict, set or generator comprehensions)
    • Generator (any yield statements)
### 支持的python结构：
    • raise
    • assert
    • Printing of strings, integers, and floats

### 支持的内置函数以及包
    • abs()
    • bool
    • complex
    • enumerate()
    • float
    • int: only the one-argument form
    • len()
    • min(): only the multiple-argument form
    • max(): only the multiple-argument form
    • pow()
    • range
    • round()
    • zip()
####    cmath module
####    math
####    operator

### 支持的原子钟计算
    class numba.cuda.atomic：再接一个方法即可
    案例：

In [35]:
from numba import cuda
import numpy as np
@cuda.jit
def max_example_3d(result, values):
    """
    Find the maximum value in values and store in result[0].
    Both result and values are 3d arrays.
    """
    i, j, k = cuda.grid(3)
    # Atomically store to result[0,1,2] from values[i, j, k]
    cuda.atomic.max(result, (0, 1, 2), values[i, j, k])
arr = np.random.rand(1000).reshape(10,10,10)
result = np.zeros((3, 3, 3), dtype=np.float64)
max_example_3d[(2, 2, 2), (5, 5, 5)](result, arr)
print(result[0, 1, 2], '==', np.max(arr))

0.9997580056812965 == 0.9997580056812965


### 随机数字
    numba.cuda.random.create_xoroshiro128p_states(n, seed, subsequence_start=0, stream=0)：返回一个生成长度为n的随机数字生成器
    numba.cuda.random.init_xoroshiro128p_states(states, seed, subsequence_start=0, stream=0)
    以下都是一些生成随机数字的方法，参数为生成器和id（线程id），返回单个随机数
    numba.cuda.random.xoroshiro128p_normal_float32(states, index)
    numba.cuda.random.xoroshiro128p_normal_float64(states, index)
    numba.cuda.random.xoroshiro128p_uniform_float32(states, index)
    numba.cuda.random.xoroshiro128p_uniform_float64(states, index)

### 一维度随机数生成

In [3]:
from __future__ import print_function, absolute_import
from numba import cuda
from numba.cuda.random import create_xoroshiro128p_states, xoroshiro128p_uniform_float32
import numpy as np
@cuda.jit
def compute_pi(rng_states, iterations, out):
    """Find the maximum value in values and store in result[0]"""
    thread_id = cuda.grid(1)
    # Compute pi by drawing random (x, y) points and finding what
    # fraction lie inside a unit circle
    inside = 0
    for i in range(iterations):
        x = xoroshiro128p_uniform_float32(rng_states, thread_id)
        y = xoroshiro128p_uniform_float32(rng_states, thread_id)
        if x**2 + y**2 <= 1.0:
            inside += 1
    out[thread_id] = 4.0 * inside / iterations
import time
start = time.time()
threads_per_block = 64
blocks = 24
rng_states = create_xoroshiro128p_states(threads_per_block * blocks, seed=1)
out = np.zeros(threads_per_block * blocks, dtype=np.float32)
compute_pi[blocks, threads_per_block](rng_states, 10000, out)
print('pi:', out.mean())
print(time.time()-start)
import time
start = time.time()
threads_per_block = 64
blocks = 24
rng_states = create_xoroshiro128p_states(threads_per_block * blocks, seed=1)
out = np.zeros(threads_per_block * blocks, dtype=np.float32)
compute_pi[blocks, threads_per_block](rng_states, 10000, out)
print('pi:', out.mean())
print(time.time()-start)

pi: 3.1416733
0.2082970142364502


### 多维度随机数生成

In [1]:
from numba import cuda
from numba.cuda.random import (create_xoroshiro128p_states,
xoroshiro128p_uniform_float32)
import numpy as np

@cuda.jit
def random_3d(arr, rng_states):
    # Per-dimension thread indices and strides
    startx, starty, startz = cuda.grid(3)
    stridex, stridey, stridez = cuda.gridsize(3)

    # Linearized thread index
    tid = (startz * stridey * stridex) + (starty * stridex) + startx

    # Use strided loops over the array to assign a random value to each entry
    for i in range(startz, arr.shape[0], stridez):
        for j in range(starty, arr.shape[1], stridey):
            for k in range(startx, arr.shape[2], stridex):
                arr[i, j, k] = xoroshiro128p_uniform_float32(rng_states, tid)

# Array dimensions
X, Y, Z = 70, 900, 719

# Block and grid dimensions
bx, by, bz = 8, 8, 8
gx, gy, gz = 16, 16, 16

# Total number of threads
nthreads = bx * by * bz * gx * gy * gz

# Initialize a state for each thread
rng_states = create_xoroshiro128p_states(nthreads, seed=1)

# Generate random numbers
arr = cuda.device_array((X, Y, Z), dtype=np.float32)
random_3d[(gx, gy, gz), (bx, by, bz)](arr, rng_states)
arr = arr.copy_to_host()
print(arr.mean())

0.50002056


### 设备操作
***设备选择***   
    numba.cuda.select_device(device_id):选择设备
    numba.cuda.close()：关闭设备  
***查看设备列表***    
    numba.cuda.gpus
    numba.cuda.cudadrv.devices.gpus

### 案例1：矩阵向乘

In [2]:
from numba import cuda, float32
import numpy as np
import math