<a href="https://colab.research.google.com/github/applejxd/colaboratory/blob/master/numba.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## numba での高速化

[numpy での重い計算の例](https://qiita.com/gyu-don/items/9d223b007ca620e95abc)

In [1]:
import sys
sys.setrecursionlimit(100000)

def ack(m, n):
    if m == 0:
        return n + 1
    if n == 0:
        return ack(m - 1, 1)
    return ack(m - 1, ack(m, n - 1))

通常時の計算時間を測定

In [2]:
import time
from contextlib import contextmanager

@contextmanager
def timer():
    t = time.perf_counter()
    yield None
    print('Elapsed:', time.perf_counter() - t)

with timer():
    ack(3, 10)

Elapsed: 15.715511877999994


numba の nopython モード (njit) 利用

In [3]:
from numba import njit

@njit
def ack(m, n):
    if m == 0:
        return n + 1
    if n == 0:
        return ack(m - 1, 1)
    return ack(m - 1, ack(m, n - 1))

# コンパイル時間含む
with timer():
    ack(3, 10)

# コンパイル時間含まない
with timer():
    ack(3, 10)

Elapsed: 1.094479265000004
Elapsed: 0.2952723059999869


[高速化のテクニック](https://numba.readthedocs.io/en/stable/user/performance-tips.html)。
並列化 & fastmath。

In [4]:
import numpy as np
from numba import prange


@njit
def sum_of_squares(arr):
    s = 0
    for i in range(arr.shape[0]):
        s += arr[i] ** 2
    return s

@njit(parallel=True)
def sum_of_squares_parallel(arr):
    s = 0
    for i in prange(arr.shape[0]):
        s += arr[i] ** 2
    return s

@njit(parallel=True, fastmath=True)
def sum_of_squares_fast(arr):
    s = 0
    for i in prange(arr.shape[0]):
        s += arr[i] ** 2
    return s

arr = np.random.randn(1000000)

sum_of_squares(arr)
with timer():
    sum_of_squares(arr)
    
sum_of_squares_parallel(arr)
with timer():
    sum_of_squares_parallel(arr)

sum_of_squares_fast(arr)
with timer():
    sum_of_squares_fast(arr)

Elapsed: 0.0013868989999963333
Elapsed: 0.0007452540000087993
Elapsed: 0.0026067459999978837


numba.cuda は[GPUを意識したプログラミング](https://co-crea.jp/wp-content/uploads/2016/07/File_2.pdf)が必要：
- [グリッド・ブロック中のスレッド位置の取得方法](https://numba.pydata.org/numba-doc/latest/cuda/kernels.html#absolute-positions)
- [CPU・GPU間のデータ転送方法](https://numba.pydata.org/numba-doc/latest/cuda/memory.html)

In [6]:
from numba import cuda
import numpy as np
import sys
sys.setrecursionlimit(100000)


# カーネル関数
@cuda.jit
def add_kernel(a, b, c):
    i = cuda.grid(1)
    c[i] = a[i] + b[i]

# 起動関数
def add_arrays(a, b):
    # 1ブロックあたりのスレッド数 (128~512)
    threads_per_block = 256
    # GPU の使用ブロック数を計算
    blocks = (a.size + threads_per_block - 1) // threads_per_block

    # 結果保存用にメモリ確保
    result = cuda.to_device(np.zeros_like(a))
    add_kernel[blocks, threads_per_block](
        cuda.to_device(a), cuda.to_device(b), result)
    return result.copy_to_host()

array_size = 100000000
a = np.ones(array_size, dtype=np.float32)
b = np.ones(array_size, dtype=np.float32)

with timer():
    a + b

add_arrays(a, b)
with timer():
    add_arrays(a, b)

Elapsed: 0.15795063500000595
Elapsed: 0.3968715780000025


## concurrent.futures での高速化

In [None]:
from concurrent import futures

import numpy as np
from numba import prange


data_list = np.ones((10, 2)) @ np.array([[2, 0], [0, 9]])

with timer():
    for data in data_list:
        ack(data[0], data[1])
    
with timer():
    with futures.ProcessPoolExecutor(max_workers=3) as executor:
        results = executor.map(lambda x: ack(x[0], x[1]), data_list)


[[2. 9.]
 [2. 9.]
 [2. 9.]
 [2. 9.]
 [2. 9.]
 [2. 9.]
 [2. 9.]
 [2. 9.]
 [2. 9.]
 [2. 9.]]
Elapsed: 0.31789324100009253
