<a href="https://colab.research.google.com/github/applejxd/colaboratory/blob/master/rapid_python.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## numba での高速化

### njit: numba の nopython モード

[numpy での重い計算の例](https://qiita.com/gyu-don/items/9d223b007ca620e95abc)

In [1]:
import sys
sys.setrecursionlimit(100000)

def ack(m, n):
    if m == 0:
        return n + 1
    if n == 0:
        return ack(m - 1, 1)
    return ack(m - 1, ack(m, n - 1))

通常時の計算時間を測定

In [2]:
import time
from contextlib import contextmanager

@contextmanager
def timer(target: str):
    t = time.perf_counter()
    yield None
    # 有効数字3桁表示
    print(f"[{target}]: {time.perf_counter() - t: .3g} s")

with timer("ack(3, 10)"):
    ack(3, 10)

[ack(3, 10)]:  13 s


njit で高速化。[型推論は自動。ポリモーフィズムも対応。](https://numba.readthedocs.io/en/stable/user/jit.html)

In [3]:
from numba import njit

with timer("define njit function"):
    @njit(cache=True)
    def lazy_ack(m, n):
        if m == 0:
            return n + 1
        if n == 0:
            return lazy_ack(m - 1, 1)
        return lazy_ack(m - 1, lazy_ack(m, n - 1))

# コンパイル時間含む
with timer("1st try"):
    lazy_ack(3, 10)

# コンパイル時間含まない
with timer("2nd try"):
    lazy_ack(3, 10)

[define njit function]:  0.138 s
[1st try]:  0.825 s
[2nd try]:  0.292 s


定義時に型指定でコンパイル。トータル時間は変化なし。
ポリモーフィズムもないため処理速度を図る目的等がなければ使わなくてもよい。

In [4]:
with timer("define njit function"):
    @njit("int32(int32, int32)", cache=True)
    def eager_ack(m, n):
        if m == 0:
            return n + 1
        if n == 0:
            return eager_ack(m - 1, 1)
        return eager_ack(m - 1, eager_ack(m, n - 1))

# 初回
with timer("compile time included"):
    eager_ack(3, 10)

# 事前にコンパイル済なので二回目以降の計算時間も同等
with timer("compile time excluded"):
    eager_ack(3, 10)

[define njit function]:  0.371 s
[compile time included]:  0.315 s
[compile time excluded]:  0.301 s


### 並列化 & fastmath


[検証が容易な2重ループの例](https://qiita.com/AnchorBlues/items/59a8543765549fe7bac0)
\begin{equation}
 S_L=\sum_{i=0}^{L-1}\sum_{j=0}^{L-1} (i-j)
\end{equation}

In [17]:
def double_sum(size):
    sum = 0
    for i in range(size):
        for j in range(size):
            sum += i - j
    return sum

size = 10000

with timer("pure python"):
    # 0 となることを確認
    print(f"ans={double_sum(size)}")

ans=0
[pure python]:  8.83 s


numba を用いたケース

In [7]:
with timer("define njit function"):
    @njit("int32(int32)", cache=True)
    def double_sum_njit(size):
        sum = 0
        for i in range(size):
            for j in range(size):
                sum += i - j
        return sum

with timer("njit enabled"):
    print(f"ans={double_sum_njit(size)}")

[define njit function]:  0.0712 s
ans=0
[njit enabled]:  3.81e-05 s


更に高速化するには[追加オプション](https://numba.readthedocs.io/en/stable/user/performance-tips.html)を使用する。

次の例は prange を用いた並列化. ただしこのケースではマルチコア化のバウンドの方が大きいため高速化されない。

In [8]:
from numba import prange

with timer("define njit function"):
    @njit("int32(int32)", cache=True, parallel=True)
    def double_sum_parallel(size):
        sum = 0
        # 並列化の対象のループには prange を使う
        for i in prange(size):
            for j in range(size):
                sum += i - j
        return sum

with timer("njit with parallel"):
    print(f"ans={double_sum_parallel(size)}")

[define njit function]:  0.583 s
ans=0
[njit with parallel]:  0.000816 s


C++ の最適化のように fastmath も使用可能

In [9]:
with timer("define njit function"):
    @njit("int32(int32)", cache=True, parallel=True, fastmath=True)
    def double_sum_fast(size):
        sum = 0
        for i in prange(size):
            for j in range(size):
                sum += i - j
        return sum

with timer("njit with parallel & fastmath"):
    print(f"ans={double_sum_fast(size)}")

[define njit function]:  0.387 s
ans=0
[njit with parallel & fastmath]:  0.000772 s


### numba.cuda

[GPUを意識したプログラミング](https://co-crea.jp/wp-content/uploads/2016/07/File_2.pdf)が必要：
- [グリッド・ブロック中のスレッド位置の取得方法](https://numba.pydata.org/numba-doc/latest/cuda/kernels.html#absolute-positions)
- [CPU・GPU間のデータ転送方法](https://numba.pydata.org/numba-doc/latest/cuda/memory.html)

In [10]:
from numba import cuda
import numpy as np
import sys
sys.setrecursionlimit(100000)


# カーネル関数
@cuda.jit
def add_kernel(a, b, c):
    i = cuda.grid(1)
    c[i] = a[i] + b[i]

# 起動関数
def add_arrays(a, b, threads_per_block=256):
    # threads_per_block は1ブロックあたりのスレッド数 (128~512)
    # GPU の使用ブロック数を計算
    blocks = (a.size + threads_per_block - 1) // threads_per_block

    # 結果保存用にメモリ確保
    result = cuda.to_device(np.zeros_like(a))
    add_kernel[blocks, threads_per_block](
        cuda.to_device(a), cuda.to_device(b), result)
    return result.copy_to_host()

array_size = 100000000
a = np.ones(array_size, dtype=np.float32)
b = np.ones(array_size, dtype=np.float32)

with timer("CPU computation"):
    a + b

add_arrays(a, b)
with timer("GPU computation"):
    add_arrays(a, b)

[CPU computation]:  0.149 s
[GPU computation]:  0.498 s


## concurrent.futures での高速化

CPU 数を確認

In [11]:
import os

print(f"CPU count is {os.cpu_count()}.")
print(f"CPU for current thread is {len(os.sched_getaffinity(0))}.")

import multiprocessing
print(f"CPU count by multiprocessing is {multiprocessing.cpu_count()}.")

CPU count is 2.
CPU for current thread is 2.
CPU count by multiprocessing is 2.


ProcessPoolExecutor は CPU bound のタスクに有効。

以下はmax_workers の最適値を見つけるサンプル。

In [12]:
import numpy as np
import concurrent.futures

def do_something(size):
    return np.dot(np.ones((size, size)), np.ones((size, size)))

worker_values = [1, 2, 4, 8, 16, 32, 64]
tasks = [1000] * 10

for max_workers in worker_values:
    with timer(f"max_workers={max_workers}"):
        with concurrent.futures.ProcessPoolExecutor(max_workers) as executor:
            futures = [executor.submit(do_something, task) for task in tasks]
            results = [f.result() for f in concurrent.futures.as_completed(futures)]

[max_workers=1]:  0.866 s
[max_workers=2]:  0.954 s
[max_workers=4]:  1.01 s
[max_workers=8]:  1.18 s
[max_workers=16]:  1.73 s
[max_workers=32]:  1.47 s
[max_workers=64]:  1.77 s


上記は submit メソッドを使用した。
submit は for 文などで順次タスクを追加する場合に便利。

既に処理対象がリストなどでまとまっている場合は map メソッドが便利。使い方は以下。

In [14]:
import numpy as np
import concurrent.futures

def do_something(size):
    return np.dot(np.ones((size, size)), np.ones((size, size)))

worker_values = [1, 2, 4, 8, 16, 32, 64]
tasks = [1000] * 10

for max_workers in worker_values:
    with timer(f"max_workers={max_workers}"):
        with concurrent.futures.ProcessPoolExecutor(max_workers) as executor:
            futures = executor.map(do_something, tasks)

[max_workers=1]:  0.912 s
[max_workers=2]:  0.92 s
[max_workers=4]:  1 s
[max_workers=8]:  1.04 s
[max_workers=16]:  1.12 s
[max_workers=32]:  1.92 s
[max_workers=64]:  2.28 s


ThreadPoolExecutor は I/O bound のタスクに有効

In [16]:
import numpy as np
import requests
import concurrent.futures

def do_something(url):
    response = requests.get(url)
    return response.content

worker_values = [1, 2, 4, 8, 16, 32, 64]
tasks = ['https://google.com', 'https://facebook.com', 'https://twitter.com',
         'https://www.youtube.com/', 'https://www.amazon.co.jp/',
         'https://github.com/']

for max_workers in worker_values:
    with timer(f"max_workers={max_workers}"):
        with concurrent.futures.ThreadPoolExecutor(max_workers) as executor:
            futures = [executor.submit(do_something, task) for task in tasks]
            results = [f.result() for f in concurrent.futures.as_completed(futures)]

[max_workers=1]:  2.31 s
[max_workers=2]:  1.1 s
[max_workers=4]:  1 s
[max_workers=8]:  0.841 s
[max_workers=16]:  0.86 s
[max_workers=32]:  0.919 s
[max_workers=64]:  1.34 s
