<a href="https://colab.research.google.com/github/applejxd/colaboratory/blob/master/rapid_python.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## numba での高速化

[numpy での重い計算の例](https://qiita.com/gyu-don/items/9d223b007ca620e95abc)

In [1]:
import sys
sys.setrecursionlimit(100000)

def ack(m, n):
    if m == 0:
        return n + 1
    if n == 0:
        return ack(m - 1, 1)
    return ack(m - 1, ack(m, n - 1))

通常時の計算時間を測定

In [2]:
import time
from contextlib import contextmanager

@contextmanager
def timer(target: str):
    t = time.perf_counter()
    yield None
    # 有効数字3桁表示
    print(f"[{target}]: {time.perf_counter() - t: .3g} s")

with timer("ack(3, 10)"):
    ack(3, 10)

[ack(3, 10)]:  15.1 s


## njit: numba の nopython モード

In [3]:
from numba import njit

@njit
def ack(m, n):
    if m == 0:
        return n + 1
    if n == 0:
        return ack(m - 1, 1)
    return ack(m - 1, ack(m, n - 1))

# コンパイル時間含む
with timer("compile time included"):
    ack(3, 10)

# コンパイル時間含まない
with timer("compile time excluded"):
    ack(3, 10)

[compile time included]:  0.808 s
[compile time excluded]:  0.289 s


### 並列化 & fastmath


更に高速化するには[追加オプション](https://numba.readthedocs.io/en/stable/user/performance-tips.html)を使用する

In [4]:
import numpy as np
from numba import prange


@njit
def sum_of_squares(arr):
    s = 0
    for i in range(arr.shape[0]):
        s += arr[i] ** 2
    return s

@njit(parallel=True)
def sum_of_squares_parallel(arr):
    s = 0
    for i in prange(arr.shape[0]):
        s += arr[i] ** 2
    return s

@njit(parallel=True, fastmath=True)
def sum_of_squares_fast(arr):
    s = 0
    for i in prange(arr.shape[0]):
        s += arr[i] ** 2
    return s

arr = np.random.randn(1000000)

sum_of_squares(arr)
with timer("njit only"):
    sum_of_squares(arr)
    
sum_of_squares_parallel(arr)
with timer("njit with parallel"):
    sum_of_squares_parallel(arr)

sum_of_squares_fast(arr)
with timer("njit with parallel & fastmath"):
    sum_of_squares_fast(arr)

[njit only]:  0.00137 s
[njit with parallel]:  0.000741 s
[njit with parallel & fastmath]:  0.000598 s


### numba.cuda

[GPUを意識したプログラミング](https://co-crea.jp/wp-content/uploads/2016/07/File_2.pdf)が必要：
- [グリッド・ブロック中のスレッド位置の取得方法](https://numba.pydata.org/numba-doc/latest/cuda/kernels.html#absolute-positions)
- [CPU・GPU間のデータ転送方法](https://numba.pydata.org/numba-doc/latest/cuda/memory.html)

In [5]:
from numba import cuda
import numpy as np
import sys
sys.setrecursionlimit(100000)


# カーネル関数
@cuda.jit
def add_kernel(a, b, c):
    i = cuda.grid(1)
    c[i] = a[i] + b[i]

# 起動関数
def add_arrays(a, b, threads_per_block=256):
    # threads_per_block は1ブロックあたりのスレッド数 (128~512)
    # GPU の使用ブロック数を計算
    blocks = (a.size + threads_per_block - 1) // threads_per_block

    # 結果保存用にメモリ確保
    result = cuda.to_device(np.zeros_like(a))
    add_kernel[blocks, threads_per_block](
        cuda.to_device(a), cuda.to_device(b), result)
    return result.copy_to_host()

array_size = 100000000
a = np.ones(array_size, dtype=np.float32)
b = np.ones(array_size, dtype=np.float32)

with timer("CPU computation"):
    a + b

add_arrays(a, b)
with timer("GPU computation"):
    add_arrays(a, b)

[CPU computation]:  0.147 s
[GPU computation]:  0.552 s


## concurrent.futures での高速化

CPU 数を確認

In [6]:
import os

print(f"CPU count is {os.cpu_count()}.")
print(f"CPU for current thread is {len(os.sched_getaffinity(0))}.")

import multiprocessing
print(f"CPU count by multiprocessing is {multiprocessing.cpu_count()}.")

CPU count is 2.
CPU for current thread is 2.
CPU count by multiprocessing is 2.


ProcessPoolExecutor は CPU bound のタスクに有効。

以下はmax_workers の最適値を見つけるサンプル。

In [10]:
import numpy as np
import concurrent.futures

def do_something(size):
    return np.dot(np.ones((size, size)), np.ones((size, size)))

worker_values = [1, 2, 4, 8, 16, 32, 64]
tasks = [1000] * 10

for max_workers in worker_values:
    with timer(f"max_workers={max_workers}"):
        with concurrent.futures.ProcessPoolExecutor(max_workers) as executor:
            futures = [executor.submit(do_something, task) for task in tasks]
            results = [f.result() for f in concurrent.futures.as_completed(futures)]

[max_workers=1]:  0.891 s
[max_workers=2]:  1.45 s
[max_workers=4]:  1.4 s
[max_workers=8]:  1.13 s
[max_workers=16]:  1.17 s
[max_workers=32]:  1.49 s
[max_workers=64]:  2.06 s


上記は submit メソッドを使用した。
submit は for 文などで順次タスクを追加する場合に便利。

既に処理対象がリストなどでまとまっている場合は map メソッドが便利。使い方は以下。

In [12]:
import numpy as np
import concurrent.futures

def do_something(size):
    return np.dot(np.ones((size, size)), np.ones((size, size)))

worker_values = [1, 2, 4, 8, 16, 32, 64]
tasks = [1000] * 10

for max_workers in worker_values:
    with timer(f"max_workers={max_workers}"):
        with concurrent.futures.ProcessPoolExecutor(max_workers) as executor:
            futures = executor.map(do_something, tasks)

[max_workers=1]:  0.873 s
[max_workers=2]:  1.15 s
[max_workers=4]:  1.57 s
[max_workers=8]:  1.23 s
[max_workers=16]:  1.17 s
[max_workers=32]:  1.55 s
[max_workers=64]:  1.97 s


ThreadPoolExecutor は I/O bound のタスクに有効

In [8]:
import numpy as np
import requests
import concurrent.futures

def do_something(url):
    response = requests.get(url)
    return response.content

worker_values = [1, 2, 4, 8, 16, 32, 64]
tasks = ['https://google.com', 'https://facebook.com', 'https://twitter.com',
         'https://www.youtube.com/', 'https://www.amazon.co.jp/',
         'https://github.com/']

for max_workers in worker_values:
    with timer(f"max_workers={max_workers}"):
        with concurrent.futures.ThreadPoolExecutor(max_workers) as executor:
            futures = [executor.submit(do_something, task) for task in tasks]
            results = [f.result() for f in concurrent.futures.as_completed(futures)]

[max_workers=1]:  3.01 s
[max_workers=2]:  1.58 s
[max_workers=4]:  1.15 s
[max_workers=8]:  1.09 s
[max_workers=16]:  0.697 s
[max_workers=32]:  0.963 s
[max_workers=64]:  0.709 s
