<a href="https://colab.research.google.com/github/applejxd/colaboratory/blob/master/others/rapid_python.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## numba での高速化

### njit: numba の nopython モード

[numpy での重い計算の例](https://qiita.com/gyu-don/items/9d223b007ca620e95abc)

In [27]:
import sys
sys.setrecursionlimit(100000)

def ack(m, n):
    if m == 0:
        return n + 1
    if n == 0:
        return ack(m - 1, 1)
    return ack(m - 1, ack(m, n - 1))

通常時の計算時間を測定

In [28]:
import time
from contextlib import contextmanager

@contextmanager
def timer(target: str):
    t = time.perf_counter()
    yield None
    # 有効数字3桁表示
    print(f"[{target}]: {time.perf_counter() - t: .3g} s")

with timer("ack(3, 10)"):
    ack(3, 10)

[ack(3, 10)]:  7.62 s


njit で高速化。[型推論は自動。ポリモーフィズムも対応。](https://numba.readthedocs.io/en/stable/user/jit.html)

In [29]:
from numba import njit

with timer("define njit function"):
    @njit(cache=True)
    def lazy_ack(m, n):
        if m == 0:
            return n + 1
        if n == 0:
            return lazy_ack(m - 1, 1)
        return lazy_ack(m - 1, lazy_ack(m, n - 1))

# コンパイル時間含む
with timer("1st try"):
    lazy_ack(3, 10)

# コンパイル時間含まない
with timer("2nd try"):
    lazy_ack(3, 10)

[define njit function]:  0.0647 s
[1st try]:  1.36 s
[2nd try]:  0.229 s


定義時に型指定でコンパイル。トータル時間は変化なし。
ポリモーフィズムもないため処理速度を図る目的等がなければ使わなくてもよい。

In [30]:
with timer("define njit function"):
    @njit("int32(int32, int32)", cache=True)
    def eager_ack(m, n):
        if m == 0:
            return n + 1
        if n == 0:
            return eager_ack(m - 1, 1)
        return eager_ack(m - 1, eager_ack(m, n - 1))

# 初回
with timer("compile time included"):
    eager_ack(3, 10)

# 事前にコンパイル済なので二回目以降の計算時間も同等
with timer("compile time excluded"):
    eager_ack(3, 10)

[define njit function]:  0.396 s
[compile time included]:  0.23 s
[compile time excluded]:  0.242 s


### 並列化 & fastmath


[検証が容易な2重ループの例](https://qiita.com/AnchorBlues/items/59a8543765549fe7bac0)
\begin{equation}
 S_L=\sum_{i=0}^{L-1}\sum_{j=0}^{L-1} (i-j)
\end{equation}

In [31]:
def double_sum(size):
    sum = 0
    for i in range(size):
        for j in range(size):
            sum += i - j
    return sum

size = 10000

with timer("pure python"):
    # 0 となることを確認
    print(f"ans={double_sum(size)}")

ans=0
[pure python]:  8.34 s


numba を用いたケース

In [32]:
with timer("define njit function"):
    @njit("int32(int32)", cache=True)
    def double_sum_njit(size):
        sum = 0
        for i in range(size):
            for j in range(size):
                sum += i - j
        return sum

with timer("njit enabled"):
    print(f"ans={double_sum_njit(size)}")

[define njit function]:  0.0749 s
ans=0
[njit enabled]:  3.21e-05 s


更に高速化するには[追加オプション](https://numba.readthedocs.io/en/stable/user/performance-tips.html)を使用する。

次の例は prange を用いた並列化. ただしこのケースではマルチコア化のバウンドの方が大きいため高速化されない。

In [33]:
from numba import prange

with timer("define njit function"):
    @njit("int32(int32)", cache=True, parallel=True)
    def double_sum_parallel(size):
        sum = 0
        # 並列化の対象のループには prange を使う
        for i in prange(size):
            for j in range(size):
                sum += i - j
        return sum

with timer("njit with parallel"):
    print(f"ans={double_sum_parallel(size)}")

[define njit function]:  0.641 s
ans=0
[njit with parallel]:  0.00178 s


C++ の最適化のように fastmath も使用可能

In [34]:
with timer("define njit function"):
    @njit("int32(int32)", cache=True, parallel=True, fastmath=True)
    def double_sum_fast(size):
        sum = 0
        for i in prange(size):
            for j in range(size):
                sum += i - j
        return sum

with timer("njit with parallel & fastmath"):
    print(f"ans={double_sum_fast(size)}")

[define njit function]:  0.765 s
ans=0
[njit with parallel & fastmath]:  0.00192 s


### numba.cuda

[GPUを意識したプログラミング](https://co-crea.jp/wp-content/uploads/2016/07/File_2.pdf)が必要：
- [グリッド・ブロック中のスレッド位置の取得方法](https://numba.pydata.org/numba-doc/latest/cuda/kernels.html#absolute-positions)
- [CPU・GPU間のデータ転送方法](https://numba.pydata.org/numba-doc/latest/cuda/memory.html)

In [35]:
from numba import cuda
import numpy as np
import sys
sys.setrecursionlimit(100000)


# カーネル関数
@cuda.jit
def add_kernel(a, b, c):
    i = cuda.grid(1)
    c[i] = a[i] + b[i]

# 起動関数
def add_arrays(a, b, threads_per_block=256):
    # threads_per_block は1ブロックあたりのスレッド数 (128~512)
    # GPU の使用ブロック数を計算
    blocks = (a.size + threads_per_block - 1) // threads_per_block

    # 結果保存用にメモリ確保
    result = cuda.to_device(np.zeros_like(a))
    add_kernel[blocks, threads_per_block](
        cuda.to_device(a), cuda.to_device(b), result)
    return result.copy_to_host()

array_size = 100000000
a = np.ones(array_size, dtype=np.float32)
b = np.ones(array_size, dtype=np.float32)

with timer("CPU computation"):
    a + b

add_arrays(a, b)
with timer("GPU computation"):
    add_arrays(a, b)

[CPU computation]:  0.193 s
[GPU computation]:  0.503 s


## concurrent.futures での高速化

### オリジナルの API

CPU 数を確認

In [36]:
import os

print(f"CPU count is {os.cpu_count()}.")
print(f"CPU for current thread is {len(os.sched_getaffinity(0))}.")

import multiprocessing
print(f"CPU count by multiprocessing is {multiprocessing.cpu_count()}.")

CPU count is 2.
CPU for current thread is 2.
CPU count by multiprocessing is 2.


ProcessPoolExecutor は CPU bound のタスクに有効。

以下はmax_workers の最適値を見つけるサンプル。

In [37]:
import numpy as np
import concurrent.futures

def do_something(size):
    return np.dot(np.ones((size, size)), np.ones((size, size)))

worker_values = [1, 2, 4, 8, 16, 32, 64]
tasks = [1000] * 10

for max_workers in worker_values:
    with timer(f"max_workers={max_workers}"):
        with concurrent.futures.ProcessPoolExecutor(max_workers) as executor:
            futures = [executor.submit(do_something, task) for task in tasks]
            results = [f.result() for f in concurrent.futures.as_completed(futures)]

[max_workers=1]:  0.715 s
[max_workers=2]:  0.746 s
[max_workers=4]:  1.03 s
[max_workers=8]:  0.989 s
[max_workers=16]:  1.18 s
[max_workers=32]:  2.02 s
[max_workers=64]:  3.19 s


上記は submit メソッドを使用した。
submit は for 文などで順次タスクを追加する場合に便利。

既に処理対象がリストなどでまとまっている場合は map メソッドが便利。使い方は以下。

In [38]:
import numpy as np
import concurrent.futures

def do_something(size):
    return np.dot(np.ones((size, size)), np.ones((size, size)))

worker_values = [1, 2, 4, 8, 16, 32, 64]
tasks = [1000] * 10

for max_workers in worker_values:
    with timer(f"max_workers={max_workers}"):
        with concurrent.futures.ProcessPoolExecutor(max_workers) as executor:
            results = executor.map(do_something, tasks)

[max_workers=1]:  1.12 s
[max_workers=2]:  1.2 s
[max_workers=4]:  1.04 s
[max_workers=8]:  0.965 s
[max_workers=16]:  1.14 s
[max_workers=32]:  1.44 s
[max_workers=64]:  2.02 s


ThreadPoolExecutor は I/O bound のタスクに有効

In [39]:
import numpy as np
import requests
import concurrent.futures

def do_something(url):
    response = requests.get(url)
    return response

worker_values = [1, 2, 4, 8, 16, 32, 64]
tasks = ['https://google.com', 'https://facebook.com', 'https://twitter.com',
         'https://www.youtube.com/', 'https://www.amazon.com/',
         'https://github.com/']

for max_workers in worker_values:
    with timer(f"max_workers={max_workers}"):
        with concurrent.futures.ThreadPoolExecutor(max_workers) as executor:
            results = executor.map(do_something, tasks)

[max_workers=1]:  1.26 s
[max_workers=2]:  0.828 s
[max_workers=4]:  0.71 s
[max_workers=8]:  0.505 s
[max_workers=16]:  0.655 s
[max_workers=32]:  0.502 s
[max_workers=64]:  0.584 s


結果はジェネレータ式なので for 文で処理可能

In [40]:
with concurrent.futures.ThreadPoolExecutor(max_workers=16) as executor:
    results = executor.map(do_something, tasks)

print(f"type_of_results={type(results)}")
for task, result in zip(tasks, results):
    print(f"{task}: code {result.status_code}")

type_of_results=<class 'generator'>
https://google.com: code 200
https://facebook.com: code 200
https://twitter.com: code 200
https://www.youtube.com/: code 200
https://www.amazon.com/: code 503
https://github.com/: code 200


### tqdm.contrib の利用

tqdm.contrib.concurrent.thread_map で tqdm と併用可能。表記もシンプル。

In [41]:
import requests
from tqdm.contrib.concurrent import thread_map


def do_something(url):
    response = requests.get(url)
    return response

worker_values = [1, 2, 4, 8, 16, 32, 64]
tasks = ['https://google.com', 'https://facebook.com', 'https://twitter.com',
         'https://www.youtube.com/', 'https://www.amazon.com/',
         'https://github.com/']

results = list(thread_map(do_something, tasks))

  0%|          | 0/6 [00:00<?, ?it/s]

tqdm.contrib.concurrent.process_map も同様

In [42]:
from tqdm.contrib.concurrent import process_map

def do_something(size):
    return np.dot(np.ones((size, size)), np.ones((size, size)))

tasks = [1000] * 10
results = list(process_map(do_something, tasks))

  0%|          | 0/10 [00:00<?, ?it/s]

### プロセス間通信

[共有メモリを使う並列化](https://docs.python.org/ja/3/library/multiprocessing.html#sharing-state-between-processes)。

[concurrent.futures.ProcessPoolExecutor](https://rinoguchi.net/2020/07/python-multiprocess.html)と使う方法は不明。

In [43]:
from multiprocessing import Process, Value, Array
from tqdm.contrib.concurrent import process_map

def f(n, a):
    n.value = 3.1415927
    for i in range(len(a)):
        a[i] = -a[i]

num = Value('d', 0.0)
arr = Array('i', range(10))

p = Process(target=f, args=(num, arr))
p.start()
p.join()

print(num.value)
print(arr[:])

3.1415927
[0, -1, -2, -3, -4, -5, -6, -7, -8, -9]


[Manager は低速だが変数管理が自動なので使用しやすい](https://docs.python.org/ja/3/library/multiprocessing.html#sharing-state-between-processes)

In [44]:
from multiprocessing import Process, Manager
from tqdm.contrib.concurrent import process_map
import time

def f(d, l):
    d[1] = '1'
    d['2'] = 2
    d[0.25] = None
    time.sleep(1)
    l.reverse()

with Manager() as manager:
    d = manager.dict()
    l = manager.list(range(100))

    p = Process(target=f, args=(d, l))
    p.start()
    p.join()

    print(d)
    print(l)

{1: '1', '2': 2, 0.25: None}
[99, 98, 97, 96, 95, 94, 93, 92, 91, 90, 89, 88, 87, 86, 85, 84, 83, 82, 81, 80, 79, 78, 77, 76, 75, 74, 73, 72, 71, 70, 69, 68, 67, 66, 65, 64, 63, 62, 61, 60, 59, 58, 57, 56, 55, 54, 53, 52, 51, 50, 49, 48, 47, 46, 45, 44, 43, 42, 41, 40, 39, 38, 37, 36, 35, 34, 33, 32, 31, 30, 29, 28, 27, 26, 25, 24, 23, 22, 21, 20, 19, 18, 17, 16, 15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0]


## データロードの高速化

In [45]:
%%capture
!pip install dask[dataframe] orjson

In [46]:
import pandas as pd
import dask.dataframe as dd
import json
import orjson

### CSV

CSV データを用意

In [47]:
import urllib

url = "https://github.com/mwaskom/seaborn-data/raw/refs/heads/master/diamonds.csv"

with urllib.request.urlopen(url) as web_file:
  with open("diamonds.csv", 'wb') as local_file:
    local_file.write(web_file.read())

pandas で読み込み

In [48]:
%%time
df = pd.read_csv("diamonds.csv")

CPU times: user 55.7 ms, sys: 16.4 ms, total: 72.2 ms
Wall time: 99.3 ms


dask で並列読み込み

In [49]:
%%time
df = dd.read_csv("diamonds.csv", dtype={'table': 'float64'}).compute()

CPU times: user 366 ms, sys: 59.4 ms, total: 426 ms
Wall time: 863 ms


### Json

json データ取得

In [50]:
url = "https://github.com/json-iterator/test-data/raw/refs/heads/master/large-file.json"

with urllib.request.urlopen(url) as web_file:
  with open("large-file.json", 'wb') as local_file:
    local_file.write(web_file.read())

標準の json ライブラリで取得

In [51]:
%%time
with open("large-file.json", mode="r") as f:
    data = json.load(f)

CPU times: user 452 ms, sys: 153 ms, total: 605 ms
Wall time: 755 ms


orjson で高速に処理

In [52]:
%%time
with open("large-file.json", mode="r") as f:
    data = orjson.loads(f.read())

CPU times: user 569 ms, sys: 235 ms, total: 804 ms
Wall time: 875 ms


書き出しも比較。まずは標準ライブラリ。

In [53]:
%%time
with open("large-file.json", mode="w") as f:
    json.dump(data, f, ensure_ascii=False)

CPU times: user 1.86 s, sys: 91.3 ms, total: 1.95 s
Wall time: 2.46 s


次は orjson

In [54]:
%%time
with open("large-file.json", mode="wb") as f:
    f.write(orjson.dumps(data))

CPU times: user 36.5 ms, sys: 40.5 ms, total: 76.9 ms
Wall time: 114 ms
