<a href="https://colab.research.google.com/github/dk-wei/python-multiprocessing/blob/main/Python_Multiprocessing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

参考资料1：[Parallel programming in Python: multiprocessing (part 1)
](https://www.kth.se/blogs/pdc/2019/02/parallel-programming-in-python-multiprocessing-part-1/)           
参考资料2：[Parallel programming in Python: multiprocessing (part 2)
](https://www.kth.se/blogs/pdc/2019/02/parallel-programming-in-python-multiprocessing-part-2/) 

In [None]:
#!pip install ray[default]

In [None]:
import multiprocessing as mp
from tqdm.notebook import tqdm

In [99]:
#org_list = list(range(2000000))
org_list = [3,4,5,6,5,4,3,2,1,3,4,4,3,2,4,6,2,1,32,4,55,3,2,1,4,6,78,8,5,6,0,9,8,6,3,5,6,7,7,8,8,4,2]

# Single Processing

In [None]:
%%time
res1 = []
res2 = []
j = 0

for i in tqdm(org_list):

  res1.append(i**2)
  res1.append(i**10)

HBox(children=(FloatProgress(value=0.0, max=43.0), HTML(value='')))


CPU times: user 54.4 ms, sys: 7.69 ms, total: 62.1 ms
Wall time: 95.5 ms


# Multi-processing 傻瓜版
## Synchronous method

- `map/imap`: single argument
- `starmap`: multiple argument

Note that both `map` and `starmap` are **synchronous** methods. In other words, if a worker process finishes its sub-task very early, it will wait for the other worker processes to finish. This may lead to performance degradation if the workload is not well balanced among the worker processes.

In [52]:
import multiprocessing

#cores = multiprocessing.cpu_count()  # Count the number of cores in a computer
pool = multiprocessing.Pool(processes=2) 

def func(item):
  # res1 = []
  # res2 = []

  # res1.append(item**2)
  # res1.append(item**10)

    return [item**2, item**10]

In [None]:
pool = multiprocessing.Pool(processes=2)
sim_lst1 = list(tqdm(pool.imap(func, org_list)))
    
pool.close()
pool.join()

HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))




In [None]:
pool = multiprocessing.Pool(processes=2)
sim_lst2 = list(tqdm(pool.map(func, org_list)))
    
pool.close()
pool.join()

HBox(children=(FloatProgress(value=0.0, max=43.0), HTML(value='')))




In [None]:
sim_lst3 = list(tqdm(map(func, org_list)))

HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))




In [None]:
sim_lst2 == sim_lst3

True

In [None]:
def power_n(x, n):
    return x ** n

In [None]:
pool = multiprocessing.Pool(processes=2)

result = pool.starmap(power_n, [(x, 2) for x in range(20)])
print(result)

[0, 1, 4, 9, 16, 25, 36, 49, 64, 81, 100, 121, 144, 169, 196, 225, 256, 289, 324, 361]


## Asynchronous method

- `apply_async`

The Pool class also provides the `apply_async` method that makes **asynchronous** execution of the worker processes possible. Unlike the `map` method, which executes a computational routine over a list of inputs, the `apply_async` method executes the routine only once. Therefore, in the previous example, we would need to define another routine, `power_n_list`, that computes the values of a list of numbers raised to a particular power.

In [None]:
nsteps = 10000000
dx = 1.0 / nsteps
pi = 0.0

In [None]:
for i in range(nsteps):
    x = (i + 0.5) * dx
    pi += 4.0 / (1.0 + x * x)
pi *= dx

In [None]:
def calc_partial_pi(rank, nprocs, nsteps, dx):
    partial_pi = 0.0
    for i in range(rank, nsteps, nprocs):
        x = (i + 0.5) * dx
        partial_pi += 4.0 / (1.0 + x * x)
    partial_pi *= dx
    return partial_pi

In [None]:
import multiprocessing as mp

nprocs = mp.cpu_count()
inputs = [(rank, nprocs, nsteps, dx) for rank in range(nprocs)]

In [None]:
# starmap

pool = mp.Pool(processes=nprocs)
result = pool.starmap(calc_partial_pi, inputs)
pi = sum(result)

In [None]:
# apply_async

multi_result = [pool.apply_async(calc_partial_pi, inp) for inp in inputs]

# 运行下面这个list才会真的开始processing
result = [p.get() for p in multi_result]  
pi = sum(result)

## Summary

- `map` and `starmap` are synchronous methods.
- `map` and `starmap` guarantee the correct order of output.
- `starmap` and `apply_async` support multiple arguments.

# Multi-processing 进阶版

In the previous post we introduced the `Pool` class of the multiprocessing module. In this post we continue on and introduce the `Process` class, which makes it possible to have direct control over individual processes.   


采用`Process` class，需要自己chuck list，定义流程，不过更灵活，功能也更强大

- `Process`: 安排运算
- `Queue`: 保存运算结果 (`put`/`get`, The `Queue` class includes the `put` method for depositing data and the `get` method for retrieving data.)


样例代码(无序输出版):

```python
import multiprocessing as mp

def square(x, q):
    q.put(x * x)

qout = mp.Queue()
processes = [mp.Process(target=square, args=(i, qout))
             for i in range(2, 10)]

for p in processes:
    p.start()

for p in processes:
    p.join()

result = [qout.get() for p in processes]
print(result)

```


样例代码(有序输出版):

```python
import multiprocessing as mp
from random import randint
from time import sleep

def square(i, x, q):
    sleep(0.01 * randint(0, 100)) 
    q.put((i, x * x))

input_values = [2, 4, 6, 8, 3, 5, 7, 9]
qout = mp.Queue()
processes = [mp.Process(target=square, args=(ind, val, qout))
             for ind, val in enumerate(input_values)]

for p in processes:
    p.start()

for p in processes:
    p.join()

unsorted_result = [qout.get() for p in processes]
result = [t[1] for t in sorted(unsorted_result)] 
print(result)
```



先split原list成多个sublist：

In [100]:
def split(a, n):
  '''
  a: 需要chunk的large list
  n: 希望chunk出多少个sublist
  '''
  k, m = divmod(len(a), n)
  split_data = [a[i*k+min(i, m):(i+1)*k+min(i+1, m)] for i in range(n)]

  # 没办法，我们得在每个sublist前面做好index的标记，最后拼接result的时候会用到
  split_data_order_number = [[i, v] for i, v in enumerate(split_data)]

  return split_data_order_number
        
sub_list = split(org_list,6)

In [101]:
#org_list

In [102]:
list(sub_list)

[[0, [3, 4, 5, 6, 5, 4, 3, 2]],
 [1, [1, 3, 4, 4, 3, 2, 4]],
 [2, [6, 2, 1, 32, 4, 55, 3]],
 [3, [2, 1, 4, 6, 78, 8, 5]],
 [4, [6, 0, 9, 8, 6, 3, 5]],
 [5, [6, 7, 7, 8, 8, 4, 2]]]

In [103]:
%%time

def func(x, q1, q2, q3):
    
    
    index = x[0]
    value = x[1]
    res1 = []
    res2 = []
    
    print(f'Job {index} starting\n')
    for i in tqdm(value):

      res1.append(i**2)
      res2.append(i**3)
        
    # 我们不用return，而是q.put()来接收输出    
    q1.put(res1)
    q2.put(res2)
    q3.put(index)
    print(f'Job {index} finishing\n')

qout1 = mp.Queue()
qout2 = mp.Queue()
qout3 = mp.Queue()

# 把要处理的list放在`sub_list`
# 要注意的是，每个process结束，结果也会被存储，不一定是按照sub_list的顺序来存储的，谁先完成谁先存储
processes = [mp.Process(target=func, args=(i, qout1, qout2, qout3)) for i in sub_list] 

CPU times: user 582 µs, sys: 942 µs, total: 1.52 ms
Wall time: 1.54 ms


In [104]:
%%time
for p in processes:
    p.daemon = True
    p.start()
    

# 到下面`qout.get()`这一步，processing才正式开始
# 按照前面chunk的index进行排序
unsorted_result = [[qout1.get(), qout2.get(), qout3.get()] for p in processes]
result = sum([t[0] for t in sorted(unsorted_result, key=lambda x: x[2])], []) 


for p in processes:
    p.join()


print('All task is done!')

Job 0 starting

Job 1 starting

Job 2 starting

Job 3 starting

Job 4 starting



HBox(children=(FloatProgress(value=0.0, max=8.0), HTML(value='')))

HBox(children=(FloatProgress(value=0.0, max=7.0), HTML(value='')))

Job 5 starting



HBox(children=(FloatProgress(value=0.0, max=7.0), HTML(value='')))

HBox(children=(FloatProgress(value=0.0, max=7.0), HTML(value='')))


Job 0 finishing


HBox(children=(FloatProgress(value=0.0, max=7.0), HTML(value='')))





HBox(children=(FloatProgress(value=0.0, max=7.0), HTML(value='')))

Job 1 finishing


Job 4 finishing



Job 2 finishing

Job 5 finishing


Job 3 finishing

All task is done!
CPU times: user 229 ms, sys: 241 ms, total: 470 ms
Wall time: 1.11 s


In [105]:
result == [i**2 for i in org_list]

True

Summary:

- The `Proces`s class makes it possible to control the processes directly.
- The `Queue` class can be used to save results from the processes.
- The processes are executed asynchronously.
- The order of the output is not guaranteed to correspond to that of the input values.

# Multi-processing (Ray)

In [106]:
import ray
import time

In [107]:
ray.shutdown()

# Start Ray.
ray.init(num_cpus=8)

2021-05-07 00:41:11,584	INFO services.py:1269 -- View the Ray dashboard at [1m[32mhttp://127.0.0.1:8265[39m[22m


{'metrics_export_port': 50732,
 'node_id': '51ab37dccea7212afceec20a5d91788002268bfea5cfeb98792c0551',
 'node_ip_address': '172.28.0.2',
 'object_store_address': '/tmp/ray/session_2021-05-07_00-41-09_875715_169/sockets/plasma_store',
 'raylet_ip_address': '172.28.0.2',
 'raylet_socket_name': '/tmp/ray/session_2021-05-07_00-41-09_875715_169/sockets/raylet',
 'redis_address': '172.28.0.2:6379',
 'session_dir': '/tmp/ray/session_2021-05-07_00-41-09_875715_169',
 'webui_url': '127.0.0.1:8265'}

In [112]:
%%time

@ray.remote
def func(x):
    
    index = x[0]
    value = x[1]
    res1 = []
    res2 = []
    
    #print(f'Job {index} starting\n')
    for i in value:
      res1.append(i**2)
      res2.append(i**3)
       
    #print(f'Job {index} finishing\n')

    return index, res1, res2

# def f(x):
#     #time.sleep(1)
#     return x

# Start n tasks in parallel.
# 注意，不像上面的multiprocessing.process, ray的运算是按顺序的
result_ids = []
for i in tqdm(sub_list):
    result_ids.append(func.remote(i))




HBox(children=(FloatProgress(value=0.0, max=6.0), HTML(value='')))


CPU times: user 44.5 ms, sys: 7.12 ms, total: 51.6 ms
Wall time: 64.4 ms


In [113]:
%%time

# 开始get，也才会开始正式的运算    
# Wait for the tasks to complete and retrieve the results.
# With at least 4 cores, this will take 1 second.
results = ray.get(result_ids)  # [0, 1, 2, 3]

CPU times: user 1.06 ms, sys: 0 ns, total: 1.06 ms
Wall time: 1.32 ms


In [115]:
results

[(0, [9, 16, 25, 36, 25, 16, 9, 4], [27, 64, 125, 216, 125, 64, 27, 8]),
 (1, [1, 9, 16, 16, 9, 4, 16], [1, 27, 64, 64, 27, 8, 64]),
 (2, [36, 4, 1, 1024, 16, 3025, 9], [216, 8, 1, 32768, 64, 166375, 27]),
 (3, [4, 1, 16, 36, 6084, 64, 25], [8, 1, 64, 216, 474552, 512, 125]),
 (4, [36, 0, 81, 64, 36, 9, 25], [216, 0, 729, 512, 216, 27, 125]),
 (5, [36, 49, 49, 64, 64, 16, 4], [216, 343, 343, 512, 512, 64, 8])]