# Parallelization

- Parallel programming is hard. 
- Fortunately, many computational tasks are ["embarassingly parallel"](https://en.wikipedia.org/wiki/Embarrassingly_parallel), and parallelization can provide great speedups at low cost.
- The main challenge is multiple tasks accessing the same resource.
- With multithreading, order of execution may change in subtle ways.
- Parallelization will always multiply memory usage. Won't help if your processing is *memory bound*.

- To parallelize your code, you can use multiple **threads** or multiple **processes**.

- Threads are lightweight and share the same memory space.
  - In Python, because of GIL (global interpreter lock), only one thread can be executed at a time.
  - Parallelization is achieved by switching between threads when they idle.
  - Best used with *I/O bound* tasks (CPU load under 100% is good indicator).
  
- Multiprocessing spawns subprocesses and initial memory state is cloned to each.
  - Every process then has independent memory space. Less risk of corrupting shared state.
  - Because initial memory is copied to each process, memory usage is higher.
  - Best used with *CPU bound* tasks.

# Multithreading


## Example: download many files

This task is network I/O bound, CPU is idling while waiting for the next chunk of data to arrive.

[Cartographic boundary files](https://www.census.gov/geographies/mapping-files/time-series/geo/cartographic-boundary.html) - Census

> The cartographic boundary files are **simplified** representations of selected geographic areas from the Census Bureau’s MAF/TIGER geographic database. These boundary files are specifically designed for small scale thematic mapping.

In [None]:
import concurrent.futures
import threading

from tools import download_file, unzip, ResourceMonitor, tracts_state_00_aa, tracts_state_aa_00

In [None]:
def download_state_tracts(state_code):
    url = f'https://www2.census.gov/geo/tiger/GENZ2019/shp/cb_2019_{state_code}_tract_500k.zip'
    f = download_file(url, f'data/tracts/{state_code}', overwrite=True, verbose=False)
    print(threading.current_thread().name, 'finished', state_code)
    return f

### Sequential

In [None]:
mon = ResourceMonitor(interval=0.5)
mon.start()

file_paths = []
for sc in tracts_state_00_aa.keys():
    file_paths.append(download_state_tracts(sc))

mon.stop()
mon.plot()

### Parallel
With multiple threads CPU can go over 100%.

Problem of shared resources. In this case - standard output stream.

In [None]:
mon = ResourceMonitor(interval=0.2)
mon.start()

with concurrent.futures.ThreadPoolExecutor() as pool:
    futures = []
    for sc in state_codes:
        futures.append(pool.submit(download_state_tracts, sc))
concurrent.futures.wait(futures)
file_paths = [f.result() for f in futures]

mon.stop()
mon.plot()

# Multiprocessing

## memory under multiprocessing

In [None]:
import psutil
p = psutil.Process()
p.memory_info().rss

In [None]:
import concurrent.futures
import time

In [None]:
x = [1] * 50_000_000

In [None]:
def foo(i):
    import os
    import psutil
    p = psutil.Process()
    time.sleep(i / 2)
    mem = p.memory_info().rss // 2**20
    print(i, os.getpid(), os.getppid(), mem)
    time.sleep(3)
    mem = p.memory_info().rss // 2**20
    print(i, 1, mem)
    xx = [1] * 10_000_000
    time.sleep(3)
    mem = p.memory_info().rss // 2**20
    print(i, 2, mem)
    return i

n = 3
with concurrent.futures.ProcessPoolExecutor(n) as pool:
    z = pool.map(foo, range(n))

## Example: identify census tracts from coordinates

This requires to perform "point in shape" computation, CPU intensive task, many times.

In [None]:
import concurrent.futures

import pandas as pd
import geopandas as gpd
import fastparquet
from tools import download_file, unzip, tracts_state_00_aa, tracts_state_aa_00, ResourceMonitor

In [None]:
def unzip_tract(state_code):
    f = f'data/tracts/{state_code}/cb_2019_{state_code}_tract_500k.zip'
    unzip(f, f'data/tracts/{state_code}', overwrite=True, verbose=False)

In [None]:
%%time
for sc in tracts_state_00_aa:
    unzip_tract(sc)

In [None]:
%%time
with concurrent.futures.ProcessPoolExecutor() as pool:
    pool.map(unzip_tract, tracts_state_00_aa.keys())

In [None]:
def tracts_from_coords(state):
    state_code = tracts_state_aa_00[state]
    df = pd.read_parquet('data/synig.pq', columns=['ABI', 'LONGITUDE', 'LATITUDE'],
                         filters=[('YEAR', '==', 2020), ('STATE', '==', state)])
    if len(df) == 0:
        return
    df = gpd.GeoDataFrame(df)
    df['LONLAT'] = gpd.points_from_xy(df['LONGITUDE'], df['LATITUDE'])
    df = df.set_geometry('LONLAT', crs={'init': 'epsg:4326'})
    tracts = gpd.read_file(f'data/tracts/{state_code}/cb_2019_{state_code}_tract_500k.shp')
    tracts = tracts[['GEOID', 'geometry']].to_crs({'init': 'epsg:4326'})
    df = gpd.sjoin(df, tracts, 'left', 'within')
    return df[['ABI', 'GEOID']]

In [None]:
%%time
states = list(tracts_state_aa_00.keys())[:5]

mon = ResourceMonitor()
mon.start()
df = []
for state in states:
    print(state, end=' ')
    df.append(tracts_from_coords(state))
print()
df = pd.concat(df, ignore_index=True)
mon.stop()
mon.plot()

In [None]:
%%time
states = list(tracts_state_aa_00.keys())[:5]

with concurrent.futures.ProcessPoolExecutor(10) as pool:
    df = pool.map(tracts_from_coords, tracts_state_aa_00.keys())
df = [x for x in df if x is not None]
df = pd.concat(df, ignore_index=True)