# Efektywne programowanie w języku Python 

## wykład 5

## Parallel processing

Parallel processing is a mode of operation where the task is executed simultaneously in multiple processors in the same computer. It is meant to reduce the overall processing time.

### How many maximum parallel processes can you run?

In [1]:
import multiprocessing as mp
print("Number of processors: ", mp.cpu_count())

Number of processors:  4


### What is Synchronous and Asynchronous execution?

A **synchronous** execution is one the processes are completed in the same order in which it was started. This is achieved by locking the main program until the respective processes are finished.

**Asynchronous**, on the other hand, doesn’t involve locking. As a result, the order of results can get mixed up but usually gets done quicker.

There are 2 main objects in multiprocessing to implement parallel execution of a function: The `Pool` Class and the `Process` Class.

1. `Pool` Class
  - Synchronous execution
    - `Pool.map()` and `Pool.starmap()`
    - `Pool.apply()`
  - Asynchronous execution
    - `Pool.map_async()` and `Pool.starmap_async()`
    - `Pool.apply_async()`
2. `Process` Class

### Problem

> Count how many numbers exist between a given range in each row.

In [2]:
import numpy as np
from time import time

# Prepare data
np.random.RandomState(100)
arr = np.random.randint(0, 10, size=[200000, 5])
data = arr.tolist()

#### Solution without parallelization

In [3]:
# Solution Without Paralleization
def howmany_within_range(row, minimum, maximum):
    """Returns how many numbers lie within `maximum` and `minimum` in a given `row`"""
    count = 0
    for n in row:
        if minimum <= n <= maximum:
            count = count + 1
    return count

In [4]:
results = []
for row in data:
    results.append(howmany_within_range(row, minimum=4, maximum=8))

In [1]:
#print(results)

#### How to parallelize any function?

The general way to parallelize any operation is to take a particular function that should be run multiple times and make it run parallelly in different processors.

To do this, you initialize a Pool with n number of processors and pass the function you want to parallelize to one of Pools parallization methods.

`multiprocessing.Pool()` provides the `apply()`, `map()` and `starmap()` methods to make any function run in parallel.

Both `apply()` and `map()` take the function to be parallelized as the main argument. But the difference is, `apply()` takes an args argument that accepts the parameters passed to the ‘function-to-be-parallelized’ as an argument, whereas, map can take only one iterable as an argument.

`map()` is really more suitable for simpler iterable operations but does the job faster.

#### Parallelizing using `Pool.apply()`

In [None]:
# Parallelizing using Pool.apply()

import multiprocessing as mp

# Step 1: Init multiprocessing.Pool()
pool = mp.Pool(mp.cpu_count())

# Step 2: `pool.apply` the `howmany_within_range()`
results = [pool.apply(howmany_within_range, args=(row, 4, 8)) for row in data]

# Step 3: Don't forget to close
pool.close()    

print(results[:10])
#> [3, 1, 4, 4, 4, 2, 1, 1, 3, 3]

#### Parallelizing using `Pool.map()`

`Pool.map()` accepts only one iterable as argument. So as a workaround, I modify the howmany_within_range function by setting a default to the minimum and maximum parameters to create a new `howmany_within_range_rowonly()` function so it accetps only an iterable list of rows as input. I know this is not a nice usecase of `map()`, but it clearly shows how it differs from `apply()`.

In [None]:
# Parallelizing using Pool.map()
import multiprocessing as mp

# Redefine, with only 1 mandatory argument.
def howmany_within_range_rowonly(row, minimum=4, maximum=8):
    count = 0
    for n in row:
        if minimum <= n <= maximum:
            count = count + 1
    return count

pool = mp.Pool(mp.cpu_count())

results = pool.map(howmany_within_range_rowonly, [row for row in data])

pool.close()

print(results[:10])
#> [3, 1, 4, 4, 4, 2, 1, 1, 3, 3]

#### Parallelizing using `Pool.starmap()`

`Pool.starmap()` accepts only one iterable as argument, but in `starmap()`, each element in that iterable is also a iterable. You can to provide the arguments to the ‘function-to-be-parallelized’ in the same order in this inner iterable element, will in turn be unpacked during execution.

In [None]:
# Parallelizing with Pool.starmap()
import multiprocessing as mp

def howmany_within_range(row, minimum, maximum):
    """Returns how many numbers lie within `maximum` and `minimum` in a given `row`"""
    count = 0
    for n in row:
        if minimum <= n <= maximum:
            count = count + 1
    return count

if __name__ == '__main__':
    pool = mp.Pool(mp.cpu_count())

    results = pool.starmap(howmany_within_range, [(row, 4, 8) for row in data])

    pool.close()

    print(results[:10])
    #> [3, 1, 4, 4, 4, 2, 1, 1, 3, 3]

####  Parallelizing with `Pool.apply_async()`

In [None]:
# Parallel processing with Pool.apply_async() without callback function

import multiprocessing as mp
pool = mp.Pool(mp.cpu_count())

results = []

# call apply_async() without callback
result_objects = [pool.apply_async(howmany_within_range2, args=(i, row, 4, 8)) for i, row in enumerate(data)]

# result_objects is a list of pool.ApplyResult objects
results = [r.get()[1] for r in result_objects]

pool.close()
pool.join()
print(results[:10])
#> [3, 1, 4, 4, 4, 2, 1, 1, 3, 3]

itd. ...

- `without`: all done at 1.32 seconds
- `apply`: all done at 306.88 seconds
- `map`:        all done at 4.34 seconds
- `starmap`: all done at 5.69 seconds
- `apply_async`: all done at 36.52 seconds
- `map_async`: all done at 3.79 seconds
- `starmap_async`: all done at 5.94 seconds


## The Python Standard Library

Behind: Python syntax and philosophy

"Python" is a "batteries-included" distribution

Many powerful tools are already implemented in the:

[https://docs.python.org/3.6/library/](https://docs.python.org/3.6/library/)

Assume all necessary imports have been executed

# collections
## container datatypes

## collections.namedtuple
### create tuple subclasses with named fields

In [None]:
Point = collections.namedtuple('Point', ['x', 'y'])

p = Point(11, y=22) # positional or keyword arguments

# Fields are accessible by name! "Readability counts."
-p.x, 2 * p.y # => -11, 44

# readable __repr__ with a name=value style
print(p) # Point(x=11, y=22)

In [None]:
Point = collections.namedtuple('Point', ['x', 'y'])

p = Point(11, 22)

# Subscriptable, like regular tuples
p[0] * p[1] # => 242

# Unpack, like regular tuples
x, y = p # x == 11, y == 22

# Usually don't need to unpack if attributes have names
math.hypot(p.x - other.x, p.y - other.y)

# Good Python Style:
# Use namedtuple

In [None]:
# Can you guess the context of this code?
p = (170, 0.1, 0.6)
if p[1] >= 0.5:
    print("Whew, that is bright!")
if p[2] >= 0.5:
    print("Wow, that is light!")

In [28]:
Color = collections.namedtuple("Color", ["hue", "saturation", "luminosity"])
pixel = Color(170, 0.1, 0.6)

if pixel.saturation >= 0.5:
    print("Whew, that is bright!")
if pixel.luminosity >= 0.5:
    print("Wow, that is light!")

Wow, that is light!


## collections.defaultdict
### dict subclass with factory function for missing values

In [None]:
# Have:
input_data = [('yellow', 1), ('blue', 2), ('yellow', 3), ('blue', 4), ('red', 1)]

# Want:
output = {'blue': [2, 4], 'red': [1], 'yellow': [1, 3]}

In [3]:
input_data = [('yellow', 1), ('blue', 2), ('yellow', 3), ('blue', 4), ('red', 1)]

# One approach
output = {}
for k, v in input_data:
    if k not in output:
        output[k] = []
    output[k].append(v)
        
print(output) # => {'yellow': [1, 3], 'blue': [2, 4], 'red': [1]}

{'yellow': [1, 3], 'blue': [2, 4], 'red': [1]}


In [6]:
import collections
input_data = [('yellow', 1), ('blue', 2), ('yellow', 3), ('blue', 4), ('red', 1)]

# One approach
# accepts one argument - a zero-argument factory function to supply missing keys
output = collections.defaultdict(lambda: list()) 
for k, v in input_data:
    output[k].append(v)
        
print(output) # => {'yellow': [1, 3], 'blue': [2, 4], 'red': [1]}

defaultdict(<function <lambda> at 0x000001C5DB3B3730>, {'yellow': [1, 3], 'blue': [2, 4], 'red': [1]})


In [None]:
# defaultdict with default value []
collections.defaultdict(lambda: list())

# equivalent to
collections.defaultdict(list)

# defaultdict with default value 0
collections.defaultdict(lambda: 0)

# equivalent to
collections.defaultdict(int)

In [7]:
# Have: s = 'mississippi'
# Want: d = {'i': 4, 'p': 2, 'm': 1, 's': 4}
s = 'mississippi'
d = collections.defaultdict(int) # or... lambda: 0
for letter in s:
    d[letter] += 1
    
print(d)

defaultdict(<class 'int'>, {'m': 1, 'i': 4, 's': 4, 'p': 2})


## collections.Counter
### dict subclass for counting hashable objects

In [8]:
# Have: s = 'mississippi'
# Want: [('s', 4), ('m', 1), ('i', 4), ('p', 2)]
s = 'mississippi'

count = collections.Counter(s)

print(count) # Counter({'i': 4, 's': 4, 'p': 2, 'm': 1})
print(list(count.items())) # [('m', 1), ('i', 4), ('s', 4), ('p', 2)]

Counter({'i': 4, 's': 4, 'p': 2, 'm': 1})
[('m', 1), ('i', 4), ('s', 4), ('p', 2)]


In [9]:
# Tally occurrences of words in a list
colors = ['red', 'blue', 'red', 'green', 'blue']

# One approach
counter = collections.Counter()
for color in colors:
    counter[color] += 1

print(counter) # Counter({'red': 2, 'blue': 2, 'green': 1})

# A better approach
counter = collections.Counter(colors)
print(counter) # Counter({'red': 2, 'blue': 2, 'green': 1})

Counter({'red': 2, 'blue': 2, 'green': 1})
Counter({'red': 2, 'blue': 2, 'green': 1})


In [12]:
# Get most common elements!
collections.Counter('abracadabra').most_common(3) # [('a', 5), ('b', 2), ('r', 2)]

# Supports basic arithmetic
collections.Counter('which') + collections.Counter('witch') # Counter({'c': 2, 'h': 3, 'i': 2, 't': 1, 'w': 2})

collections.Counter('abracadabra') - collections.Counter('alakazam') # Counter({'a': 1, 'b': 2, 'c': 1, 'd': 1, 'r': 2})

Counter({'a': 1, 'b': 2, 'c': 1, 'd': 1, 'r': 2})

## re
### Regular expression operations

In [16]:
import re

In [None]:
# Search for pattern match anywhere in string; return None if not found
m = re.search(r"(\w+) (\w+)", "Physicist Isaac Newton")
m.group(0) # "Physicist Isaac" - the entire match
m.group(1) # "Physicist" - first parenthesized subgroup
m.group(2) # "Isaac" - second parenthesized subgroup

# Match pattern against start of string; return None if not found
m = re.match(r"(?P<fname>\w+) (?P<lname>\w+)", "Malcolm Reynolds")
m.group('fname') # => 'Malcolm'
m.group('lname') # => 'Reynolds'

In [21]:
m.group(3) 

IndexError: no such group

In [None]:
# Substitute occurrences of one pattern with another
re.sub(r'@\w+\.com', '@stanford.edu', 'sam@go.com poohbear@bears.com')
# => sam@stanford.edu poohbear@stanford.edu

pattern = re.compile(r'[a-z]+[0-9]{3}') # compile pattern for fast ops
match = re.search(pattern, '@@@abc123') # pattern is first argument
match.span() # (3, 9)

## itertools
### iterators for efficient looping

### Combinatorics

In [None]:
def view(it): print(*[''.join(els) for els in it])
view(itertools.product('ABCD', 'EFGH'))
# => AE AF AG AH BE BF BG BH CE CF CG CH DE DF DG DH

view(itertools.product('ABCD', repeat=2))
# => AA AB AC AD BA BB BC BD CA CB CC CD DA DB DC DD

view(itertools.permutations('ABCD', 2))
# => AB AC AD BA BC BD CA CB CD DA DB DC

view(itertools.combinations('ABCD', 2))
# => AB AC AD BC BD CD

view(itertools.combinations_with_replacement('ABCD', 2))
# => AA AB AC AD BB BC BD CC CD DD

In [None]:
# start, [step] -> start, start + step, ...
itertools.count(10) # -> 10, 11, 12, 13, 14, ...

# Cycle through elements of an iterable
itertools.cycle('ABC') # -> 'A', 'B', 'C', 'A', ...

# Repeat a single element over and over.
itertools.repeat(10) # -> 10, 10, 10, 10, ...

## random
### Generate pseudo-random numbers

In [None]:
# Random float x with 0.0 <= x < 1.0
random.random() # => 0.37444887175646646

# Random float x, 1.0 <= x < 10.0
random.uniform(1, 10) # => 1.1800146073117523

# Random integer from 1 to 6 (inclusive)
random.randint(1, 6) # => 4 (https://xkcd.com/221/)

# Random integer from 0 to 9 (inclusive)
random.randrange(10) # => 7

# Random even integer from 0 to 100 (inclusive)
random.randrange(0, 101, 2) # => 26

In [1]:
import random
# Choose a single element
random.choice('abcdefghij') # => 'c'

items = [1, 2, 3, 4, 5, 6, 7]
random.shuffle(items)
print(items) # => [7, 3, 2, 5, 6, 4, 1]

# k samples without replacement
random.sample(range(5), k=3) # => [3, 1, 4]

# Sample from statistical distributions (others exist)
random.normalvariate(mu=0, sigma=3) # => 2.373780578271

[7, 5, 4, 3, 2, 1, 6]


0.24707776115117297

# Builtin Functions

In [None]:
any([True, True, False]) # => True
all([True, True, False]) # => False

In [None]:
int('45') # => 45
int('0x2a', 16) # => 42
int('1011', 2) # => 11
hex(42) # => '0x2a'
bin(42) # => '0b101010'

In [None]:
ord('a') # => 97
chr(97) # => 'a'

In [5]:
#round(123.45, 1) # => 123.4
round(123.45, 0) # => 100

123.0

In [6]:
max(2, 3) # => 3
max([0, 4, 1]) # => 4
min(['apple', 'banana', 'pear'], key=len) # => 0

'pear'

In [22]:
sum([3, 5, 7]) # => 15

15

In [23]:
sum([3, 5, 7], 10) # => 15

25

In [None]:
pow(3, 5) # => 243 (= 3 ** 5)
pow(3, 5, 10) # => 3 (= (3 ** 5) % 10, efficiently)

In [None]:
quotient, remainder = divmod(10, 6)
# quotient, remainder => (1, 4)

In [None]:
# Flatten a list of lists (slower than itertools.chain)
sum([[3, 5], [1, 7], [4]], []) # => [3, 5, 1, 7, 4]

# Other Modules

- 6.1. string — Common string operations
- 7.1. struct — Interpret bytes as packed binary data
- 8.1. datetime — Basic date and time types
- 9.5. fractions — Rational numbers
- 9.7. statistics — Mathematical statistics functions
- 10.3. operator — Standard operators as functions
- 12.1. pickle — Python object serialization
- 14.1. csv — CSV File Reading and Writing
- 16.1. os — Miscellaneous operating system interfaces

- 16.3. time — Time access and conversions
- 16.4. argparse — Parser for command-line options, arguments and sub-commands
- 16.6. logging — Logging facility for Python
- 17.1. threading — Thread-based parallelism
- 17.2. multiprocessing — Process-based parallelism
- 18.1. socket — Low-level networking interface 
- 18.5. asyncio – Asynchronous I/O, event loop, coroutines and tasks

- 18.8. signal — Set handlers for asynchronous events
- 26.3. unittest — Unit testing framework
- 26.6. 2to3 - Automated Python 2 to 3 code translation
- 27.3. pdb — The Python Debugger
- 27.6. trace — Trace or track Python statement execution
- 29.12. inspect — Inspect live objects

[https://www.youtube.com/watch?v=o9pEzgHorH0](https://www.youtube.com/watch?v=o9pEzgHorH0)

[https://www.youtube.com/watch?v=JYuE8ZiDPl4&list=PLs4CJRBY5F1KsK4AbFaPsUT8X8iXc7X84&index=2](https://www.youtube.com/watch?v=JYuE8ZiDPl4&list=PLs4CJRBY5F1KsK4AbFaPsUT8X8iXc7X84&index=2)