# Multiprocessing

This is a very, very, very brief example for a very complex topic: multiprocessing and multithreading.
The core idea is easy, though: If you loop over, say, a list of objects to process them, you don't always have to wait for the first object to be processed before you can do the second, but you could do that in parallel. That's especially useful if you have multiple cores.

In [1]:
import re  # for our tokenizer later, don't need it for multiprocessing

Example from the Python documentation (https://docs.python.org/3/library/multiprocessing.html):

In [3]:
from multiprocessing import Pool

def f(x):
    return x*x
p = Pool(5)
if __name__ == '__main__':        # in our course, we didn't discuss the if __name__=='__main__' construct yet
        print(p.map(f, [1, 2, 3]))

[1, 4, 9]


## Now for us!
Let's apply it to our context:

In [5]:

def my_preprocessing(t):
    '''takes a string t and returns a very basic tokenized list'''
    return [e for e in re.split(r"\W", t.lower()) if len(e)>2]

mydata = ["This is some text", "This is some other text", "Number three", "Good morning! This is text","Bye!"]

result = [my_preprocessing(text) for text in mydata]
print(result)

[['this', 'some', 'text'], ['this', 'some', 'other', 'text'], ['number', 'three'], ['good', 'morning', 'this', 'text'], ['bye']]


In [6]:
# now with multiprocessing instead

N_PROCESSES=3

with Pool(N_PROCESSES) as p:
    result2 = p.map(my_preprocessing,mydata)
print(result2)

[['this', 'some', 'text'], ['this', 'some', 'other', 'text'], ['number', 'three'], ['good', 'morning', 'this', 'text'], ['bye']]


In [7]:
result == result2

True

Of course, this does not make sense for such little data because the overhead is higher than the gain, timewise.