# Overcoming the Limitations of Threads

## The GIL

In the last mission, we learned about I/O bounds, threads, and locks. We went through a few exercises where we turned code into its threaded equivalent to see if we gained any speed. Unfortunately, even when we ran multiple threads, we didn't see the performance gains we intuitively expected. For example, **`2` threads didn't make the program twice as fast**.<br>

There are two main reasons for this:
* The [Global Interpreter Lock](https://wiki.python.org/moin/GlobalInterpreterLock), or [GIL](https://en.wikipedia.org/wiki/CPython), in CPython.
* The number of cores in the underlying processor.

In the last mission, we learned about the idea of thread safety -- how if two threads write to the same resource at the same time, they can cause conflicts. We saw an example of this when multiple threads wrote to the system standard output at the same time. This caused issues with output appearing out of order, or newlines not showing up after a string.<br>

[CPython](https://en.wikipedia.org/wiki/CPython), the most commonly used [Python interpreter](http://docs.python-guide.org/en/latest/starting/which-python/), does not do memory management in a thread safe way. This means that CPython needs to prevent multiple threads from executing Python code at the same time. Although you can initialize multiple threads and execute code using them, the GIL only allows one of those threads at a time to execute Python code, via a locking mechanism.<br>

This makes threads perform poorly when parallelizing CPU-bound tasks, since only one thread can execute at a time. However, the GIL is released in certain situations, including when doing I/O operations such as waiting for network activity or reading a file. This makes threads more performant in I/O bound situations.<br>

We can demonstrate this by reading a file with two different threads, and seeing how long it takes. If they take less than twice the time it took to run the code with a single thread, then you know you're seeing some benefit. The larger the file, the greater the benefit you'll see from multiple threads, since the GIL will be released for longer, as the file data is transferred.



* Write a function that opens and reads the file `Emails.csv`.
* Run the function `100` times and time it each time.
  * Assign the times to the list `times`.
* Create two threads that both call the function and run them `100` times.
  * Assign the `times` to the list `threaded_times`.
* Find the median of `times` and `threaded_times`.
* What do the results tell you? Were they what you expected?

In [1]:
import threading
import time
import statistics

In [17]:
def open_emails():
    emails = []
    with open('../data/Emails.csv') as f:
        data = f.read()
    return data

In [18]:
times = []

for i in range(100):
    start = time.time()
    open_emails()
    times.append(time.time() - start)
    
print(statistics.median(times))

0.043221473693847656


In [19]:
threaded_times = []

for i in range(100):
    
    start = time.time()
    
    thread1 = threading.Thread(target=open_emails)
    thread2 = threading.Thread(target=open_emails)
    thread1.start()
    thread2.start()
    thread1.join()
    thread2.join()
    
    threaded_times.append(time.time() - start)
    
print(statistics.median(threaded_times))

0.08028936386108398


## Python Interpreters

Before we dive more into the GIL and how it works, let's take a minute to discuss Python interpreters. A good way to think about interpreters is by thinking about how your brain interprets the words you're seeing right now. If you didn't know English, these letters would just be drawings on a page. If you look at letters in a language you don't know, they won't appear to have any meaning:

![alphabet-interpret](https://s3.amazonaws.com/dq-content/169/esperanto.gif)

The above is the alphabet of an obscure language called [Esperanto](https://en.wikipedia.org/wiki/Esperanto). If you understand English, some of the letters look familiar, but many have strange symbols above them, or don't "click" in your mind.<br>

A Python program is fundamentally just a series of random symbols written in a specific way. Here's an example:

```python
def loop():
    for i in range(5):
        print(i)
```

If you understand Python code already, the above example is easy to read. If you don't, the example might look foreign or intimidating. Just like you need lessons before you can understand Esperanto, a computer needs to be told how to parse the syntax of the Python language and execute it. Without an interpreter, there's no way for a computer to execute your Python code.<br>

Computers don't natively understand Python. When your program "runs", what actually happens is that your Python program is translated into code the machine can run. The code is then executed, and the response is translated back into a value you can understand.<br>

The CPython interpreter takes care of translating your code into Python bytecode. The bytecode is then executed by the interpreter in a virtual machine that converts bytecode to machine code, which we'll cover more in a bit. Here's a diagram:

![cpython-interpret-diagram](https://s3.amazonaws.com/dq-content/168/bytecode.svg)

Translating to bytecode first makes it easier for the CPython interpreter to associate each command with the machine code that it needs to run. The components that translate Python code to bytecode and bytecode to machine code are written in a low-level language called **[C](https://en.wikipedia.org/wiki/C)**.

There's a small set of potential bytecode operations, and each one is mapped to machine code ahead of time. This makes running programs as fast as possible. We can use the [dis](https://docs.python.org/3/library/dis.html) package to look at the bytecode generated for our programs. The [dis.dis()](https://docs.python.org/3/library/dis.html#dis.dis) function will dissassemble a Python object and turn it into bytecode.<br>

Here's what our example function from above would look like in bytecode:

```
4           0 SETUP_LOOP              30 (to 33)
              3 LOAD_GLOBAL              0 (range)
              6 LOAD_CONST               1 (5)
              9 CALL_FUNCTION            1 (1 positional, 0 keyword pair)
             12 GET_ITER
        >>   13 FOR_ITER                16 (to 32)
             16 STORE_FAST               0 (i)

  5          19 LOAD_GLOBAL              1 (print)
             22 LOAD_FAST                0 (i)
             25 CALL_FUNCTION            1 (1 positional, 0 keyword pair)
             28 POP_TOP
             29 JUMP_ABSOLUTE           13
        >>   32 POP_BLOCK
        >>   33 LOAD_CONST               0 (None)
             36 RETURN_VALUE
```

We won't dive into how to read bytecode [here](http://akaptur.com/blog/2013/11/15/introduction-to-the-python-interpreter/), but you can find a good guide here. After the bytecode is generated, the Python interpreter loops through the bytecode, and runs a pre-generated snippet of code that corresponds to each bytecode instruction. You can [read more here](http://www.devshed.com/c/a/python/how-python-runs-programs/), or see the source code that matches the bytecode to [machine code here](https://github.com/python/cpython/blob/51ba5b7d0c194b0c2b1e5d647e70e3538b8dde3e/Python/ceval.c#L1357). It's not usually important to dive deeply into how CPython interprets code. The important things to understand are:<br>
* Python enables us to write at a high abstraction layer. This means that our code can be extremely terse, but still achieve a lot. Imagine having to write the raw bytecode above every time you wanted to write a simple loop!
* Python builds on a lot of other tools, and sometimes the performance of Python can be impacted by those tools. In this case, CPython does a lot of work for us, and is a huge help, but it also introduces some limitations, like the GIL.

Although CPython is widely used becuase it gets updates the fastest, and is the "official" interpreter, there are other Python interpreters written in other languages, such as:<br>
* [Jython](http://www.jython.org/) -- A Python interpreter that runs in the [JVM](https://en.wikipedia.org/wiki/Java_virtual_machine).
* [PyPy](https://pypy.org/) -- a faster Python interpreter.
* [IronPython](http://ironpython.net/) -- Python running on the [.NET](https://en.wikipedia.org/wiki/.NET_Framework) framework.

Each interpreter has its own tradeoffs, but it's usually best to just go with CPython. Before we move on, let's take a quick look at some bytecode using dis.


## Clinton Emails



Before we dive into the GIL, let's introduce the dataset we'll use in this mission, and do a quick exercise with it. In this mission, we'll be analyzing `7000` emails that the [US State Department](https://www.state.gov/) released from [Hillary Clinton's](https://en.wikipedia.org/wiki/Hillary_Clinton) private email server. These emails, and others, caused several major controversies during her [2016 campaign for President](https://en.wikipedia.org/wiki/United_States_presidential_election,_2016).<br>

The data is available [here](https://data.world/briangriffey/clinton-emails) for download in CSV format. We'll be working primarily with the `Emails.csv` in this mission. Here are the first few rows:

In [3]:
import pandas as pd
pd.read_csv('../data/Emails.csv').head(5)

Unnamed: 0,Id,DocNumber,MetadataSubject,MetadataTo,MetadataFrom,SenderPersonId,MetadataDateSent,MetadataDateReleased,MetadataPdfLink,MetadataCaseNumber,...,ExtractedTo,ExtractedFrom,ExtractedCc,ExtractedDateSent,ExtractedCaseNumber,ExtractedDocNumber,ExtractedDateReleased,ExtractedReleaseInPartOrFull,ExtractedBodyText,RawText
0,1,C05739545,WOW,H,"Sullivan, Jacob J",87.0,2012-09-12T04:00:00+00:00,2015-05-22T04:00:00+00:00,DOCUMENTS/HRC_Email_1_296/HRCH2/DOC_0C05739545...,F-2015-04841,...,,"Sullivan, Jacob J <Sullivan11@state.gov>",,"Wednesday, September 12, 2012 10:16 AM",F-2015-04841,C05739545,05/13/2015,RELEASE IN FULL,,UNCLASSIFIED\nU.S. Department of State\nCase N...
1,2,C05739546,H: LATEST: HOW SYRIA IS AIDING QADDAFI AND MOR...,H,,,2011-03-03T05:00:00+00:00,2015-05-22T04:00:00+00:00,DOCUMENTS/HRC_Email_1_296/HRCH1/DOC_0C05739546...,F-2015-04841,...,,,,,F-2015-04841,C05739546,05/13/2015,RELEASE IN PART,"B6\nThursday, March 3, 2011 9:45 PM\nH: Latest...",UNCLASSIFIED\nU.S. Department of State\nCase N...
2,3,C05739547,CHRIS STEVENS,;H,"Mills, Cheryl D",32.0,2012-09-12T04:00:00+00:00,2015-05-22T04:00:00+00:00,DOCUMENTS/HRC_Email_1_296/HRCH2/DOC_0C05739547...,F-2015-04841,...,B6,"Mills, Cheryl D <MillsCD@state.gov>","Abedin, Huma","Wednesday, September 12, 2012 11:52 AM",F-2015-04841,C05739547,05/14/2015,RELEASE IN PART,Thx,UNCLASSIFIED\nU.S. Department of State\nCase N...
3,4,C05739550,CAIRO CONDEMNATION - FINAL,H,"Mills, Cheryl D",32.0,2012-09-12T04:00:00+00:00,2015-05-22T04:00:00+00:00,DOCUMENTS/HRC_Email_1_296/HRCH2/DOC_0C05739550...,F-2015-04841,...,,"Mills, Cheryl D <MillsCD@state.gov>","Mitchell, Andrew B","Wednesday, September 12,2012 12:44 PM",F-2015-04841,C05739550,05/13/2015,RELEASE IN PART,,UNCLASSIFIED\nU.S. Department of State\nCase N...
4,5,C05739554,H: LATEST: HOW SYRIA IS AIDING QADDAFI AND MOR...,"Abedin, Huma",H,80.0,2011-03-11T05:00:00+00:00,2015-05-22T04:00:00+00:00,DOCUMENTS/HRC_Email_1_296/HRCH1/DOC_0C05739554...,F-2015-04841,...,,,,,F-2015-04841,C05739554,05/13/2015,RELEASE IN PART,"H <hrod17@clintonemail.com>\nFriday, March 11,...",B6\nUNCLASSIFIED\nU.S. Department of State\nCa...


The main column we're interested in is `RawText`, which is the raw text of the entire email.<br>

In order to establish a baseline for single-threaded performance, let's see how long it takes to count the number of capital letters in each email.

* Loop through each email in the `RawText` column, and:
  * Count up the number of capital letters in the email using a list comprehension.
  * Append the count to `capital_letters`.
* Make sure to profile the loop, and assign the total running time to `total`.
* Print `total`, and see how long it took.

In [4]:
emails = pd.read_csv('../data/Emails.csv')
capital_letters = []

In [8]:
emails['RawText'].head()

0    UNCLASSIFIED\nU.S. Department of State\nCase N...
1    UNCLASSIFIED\nU.S. Department of State\nCase N...
2    UNCLASSIFIED\nU.S. Department of State\nCase N...
3    UNCLASSIFIED\nU.S. Department of State\nCase N...
4    B6\nUNCLASSIFIED\nU.S. Department of State\nCa...
Name: RawText, dtype: object

In [9]:
start = time.time()
for rawtxt in emails['RawText']:
    
    capital_letters.append(len([char for char in rawtxt
                               if char.isupper()]))
            
            
total = time.time() - start
print(total)

2.1561992168426514


## How The GIL Works

The GIL prevents multiple threads from executing Python code at once. Think of this as a lock, like this:

```python
def count_capitals(email):
    return len([letter for letter in email if letter.isupper()])

t1 = threading.Thread(target=count_capitals, args=(emails[0],))
t2 = threading.Thread(target=count_capitals, args=(emails[1],))

t1.start()
t2.start()
```

* Write a function that counts up the number of capital letters in a range of emails.
  * The function should take three inputs, `start`, `finish`, and `capital_letters`:
    * `start` -- what index in the RawText column to start from.
    * `end` -- what index in the RawText column to end on.
    * `capital_letters` -- a list to append the capital letter counts for each email to.
  * The function should loop through each element in `emails["RawText"]`, from `start` to `finish`, and count up the number of capital letters.
  * Append the count to `capital_letters`.
* Create a thread called `t1` that calls the function with the arguments `0`, `3972`, and `capital_letters1`.
* Create a thread called `t2` that calls the function with the arguments `3972`, `7946`, and `capital_letters2`.
* Start both threads.
* Call the [threading.Thread.join()](https://docs.python.org/3/library/threading.html#threading.Thread.join) method on both threads.
* Make sure to time how long the above, and assign to `total`.
* Print `total`, and see how long it took. How does it compare to the last screen.

In [10]:
capital_letters1 = []
capital_letters2 = []

In [11]:
def count_capitals(start, finish, capital_letters):
    
    for email in emails['RawText'][start:finish]:
        capital_letters.append(len([char for char in email
                                   if char.isupper()]))
    
    

In [14]:
print(len(emails['RawText']))
print(len(emails['RawText'])/2)

7945
3972.5


In [12]:
t1 = threading.Thread(target=count_capitals, args=(0,3972,capital_letters1))
t2 = threading.Thread(target=count_capitals, args=(3972,7946,capital_letters2))

start = time.time()
t1.start();t2.start();t1.join();t2.join()
total = time.time() - start

print(total)

2.3887863159179688


## Processes

You may have noticed that the code we wrote in the last screen was around the same speed as the code we wrote two screens ago. This is because of the GIL and overhead associated with threading. The GIL prevents multiple threads from executing at once, which makes the performance of threaded CPU bound programs at best equivalent to single threaded performance.<br>

However, there's also a small bit of overhead associated with each thread. This overhead comes from:
* The Python interpreter switching between threads as locks get released / acquired.
* Creating the threads and starting them.
* Joining the threads and terminating them.

Each of the above steps takes a tiny amount of time. This means that threads are good for situations where:
* You have long-running I/O bound tasks.

Threads aren't so good for situations where:
* You have CPU-bound tasks.
* You have tasks that will run very quickly (so the overhead outweighs the gains).

Although the GIL makes it harder to execute CPU bound tasks more quickly using threads, it doesn't prevent us from parallelizing CPU bound code using Python. To do so, we'll need to create multiple processes, and execute one piece of our code in each process.<br>

Whenever you launch [Microsoft Word](https://en.wikipedia.org/wiki/Microsoft_Word), [Spotify](https://en.wikipedia.org/wiki/Spotify), or [Google Chrome](https://en.wikipedia.org/wiki/Google_Chrome), the [Operating System](https://en.wikipedia.org/wiki/Operating_system) **creates a separate process for the program**, giving it a secure place to execute. Launching a Python interpreter also launches a process, with its own memory, and its own GIL. **Each process can contain one or more threads**. Here's a diagram:

![process-threads](https://s3.amazonaws.com/dq-content/168/processes_threads.svg)

In the above example, we have three processes, `Process 1`, `Process 2`, and `Process 3`. Each process contains its own Python interpreter (what would happen when you type python `script.py` in the terminal, for instance). The first process is running `3` threads, and the second process is running `1` thread. The threads inside each process are still subject to the GIL, but each process has its own GIL. **For example, `Thread 1` in `Process 1`, and `Thread 2` in `Process 2` can be running at the same time**.<br>

Threads share memory, which isn't a security risk because each thread is created by the same program. For example, it isn't a problem if the Google Chrome browser creates a new thread to render a background tab, because Google Chrome can trust its own code.<br>

Allowing processes to share memory is a huge security and reliability risk. What if Google Chrome could change the in-memory values for [Mozilla Firefox](https://en.wikipedia.org/wiki/Firefox)? Imagine if someone else could change the values of your variables while a program you wrote was running. This would make your program run unpredictably, and could cause major problems if your program was doing something critical. Thus, **processes each have their own memory, which isn't shared between processes**.<br>

The important takeaways here are:
* Threads run inside processes.
* Each process has its own memory, and all the threads inside share the same memory.
* One thread can be running inside each Python interpreter at a time, so starting multiple processes enables us to avoid the GIL.
* **Creating a process is a relatively "heavy" operation, and takes time. Threads, since they're inside processes, are much faster to make**.

## Multiprocessing

The easiest way to work with multiple processes in Python is to use the [multiprocessing](https://docs.python.org/3/library/multiprocessing.html) library. This library takes care of creating multiple processes for you, and makes it easy to pass data between them. Operating systems, like Linux and Microsoft Windows, have built-in ways to create new processes already. Because the Python interpreter uses these built-in methods to create new processes, the specifics of how processes are created varies by operating system. You can read more [here](https://docs.python.org/3/library/multiprocessing.html#contexts-and-start-methods), but here's a summary of the `3` methods of creating processes:
* `spawn` -- creates a new Python interpreter process, and only passes in a minimal set of resources from the parent (omitting open files, etc). Default on Windows.
* `fork` -- copies the parent process and its memory state, enabling the child to access functions, etc, that are defined in the parent. Default on Unix.
* `forkserver` -- creates a fork server, which is then used as the base process for future processes.

Note that all of the above methods use the underlying operating system (OS) to create the new process. We could manually start new processes using these OS mechanisms, but it would be very inconvenient (it would be like writing bytecode on your own when you can just write Python instead).<br>

We refer to the process that creates the new processes as the parent, and the new processes as children. Here's a diagram of how `fork` works:

![diagram-how-fork-works](https://s3.amazonaws.com/dq-content/168/parent_process.svg)

Note how when a process is forked, its memory is copied into the new process, so the new process has access to all of the same variables. Also note that per the above diagram, after the process is forked, the memory states aren't linked, and each process can independently modify the values in memory.<br>

When we worked with threads, we used the `threading.Thread` class, and the `start()` and `join()` methods. With `multiprocessing`, we can use the [multiprocessing.Process](https://docs.python.org/3/library/multiprocessing.html#multiprocessing.Process) class, along with the [multiprocessing.Process.start()](https://docs.python.org/3/library/multiprocessing.html#multiprocessing.Process.start) method, and the [multiprocessing.Process.join()](https://docs.python.org/3/library/multiprocessing.html#multiprocessing.Process.join) method:

```python
import multiprocessing

def task(email):
    print(email)

process = multiprocessing.Process(target=task, args=(email,))
process.start()
process.join()
```

Because of how creating new processes works (by copying in-memory values, as explained above), you need to be careful to define any functions or variables you want to use in a process before creating it, which is why the `task` definition is above the process creation in the example.


* Swap your code from the last screen to use `multiprocessing`.
* Make sure to profile the code, assign the result to `total`, and print the total time it took to run.
* Print out `capital_letters` when you're done. Is anything off about the result? Do you have any theories about why?

In [15]:
import multiprocessing

capital_letters1 = []
capital_letters2 = []

In [16]:
process1 = multiprocessing.Process(target=count_capitals,
                                  args=(0,3972,capital_letters1))
process2 = multiprocessing.Process(target=count_capitals,
                                  args=(3972,7946,capital_letters2))

start = time.time()
process1.start();process2.start();process1.join();process2.join()
print(time.time() - start)

1.5566089153289795


In [17]:
capital_letters1

[]

In [18]:
capital_letters2

[]

### Note.
* Calculation performance improved, chekced.
* However, storing capital characters has not been done properly.

## Multiple Cores

You should have noticed two things in the last screen:
* The total runtime of the code was much lower than when we used threads.
* The `capital_letters1` list was empty.

We'll briefly touch on the first point. Although the overhead to setup and remove processes is higher, the lack of the GIL means that code can be executed on multiple CPU cores. If you've bought a computer in the past few years, you may have noticed that computers are now advertised as having multi-core processors. The [Intel i5 processors](http://www.intel.com/content/www/us/en/products/processors/core/i5-processors.html) are one example of this, as they have `2` to `4` cores.<br>

Think of each core as its own CPU. A single process can only run on a single CPU. Here's an example of two processes sharing a single CPU:

![single-core-process](../img/5.png)

Each CPU can only run a certain number of operations per second. **If one of the programs wants to run more operations than the CPU can handle, the other program will run more slowly**. Think of this model as each process getting `50%` of the capacity of the CPU. Given this, we'd expect each process to run at `50%` of the speed it could otherwise run.<br>

But if your program creates multiple processes, each process can be run on a different CPU:

![single-core-process2](../img/6.png)

This makes your program much faster, since it's effectively running twice as many operations per second (due to using two cores). However, due to the overhead of creating and removing processes, you may not see a direct 2x performance gain.<br>

**There aren't many programs which are properly written to consistently utilize high numbers of cores, which is why a `4` core processor isn't twice as fast as a `2` core processor for everything**. There's a tricky relationship between [clock speed](https://en.wikipedia.org/wiki/Clock_rate) of a processor, which is how fast each CPU executes, and the number of cores. Generally, single-threaded or single-process applications will run faster on a CPU with a higher clock speed, even if it has less cores, whereas the reverse is true for programs that use multiple cores.<br>

Because the multiprocessing package sidesteps the GIL by using processes, which **don't share memory**, you **can't modify values in the process and expect those values to show up in the main program thread**. This is why our capital_letters1 list was empty. We'll sidestep this issue for now, and focus on seeing how performance changes when we try to use more processes.

* Change the code from the last screen to use `4` processes across the email list.
* You don't need to pass in `capital_letters1`, `capital_letters2`, etc, since we've seen that they don't work.
* Make sure to profile the code, assign the result to `total`, and print the total time it took to run.
* Did anything change about the total time? Why do you think this is?

In [22]:
print(len(emails)/4)
print(len(emails)/4*2)
print(len(emails)/4*3)
print(len(emails))

1986.25
3972.5
5958.75
7945


In [28]:
def count_capitals(start, finish):
    capitals = []
    for email in emails['RawText'][start:finish]:
        capitals.append(len([char for char in email
                                   if char.isupper()]))

    return capitals

In [29]:
process1 = multiprocessing.Process(target=count_capitals,
                                  args=(0,1986))
process2 = multiprocessing.Process(target=count_capitals,
                                  args=(1986,3972))
process3 = multiprocessing.Process(target=count_capitals,
                                  args=(3972,5958))
process4 = multiprocessing.Process(target=count_capitals,
                                  args=(5958,7946))

processes = [process1, process2, process3, process4]

start = time.time()

for p in processes:
    p.start()
for p in processes:
    p.join()

print(time.time() - start)

1.5478589534759521


## Inter-Process Communication

You may have noticed in the last screen that our program took longer to run with `4` processes than it did earlier with `2` processes. This is because the underlying number of processors we have access to stayed constant, so additional processes just added more overhead without enabling more processors to work on our program. In the Dataquest environment, you always have access to `1.5` processors. This means that we created too many processes for our processor counts. It's usually a good idea to have the number of processes you create match the number of processor cores on your system.<br>

You may have also noticed that no matter how many processes we use, it's not particularly useful unless we have a way of getting values back. Our list `capital_letters1` was empty. This is because although the parent's memory state was copied to the child when we first created the new process, the child had no way to get modifications made to variables back to the parent, because the memory isn't shared.<br>

In order for the child and the parent to communicate while the child is running, we need to use [Inter-process Communication](https://en.wikipedia.org/wiki/Inter-process_communication), or IPC. As we mentioned earlier, processes can't share memory due to security reasons. In fact, for security, processes can only communicate in very constrained ways. The simplest method of inter-process communication is called a [pipe](https://en.wikipedia.org/wiki/Pipeline). Pipes will usually work by writing to a shared file on disk, or copying the standard output from one process to the standard input of another process.<br>

You may recall using the pipe command (`|`) in the command line course like this:

```bash
ls -l | grep csv
```

The above command will **list all the files in the current directory, then pass the file list into the grep command, so that only files that contain csv are output**. This is a simple example of two processes `ls -l` and `grep` communicating. The pipe is unidirectional -- data comes in from `ls -l` via the standard output, and is fed into `grep` via the standard input.<br>

multiprocessing uses the same concept of the pipe to communicate between pipes, and implements it via the [Pipe](https://docs.python.org/3/library/multiprocessing.html#multiprocessing.Pipe) class.<br>

We can use a pipe to pass data between processes like this:

```python
import multiprocessing
​
def echo_email(email, conn):
    # Sends the email through the pipe to the parent process.
    conn.send(email)
    # Close the connection, since the process will terminate.
    conn.close()
​
# Creates a parent connection (which we'll use in this thread), and a child connection (which we'll pass in).
parent_conn, child_conn = multiprocessing.Pipe()
# Pass the child connection into the child process.
p = multiprocessing.Process(target=echo_email, args=(email, child_conn,))
# Start the process.
p.start()
# Block until we get data from the child.
print(parent_conn.recv())
# Wait for the process to finish.
p.join()
```

Here's a diagram of what happens in the above example:

![diagram-for-pipe-example](https://s3.amazonaws.com/dq-content/168/pipe.svg)

* Modify the below code to pass two lists `capital_letters1`, and `capital_letters2`, back from the processes:
  * Create two sets of connections using `multiprocessing.Pipe`.
  * Modify `count_capitals_in_emails` to pass data back:
    * Accept an additional argument, `conn`.
    * Initialize a list `capital_letters`, in the function.
      * Append to it in the loop.
    * Send `capital_letters` back over `conn`.
    * Close `conn`.
  * Pass the right arguments into the processes when you create them.
  * Receieve the data from each process.
* Print out `capital_letters1` to verify that everything worked.

In [None]:
##### ORIGINAL CODE (before modifying) #####

import multiprocessing

def count_capital_letters(email):
    return len([letter for letter in email if letter.isupper()])

def count_capitals_in_emails(start, finish):
    for email in emails["RawText"][start:finish]:
        capital_letters.append(count_capital_letters(email))

start = time.time()
p1 = multiprocessing.Process(target=count_capitals_in_emails, args=(0, 3972))
p2 = multiprocessing.Process(target=count_capitals_in_emails, args=(3972, 7946))
p1.start()
p2.start()

for process in [p1, p2]:
    process.join()
    
total = time.time() - start

print(total)

In [39]:
##### AFTERCODES #####

import multiprocessing

def count_capital_letters(email):
    return len([letter for letter in email if letter.isupper()])

def count_capitals_in_emails(start, finish, conn):
    capital_letters = []
    for email in emails["RawText"][start:finish]:
        capital_letters.append(count_capital_letters(email))
    
    conn.send(capital_letters)
    conn.close()
        
start = time.time()
#p1 = multiprocessing.Process(target=count_capitals_in_emails, args=(0, 3972))
#p2 = multiprocessing.Process(target=count_capitals_in_emails, args=(3972, 7946))
#p1.start()
#p2.start()
#for process in [p1, p2]:
#    process.join()

parent_conn1, child_conn1 = multiprocessing.Pipe()
parent_conn2, child_conn2 = multiprocessing.Pipe()

p1 = multiprocessing.Process(target=count_capitals_in_emails, 
                            args=(0, 3972, child_conn1))
p2 = multiprocessing.Process(target=count_capitals_in_emails, 
                            args=(3972, 7946, child_conn2))

# Start the process.
p1.start();p2.start()

# Block until we get data from the child.
capital_letters1 = parent_conn1.recv()
capital_letters2 = parent_conn2.recv()

# Wait for the process to finish.
p1.join();p2.join()

total = time.time() - start

print(total)

1.6029572486877441


## Worker Pools

Using the `Pipe` class worked well as a form of IPC, and it gave us back the data we wanted. However, it was also very cumbersome to use. **As we scale up our process count, dealing with manually creating and starting processes and dealing with pipes becomes cumbersome**.<br>

This is where the idea of the worker pool, as implemented by the [multiprocessing.Pool class](https://docs.python.org/3/library/multiprocessing.html#multiprocessing.Pool), comes in. The `Pool` class builds on the low-level process and `Pipe` infrastructure to give us a simple interface for working with processes.<br>

The concept of the worker pool recurs throughout data engineering, so it's an important one to be able to reason about. As a thought experiment, let's say that we want to get `90` documents translated from Spanish to English. We have the money to hire `3` translators, and each translator can translate 1 document at a time. Our stack and translators:

![stack-and-translators](../img/7.png)


Each document may take a different amount of time to translate, and each translator may work at a different speed, so we don't want to just divide up the work beforehand. That could lead to situations like this:

![translators-1](https://s3.amazonaws.com/dq-content/168/translators_1.svg)

Translators `2` and `3` just sat around after they finished translating, waiting for `1` to finish. It would have been more efficient to allow each translator to take whatever work was left at the top of the pile:

![translators-2](https://s3.amazonaws.com/dq-content/168/translators_2.svg)

By sharing the stack, and allowing each translator that needs work to take whatever document is on top of the pile, we cut the total translation time from `60` minutes to `45` minutes.<br>

Earlier, when we divided up the work beforehand, and passed many emails into a process at once, we used a model analagous to the slower method above. Worker pools function more similarly to the second, faster, approach. In a worker pool, you create several processes, then pass tasks to the available workers until all of the work is done.<br>

To initialize a pool of workers with `multiprocessing`, you just pass the number of workers you want into the `Pool` class:

```python
from multiprocessing import Pool

# Create a pool of workers.
p = Pool(5)
```

Once we've created a pool, we can use it to perform work by calling the [multiprocessing.Pool.map()](https://docs.python.org/3/library/multiprocessing.html#multiprocessing.pool.Pool.map) function. The `map` function will take a function and a list as input. It will then pass each element of the list into the function, assemble all of the outputs, and return another list.<br>

Here's an example:

```python
from multiprocessing import Pool

# Create a pool of workers.
p = Pool(5)

def double(x):
    return x * 2

p.map(double, [1,4,5])

```

The above code will output `[2, 8, 10]`.

* Create a `Pool` with `2` workers.
* Use the Pool to find the number of capital letters in each email. Assign the result to `capital_letters`.
* Make sure to profile your code, and assign the total time to `total`.
* Print total. Is the time what you expected?

In [40]:
from multiprocessing import Pool

In [48]:
def count_capital_letters(email):
    return len([letter for letter in email if letter.isupper()])

p = Pool(2)

start = time.time()
capital_letters = p.map(count_capital_letters, emails['RawText'])
total = time.time() - start

print(total)

1.4367668628692627


## Deadlocks

Even though we used a worker pool, which is theoretically more efficient, **you may have noticed that our last code example was actually slower than dividing up the tasks beforehand**. 
### This is because there's a small overhead to a worker retrieving a task from the "stack" of tasks.
When each individual unit of work is large, such as a `10` minute document translation, the small overhead doesn't make a big difference.<br>

When you're instead trying to find capital letters, which takes a small fraction of a second for each email, the overhead becomes significant relative to the runtime of each task (in our example, retrieving the task takes `~40%` of the time of actually running the task, which is large). It's important to keep this in mind, and think about dividing tasks upfront when each task will run relatively quickly.<br>

One thing to be very aware of when using threads or processes is the potential for deadlocks. Deadlocks happen when two threads or processes (we'll simplify by referring to both as "processes") both require a lock that the other process has before proceeding. An example is a hostage situtation:
* A criminal has a hostage.
* The police have the jewels the criminal demanded in exchange for the hostage.
* The criminal won't release the hostage until he gets the jewels.
* The policewoman won't give the criminal the jewels until she gets the hostage.

This results in a deadlock -- neither side will release what they have until they first receive something from the other side. This results in both parties waiting forever.<br>

A similar situation can arise when two processes want to access shared resources in a different order:

```python
import multiprocessing

db_lock = multiprocessing.Lock()
stdout_lock = multiprocessing.Lock()

def write_to_database(data, db_lock, stdout_lock):
    stdout_lock.acquire()
    print("Writing {} lines of data to the database.".format(len(data)))
    db_lock.acquire()
    db_write(data)
    db_lock.release()
    stdout_lock.release()


def read_from_database(db_lock, stdout_lock):
    db_lock.acquire()
    data = db_read()

    stdout_lock.acquire()
    print("Read {} lines of data from the database.".format(len(data)))
    stdout_lock.release()
    db_lock.release()

p1 = multiprocessing.Process(target=read_from_database, args=(db_lock, stdout_lock))
p2 = multiprocessing.Process(target=write_to_database, args=(data, db_lock, stdout_lock))


p1.start()
p2.start()
```

In the above example, `p1` immediately grabs the `db_lock`, and `p2` immediately grabs the `stdout_lock`. Both then try to acquire the other lock without releasing their existing lock. This causes them to wait forever:

![p1-p2-deadlock](../img/8.png)

When you use locks and multiple resources, it's worth printing out the status of each lock if you notice your program hanging -- this will help you debug.


## Next Steps

In this mission, we learned about how processes can help you overcome some of the challenges.<br>

Threads are best in situations where:
* Your task is I/O bound.
* You need a lightweight way to spin up and create threads.
* You want to share memory, and don't want the complexity of dealing with IPC.

### Processes are best in situations where:
* Your task is `CPU bound`.
* Your task `takes long enough that it's worth the additional overhead`.

Both can make your programs significantly faster when used correctly. If you want to read more, check out these resources:
* [multiprocessing documentation](https://docs.python.org/3/library/multiprocessing.html)
* [A look at Python's internals](https://tech.blog.aknin.name/2010/04/02/pythons-innards-introduction/)
* [Wikipedia article on processes](https://en.wikipedia.org/wiki/Process)
* [CPython on Wikipedia](https://en.wikipedia.org/wiki/CPython)

In the next mission, we'll look more closely at how to use parallelization and threading to speed up data science work.