### Parallel processing

This is probably the hardest tutorial of the day. I have done the best I can to make it clear and simple as possible. It took me a very long time to understand it. I hope that this tutorial means you can understand parallel processing in a much shorter time than I can.

Parallel processing is highly important in bioinformatics and something that is often taken for granted. You might think you know how to parallel process by adding '-t 20' to the options of your command. But that doesn't help grip what's going on under the hood.

"You only truly understand the benefits of parallel processing until you can no longer do it."

This example generated here is taken with much influence from a fantastic stackover-flow [thread](http://stackoverflow.com/questions/14533458/python-threading-multiple-bash-subprocesses).

An example of important parallel-processing in nanopore sequencing is extracting fastq information from a set of fast5 files. This can take up to a day...

In our example today we will generate 200 commands that all take somewhere between 0 and 9 seconds to complete. We will run this in five parallel streams. This should roughly take one fifth of the time it would take if we were to do this with only one stream.

In [76]:
# Import the libraries we will need
import subprocess  # The library that will actually be talking to the shell and
                   # tell it to what to run and when.
from itertools import islice  # Important tool that will allow us to split up our
                              # commands for each output.
import random  # This will determine how long each process will take.
random.seed(1)  # Feel free to change this, but useful in the notebook so the author can explain
                # the output even if the output is 'random'

In [77]:
# Set the number of threads
threads = 5  # Of the 200 commands, five will be running at any one time.

# We need to have 5 separate output files to stop each running command from 
# over writing the work of a simultaneous command.
output_files = ["output.file.%d" % i for i in range(0, threads)]  #output.file.0 to output.file.4

file_handlers = [None]*threads  # Generates a list of NULL variables of length 5.

# This assigns the file handler for each file.
for index, output_file in enumerate(output_files):
    file_handlers[index] = open(output_file, 'w')

for handler in file_handlers:  # Print the file handler so we know what they look like.
    print handler

<open file 'output.file.0', mode 'w' at 0x0000000003B6EE40>
<open file 'output.file.1', mode 'w' at 0x0000000003BDE420>
<open file 'output.file.2', mode 'w' at 0x0000000003BDE5D0>
<open file 'output.file.3', mode 'w' at 0x0000000003BDE660>
<open file 'output.file.4', mode 'w' at 0x0000000003CE5C00>


Ugh! Looking kinda ugly, but what we can see that each of our output files are in an 'open' state
and that we are 'writing' into them (rather than reading, or appending)

Now we will make a list of random numbers and a list of associated commands.
These commands will sleep for a 'random' number of seconds, then wake up and print which of the 200 commands they were, and how long they slept for.

In [78]:
random_number_list = [random.randrange(1,10) for i in range(0,200)]
commands = ["sleep %d && echo Command number - %d. Slept for %d." % (j, i, j)
            for i, j in enumerate(random_number_list)]

Now we're going to make some processes using the subprocess command. The subprocess command is able to talk and boss around the terminal, then pull in the output of the terminal.

In [79]:
# Talk to the shell. Note these commands won't run just yet.
processes = (subprocess.Popen(cmd, shell=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
            for cmd in commands)

# We use the islice command to split our list of 200 commands into five smaller lists.
running_processes = list(islice(processes, threads))

while running_processes:
    for i, process in enumerate(running_processes):
        if process.poll() is not None:  # Means that the process is complete!
            stdout, stderr = process.communicate()  # Get the output of the completed process
            file_handlers[i].write(str(stdout) + "\n")  # Write the output to handler that
            running_processes[i] = next(processes, None)
            # Run the next number in the list.
            if running_processes[i] is None:  # No more commands waiting to be processed.
                del running_processes[i]  # Not a valid process.
                break

Before we can see anything we need to close of the file handlers.

In [80]:
# By closing the file_handler this prints everything accumulated in the handler to the file.
for handler in file_handlers:
    handler.close()

In [81]:
# Now let's have a look at the first few lines of each each file.
number_of_lines = 3
for output_file in output_files:
    with open(output_file) as output_handler:
        head = list(islice(output_handler, number_of_lines))
    print "### " + output_file + " ###"
    print head

### output.file.0 ###
['\n', '\n', '\n']
### output.file.1 ###
['\n', '\n', '\n']
### output.file.2 ###
['\n', '\n', '\n']
### output.file.3 ###
['\n', '\n', '\n']
### output.file.4 ###
['\n', '\n', '\n']


Notice how the first line of each file is in order. Then let's look at which of these first lines was the fastest to process. Notice how they got the next thread? Therefore, it is highly likely that each of these output files have a different 