# System Calls With the subprocess Module

In this section we will cover how to use the **subprocess** module to interact with other software, making system calls, collecting data, and automating your entire analysis pipeline.

In bioinformatics we will often want to use other peoples software, whether that's Python packages like Biopython or packages written in other languages, like the highly efficient C. Rather than reinvent the wheel, we use many specialized packages to quickly and easily perform tasks such as alignment, SNP calling, or phylogenetics. Even simple Unix tools like **wc** can be useful to count the number of genes in a GFF file or length of a FASTA sequence.

Most bioinformatics tools, like Unix tools, are run through the command line. Given that you might want to repeat an analysis using some of these tools on, for example, hundreds of genes, manually entering the commands for your entire pipeline is not only boring and error prone but a waste of your time.

Here we will teach you to automate your pipeline by using Python to run these command line tools for you.

---
## Advantages to Automation

**Reproducability**

Perhaps the greatest advantage to a scientist in automating analysis is that the analysis can be reproduced exactly. Your exact methods are laid out in your Python script, where you and others can scrutinize, repeat, and modify them.

**Less Tedium**

From printing out the name of every gene expressed over a certain level, to BLASTING those genes against the NCBI database, to sorting and counting the resulting hits, scripting saves you a huge amount of tedious labor. Nobody wants to type, or even copy and paste, hundreds of BLAST queries.

**Consistency**

As well as being incredibly mind-numbing, manually running bioinformatics tools is dangerous. What if you accidentally type 'Gasterosteus_aculeatus_CA_SNPs.vcf' instead of 'Gasterosteus_aculeatus_AC_SNPs.vcf', accidentally substituting your California population for your Atlantic Coast population? Or 'clean_reads.py expensive_dataset.fq > expensive_dataset.fq' instead of 'clean_reads.py expensive_dataset.fq > expensive_dataset.clean.fq'? There are *thousands* of ways you can accidentally screw up your analysis to either ruin your day or produce erroneous results.

Automation reduces the risk of stupid typos and other accidents. You won't forget to include mydata.part.14.bam in the analysis when you run **results = [analyse(data) for data in mydata]**.

**Parallelization**

Modern computers, even budget laptops, now have multiple processors, which means you can run several or even hundreds of analyses at once (if you have access to a supercomputing cluster)! Python provides a number of tools to help you manage these processes and make the most out of parallel computing.

---

## subprocess.check_output()
The easiest way to make system calls is with the __check_output()__ function in the __subprocess__ module

In [None]:
import subprocess as sp
 
output = sp.check_output('ls', shell=True)

print output
type (output)

Here, we used __subprocess.check_output()__ to run the command **ls**, and captured the output in the *output* variable.

You'll notice that **check_output()** takes one mandatory argument: the command you want to run as you would type it into your terminal shell. The __shell__ keyword argument we will leave as __True__ for this tutorial. Without it, the process is created directly by the operating system, and any symbols or commands that the shell would recognize (e.g. spaces, ">", and "|") result in an error. Because spaces are a symbol recognized by the shell, when calling __call()__ without __shell=True__, the first argument should be a list of command line arguments that would be separated by spaces spaces on the terminal. So **ls -a -1 /home/james** would become **['ls', '-a', '-1', '/home/james']**.

## Processing output a line at a time...

As you saw above, the variable __output__ contains a multi-line string. Well, that's all fine and good, but what if you want to loop over some output and take action on each line?

In [None]:
import subprocess as sp
 
output = sp.check_output('ls', shell=True)
output_list = output.rstrip().split('\n')
print output_list
for cur_line in output_list:
    print "Here's a line:",cur_line

### Redirecting Output to Files
Since we are using __shell=True__, you can redirect the output of a command to file exactly as you would on the terminal.

To demonstrate this, we will be asking Python to do something very meta: run another Python script! Here's the script we will be running, which should be saved in your 5.2 directory as 'test_output.py'.

In [None]:
#!/usr/bin/env python
print 'this is a test'
print 'this is only a test'

Remember that we can redirect the output of a command to a file with __>__.

In [None]:
import subprocess as sp

command = 'python test_output.py > out.txt'
output = sp.check_output(command, shell=True)
print output

This time there is no output to print, since we redirected it. Check 'out.txt' to make sure the output went where you expected it.

### Chaining Commands With Pipes
We often want to do something else with the output of a program, either parsing it and reformating it, performing a second step in the analysis, or turning it into a figure. To do this, we will *pipe* the output of one command directly into another command.

A *pipe* is used to send the output of one program into the input of another. We learned in the first lecture that this is done on the Unix command line with the **|** character. For example, this script should count the number of files that contain the word "yeast" in the current directory.

In [None]:
%%bash
echo "What directory am I in?"
pwd
echo
echo "List the files, long form:"
ls -l
echo
echo "now, only yeasts...."
ls -1 | grep 'yeast'
echo
echo "Now, count the yeasts..."
ls -1 | grep 'yeast' | wc -l

In [None]:
import subprocess as sp

command = "ls -1 | grep 'yeast' | wc -l"
print sp.check_output(command, shell=True)

## Another way to run

Calling programs directly from python is a good way to run. But sometimes, it's nice to have a check... if I'm about to move a whole bunch of files (scary!), it can be nice to print the commands instead of running them directly.
  
This lets you inspect the generated commands before running them, and potentially (in the case of a group of moves) try one. If it looks good, cut-and-paste all of them to the command line....

At the end of this lesson, we'll see a better way to do the above.

In [None]:
import subprocess as sp

## Another way to run!
output = sp.check_output('ls', shell=True)
output_list = output.rstrip().split('\n')
for cur_line in output_list:
    print "wc",cur_line

# Handling errors

In [None]:
# Ice cream fun! 
def favorite_ice_cream(n):
    ice_creams = [
        "chocolate",
        "vanilla",
        "strawberry"
    ]
    print(ice_creams[n])

In [None]:
favorite_ice_cream(2)

In [None]:
favorite_ice_cream(3)

Well, that didn't work. Now that we've defined __favorite_ice_cream()__, let's try to run it safely.... 

In [None]:
print "Let's ask about our favorite ice cream!"
try:
    favorite_ice_cream(3)
except:
    print "Woah, nelly! That was a baaaad idea!"
print "But we will carry on."

In [None]:
print "Let's ask about our favorite ice cream!\n"
try:
    favorite_ice_cream(3)
except Exception as e:
    #print "we got this kind of error:", type(e)
    print "oops, we got this error: '" + str(e) +"'\n"
print "But we will carry on."

## Assert and the truth

Let's play with "assert" - a funciton that returns error if given false. That seems silly; why bother? It's about 'preconditions'....

In [None]:
pi = 3.14159
def area(radius):
    # Pi had better be defined to at least 5 digits....
    assert (pi == 3.14159)
    return  pi * radius * radius

In [None]:
print area(100)

In [None]:
pi = 3.14
print area(100)

In [None]:
def favorite_ice_cream(n):
    
    ice_creams = [
        "chocolate",
        "vanilla",
        "strawberry"
    ]
    assert (n < len(ice_creams))
    print(ice_creams[n])

In [None]:
favorite_ice_cream(3)

## Bringing it all together - what happens when an external program blows up?

Let's run a version of the ice cream program above - this should be nothing new...

In [3]:
import subprocess as sp

output = sp.check_output('./good_ice_cream.py', shell=True)
print output

vanilla



In [7]:
import subprocess as sp

output = sp.check_output('./bad_ice_cream.py', shell=True)
print output

CalledProcessError: Command './bad_ice_cream.py' returned non-zero exit status 1

In [6]:
import subprocess as sp
try:
    output = sp.check_output('./bad_ice_cream.py', shell=True)
except Exception as e:
    output = "Well, that didn't work..."
    print "We got error: '" + str(e) + "'"
print output

We got error: 'Command './bad_ice_cream.py' returned non-zero exit status 1'
Well, that didn't work...


# Writing and running programs outside of notebook...

* SublimeText
* PyCharm
* nano

## Paste "another way to run" and execute it with bash...

In [None]:
#!/usr/bin/env python

import subprocess as sp

## Another way to run!
output = sp.check_output('ls', shell=True)
output_list = output.rstrip().split('\n')
for cur_line in output_list:
    print "wc",cur_line