### Subprocesses

One of the biggest strengths of Python is that it can be used as a *glue* language. <br>
It can 'glue' together a series of programs into a flexible and highly extensible pipline.

### Why subprocesses
One of the most common, yet complicated, tasks that most programming languages need to do is creating new processes. <br>
This could be as simple as seeing what files are present in the current working directory (`ls`) or as complicated as creating a program workflow that *pipes* output from one program into another program's input. <br/><br/>
Many such tasks are easily taken care of through the use of Python libraries and modules (`import`) that *wrap* the programs into Python code, effectively creating Application Programming Interfaces (API). <br/><br/>
However, there are many use cases that require the user to make calls to the terminal from ***within*** a Python program.

#### Operating System Conundrum

First, we need to address the following issue. As many in this class have found out, while Python can be installed on most operating systems; doing the same thing in one operating system (Unix) may not always yield the same results in another (Windows).<br/><br/>
The very first step to making a program **"OS-agnostic"** is through the use of the `os` module.

In [1]:
import os

https://docs.python.org/3/library/os.html

In [3]:
# dir(os)

In [4]:
for elem in dir(os):
    if "error" in elem:
        print(elem)

error
strerror


In [5]:
# The name of the operating system dependent module imported. 
# The following names have currently been registered: 'posix', 'nt', 'java'
# Portable Operating System Interface -  IEEE standard designed to facilitate application portability
# (Windows) New Technology - a 32-bit operating system that supports preemptive multitasking
# 
os.name

'posix'

In [8]:
# Returns information identifying the current operating system. The return value is an object with five attributes:
# - sysname - operating system name
# - nodename - name of machine on network (implementation-defined)
# - release - operating system release
# - version - operating system version
# - machine - hardware identifier

os.uname() # does not work for windows

posix.uname_result(sysname='Darwin', nodename='0587368551.wireless.umich.net', release='18.7.0', version='Darwin Kernel Version 18.7.0: Tue Jan 12 22:04:47 PST 2021; root:xnu-4903.278.56~1/RELEASE_X86_64', machine='x86_64')

In [9]:
import platform # works for Windows
platform.uname()

uname_result(system='Darwin', node='0587368551.wireless.umich.net', release='18.7.0', version='Darwin Kernel Version 18.7.0: Tue Jan 12 22:04:47 PST 2021; root:xnu-4903.278.56~1/RELEASE_X86_64', machine='x86_64', processor='i386')

In [10]:
import sys

# https://docs.python.org/3/library/sys.html
# This string contains a platform identifier that can be used to append platform-specific components
# to sys.path, for instance.
    
sys.platform

'darwin'

In [11]:
# A list of strings that specifies the search path for modules. 

sys.path

['/Users/mitrea/Documents/CLASSES/FALL_2021/BIOINF 575 FA 2021/Session 26 - subprocesses',
 '/opt/anaconda3/lib/python38.zip',
 '/opt/anaconda3/lib/python3.8',
 '/opt/anaconda3/lib/python3.8/lib-dynload',
 '',
 '/opt/anaconda3/lib/python3.8/site-packages',
 '/opt/anaconda3/lib/python3.8/site-packages/aeosa',
 '/opt/anaconda3/lib/python3.8/site-packages/IPython/extensions',
 '/Users/mitrea/.ipython']

In [13]:
# A mapping object representing the string environment.

os.environ['HOME']

'/Users/mitrea'

In [12]:
os.environ

environ{'TERM_PROGRAM': 'Apple_Terminal',
        'SHELL': '/bin/bash',
        'TERM': 'xterm-color',
        'TMPDIR': '/var/folders/dg/8l7ql9hs6j71k502y3_6r3jh0000gp/T/',
        'CONDA_SHLVL': '2',
        'Apple_PubSub_Socket_Render': '/private/tmp/com.apple.launchd.U6kIWXnzqk/Render',
        'CONDA_PROMPT_MODIFIER': '(base) ',
        'TERM_PROGRAM_VERSION': '421.2',
        'GSETTINGS_SCHEMA_DIR_CONDA_BACKUP': '',
        'TERM_SESSION_ID': 'C26DDA13-545D-4362-9ECB-296FB7FF2197',
        'LC_ALL': 'en_US.UTF-8',
        'USER': 'mitrea',
        'CONDA_EXE': '/opt/anaconda3/bin/conda',
        'SSH_AUTH_SOCK': '/private/tmp/com.apple.launchd.aO5rl7YB7Y/Listeners',
        'MKL_INTERFACE_LAYER': 'LP64,GNU',
        '_CE_CONDA': '',
        'CONDA_PREFIX_1': '/Users/mitrea/anaconda3',
        'CONDA_ROOT': '/opt/anaconda3',
        'PATH': '/opt/anaconda3/bin:/Users/mitrea/anaconda3/condabin:/usr/local/bin:/usr/bin:/bin:/usr/sbin:/sbin:/Library/TeX/texbin',
        'GSETTINGS_SCH

In [14]:
#Return the value of the environment variable key if it exists, 
#or default if it doesn’t. key, default and the result are str.

os.getenv("HOME")

'/Users/mitrea'

In [15]:
os.getenv("PATH")

'/opt/anaconda3/bin:/Users/mitrea/anaconda3/condabin:/usr/local/bin:/usr/bin:/bin:/usr/sbin:/sbin:/Library/TeX/texbin'

In [16]:
os.getenv("PATH").split(":")

['/opt/anaconda3/bin',
 '/Users/mitrea/anaconda3/condabin',
 '/usr/local/bin',
 '/usr/bin',
 '/bin',
 '/usr/sbin',
 '/sbin',
 '/Library/TeX/texbin']

In [17]:
# Returns the list of directories that will be searched for a named executable,
#similar to a shell, when launching a process. 
# env, when specified, should be an environment variable dictionary to lookup the PATH in. 
# By default, when env is None, environ is used.

os.get_exec_path()

['/opt/anaconda3/bin',
 '/Users/mitrea/anaconda3/condabin',
 '/usr/local/bin',
 '/usr/bin',
 '/bin',
 '/usr/sbin',
 '/sbin',
 '/Library/TeX/texbin']

The `os` module wraps OS-specific operations into a set of standardized commands. <br>
For instance, the Linux end-of-line (EOL) character is a `\n`, but `\r\n` in Windows. <br>
In Python, we can just use the following:

In [18]:
# EOL - for the current (detected) environment

'''
The string used to separate (or, rather, terminate) lines on the current platform. 
This may be a single character, such as '\n' for POSIX, or multiple characters, 
for example, '\r\n' for Windows. 
Do not use os.linesep as a line terminator when writing files opened in text mode (the default); 
use a single '\n' instead, on all platforms.
'''

os.linesep

'\n'

Another example, in a Linux environment, one must use the following command to list the contents of a given directory:
```
ls -alh 
```

In Windows, the equivalent is as follows:
```
dir
```

Python allows users to do a single command, in spite of the OS:

In [19]:
# List directory contents

os.listdir("demoCM")

['testing.txt', '.ipynb_checkpoints', 'test_script.sh']

However, the biggest issue for creating an OS-agnostic program is ***paths*** <br/>
Windows: `"C:\\Users\\MDS\\Documents"`<br/>
Linux: `/mnt/c/Users/MDS/Documents/`<br/><br/>
Enter Python:

In [20]:
# path joining from pwd
pwd = os.getcwd()
os.path.join(pwd,"test.py")

'/Users/mitrea/Documents/CLASSES/FALL_2021/BIOINF 575 FA 2021/Session 26 - subprocesses/test.py'

### `subprocess`

If you Google anything on how to run shell commands, but don't specify Python 3.x, you will likely get an answer that includes `popen`, `popen2`, or `popen3`. These were the most prolific ways to *open* a new *p*rocess. In Python 3.x, they encapsulated these functions into a new one called `run` available through the `subprocess` library.

In [21]:
# Import and alias
import subprocess as sp

In [22]:
dir(sp)

['CalledProcessError',
 'CompletedProcess',
 'DEVNULL',
 'PIPE',
 'Popen',
 'STDOUT',
 'SubprocessError',
 'TimeoutExpired',
 '_PIPE_BUF',
 '_PopenSelector',
 '_USE_POSIX_SPAWN',
 '__all__',
 '__builtins__',
 '__cached__',
 '__doc__',
 '__file__',
 '__loader__',
 '__name__',
 '__package__',
 '__spec__',
 '_active',
 '_args_from_interpreter_flags',
 '_cleanup',
 '_mswindows',
 '_optim_args_from_interpreter_flags',
 '_posixsubprocess',
 '_time',
 '_use_posix_spawn',
 'builtins',
 'call',
 'check_call',
 'check_output',
 'contextlib',
 'errno',
 'getoutput',
 'getstatusoutput',
 'io',
 'list2cmdline',
 'os',
 'run',
 'select',
 'selectors',
 'signal',
 'sys',
 'threading',
 'time',

#### `check_output`

In [24]:
# help(sp.check_output)

In [25]:
# check_output returns a bytestring by default, so I set encoding to convert it to strings.
# [command, command line arguments]
# change from bytes to string using encoding

sp.check_output(["echo","test"],encoding='utf_8')

'test\n'

In [26]:
! type python

python is /opt/anaconda3/bin/python


In [29]:
! chmod u+x test.py

In [32]:
sp.check_output([os.path.join(pwd,"test.py"),"[1,2,3]"],encoding='utf_8')

"This is a python script\nname variable is __main__\nNumber of arguments: 2\nArgument List: ['/Users/mitrea/Documents/CLASSES/FALL_2021/BIOINF 575 FA 2021/Session 26 - subprocesses/test.py', '[1,2,3]']\nthe array is [1 2 3]\nThe test variable value is 10\n"

The first thing we will look are trivial examples that demonstrate just capturing the *output* (stdout) of a program

However, while the `check_output` function is still in the `subprocess` module, it can easily be converted into into a more specific and/or flexible `run` function signature.

#### `run`

In [33]:
help(sp.run)

Help on function run in module subprocess:

run(*popenargs, input=None, capture_output=False, timeout=None, check=False, **kwargs)
    Run command with arguments and return a CompletedProcess instance.
    
    The returned instance will have attributes args, returncode, stdout and
    stderr. By default, stdout and stderr are not captured, and those attributes
    will be None. Pass stdout=PIPE and/or stderr=PIPE in order to capture them.
    
    If check is True and the exit code was non-zero, it raises a
    CalledProcessError. The CalledProcessError object will have the return code
    in the returncode attribute, and output & stderr attributes if those streams
    were captured.
    
    If timeout is given, and the process takes too long, a TimeoutExpired
    exception will be raised.
    
    There is an optional argument "input", allowing you to
    pass bytes or a string to the subprocess's stdin.  If you use this argument
    you may not also use the Popen constructor's "std

In [34]:
sub = sp.run(
    [
        'echo',             # The command we want to run
        'test'              # Arguments for the command
    ],
    encoding='utf_8',       # Converting byte code
    stdout=sp.PIPE,         # Where to send the output
    check=True              # Whether to raise an error if the process fails
)  
sub

CompletedProcess(args=['echo', 'test'], returncode=0, stdout='test\n')

In [35]:
dir(sub)

['__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 'args',
 'check_returncode',
 'returncode',
 'stderr',
 'stdout']

In [36]:
print(sub.stdout)

test



The main utility of `check_output` was to capture the output (stdout) of a program. <br>
By using the `stdout=subprocess.PIPE` argument, the output can easily be captured, along with its return code. <br>
A return code signifies the program's exit status: 0 for success, anything else otherwise

In [37]:
sub.returncode

0

With our `run` code above, our program ran to completetion, exiting with status 0. The next example shows a different status.

In [38]:
sp.run(
        'exit 1',      # Command & arguments
        shell = True   # Run from the shell
        )


CompletedProcess(args='exit 1', returncode=1)

However, if the `check=True` argument is used, it will raise a `CalledProcessError` if your program exits with anything different than 0. This is helpful for detecting a pipeline failure, and exiting or correcting before attempting to continue computation.

In [39]:
sp.run(
        'exit 1',      # Command & arguments
        shell = True,  # Run from the shell
        check = True   # Check exit status
    )

CalledProcessError: Command 'exit 1' returned non-zero exit status 1.

In [40]:
sub = sp.run(
        'exit 1',      # Command & arguments
        shell = True,  # Run from the shell
        # check = True   # Check exit status
    )
if (sub.returncode != 0):
    print(f"Exit code {sub.returncode}. Expected 0 when there is no error.")

Exit code 1. Expected 0 when there is no error.


#### Syntax

Syntax when using `run`:<br/>
1. A list of arguments: `subprocess.run(['echo', 'test', ...], ...)` 
2. A string and `shell`: `subprocess.run('exit 1', shell = True, ...)`

The preferred way of using `run` is the first way. <br>
This preference is mainly due to security purposes (to prevent shell injection attacks). <br>
It also allows the module to take care of any required escaping and quoting of arguments for a pseudo-OS-agnostic approach. 

There are some guidelines though:
1. Sequence (list) of arguments is generally preferred
2. A str is appropriate if the user is just calling a program with no arguments
3. The user should use a str to pass argument if `shell` is `True`<br/>
Your next questions should be, "What is `shell`?"

`shell` is just your terminal/command prompt. This is the environment where you call `ls/dir` in. It is also where users can define variables. More importantly, this is where your *environmental variables* are set...like `PATH`.<br/><br/>
By using `shell = True`, the user can now use shell-based environmental variable expansion from within a Python program.

In [41]:
sp.run(
        'echo $PATH',            # Command
        shell = True,            # Use the shell
        stdout=sp.PIPE,          # Where to send it
        encoding='utf_8'         # Convert from bytes to string
    )      # Look at the output


CompletedProcess(args='echo $PATH', returncode=0, stdout='/opt/anaconda3/bin:/Users/mitrea/anaconda3/condabin:/usr/local/bin:/usr/bin:/bin:/usr/sbin:/sbin:/Library/TeX/texbin\n')

In [49]:
p1 = sp.run(
        'sleep 5',               # Command
        shell = True,            # Use the shell
        stdout=sp.PIPE,          # Where to send it
        encoding='utf_8'         # Convert from bytes to string
    )
print(p1)
p2 = sp.run(
        'echo done',             # Command
        shell = True,            # Use the shell
        stdout=sp.PIPE,          # Where to send it
        encoding='utf_8'         # Convert from bytes to string
    )
print(p2)

CompletedProcess(args='sleep 5', returncode=0, stdout='')
CompletedProcess(args='echo done', returncode=0, stdout='done\n')


In [50]:
dir(p1)

['__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 'args',
 'check_returncode',
 'returncode',
 'stderr',
 'stdout']

For the most part, you shouldn't need to use `shell` simply because Python has modules in the standard library that can do most of the shell commands. For example `mkdir` can be done with `os.mkdir()`, and `$PATH` can be retrieved using os.getenv("PATH") or os.get_exec_path() as shown above. 

#### Blocking vs Non-blocking

The last topic of this lecture is "blocking". This is computer science lingo/jargon for whether or not a program ***waits*** until something is complete before moving on. Think of this like a really bad website that takes forever to load because it is waiting until it has rendered all its images first, versus the website that sets the formatting and text while it works on the images.

1. `subprocess.run()` is blocking (it waits until the process is complete)
2. `subprocess.Popen()` is non-blocking (it will run the command, then move on)

***Most*** use cases can be handled through the use of `run()`.<br> 
`run()` is just a *wrapped* version of `Popen()` that simplifies use. <br>
However, `Popen()` allows the user a more flexible control of the subprocess call. <br>
`Popen()` can be used similar way as run (with more optional parameters).

An example use case for `Popen()` is if the user has some intermediate data that needs to get processed, but the output of that data doesn't necessarily affect the rest of the pipeline.

#### `Popen`

In [43]:
p1 = sp.Popen(
        'sleep 5',               # Command
        shell = True,            # Use the shell
        stdout=sp.PIPE,          # Where to send it
        encoding='utf_8'         # Convert from bytes to string
    )
print(p1)
p2 = sp.Popen(
        'echo done',             # Command
        shell = True,            # Use the shell
        stdout=sp.PIPE,          # Where to send it
        encoding='utf_8'         # Convert from bytes to string
    )
print(p2)
print("processes ran")

print(p1.stdout.read())
print(p2.stdout.read())
print("processes completed")



<subprocess.Popen object at 0x11212bc40>
<subprocess.Popen object at 0x112093550>
processes ran

done

processes completed


In [44]:
# Use context manager to handle process while it is running,
# and gracefully close it
with sp.Popen(
    [
        'echo',         # Command
        'here we are'       # Command line arguments
    ],
    encoding='utf_8', # Convert from byte to string
    stdout=sp.PIPE    # Where to send it
) as proc:            # Enclose and alias the context manager
    print(
        proc.stdout.read() # Look at the output
    )

here we are



In [45]:
for elem in dir(proc):
    if not elem.startswith('_'):
        print(elem)

args
communicate
encoding
errors
kill
pid
poll
returncode
send_signal
stderr
stdin
stdout
terminate
text_mode
universal_newlines
wait


#### ***NOTE***: From here on out, there might be different commands used for **Linux** / **MacOS** or **Windows**

Add the following text to a new file `test_pipe.txt` 
```
testing
a
subprocess
pipe
```

In [None]:
# another way to add the text to the file
#test_pipe.txt - a file to be used to demonstrate pipe of cat and sort 
!echo testing > test_pipe.txt
!echo a >> test_pipe.txt
!echo subprocess >> test_pipe.txt
!echo pipe >> test_pipe.txt


In [46]:
# start the first process - cat - reading the file content

# mac OS
p1 = sp.Popen(['cat','test_pipe.txt'], stdout=sp.PIPE, encoding='utf_8')

# windows OS
# p1 = sp.Popen(['type','test_pipe.txt'], stdout=sp.PIPE, encoding='utf_8')

print(p1.stdout.read())

testing
a
subprocess
pipe



In [47]:
# add the second process and connect the pipe: 
# for p2 we use stdin=p1.stdout

# mac OS
p1 = sp.Popen(['cat','test_pipe.txt'], stdout=sp.PIPE, encoding='utf_8')

# windows OS
# p1 = sp.Popen(['type','test_pipe.txt'], stdout=sp.PIPE, encoding='utf_8')


p2 = sp.Popen(['sort'], stdin=p1.stdout, stdout=sp.PIPE, encoding='utf_8')
p1.stdout.close()  # Allow p1 to receive a SIGPIPE if p2 exits
output = p2.communicate()[0]
print(output)


a
pipe
subprocess
testing



`Popen` can create background processes, shell-background-like behavior means not blocking. <br>
`Popen` has a lot more functionality than `run`.

In [48]:
sub_popen = sp.Popen(
    [
        'echo',          # Command
        'test',        # Command line arguments
    ],
    encoding='utf_8',  # Convert from byte to string
    stdout=sp.PIPE     # Where to send it
)
for j in dir(sub_popen):
    if not j.startswith('_'):
        print(j)


args
communicate
encoding
errors
kill
pid
poll
returncode
send_signal
stderr
stdin
stdout
terminate
text_mode
universal_newlines
wait


In [51]:
sub_popen.kill()       # Close the process

Example creating child process.<br>
https://pymotw.com/3/subprocess/

A collection of `Popen` examples: <br>
https://www.programcreek.com/python/example/50/subprocess.Popen

In [None]:
67dN85Td

#### Exercise 
Write bash script that takes the file as an argument and returns lines that contain the letter p.   
Run that script using python code.




In [52]:
# the script is in the ex1.sh file

import os
import sys
import subprocess as sp

sub = sp.run(
    [
        'bash',             # The command we want to run
        'ex1.sh',              # Arguments for the command
        'test_pipe.txt'
    ],
    encoding='utf_8',       # Converting byte code
    stdout=sp.PIPE,         # Where to send the output
    check=True              # Whether to raise an error if the process fails
)  
sub

CompletedProcess(args=['bash', 'ex1.sh', 'test_pipe.txt'], returncode=0, stdout='')

The content of the bash script is:   
ex1.sh

```bash 
cat $1 | grep p > res_sh.txt
echo done >> res_sh.txt
```

#### Exercise -  only if you have R installed
Write R script that takes the file `test_R.txt` as an argument and return the sum of the matrix from the file.
Run that script using python code.

```
rN	val1	val2
r1	1	2
r2	3	4
```

In [53]:
# the script is in the ex2.R file
# the data is in the test_R.txt file 

import os
import sys
import subprocess as sp

sub = sp.run(
    [
        'Rscript',             # The command we want to run
        '--vanilla',
        'ex2.R',              # Arguments for the command
        'test_R.txt'
    ],
    encoding='utf_8',       # Converting byte code
    stdout=sp.PIPE,         # Where to send the output
    check=True              # Whether to raise an error if the process fails
)  
sub

CompletedProcess(args=['Rscript', '--vanilla', 'ex2.R', 'test_R.txt'], returncode=0, stdout='[1] 10\n')

The content of the R script is:   
ex2.R

```R 
args = commandArgs(trailingOnly=TRUE)
df = read.table(args, header = T, row.names = 1, sep = "\t")
print(sum(df))
```