# Python Workflows

### R. Burke Squires
NIAID Bioinformatics and Compuatational Biosciences Branch (BCBB)

---

### Good

- Unix pipeling of programs

### Better

- Jupyter Notebook and biopython

### Best

- Python workflow package, validation, logging

---

## Good Workflows

### Unix Pipeline / Script Pipelines

Scripts, written in Unix shell or python, can be seen as the most basic form of pipeline framework:

- Allows variables and conditional logic to be used to build flexible pipelines
- Scripts tend to be __quite brittle__; lacking 'robustness'

- Scripts lack support for two key features necessary for the efficient processing of data: 
    - Dependencies
    - Reentrancy 
    
__Dependencies:__ Upstream files (or tasks) that downstream transformation steps require as input. When a dependency is updated, associated downstream files should be updated as well.

__Reentrancy:__ Ability of a program to continue where it left off if interrupted, obviating the need to restart from the beginning of a process

### Unix Pipeline Example

In [None]:
ls

In [None]:
ls data

In [None]:
!head data/gapminder_five_year_dirty.txt

In [None]:
!wc -l data/gapminder_five_year_dirty.txt

In [None]:
%%bash
tail -n 1000 data/gapminder_five_year_dirty.txt  > gapminder_tail.txt

In [None]:
%%bash
head -n 500 gapminder_tail.txt > gapminder_middle.txt

Select the bottom 1000 results of the file:

    !tail -n 1000 gapminder_five_year_dirty.txt

"Pipe" the results of the head command into teh tail command and select the top 500 results:
    
    ... | head -n 500

Use the great than symbol to redirect the output from head into a file:

    > gapminder_middle.txt

In [None]:
%%bash
tail -n 1000 data/gapminder_five_year_dirty.txt | head -n 500  > data/gapminder_middle2.txt

### Windows PowerShell Pipelining: 
- https://docs.microsoft.com/en-us/powershell/module/microsoft.powershell.core/about/about_pipelines?view=powershell-7
- https://docs.microsoft.com/en-us/powershell/scripting/learn/understanding-the-powershell-pipeline?view=powershell-7

---

# Better Workflows

A better step to workflows is to use Jupyter notebooks, share them, even convert them and run them as a python script. We will take a look at the 2-better.ipynb notebook

### Jupyter Notebook & Run All

One way to run the workflow, after it has been written and debugged is to open the notebook and select 'Cell' -> 'Run all'. This will run all the cells in a notebook from top to bottom, in essence running the entire workflow.

Open the [better notebook](better.ipynb) and select 'Run all'

In [None]:
%%bash
jupyter nbconvert '02_better.ipynb' --to script

Now lets run the script:    

In [None]:
%%bash
ipython 02_better.py

Schloss microbiome notebook:

- http://www.nature.com/nature/journal/v509/n7500/full/nature13178.html
- http://nbviewer.ipython.org/gist/pschloss/9815766/notebook.ipynb

---

## Best Workflows

- [snakemake intro](./03_snakemake_intro.ipynb)
- https://slides.com/johanneskoester/snakemake-tutorial
- [snakemake short tutorial](./04_snakemake_short_tutorial.ipynb)

In [None]:
# clean up
!rm gapminder_middle2.txt
!rm gapminder_tail.txt
!rm gapminder_middle.txt
!rm 02_better.py