# Scientific workflows

The examples for the tutorial are in `materials/07_scientific_workflows` directory.

To run the examples change to this directory and run either `doit` (you will need to install the [doit](http:/doit.readthedocs.org) library) or `sh run_pipeline.sh`

## Example:

• distribution of pairwise correlations

## Outline

* what is a scientific workflow
* points to remember:
   - programming in general is an iterative process
   - keep it simple stupid
   - repetition is the root of all evil
   - program to interfaces
* three components of scientific workflow:
   - data generation (simulation, pre-processing)
   - statistics
   - plots
* two stages of data analysis:
   - exploratory on a single (or a few) datasets
   - batch analysis
* writing simple command line tools
   - argparse
   - avoid changes in API
* saving data to files:
   - csv
   - compare pickle and npz
* folder structure
   - separate folders for figures, results, data
* writing meta-scripts in bash/python
* running batch analyses
* using automatic builders (doit)

# Plan:

1) Single cell analysis

   * write simple script to read data, calculate correlation coefficients and plot their histogram
   * use argparse to specify the input filename from the command line

2) Building first workflow

   * make separate folders for figures and results
   * allow to write correlation coefficients to file
   * seperate plotting and analysis

2) Batch processing

   * write data merge script
   * write python script using subprocess, which run the correlation analysis on all its input files
   * allow to plot histograms from more than one file
   * write bash script to run the entire pipeline

3) Automating
   
   * write `dodo.py` which runs the batch analysis on V1 data - define input files by hand
   * add an extra input file to dependency list and show that the task will be automatically re-executed
   * add plotting task based on v1 data only
   * implement partially the batch run in the dodo file

# Questions and answers:

* In batch analysis can I import function rather than call external python scripts with `subprocess`?

  Both solutions are possible. The advantage of calling the external scripts is that you can also combine programs written in other languages (like C, Matlab, Java). In addition, it requires only few changes to your exisiting script for exploratory data analysis. In the script you can use all the power of python including the global variables and you do not have to clean up your code after each call (no spill of data between independent runs of the analysis).
  
  Importing and calling a function in the batch script can potentially save some time required to initialize external libraries etc.
  
  In some cases it is not possible to clean up the interpreter after a run (for example, for simulations involving neuron simulator). In this case one needs to resort to first mechanism.  
  
  
* Will this approach produce too many script files?

  Yes, this is a danger. One should make sure that scripts are not too simple, but on the other hand are re-useable. Remember nothing will replace your common sense. The workflow I presented is just a set of recipes that I found useful in my work. You can pick and mix the ones that you find useful.
  
  There are also other approaches for building workflows each with its own pros and cons:
  
    * `luigi`
    * `joblib`
    * `mdp` -- modular data processing