# Bioinformatics tutorial

In this tutorial we'll show how to use bulker to set up a controlled environment for a bioinformatics workflow. I assume you've already gone through the [install and configure](install.md) instructions. Before we begin, we need to set up a bash function to allow the activate function to work in the jupyter notebook for this tutorial:

In [1]:
bulker-activate() {
  eval "$(bulker activate -e $@)"
}

This will make the activate command adjust the current shell instead of returning a new one. This is required for this jupyter notebook and you may want to add this to your `.bashrc` anyway as it's more convenient for users as well. (You can read more about this in [tips](tips.md)). 

With that out of the way, make sure you've initialized a bulker config file:

In [2]:
rm "bulker_config.yaml"
export BULKERCFG="bulker_config.yaml"
bulker init -c $BULKERCFG

Guessing container engine is docker.
Wrote new configuration file: bulker_config.yaml



## Loading the peppro crate

We've produced a crate for our [peppro pipeline](http://peppro.databio.org), which processes nascent RNA-seq data (from PRO-seq or GRO-seq experiments). Let's load the peppro crate, which is available on the bulker registry:


In [3]:
bulker load databio/peppro

Bulker config: bulker_config.yaml
Got URL: http://hub.bulker.io/databio/peppro.yaml
Executable template: /home/nsheff/code/bulker/docs_jupyter/templates/docker_executable.jinja2
Loading manifest: 'peppro'. Activate with 'bulker activate peppro'.
Commands available: samtools, bowtie2, seqkit, fastp, seqtk, preseq, fastq_pair, wigToBigWig, bigWigCat, fastqc, pigz, cutadapt, Rscript


You can see this crate offers several common bioinformatics tools, like `samtools` and `bowtie2`. You can see this crate in your list of available crates:

In [4]:
bulker list

Bulker config: bulker_config.yaml
Available crates:
databio/peppro:default -- /home/nsheff/bulker_crates/databio/peppro/default


: 1

## Run commands in the peppro crate

Now we've loaded the crate, which means there's a folder somewhere on your computer with a bunch of executables. Bulker provides two ways to activate this crate conveniently, depending on your use case: `bulker activate`, and `bulker run`.

Let's start with a `bulker run` command, which is how we could execute an ad-hoc command. If you try to run `bowtie2` natively, it doesn't work, because I don't have it installed natively on this system:


In [5]:
bowtie2

The program 'bowtie2' is currently not installed. You can install it by typing:
sudo apt install bowtie2


: 127

But we can run it inside the crate:

In [6]:
bulker run databio/peppro bowtie2 --version

Bulker config: bulker_config.yaml
Activating crate: databio/peppro

/usr/local/bin/bowtie2-align-s version 2.3.5
64-bit
Built on default-9ec70d2b-0cae-4ef9-8b8a-41ee9710f4ef
Mon Apr  1 13:45:20 UTC 2019
Compiler: gcc version 7.3.0 (crosstool-NG 1.23.0.449-a04d0) 
Options: -O3 -m64 -msse2 -funroll-loops -g3 -fvisibility-inlines-hidden -std=c++17 -fmessage-length=0 -march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protector-strong -fno-plt -O2 -pipe -I/usr/local/include -fdebug-prefix-map=/opt/conda/conda-bld/bowtie2_1554125862557/work=/usr/local/src/conda/bowtie2-2.3.5 -fdebug-prefix-map=/usr/local=/usr/local/src/conda-prefix -std=c++98 -DPOPCNT_CAPABILITY -DWITH_TBB -DNO_SPINLOCK -DWITH_QUEUELOCK=1                                                                                                                                                                                                                                                                                           

If we need to run more than one command (or, say, an entire workflow), then it's much simpler to use `bulker activate` than `bulker run`. Here, I'm using the hypenated version I created at the beginning of this tutorial:

In [7]:
bulker-activate databio/peppro

Bulker config: bulker_config.yaml
Activating bulker crate: databio/peppro



Now we can run any crate commands as if they were native:

In [8]:
bowtie2 --version

/usr/local/bin/bowtie2-align-s version 2.3.5
64-bit
Built on default-9ec70d2b-0cae-4ef9-8b8a-41ee9710f4ef
Mon Apr  1 13:45:20 UTC 2019
Compiler: gcc version 7.3.0 (crosstool-NG 1.23.0.449-a04d0) 
Options: -O3 -m64 -msse2 -funroll-loops -g3 -fvisibility-inlines-hidden -std=c++17 -fmessage-length=0 -march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protector-strong -fno-plt -O2 -pipe -I/usr/local/include -fdebug-prefix-map=/opt/conda/conda-bld/bowtie2_1554125862557/work=/usr/local/src/conda/bowtie2-2.3.5 -fdebug-prefix-map=/usr/local=/usr/local/src/conda-prefix -std=c++98 -DPOPCNT_CAPABILITY -DWITH_TBB -DNO_SPINLOCK -DWITH_QUEUELOCK=1                                                                                                                                                                                                                                                                                                                                                               

In [9]:
samtools --version

samtools 1.9
Using htslib 1.9
Copyright (C) 2018 Genome Research Ltd.


In [10]:
fastp --version

fastp 0.20.0


In [11]:
cutadapt --version

2.4


## Running a pipeline

Now that we've proven we can run each of these commands, let's put them all together and run a whole pipeline. First, we'll clone our pipeline from github:

In [1]:
git clone http://github.com/databio/peppro

Cloning into 'peppro'...
remote: Enumerating objects: 33, done.[K
remote: Counting objects: 100% (33/33), done.[K
remote: Compressing objects: 100% (22/22), done.[K
remote: Total 1689 (delta 13), reused 26 (delta 11), pack-reused 1656[K
Receiving objects: 100% (1689/1689), 2.61 MiB | 0 bytes/s, done.
Resolving deltas: 100% (989/989), done.
Checking connectivity... done.


Now execute the example code to run it.

In [13]:
cd peppro

In [None]:
./pipelines/peppro.py \
  --sample-name test \
  --genome rCRSd \
  --input examples/data/test_r1.fq.gz \
  --single-or-paired single \
  -O $HOME/peppro_example/

This tutorial is not quite complete, but you get the idea. This will require a few other things to be installed; you will need refgenie and the necessary assets and R packages, which are not yet containerized. I'm working on finishing up this tutorial soon but I hope you've gotten a feel for how easy this makes it to load up an environment with all the commands you need in a workflow.


## Conclusion

That's basically it. If you're a workflow developer, all you need to do is [write your own manifest](manifest.md) and distribute it with your workflow; in 3 lines of code, users will be able to run your workflow using modular containers, using the container engine of their choice.

