# Introduction to biobase

Bulker comes with a core manifest called `biobase` that includes more than 50 common bioinformatics tools. Biobase is useful for everyday interactive analysis, basic pipelines, and also as a starting point for more complicated manifests. In this tutorial we'll show how to activate biobase and use it for some interactive RNA-seq analysis, and to run a pipeline. I assume you've already gone through the [install and configure](install.md) instructions.


## Loading the biobase crate

Let's load the biobase crate, which is available on the bulker registry:


In [None]:
export BULKERCFG="biobase_example/bulker_config.yaml"
rm $BULKERCFG
bulker init -c $BULKERCFG

In [2]:
bulker load biobase -f $HOME/code/hub.bulker.io/bulker/biobase.yaml

Bulker config: /home/nsheff/code/bulker/docs_jupyter/biobase_example/bulker_config.yaml
Importing crate 'bulker/alpine:default' from '/home/nsheff/bulker_crates/bulker/alpine/default'.
Importing crate 'bulker/coreutils:default' from '/home/nsheff/bulker_crates/bulker/coreutils/default'.
Loading manifest: 'bulker/biobase:default'. Activate with 'bulker activate bulker/biobase:default'.
Commands available: aws, ascp, bamtools, bedClip, bedCommonRegions, bedGraphToBigWig, bedIntersect, bedItemOverlapCount, bedops, bedPileUps, bedToBigBed, bedtools, bigWigAverageOverBed, bigWigCat, bigWigSummary, bismark, bissnp, blast, bowtie2, bowtie, bsmap, bwa, cellranger, cufflinks, curl, cutadapt, faSplit, fastq-dump, fastqc, gatk, gt, hisat2, homer, kallisto, khmer, liftOver, macs2, mashmap, picard, pigz, prefetch, R, repeatmasker, rg, Rscript, sambamba, salmon, samtools, segway, seqkit, seqtk, skewer, samblaster, STAR, tabix, trim_galore, trimmomatic, vep, wigToBigWig


You can see this crate offers many common bioinformatics tools, like `samtools` and `bowtie2`. You can see this crate in your list of available crates:

In [3]:
bulker list

Available crates:
bulker/biobase:default -- /home/nsheff/bulker_crates/bulker/biobase/default


: 1

Now, to use a bulker crate inside a jupyter notebook, you have to activate the crate before starting the notebook. This is due to a limitation with the jupyter system, as it expects that commands will run in a subprocess and then exit; but bulker spawns a new shell so you have to start your bulker environment outside of the jupyter notebook.

Starting up the crate is as simple as running:

```
bulker activate biobase
```

If using jupyter, you can then start your jupyter notebook with `jupyter notebook ...`, or you can just continue in the shell. Now we can run any crate commands as if they were native. First, you can inspect the current environment to see what commands are available.

In [None]:
bulker inspect

Any of these commands can be run as if they are native, but they will actually be running in containers.

In [None]:
samtools --version

In [14]:
kallisto --version


Unable to find image 'quay.io/biocontainers/kallisto:0.42.4--2' locally
0.42.4--2: Pulling from biocontainers/kallisto

[1B95caeb02: Already exists 
[1Bc00e8b61: Already exists 
[1Bde50789a: Already exists 
[1B8b9f3d2a: Already exists 
[1B99a2256f: Already exists 
[1B336f2e44: Already exists 
[7B95caeb02: Already exists 
[1BDigest: sha256:8dd5ec4cb7e26fd425e0787a85a3f7e1c0cb567bbb57d8bc4a80406df294216a
Status: Downloaded newer image for quay.io/biocontainers/kallisto:0.42.4--2
Error: invalid command --version
kallisto 0.42.4

Usage: kallisto <CMD> [arguments] ..

Where <CMD> can be one of:

    index         Builds a kallisto index 
    quant         Runs the quantification algorithm 
    h5dump        Converts HDF5-formatted results to plaintext
    version       Prints version information

Running kallisto <CMD> without arguments prints usage information for <CMD>



: 1

In [10]:
cutadapt --version


Command 'cutadapt' not found, but can be installed with:

sudo apt install cutadapt



: 127

## Running a pipeline

Now that we've proven we can run each of these commands, let's put them all together and run a whole pipeline. Since the environment we've just activated has all those commands available as if they were installed natively, all we have to do is run a bunch of commands in succession and they will automatically run in individual containers. 

**Keep in mind: the bulker environment contains all the commands that the pipeline will run, but the pipeline code itself is distinct. Therefore, we must retrieve the pipeline code and any requirements of the workflow system before we can run it within the bulker environment**.

First, we'll clone our pipeline from github:

In [17]:
git clone https://github.com/databio/peppro peppro_example/peppro

Cloning into 'peppro_example/peppro'...
remote: Enumerating objects: 58, done.[K
remote: Counting objects: 100% (58/58), done.[K
remote: Compressing objects: 100% (40/40), done.[K
remote: Total 1714 (delta 28), reused 41 (delta 18), pack-reused 1656[K
Receiving objects: 100% (1714/1714), 2.62 MiB | 0 bytes/s, done.
Resolving deltas: 100% (1004/1004), done.
Checking connectivity... done.


We'll also need a few other things required by the pipeline. First, we have to install the workflow system used by the pipeline. Remember, the workflow system itself is native, it's the individual commands that will be run in containers. This workflow uses [pypiper](http://pypiper.databio.org), which can be installed from PyPI:

In [12]:
pip install --user piper

You should consider upgrading via the 'pip install --upgrade pip' command.[0m


This pipeline also requires reference genome assets that are managed by [refgenie](http://refgenie.databio.org). If you don't already have an initialized refgenie config, you can easily initialize one like this:

In [18]:
pip install --user refgenie
export REFGENIE="peppro_example/refgenie.yaml"
refgenie init -c $REFGENIE -s http://staging.refgenomes.databio.org
refgenie pull -g hs38d1 fasta bowtie2_index

You should consider upgrading via the 'pip install --upgrade pip' command.[0m
Initializing refgenie genome configuration
Wrote new refgenie genome configuration file: peppro_example/refgenie.yaml
'hs38d1/fasta:default' archive size: 1.7MB
Downloading URL: http://staging.refgenomes.databio.org/v2/asset/hs38d1/fasta/archive
hs38d1/fasta:default: 1.74MB [00:00, 28.8MB/s]
Download complete: /home/nsheff/code/bulker/docs_jupyter/peppro_example/hs38d1/fasta__default.tgz
Extracting asset tarball and saving to: /home/nsheff/code/bulker/docs_jupyter/peppro_example/hs38d1/fasta/default
Default tag for 'hs38d1/fasta' set to: default
'hs38d1/bowtie2_index:default' archive size: 8.8MB
Downloading URL: http://staging.refgenomes.databio.org/v2/asset/hs38d1/bowtie2_index/archive
hs38d1/bowtie2_index:default: 9.22MB [00:00, 21.1MB/s]                          
Download complete: /home/nsheff/code/bulker/docs_jupyter/peppro_example/hs38d1/bowtie2_index__default.tgz
Extracting asset tarball and saving to

Now execute the example code to run it.

In [19]:
./peppro_example/peppro/pipelines/peppro.py \
  --sample-name test \
  --genome hs38d1 \
  --input peppro_example/peppro/examples/data/test_r1.fq.gz \
  --single-or-paired single \
  -O peppro_example/output/

### Pipeline run code and environment:

*              Command:  `./peppro_example/peppro/pipelines/peppro.py --sample-name test --genome hs38d1 --input peppro_example/peppro/examples/data/test_r1.fq.gz --single-or-paired single -O peppro_example/output/`
*         Compute host:  puma
*          Working dir:  /home/nsheff/code/bulker/docs_jupyter
*            Outfolder:  /home/nsheff/code/bulker/docs_jupyter/peppro_example/output/test/
*  Pipeline started at:   (10-21 11:29:03) elapsed: 0.0 _TIME_

### Version log:

*       Python version:  2.7.12
*          Pypiper dir:  `/home/nsheff/.local/lib/python2.7/site-packages/pypiper`
*      Pypiper version:  0.12.0
*         Pipeline dir:  `/home/nsheff/code/bulker/docs_jupyter/peppro_example/peppro/pipelines`
*     Pipeline version:  0.8.1
*        Pipeline hash:  05095a2cc78a2e210916f215dd0828940b894a6f
*      Pipeline branch:  * dev
*        Pipeline date:  2019-10-21 11:21:23 -0400

### Arguments passed to pipeline:

*           `TSS_na


> `Total_efficiency`	3.32	PEPPRO	_RES_

> `Read_depth`	1.26	PEPPRO	_RES_

### Compress all unmapped read files (10-21 11:29:11) elapsed: 4.0 _TIME_

Target to produce: `/home/nsheff/code/bulker/docs_jupyter/peppro_example/output/test/aligned_hs38d1/test_temp.bam.bai`  

> `samtools index /home/nsheff/code/bulker/docs_jupyter/peppro_example/output/test/aligned_hs38d1/test_temp.bam` (21800)
<pre>
</pre>
Command completed. Elapsed time: 0:00:01. Running peak memory: 0.057GB.  
  PID: 21800;	Command: samtools;	Return code: 0;	Memory used: 0.019GB


> `samtools idxstats /home/nsheff/code/bulker/docs_jupyter/peppro_example/output/test/aligned_hs38d1/test_temp.bam | grep -we 'chrM' -we 'chrMT' -we 'M' -we 'MT' -we 'rCRSd' -we 'rCRSd_3k'| cut -f 3`

> `samtools stats /home/nsheff/code/bulker/docs_jupyter/peppro_example/output/test/aligned_hs38d1/test_sort.bam | grep '^SN' | cut -f 2- | grep 'maximum length:' | cut -f 2-`

### Calculate NRF, PBC1, and PBC2 (10-21 11:29:13) elapsed: 2.0 _TIME_


<pre>
Registering input file: '/home/nsheff/code/bulker/docs_jupyter/peppro_example/output/test/aligned_hs38d1/test_minus.bam'
Temporary files will be stored in: 'tmp_test_minus_cuttrace_ljeMBw'
Processing with 1 cores...
Reduce step (merge files)...
Merging 124 files into output file: '/home/nsheff/code/bulker/docs_jupyter/peppro_example/output/test/signal_hs38d1/test_minus_body_0-mer.bw'
</pre>
Command completed. Elapsed time: 0:01:23. Running peak memory: 0.068GB.  
  PID: 31872;	Command: ./peppro_example/peppro/tools/bamSitesToWig.py;	Return code: 0;	Memory used: 0.057GB

Starting cleanup: 5 files; 4 conditional files for cleanup

Cleaning up flagged intermediate files. . .

Cleaning up conditional list. . .

### Pipeline completed. Epilogue
*        Elapsed time (this run):  0:03:01
*  Total elapsed time (all runs):  0:02:59
*         Peak memory (this run):  0.0675 GB
*        Pipeline completed time: 2019-10-21 11:32:04


The pipeline has now completed successfully and we can explore the results:

In [20]:
tree peppro_example/output/

[01;34mpeppro_example/output/[00m
└── [01;34mtest[00m
    ├── [01;34maligned_hs38d1[00m
    │   ├── test_fail_qc.bam
    │   ├── test_minus.bam
    │   ├── test_minus.bam.bai
    │   ├── test_plus.bam
    │   ├── test_plus.bam.bai
    │   ├── test_sort.bam
    │   ├── test_sort.bam.bai
    │   └── test_unmap.bam
    ├── [01;34mcutadapt[00m
    │   └── test_cutadapt.txt
    ├── [01;34mfastq[00m
    ├── [01;34mfastqc[00m
    ├── objects.tsv
    ├── [01;32mPEPPRO_cleanup.sh[00m
    ├── PEPPRO_commands.sh
    ├── PEPPRO_completed.flag
    ├── PEPPRO_log.md
    ├── PEPPRO_profile.tsv
    ├── [01;34mQC_hs38d1[00m
    │   └── test_bamQC.tsv
    ├── [01;34mraw[00m
    │   └── [01;36mtest.fastq.gz[00m -> [01;31m/home/nsheff/code/bulker/docs_jupyter/peppro_example/peppro/examples/data/test_r1.fq.gz[00m
    ├── [01;34msignal_hs38d1[00m
    │   └── test_minus_body_0-mer.bw
    └── stats.tsv

8 directories, 19 files


We've successfully run a complete pipeline without having to install any of the software that runs the commands in the workflow. We're also able to interactively explore the environment that ran the workflow.


## Conclusion

That's basically it. If you're a workflow developer, all you need to do is [write your own manifest](manifest.md) and distribute it with your workflow; in 3 lines of code, users will be able to run your workflow using modular containers, using the container engine of their choice.

