# Building a Bioinformatics Workflow with Jupyter & Anaconda

## Installing Bioinformatics Software with `bioconda`

Burke Squires

https://github.com/burkesquires

---

![bioconda.png](attachment:bioconda.png)

## Setting up `bioconda`

[Bioconda](https://bioconda.github.io/) is "a channel for the conda package manager specializing in bioinformatics software. Bioconda consists of:

- a repository of recipes hosted on GitHub
- a build system that turns these recipes into conda packages
- a repository of more than 3000 bioinformatics packages ready to use with conda install
- Over 250 contributors that add, modify, update and maintain the recipes

(__Note__:The NIH's own [Ryan Dale](https://github.com/daler) is a founding member of the `bioconda` project)

The conda package manager makes installing software a vastly more streamlined process. Conda is a combination of other package managers you may have encountered, such as pip, CPAN, CRAN, Bioconductor, apt-get, and homebrew. Conda is both language- and OS-agnostic, and can be used to install C/C++, Fortran, Go, R, Python, Java etc programs on Linux, Mac OSX, and Windows.

Conda allows separation of packages into repositories, or channels. The main defaults channel has a large number of common packages. Users can add additional channels from which to install software packages not available in the defaults channel. Bioconda is one such channel specializing in bioinformatics software.

Bioconda has been acknowledged by NATURE in their [technology blog](http://blogs.nature.com/naturejobs/2017/11/03/techblog-bioconda-promises-to-ease-bioinformatics-software-installation-woes/).

Each package added to Bioconda also has a corresponding Docker BioContainer automatically created and uploaded to Quay.io."


See the ailable packages [here](https://bioconda.github.io/recipes.html)

### 1. Install conda

Bioconda requires the conda package manager to be installed. If you have an Anaconda Python installation, you already have it. Otherwise, the best way to install it is with the Miniconda package. The Python 3 version is recommended.

__Note__: If you have installed the Anaconda distribution you have already taken care of this step!

### 2. Set up channels

After installing conda you will need to add the bioconda channel as well as the other channels bioconda depends on. It is important to add them in this order so that the priority is set correctly (that is, bioconda is highest priority).

The conda-forge channel contains many general-purpose packages not already found in the defaults channel. The r channel is only included due to backward compatibility. It is not mandatory, but without the r channel packages compiled against R 3.3.1 might not work.

    !~/anaconda3/bin/conda config --add channels defaults

    !~/anaconda3/bin/conda config --add channels conda-forge

    !~/anaconda3/bin/conda config --add channels bioconda

### 3. Install packages

Browse the packages to see what’s available.

Bioconda is now enabled, so any packages on the bioconda channel can be installed into the current conda environment:

    conda install bwa

Or a new environment can be created:

    conda create -n aligners bwa bowtie hisat star

---

We will install three softwares: `bwa`, `samtools`, and `bcftools`.

In [1]:
    !conda install -y bwa

Solving environment: done

## Package Plan ##

  environment location: /Users/squiresrb/anaconda3/envs/py36_r_env

  added / updated specs: 
    - bwa


The following NEW packages will be INSTALLED:

    bwa:  0.7.17-ha92aebf_3 bioconda   
    perl: 5.26.2-h16c6ff1_0 conda-forge

Preparing transaction: done
Verifying transaction: done
Executing transaction: done


In [2]:
    !conda install -y samtools

Solving environment: done

## Package Plan ##

  environment location: /Users/squiresrb/anaconda3/envs/py36_r_env

  added / updated specs: 
    - samtools


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    samtools-1.9               |       h46bd0b3_0         525 KB  bioconda
    llvm-meta-6.0.1            |                0           2 KB  conda-forge
    libpng-1.6.34              |       ha92aebf_2         321 KB  conda-forge
    libedit-3.1.20170329       |                0         155 KB  conda-forge
    ncurses-5.9                |               10         1.1 MB  conda-forge
    readline-7.0               |                0         383 KB  conda-forge
    clangdev-6.0.1             |        default_1        87.7 MB  conda-forge
    python-3.6.5               |       hc167b69_0        15.4 MB
    pcre-8.39                  |                0         268 KB  conda-forge
    r-base-3

In [4]:
    !conda install -y bcftools

Solving environment: done

# All requested packages already installed.



---

__Note__: Remember to restart the kernel in a notebook if you have made changes using conda.

## `biopython`

While we are not going to be using it today, another excellent python bioinformatics resource is the [biopython](http://biopython.org/DIST/docs/tutorial/Tutorial.html) project!

"The Biopython Project is an international association of developers of freely available Python (http://www.python.org) tools for computational molecular biology."


![biopython_logo.svg](attachment:biopython_logo.svg)


"What can I find in the Biopython package

The main Biopython releases have lots of functionality, including:

- The ability to parse bioinformatics files into Python utilizable data structures, including support for the following formats:
    - Blast output – both from standalone and WWW Blast
    - Clustalw
    - FASTA
    - GenBank
    - PubMed and Medline
    - ExPASy files, like Enzyme and Prosite
    - SCOP, including ‘dom’ and ‘lin’ files
    - UniGene
    - SwissProt
- Files in the supported formats can be iterated over record by record or indexed and accessed via a Dictionary interface.
- Code to deal with popular on-line bioinformatics destinations such as:
    - NCBI – Blast, Entrez and PubMed services
    - ExPASy – Swiss-Prot and Prosite entries, as well as Prosite searches
- Interfaces to common bioinformatics programs such as:
    - Standalone Blast from NCBI
    - Clustalw alignment program
    - EMBOSS command line tools
- A standard sequence class that deals with sequences, ids on sequences, and sequence features.
- Tools for performing common operations on sequences, such as translation, transcription and weight calculations.
- Code to perform classification of data using k Nearest Neighbors, Naive Bayes or Support Vector Machines.
- Code for dealing with alignments, including a standard way to create and deal with substitution matrices.
- Code making it easy to split up parallelizable tasks into separate processes.
- GUI-based programs to do basic sequence manipulations, translations, BLASTing, etc.
- Extensive documentation and help with using the modules, including this file, on-line wiki documentation, the web site, and the mailing list.
- Integration with BioSQL, a sequence database schema also supported by the BioPerl and BioJava projects."

With `conda` installing biopython is as easy as `conda install biopython`