In this chapter, we will cover the following recipes:
- Installing the required software with Anaconda
- Installing the required software with Docker
- Interfacing with R via rpy2
- Performing R magic with IPython

## Installing the required software with Anaconda

https://github.com/tiagoantao/bioinf-python/blob/master/notebooks/Welcome.ipynb

### Getting ready

Python can be run on top of different environments.:
   - JVM (via Jython)
   - .NET (with IronPython).
   - Standard (CPython) implementation - the JVM and .NET versions exist mostly to interact with the native libraries of these platforms.

##### Python 2 or 3?
1. Phylogenetics: Python 2 => most existing Python libraries do not support version 3
2. In the short term, Python 2, is generally better supported, but (save for the aforementioned Phylogenetics topic)
3. Python 3 is well covered for computational biology. For the long run, Python 3 is the place to be.

Both supported: use 2.7, 3.4.

##### Operating system
- Most heavy-duty analysis will be done on Linux (probably on a Linux cluster): Next-generation sequencing data analysis, complex machine learning
- Modern virtualization software (VirtualBox and Docker) for Windows or Mac OS X
- Install the 32-bit version for Windows

The software developed for this book is available at https://github.com/tiagoantao/bioinf-python.

Development compilers and libraries (all free).
- On Ubuntu, build-essential (apt-get it) package
- On Mac, Xcode (https://developer.apple.com/xcode/).

List of the most important Python software

Name | Usage | URL | Purpose
---- | ----- | --- | ------
IPython | General | http://ipython.org/ | General
NumPy | General | http://www.numpy.org/ | Numerical Python
SciPy | General | http://scipy.org/ | Scientific computing
matplotlib | General | http://matplotlib.org/ | Visualization
Biopython | General | http://biopython.org/wiki/Main_Page | Bioinformatics
PyVCF | NGS | http://pyvcf.readthedocs.org/en/latest/ | VCF processing
PySAM | NGS | http://pysam.readthedocs.org/en/latest/ | SAM/BAM processing
simuPOP | Population Genetics | http://simupop.sourceforge.net/ | Genetics Simulation
DendroPY | Phylogenetics | http://pythonhosted.org/DendroPy/ | Phylogenetics
scikit-learn | General | http://scikit-learn.org/stable/ | Machine learning
PyMOL | Proteomics | http://pymol.org/ | Molecular visualization
rpy2 | R integration | http://rpy.sourceforge.net/ | R interface
pygraphviz | General | http://pygraphviz.github.io/ | Graph library
Reportlab | General | http://reportlab.com/ | Visualization
seaborn | General | http://web.stanford.edu/~mwaskom/software/seaborn/ | Visualization/Stats
Cython | Big Data | http://cython.org/ | High performance
Numba | Big Data | http://numba.pydata.org/ | High performance

etc: Blaze (data analysis), Bokeh (visualization)


### How to install

1.Downloading the Anaconda distribution from http://continuum.io/downloads.
Choose the Python Version 2 or 3.

Accept all the installation defaults, but make sure that conda binaries are in your PATH (open a new window so that the PATH is updated).

Be careful with your PYTHONPATH and existing Python libraries.

2.Let's go ahead with libraries.

**conda create -n bioinformatics biopython biopython=1.65 python=2.7**

**conda create -n bioinformatics biopython=1.65 python=3.4**

3.Let's activate the environment

** source activate bioinformatics **

4.Install the core packages

**conda install scipy matplotlib ipython-notebook binstar pip**

**conda install pandas cython numba scikit-learn seaborn**

5.Install pygraphivz using pip (not available on conda)

** pip install pygraphviz **

6.Install the Python bioinformatics packages, apart from Biopython

** conda install -c https://conda.binstar.org/bcbio pysam **

** conda install -c https://conda.binstar.org/simupop simuPOP **

** pip install pyvcf **

** pip install dendropy **

7.Install R
- Download from the R website at http://www.r-project.org/
- On a recent Debian/Ubuntu Linux distribution:
** apt-get r-bioc-biobase r-cran-ggplot2 **

This will install Bioconductor: the main R suite for bioinformatics and ggplot2—a popular plotting library in R. Of course, this will indirectly take care of installing R.

8.If you are not on Debian/Ubuntu Linux, do not have root, or prefer to install in your home directory, after downloading and installing R manually, run the following command in R:

**source("http://bioconductor.org/biocLite.R")**

**biocLite()**

**install.packages("ggplot2")**

**install.packages("gridExtra")**

9.Install rpy2, the R-to-Python bridge. Back at the command line, under the conda bioinformatics environment.

** pip install rpy2 **

10.Others

Perform pip3 for python3

Consider using virtualenv (http://docs.python-guide.org/en/latest/dev/virtualenvs/).

**Ref) Ipython Notebook Usage: http://nbviewer.ipython.org/gist/irobii/014b8aa3574090a0d04a**

## Installing the required software with Docker

- Docker is the most widely used framework that implements operating system-level virtualization.
- This technology allows you to have an independent container: lighter than a virtual machine, but still compartmentalize software.
- This mostly isolates all processes, making it feel like each container is a virtual machine.

### How to install
0.Install Docker
1) Get the latest version from https://www.docker.com/.
2) boot2docker (http://boot2docker.io/) on Windows or Mac

1.Use the following command on the Linux shell or in boot2docker:

**docker build -t bio https://raw.githubusercontent.com/tiagoantao/bioinfpython/master/docker/2/Dockerfile**

or

**docker build -t bio https://raw.githubusercontent.com/tiagoantao/bioinfpython/master/docker/3/Dockerfile**

On Linux, you will either require to have root privileges or be added to the Docker Unix group.

2.Ready to run the container:
**docker run -ti -p 9875:9875 -v YOUR_DIRECTORY:/data bio**

3.Replace YOUR_DIRECTORY with a directory on your operating system.
YOUR_DIRECTORY will be seen in the container on /data and vice versa.

**The -p 9875:9875 will expose the container TCP port 9875 on the host computer port 9875.**

4.If you are using boot2docker, the final configuration step will be to run in the command line of your operating system:
**VBoxManage controlvm boot2docker-vm natpf1
"name,tcp,127.0.0.1,9875,,9875"**

On Windows, this binary will probably be in C:\Program Files\Oracle\VirtualBox.

On a native Docker installation, you do not need to do anything.

5.If you now start your browser pointing at http://localhost:9875, you should be
able to get the IPython Notebook server running. Just choose the Welcome notebook
to start!

etc) You will find a paper on arXiv, which introduces Docker with a focus on reproducible research at http://arxiv.org/abs/1410.0846.

## Interfacing with R via rpy2

#### Getting ready

Metadata file from the 1000 genomes sequence index (https://github.com/tiagoantao/bioinf-python/blob/master/notebooks/Datasets.ipynb) and download the sequence.index file.

If you are using notebooks, open the 00_Intro/Interfacing_R notebook.ipynb and just execute the wget command on top.
This file has information about all FASTQ files in the project (we will use data from the Human 1000 genomes project in the chapters to come). This includes the FASTQ file, the sample ID, and the population of origin and important statistical information per lane, such as the number of reads and number of DNA bases read.

#### How to do it...

See Interfacing_R.ipynb

## Performing R magic with IPython

#### Getting ready and How to do it...

See R_magic.ipynb