# Dockerizing R code modules and converting to FireCloud Workflows

* Basic Docker documentation can be found at https://docs.docker.com/. 
* Use the `r-base` image as the starting image for all R programs. `r-base` documentation can be found at https://hub.docker.com/_/r-base/. 
    + `r-base` can be used in interactive mode (try running in a terminal):
    <p>`$ docker run -ti --rm r-base`<p>
    + in batch/script mode (`-v` option to link local directory to docker image):
    <p>`$ docker run -ti --rm -v "$PWD":/home/docker -w /home/docker -u docker r-base R CMD check .`<p>
    + or invoke a shell in the container to then run `R` or `Rscript`:
    <p>`$ docker run -ti --rm r-base /usr/bin/bash`<p>
    

### Step 0: Create a docker image with R-utilites
* Create directory r-util and add a `Dockerfile` to the directory for specifying the new image with the associated R code. In this directory, include all code/files that need to be incorporated into the image. R libraries can be automatically installed using the code shown below (also see https://blog.jessfraz.com/post/r-containers-for-data-science/).

In [None]:
%%bash
# on the Mac with flynn://prot_proteomics mouted
cd /Volumes/prot_proteomics/Projects/PGDAC/docker
mkdir r-util
cd r-util
cp -r /Volumes/prot_proteomics/LabMembers/manidr/R-utilities .
cat > Dockerfile <<EOF

FROM r-base
MAINTAINER manidr@broadinstitute.org

# external dependencies for installing pacman library in R
# with pacman, any missing libraries will be automatically added
RUN apt-get update
RUN apt-get -t unstable install -y libssl-dev
RUN apt-get -t unstable install -y libcurl4-openssl-dev

# install packages
RUN echo 'install.packages(c( \
    "MASS", \
    "MethComp", \
    "NMF", \
    "PerformanceAnalytics", \
    "RColorBrewer", \
    "RankAggreg", \
    "RobustRankAggreg", \
    "bpca", \
    "caret", \
    "e1071", \
    "fastcluster", \
    "ggplot2", \
    "glmnet", \
    "gplots", \
    "lattice", \
    "limma", \
    "lme4", \
    "maptools", \
    "mclust", \
    "misc3d", \
    "mixtools", \
    "nlme", \
    "pacman", \
    "pamr", \
    "parmigene", \
    "psych", \
    "randomForest", \
    "reshape", \
    "rgl", \
    "rhdf5", \
    "samr", \
    "scales", \
    "scatterplot3d", \
    "smacof", \
    "sn", \
    "tensor", \
    "tools", \
    "verification" \
), repos="http://cran.us.r-project.org", dependencies=TRUE)' > /tmp/packages.R  \\
   && Rscript /tmp/packages.R
# bioconductor libraries
RUN echo 'source("https://bioconductor.org/biocLite.R"); biocLite("Biobase"); biocLite("graph"); biocLite("Rgraphviz"); biocLite("impute"); biocLite("rhdf5")' > /tmp/biocpkgs.R \\
   && Rscript /tmp/biocpkgs.R

COPY R-utilities/ /prot/proteomics/Projects/R-utilities

EOF

* After creating the Dockerfile, call `build` and then `run` the container:
<p>`$ docker build --rm -t broadcptac/r-util:1 .`<p>
<p>`$ docker run -ti --rm -v $(pwd)/data:/home/user/data r-util R CMD <script>`<p>

### Step 1: Create a docker image with PGDAC basic pipeline
* Create directory pgdac-basic, add a `Dockerfile` to the directory, copy associated R code and build the docker image

In [None]:
%%bash
# on the Mac with flynn://prot_proteomics mouted
cd /Volumes/prot_proteomics/Projects/PGDAC/docker
mkdir pgdac-basic
cd pgdac-basic
cp -r ../../src .
cat > Dockerfile <<EOF

FROM broadcptac/r-util:1
MAINTAINER manidr@broadinstitute.org

COPY src /prot/proteomics/Projects/PGDAC/src

EOF

docker build --rm -t broadcptac/pgdac-basic:1 .

### Step 2: Setup FireCloud/WDL/Cromwell environment and test workflow
See the README.md and Wiki at https://github.com/broadinstitute/gdac-firecloud for documentation. The gdac-firecloud repository has a make system that automates workflow create and testing. See https://github.com/broadinstitute/gdac-firecloud/wiki/Using-Make and https://github.com/broadinstitute/gdac-firecloud/wiki/Adding-Tasks-and-Workflows-to-Firecloud.

In [None]:
%%bash

# use gdac-firecloud to create and test a workflow
cd /Volumes/prot_proteomics/Projects/PGDAC/
use .git-2.11.0-with-svn
git clone git://github.com/broadinstitute/gdac-firecloud.git

# create a workflow for the PGDAC pipeline
cd gdac-firecloud/workflows
make template FLOW=pgdac_basic
cp ../../wdl/workflows/pgdac_basic/pgdac_basic.wdl pgdac_basic/.   # copy previously created workflow description
cd pgdac_basic
make validate
make inputs   # edit tests/inputs.json to set parameters for workflow
make run  # if errors, try: java -Xmx4096m -jar ../../bin/cromwell.jar run pgdac_basic.wdl tests/inputs.json

### Step 3: Upload docker image to docker hub
FireCloud retrieves docker images from the docker hub. In order to use the image in a FireCloud workflow, upload (push) docker image:
* Create a docker hub account (username: manidr)
* Click on "Organizations" and create a new broadcptac organization on docker hub under manidr's account
* Under the broadcptac organization, create a pgdac_basic repository
* Also upload r-utils docker image (so that it can be retrieved later, if lost on the local computer)

In [None]:
%%bash

docker login -u manidr
docker push broadcptac/r-util:1
docker push broadcptac/pgdac_basic:1

### Step 4: Upload workflow code to FireCloud
One approach is to run the broadinstitute/firecloud-cli docker image and upload the wdl workflow. An alternative is to set the Makefile.my in gdac-firecloud and then run "make push_wdl". Also see workshop materials in the PGDAC/firecloud directory for additional documentation.

In [None]:
%%bash
cd /Volumes/prot_proteomics/Projects/PGDAC/gdac-firecloud/workflows/pgdac_basic

docker run --rm -it -v "$HOME"/.config:/.config -v "$PWD":/working broadinstitute/firecloud-cli bash
# in the running docker, authenticate google cloud
gcloud auth login
# then upload wdl to firecloud using CLI, to the broadcptac workspace
# if the workspace specified by -s does not exist, it will be created
firecloud -u https://api.firecloud.org/api -m push -s broadcptac -n pgdac_basic -t Workflow -y "PGDAC basic pipeline" pgdac_basic.wdl


### Step 5: Create FireCloud workflow and run it
Use the directions in https://github.com/broadinstitute/gdac-firecloud/wiki/Adding-Tasks-and-Workflows-to-Firecloud#5-create-and-execute-a-workspace-method-configuration and additional FireCloud documentation to create a new workflow in FireCloud, set parameters, upload data and run the analysis:
* Create a new workspace in FireCloud (Test workflow is nci-manidr-broadinstitute-org/broad-cptac-pgdac-pipeline-test)
* Import the pgdac_basic method into the workspace
* Upload input data files to the bucket associated with the workspace (using the GUI or gsutil). Ensure that the correct user is selected -- else bucket will not show contents, with a permission denied error.
* Use the data tab to import the PGDAC/firecloud/pgdac_basic/participant.tsv file
* In the Method Configurations tab, select pgdac_basic and edit each of the input parameter fields to point to appropriate columns in the participant.tsv file (this.<colname>)
* Finally click Launch Analysis in the Method Configurations tab to run the pipeline.