# Running Parallel Jobs on JupyterHub in R
Author: Zach Schira

[JupyterHub](https://jupyterhub.readthedocs.io/en/latest/) offers a multi user environment for running Jupyter Notebooks. Research Computing provides access to a JupyterHub environment with parallel processing support. This tutorial will demonstrate how to use the [parallel](https://stat.ethz.ch/R-manual/R-devel/library/parallel/doc/parallel.pdf) R package to run simple parallel jobs within the R kernel on JupyterHub. 

## Objectives
- Connect to a remote cluster for parallel processing
- Use the parallel package to run jobs

## Dependencies
- parallel

## Using parallel
First you must connect to a cluster where you will be doing your parallel processing. In this example, we will just be using the cores available on your local machine, but if you are running your job through JupyterHub, you can use any of the RC resources described [here](https://www.rc.colorado.edu/support/user-guide/jupyterhub.html).

In [1]:
library(parallel)
num_cores <- detectCores()
cl <- makeCluster(num_cores)

The parallel package contains functions that mirror the base R [lapply](http://www.inside-r.org/r-doc/base/sapply) function. The following example will calculate the square of each number from 1-28 in parallel.

In [2]:
parSapply(cl, 1:28, function(base)
    base^2)

This same basic approach can be used for more complicated functions from external libraries, but you will need to make a few considerations to ensure everything runs properly. The `clusterEvalQ` function is used include external libraries, and `clusterExport` will allow you to use variables that you have defined outside of your parallel call. Note that if you change a variable after you call `clusterExport`, that change will not be reflected in your parallel computations. `invisible` is used on the first line of this example to hide the output, because `clusterEvalQ` will list the libraries being loaded in its output.

In [3]:
#include stats library for the sd function
invisible(clusterEvalQ(cl, library(stats)))
#list containing 4 vectors of 100 randomly generated numbers between 0-100
x <- list(runif(100,0,100), runif(100,0,100), runif(100,0,100), runif(100,0,100))
clusterExport(cl, "x")
parSapply(cl, x, sd)

You can also use functions that you have defined yourself. These will be treated just like a variable. This example defines a function called `calc_avg`, then uses that function in `parSapply`.

In [4]:
calc_avg <- function(vec) {
    avg <- sum(vec)/length(vec)
    avg
}
clusterExport(cl, "calc_avg")
parSapply(cl, x, calc_avg)

Once you have finished your work, you will want to free the resources you have been using. This is simply done with the `stopCluster` function.

In [5]:
stopCluster(cl)