segue.Rmd

`ro cache=FALSE, comment=NA or`

# Segue = MapReduce + Hadoop + Amazon EC2 + R

`Segue` is an R package by JD Long which will let you run your script across hundreds of computers in the Amazon compute cloud.  
This parallel execution uses Amazon's Elastic Map Reduce to work across many clusters, but that's all behind the scenes.  


## Install

JD hosts the project on [GoogleCode](http://code.google.com/p/segue/), but since we're focused on git in this class I've [created a copy on github](https://github.com/cboettig/segue).  Clone and install the copy using:

```
git clone https://cboettig@github.com/cboettig/segue.git
cd segue
R CMD INSTALL .
```

Or if you have the [`devtools` package](https://github.com/hadley/devtools) installed,

```
library(devtools)
install_github("segue", "cboettig")
```

You'll also need to setup a cloud computing account with Amazon if you haven't done so yet.  Instructions for this are below, but let's first take a quick look at how easy this makes it to launch an Amazon cloud instance and run some code.  If you're already familiar with creating Amazon instances, logging into them, and running R, this will actually seem even more impressive:


## A simple R example

(All R code here is in knitr, so it is actually run and generates the output you see).  
Let's illustrate this with a little likelihood routine.  Create some data, a likelihood function, and two possible parameter values we want to try out.  


``` {r } 
X <- rnorm(20, 2, 5)
loglik <- function(pars){
      sum( dnorm(X, mean=pars[1], sd=pars[2], log=TRUE) )
}
pars <- list(c(mu=1,sigma=2), c(mu=2,sigma=3))
````

Let's make sure method works locally before we go testing this on the cloud.  

``` {r } 
local <- lapply(list(1,2), function(i){
               loglik(pars[[i]])
})
````

Load the library and establish our login credentials.  

``` {r }
library(segue)
source("~/.Ramazon_keys")
setCredentials(getOption("amazon_key"), getOption("amazon_secret"))
````


Create a cluster on Amazon -- this will start charge your account by the hour.  
Note that we have to specify in this call the R objects we want to load onto the Amazon computers, along with any packages we might need 


``` {r eval = FALSE} 
myCluster <- createCluster(numInstances=2, 
                           cranPackages=c("sde"), 
                           rObjectsOnNodes=list(X=X,pars=pars,loglik=loglik))
````

The "Elastic Map Reduce" version of the `lapply` function works in almost same way as the standard `lapply`:

``` {r eval=FALSE}
cloud <- emrlapply(myCluster, as.list(1,2), function(i){
               loglik(pars[[i]])
})
 
stopCluster(myCluster)
````

The final command stop the cluster to make sure we're not being billed after the our task is done.  That's all there is to it.  Let's compare the results:

``` {r }
local
#cloud
````


## Configure Amazon
If you've already set up an Amazon EC2 account, the easiest thing to do is do store your Amazon key and Amazon secret key in a secure R script on your computer which stores these as `options`.