# Bring Your Own R Algorithm
_**Create a Docker container for training R algorithms and hosting R models**_

---

---

## Contents

1. [Background](#Background)
1. [Preparation](#Preparation)
1. [Code](#Code)
  1. [Fit](#Fit)
  1. [Serve](#Serve)
  1. [Dockerfile](#Dockerfile)
  1. [Publish](#Publish)
1. [Data](#Data)
1. [Train](#Train)
1. [Host](#Host)
1. [Predict](#Predict)
1. [Extensions](#Extensions)

---
## Background

R is a popular open source statistical programming language, with a lengthy history in Data Science and Machine Learning.  The breadth of algorithms available as an R package is impressive, which fuels a growing community of users.  The R kernel can be installed into Amazon SageMaker Notebooks, and Docker containers which use R can be used to take advantage of Amazon SageMaker's flexible training and hosting functionality.  This notebook illustrates a simple use case for creating an R container and then using it to train and host a model.  In order to take advantage of boto, we'll use Python within the notebook, but this could be done 100% in R by invoking command line arguments.

---
## Preparation

Let's start by defining the region, bucket, and prefix information we'll use.

In [None]:
import os
import boto3
import time
import json

os.environ['AWS_DEFAULT_REGION'] = 'us-west-2'
role = boto3.client('iam').list_instance_profiles()['InstanceProfiles'][0]['Roles'][0]['Arn']

bucket = '<your_s3_bucket_here>'
prefix = 'r_bring_your_own'

---
## Code

For this example, we'll need 4 supporting code files.

### Fit

`mars.R` creates functions to fit and serve our model.  The algorithm we've chose to use is [Multivariate Adaptive Regression Splines (MARS)](https://en.wikipedia.org/wiki/Multivariate_adaptive_regression_splines).  This is a suitable example as it's a unique and powerful algorithm, but it's not as broadly used as Amazon SageMarker algorithm, and it isn't available in Python's scikit-learn library.  R's repository of packages is filled with algorithms that share these same criteria. 

_The top of the code is devoted to setup.  Bringing in the libraries we'll need and setting up the file paths as detailed in Amazon SageMaker documentation on bringing your own container._

```R
# Bring in library that contains multivariate adaptive regression splines (MARS)
library(mda)

# Bring in library that allows parsing of JSON training parameters
library(jsonlite)

# Bring in library for prediction server
library(plumber)


# Setup parameters
# Container directories
prefix <- '/opt/ml'
input_path <- paste(prefix, 'input/data', sep='/')
output_path <- paste(prefix, 'output', sep='/')
model_path <- paste(prefix, 'model', sep='/')
param_path <- paste(prefix, 'input/config/hyperparameters.json', sep='/')

# Channel holding training data
channel_name = 'train'
training_path <- paste(input_path, channel_name, sep='/')
```

_Next, we define a train function that actually fits the model to the data.  For the most part this is idiomatic R, with a bit of maneuvering up front to take in parameters from a JSON file, and at the end to output a success indicator._

```R
# Setup training function
train <- function() {

    # Read in hyperparameters
    training_params <- read_json(param_path)

    target <- training_params$target

    if (is.null(training_params$degree)) {
        degree <- as.numeric(training_params$degree)}
    else {
        degree <- 2}

    # Bring in data
    training_files = list.files(path=training_path, full.names=TRUE)
    training_data = do.call(rbind, lapply(training_files, read.csv))
    
    # Convert to model matrix
    training_X <- model.matrix(~., training_data[, colnames(training_data) != target])

    # Save factor levels for scoring
    factor_levels <- lapply(training_data[, sapply(training_data, is.factor), drop=FALSE],
                            function(x) {levels(x)})
    
    # Run multivariate adaptive regression splines algorithm
    model <- mars(x=training_X, y=training_data[, target], degree=degree)
    
    # Generate outputs
    mars_model <- model[!(names(model) %in% c('x', 'residuals', 'fitted.values'))]
    attributes(mars_model)$class <- 'mars'
    save(mars_model, factor_levels, file=paste(model_path, 'mars_model.RData', sep='/'))
    print(summary(mars_model))

    write.csv(model$fitted.values, paste(output_path, 'data/fitted_values.csv', sep='/'), row.names=FALSE)
    write('success', file=paste(output_path, 'success', sep='/'))}
```

_Then, we setup the serving function (which is really just a short wrapper around our plumber.R file that we'll discuss [next](#Serve)._

```R
# Setup scoring function
serve <- function() {
    app <- plumb(paste(prefix, 'plumber.R', sep='/'))
    app$run(host='0.0.0.0', port=8080)}
```

_Finally, a bit of logic to determine if, based on the options passed to calling this script, we are using the container for training or hosting._

```R
# Run at start-up
args <- commandArgs()
if (any(grepl('train', args))) {
    train()}
if (any(grepl('serve', args))) {
    serve()}
```

### Serve
`plumber.R` uses the [plumber](https://www.rplumber.io/) package to create a light weight http server for processing requests in hosting.  Note the specific syntax, and see the plumber help docs for additional detail on more specialized use cases.

Per the Amazon SageMaker documentation, our service needs to accept post requests to ping and invocations.  plumber specifies this with custom comments, followed by functions that take specific arguments.

Here invocations does most of the work, ingesting our trained model, handling the http request body, and producing a CSV output of predictions.

```R
# plumber.R


#' Ping to show server is there
#' @post /ping
function() {
    list(status='200', code='200')}


#' Echo the parameter that was sent in
#' @param req The http request sent
#' @post /invocations
function(req) {

    # Setup locations
    prefix <- '/opt/ml'
    model_path <- paste(prefix, 'model', sep='/')

    # Bring in model file and factor levels
    load(paste(model_path, 'mars_model.RData', sep='/'))

    # Read in data
    conn <- textConnection(gsub('\\\\n', '\n', req$postBody))
    data <- read.csv(conn)
    close(conn)

    # Convert input to model matrix
    scoring_X <- model.matrix(~., data, xlev=factor_levels)

    # Return prediction
    return(paste(predict(mars_model, scoring_X, row.names=FALSE), collapse=','))}
```

### Dockerfile

Smaller containers are preferred for Amazon SageMaker, so this container is kept minimal.  It simply starts with Ubuntu, installs R, mda, and plumber libraries, then adds `mars.R` and `plumber.R`, and finally runs `mars.R` when the entrypoint is launched.

```Dockerfile
FROM ubuntu:16.04

MAINTAINER David Arpin <arpin@amazon.com>

RUN apt-get -y update && apt-get install -y --no-install-recommends \
    wget \
    r-base \
    r-base-dev \
    ca-certificates

RUN R -e "install.packages(c('mda', 'plumber'), repos='https://cloud.r-project.org')"

COPY mars.R /opt/ml/mars.R
COPY plumber.R /opt/ml/plumber.R

ENTRYPOINT ["/usr/bin/Rscript", "/opt/ml/mars.R", "--no-save"]
```

### Publish
Now, to publish this container to ECR, we run the `build_and_push.sh` script using `source build_and_push.sh rmars` from the terminal.  This code sets up permissions and naming with minimal effort.  We'll skip the details for the sake of brevity.

---
## Data
For this illustrative example, we'll use the simple `iris` dataset which can be brought into R using:

```R
data(iris)
write.csv(iris, file='iris.csv', row.names=FALSE)
```

Then let's copy the data to S3.

In [None]:
train_file = 'iris.csv'
boto3.Session().resource('s3').Bucket(bucket).Object(os.path.join(prefix, 'train', train_file)).upload_file(train_file)

_Note: Although we could, we'll avoid doing any preliminary transformations on the data, instead choosing to do those transformations inside the container.  This is not typicall the best practice for model efficiency, but provides some benefits in terms of flexibility._

---
## Train

Now, let's setup the information needed to train a Multivariate Adaptive Regression Splines (MARS) model on iris data.  In this case, we'll predict `Sepal.Length` rather than the more typical classification of `Species` to show how factors might be included in a model and limit the case to regression.

- Specify the role to use
- Give the training job a name
- Point the algorithm to the container we created
- Specify training instance resources (in this case our algorithm is only single-threaded so stick to 1 instance)
- Point to the S3 location of our input data and the `train` channel expected by our algorithm
- Point to the S3 location for output
- Provide hyperparamters (keeping it simple)
- Maximum run time

In [None]:
r_job = 'r-byo-' + time.strftime("%Y-%m-%d-%H-%M-%S", time.gmtime())

print("Training job", r_job)

r_training_params = {
    "RoleArn": role,
    "TrainingJobName": r_job,
    "AlgorithmSpecification": {
        "TrainingImage": "345362745630.dkr.ecr.us-west-2.amazonaws.com/rmars:latest",
        "TrainingInputMode": "File"
    },
    "ResourceConfig": {
        "InstanceCount": 1,
        "InstanceType": "c4.xlarge",
        "VolumeSizeInGB": 10
    },
    "InputDataConfig": [
        {
            "ChannelName": "train",
            "DataSource": {
                "S3DataSource": {
                    "S3DataType": "S3Prefix",
                    "S3Uri": "s3://{}/{}/train".format(bucket, prefix),
                    "S3DataDistributionType": "FullyReplicated"
                }
            },
            "CompressionType": "None",
            "RecordWrapperType": "None"
        }
    ],
    "OutputDataConfig": {
        "S3OutputPath": "s3://{}/{}/output".format(bucket, prefix)
    },
    "HyperParameters": {
        "target": "Sepal.Length",
        "degree": "2"
    },
    "StoppingCondition": {
        "MaxRuntimeInHours": 1
    }
}

Now let's kick off our training job on EASE, using the parameters we just created.  Because training is serverless, we don't have to wait for our job to finish to continue, but for this case, let's setup a waiter so we can monitor the status of our training.

In [None]:
%%time

im = boto3.client('im')
im.create_training_job(**r_training_params)

status = im.describe_training_job(TrainingJobName=r_job)['TrainingJobStatus']
print(status)
im.get_waiter('TrainingJob_Created').wait(TrainingJobName=r_job)
if status == 'Failed':
    message = im.describe_training_job(TrainingJobName=r_job)['FailureReason']
    print('Training failed with the following error: {}'.format(message))
    raise Exception('Training job failed')

---
## Host

Hosting the model we just trained takes three steps in Amazon SageMaker.  First, we define the model we want to host, pointing the service to the model artifact our training job just wrote to S3.

In [None]:
r_hosting_container = {
    'Image': "345362745630.dkr.ecr.us-west-2.amazonaws.com/rmars:latest",
    'ModelDataUrl': im.describe_training_job(TrainingJobName=r_job)['ModelArtifacts']['S3ModelArtifacts']
}

create_model_response = im.create_model(
    ModelName=r_job,
    ExecutionRoleArn=role,
    PrimaryContainer=r_hosting_container)

print(create_model_response['ModelArn'])

Next, let's create an endpoing configuration, passing in the model we just registered.  In this case, we'll only use a few c4.xlarges.

In [None]:
r_endpoint_config = 'r-endpoint-config-' + time.strftime("%Y-%m-%d-%H-%M-%S", time.gmtime())
print(r_endpoint_config)
create_endpoint_config_response = im.create_endpoint_config(
    EndpointConfigName=r_endpoint_config,
    ProductionVariants=[{
        'InstanceType': 'c4.xlarge',
        'MaxInstanceCount': 2,
        'MinInstanceCount': 1,
        'ModelName': r_job,
        'VariantName': 'AllTraffic'}])

print("Endpoint Config Arn: " + create_endpoint_config_response['EndpointConfigArn'])

Finally, we'll create the endpoints using our endpoint configuration from the last step.

In [None]:
%%time

r_endpoint = 'r-endpoint-' + time.strftime("%Y%m%d%H%M", time.gmtime())
print(r_endpoint)
create_endpoint_response = im.create_endpoint(
    EndpointName=r_endpoint,
    EndpointConfigName=r_endpoint_config)
print(create_endpoint_response['EndpointArn'])

resp = im.describe_endpoint(EndpointName=r_endpoint)
status = resp['EndpointStatus']
print("Status: " + status)

im.get_waiter('Endpoint_Created').wait(EndpointName=r_endpoint)

print("Arn: " + resp['EndpointArn'])
print("Status: " + status)

if status != 'InService':
    raise Exception('Endpoint creation did not succeed')

---
## Predict
To confirm our endpoints are working properly, let's try to invoke the endpoint.

In [None]:
runtime = boto3.Session().client(service_name='runtime.maeve', endpoint_url='https://maeveruntime.prod.us-west-2.ml-platform.aws.a2z.com')

# TODO Get this payload from an actual file
payload = 'Sepal.Width,Petal.Length,Petal.Width,Species\n3.5,1.4,0.2,setosa\n3,1.4,0.2,setosa\n3.2,1.3,0.2,setosa\n3.1,1.5,0.2,setosa\n6,1.4,0.2,setosa\n3.9,1.7,0.4,setosa'
response = runtime.invoke_endpoint(EndpointName=r_endpoint,
                                   ContentType='text/csv',
                                   Body=payload)

result = json.loads(response['Body'].read().decode())
result

_Note: The payload we're passing in the request is a CSV string with a header record, followed by multiple new lines.  It also contains text columns, which the serving code converts to the set of indicator variables needed for our model predictions.  Again, this is not a best practice for highly optimized code, however, it showcases the flexibility of bringing your own algorithm._

---
## Extensions

This notebook showcases a straightforward example to train and host an R algorithm in Amazon SageMaker.  As mentioned previously, this notebook could also be written in R.  We could even train the algorithm entirely within a notebook and then simply use the serving portion of the container to host our model.

Other extensions could include setting up the R algorithm to train in parallel.  Although R is not the easiest language to build distributed applications on top of, this is possible.  In addition, running multiple versions of training simultaneously would allow for parallelized grid (or random) search for optimal hyperparamter settings.  This would more fully realize the benefits of serverless training.