# Building a R-mars Docker Image
We will start our MLOps journey here by creating a Docker Image for training a R-Mars [Multivariate Adaptive Regression Splines](https://en.wikipedia.org/wiki/Multivariate_adaptive_regression_splines) model.

So, after we create and test locally our Dockerfile, we'll send it to our first pipeline that will build this image and make it available in ECR.

## First create the training code

`mars.R` creates functions to fit and serve our model.  The algorithm we've chosen to use is [Multivariate Adaptive Regression Splines](https://en.wikipedia.org/wiki/Multivariate_adaptive_regression_splines).  This is a suitable example as it's a unique and powerful algorithm, but isn't as broadly used as Amazon SageMaker algorithms, and it isn't available in Python's scikit-learn library.  R's repository of packages is filled with algorithms that share these same criteria. 

In [1]:
%%writefile mars.R
# Bring in library that contains multivariate adaptive regression splines (MARS)
library(mda)

# Bring in library that allows parsing of JSON training parameters
library(jsonlite)

# Bring in library for prediction server
library(plumber)


# Setup parameters
# Container directories
prefix <- '/opt/ml'
input_path <- paste(prefix, 'input/data', sep='/')
output_path <- paste(prefix, 'output', sep='/')
model_path <- paste(prefix, 'model', sep='/')
param_path <- paste(prefix, 'input/config/hyperparameters.json', sep='/')

# Channel holding training data
channel_name = 'train'
training_path <- paste(input_path, channel_name, sep='/')


# Setup training function
train <- function() {

    # Read in hyperparameters
    training_params <- read_json(param_path)

    target <- training_params$target

    if (!is.null(training_params$degree)) {
        degree <- as.numeric(training_params$degree)}
    else {
        degree <- 2}

    # Bring in data
    training_files = list.files(path=training_path, full.names=TRUE)
    training_data = do.call(rbind, lapply(training_files, read.csv))
    
    # Convert to model matrix
    training_X <- model.matrix(~., training_data[, colnames(training_data) != target])

    # Save factor levels for scoring
    factor_levels <- lapply(training_data[, sapply(training_data, is.factor), drop=FALSE],
                            function(x) {levels(x)})
    
    # Run multivariate adaptive regression splines algorithm
    model <- mars(x=training_X, y=training_data[, target], degree=degree)
    
    # Generate outputs
    mars_model <- model[!(names(model) %in% c('x', 'residuals', 'fitted.values'))]
    attributes(mars_model)$class <- 'mars'
    save(mars_model, factor_levels, file=paste(model_path, 'mars_model.RData', sep='/'))
    print(summary(mars_model))

    write.csv(model$fitted.values, paste(output_path, 'data/fitted_values.csv', sep='/'), row.names=FALSE)
    write('success', file=paste(output_path, 'success', sep='/'))}


# Setup scoring function
serve <- function() {
    app <- plumb(paste(prefix, 'plumber.R', sep='/'))
    app$run(host='0.0.0.0', port=8080)}


# Run at start-up
args <- commandArgs()
if (any(grepl('train', args))) {
    train()}
if (any(grepl('serve', args))) {
    serve()}

Writing mars.R


## Next, create the Serve function

`plumber.R` uses the [plumber](https://www.rplumber.io/) package to create a lightweight HTTP server for processing requests in hosting.  Note the specific syntax, and see the plumber help docs for additional detail on more specialized use cases.

Per the Amazon SageMaker documentation, our service needs to accept post requests to ping and invocations.  plumber specifies this with custom comments, followed by functions that take specific arguments.

Here invocations does most of the work, ingesting our trained model, handling the HTTP request body, and producing a CSV output of predictions.

In [50]:
%%writefile plumber.R
#' Ping to show server is there
#' @get /ping
function() {
    return('')}


#' Parse input and return prediction from model
#' @param req The http request sent
#' @post /invocations
function(req) {

    # Setup locations
    prefix <- '/opt/ml'
    model_path <- paste(prefix, 'model', sep='/')

    # Bring in model file and factor levels
    load(paste(model_path, 'mars_model.RData', sep='/'))
        
    # Read in data
    conn <- textConnection(gsub('\\\\n', '\n', req$postBody))
    data <- read.csv(conn)
    close(conn)

    # Convert input to model matrix
    scoring_X <- model.matrix(~., data, xlev=factor_levels)

    # Return prediction
    return(paste(predict(mars_model, scoring_X, row.names=FALSE), collapse=','))}

Overwriting plumber.R


## Now  create a Dockerfile

Create the docker file that references the `mars.R` and `plumbler.R` files we have created.

Smaller containers are preferred for Amazon SageMaker as they lead to faster spin up times in training and endpoint creation, so this container is kept minimal.  It simply starts with Ubuntu, installs R, mda, and plumber libraries, then adds `mars.R` and `plumber.R`, and finally runs `mars.R` when the entrypoint is launched.

In [51]:
%%writefile Dockerfile
FROM ubuntu:16.04

MAINTAINER Amazon SageMaker Examples <amazon-sagemaker-examples@amazon.com>

RUN apt-get -y update && apt-get install -y --no-install-recommends \
    wget \
    r-base \
    r-base-dev \
    ca-certificates

RUN R -e "install.packages(c('mda', 'plumber'), repos='https://cloud.r-project.org')"

COPY mars.R /opt/ml/mars.R
COPY plumber.R /opt/ml/plumber.R

ENTRYPOINT ["/usr/bin/Rscript", "/opt/ml/mars.R", "--no-save"]

Overwriting Dockerfile


## Finally, let's create the buildspec
This file will be used by CodeBuild for creating our base image

In [52]:
%%writefile buildspec.yml
version: 0.2

phases:
  install:
    runtime-versions:
      docker: 18

  pre_build:
    commands:
      - echo Logging in to Amazon ECR...
      - $(aws ecr get-login --no-include-email --region $AWS_DEFAULT_REGION)
      - docker pull $AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com/scikit-base:latest
      - docker tag $AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com/scikit-base:latest scikit-base:latest
  build:
    commands:
      - echo Build started on `date`
      - echo Building the Docker image...
      - docker build -t $IMAGE_REPO_NAME:$IMAGE_TAG .
      - docker tag $IMAGE_REPO_NAME:$IMAGE_TAG $AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com/$IMAGE_REPO_NAME:$IMAGE_TAG

  post_build:
    commands:
      - echo Build completed on `date`
      - echo Pushing the Docker image...
      - echo docker push $AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com/$IMAGE_REPO_NAME:$IMAGE_TAG
      - docker push $AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com/$IMAGE_REPO_NAME:$IMAGE_TAG
      - echo $AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com/$IMAGE_REPO_NAME:$IMAGE_TAG > image.url
      - echo Done
artifacts:
  files:
    - image.url
  name: image_url
  discard-paths: yes

Overwriting buildspec.yml


### Building the image locally, first

In [53]:
!docker build -f Dockerfile -t sagemaker-rmars:1.0 .

Sending build context to Docker daemon  105.5kB
Step 1/7 : FROM ubuntu:16.04
 ---> 5e13f8dd4c1a
Step 2/7 : MAINTAINER Amazon SageMaker Examples <amazon-sagemaker-examples@amazon.com>
 ---> Using cache
 ---> b92d1fc0d694
Step 3/7 : RUN apt-get -y update && apt-get install -y --no-install-recommends     wget     r-base     r-base-dev     ca-certificates
 ---> Using cache
 ---> 41f82a5014d8
Step 4/7 : RUN R -e "install.packages(c('mda', 'plumber'), repos='https://cloud.r-project.org')"
 ---> Using cache
 ---> 59c5760cc0dc
Step 5/7 : COPY mars.R /opt/ml/mars.R
 ---> Using cache
 ---> 911320145c53
Step 6/7 : COPY plumber.R /opt/ml/plumber.R
 ---> ca08c32708f7
Step 7/7 : ENTRYPOINT ["/usr/bin/Rscript", "/opt/ml/mars.R", "--no-save"]
 ---> Running in 2ab53452453b
Removing intermediate container 2ab53452453b
 ---> 12d8a001520d
Successfully built 12d8a001520d
Successfully tagged sagemaker-rmars:1.0


# Let's do some tests, locally
## First, let's define some hyperparameters for both algorithms

In [54]:
hyperparameters = {
    "degree": 2,
    "target": "Sepal.Length"
}

In [55]:
import json
!mkdir -p input/config

hyperparameters = dict({key: str(values) for key, values in hyperparameters.items()})
with open('input/config/hyperparameters.json', 'w') as f:
    f.write(json.dumps(hyperparameters))
    f.flush()
    f.close()

## Then, let's prepare a dataset


In [56]:
!mkdir -p input/data/train
!cp iris.csv input/data/train
!head input/data/train/iris.csv

Sepal.Length,Sepal.Width,Petal.Length,Petal.Width,Species
5.1,3.5,1.4,0.2,setosa
4.9,3,1.4,0.2,setosa
4.7,3.2,1.3,0.2,setosa
4.6,3.1,1.5,0.2,setosa
5,3.6,1.4,0.2,setosa
5.4,3.9,1.7,0.4,setosa
4.6,3.4,1.4,0.3,setosa
5,3.4,1.5,0.2,setosa
4.4,2.9,1.4,0.2,setosa


## Then, let's test the training process

In [59]:
!rm -Rf model
!mkdir -p model
!rm -Rf output
!mkdir -p output/data

In [60]:
print( "Training ...")
!docker run --rm --name 'my_model' \
    -v "$PWD/model:/opt/ml/model" \
    -v "$PWD/input:/opt/ml/input" \
    -v "$PWD/output:/opt/ml/output" sagemaker-rmars:1.0 train

Training ...
Loading required package: class
Loaded mda 0.4-10

Loading required package: methods
replacing previous import by 'Rcpp::evalCpp' when loading 'later' 
               Length Class  Mode   
call             4    -none- call   
all.terms       18    -none- numeric
selected.terms   6    -none- numeric
penalty          1    -none- numeric
degree           1    -none- numeric
nk               1    -none- numeric
thresh           1    -none- numeric
gcv              1    -none- numeric
factor         126    -none- numeric
cuts           126    -none- numeric
lenb             1    -none- numeric
coefficients     6    -none- numeric


## This is the serving test. It simulates an Endpoint exposed by Sagemaker

After you execute the next cell, this Jupyter notebook will freeze. A webservice will be exposed at the port 8080. 

In [61]:
!docker run --rm --name 'my_rmars' \
    -p 8080:8080 \
    -v "$PWD/model:/opt/ml/model" \
    -v "$PWD/input:/opt/ml/input" sagemaker-rmars:1.0 serve

Loading required package: class
Loaded mda 0.4-10

Loading required package: methods
replacing previous import by 'Rcpp::evalCpp' when loading 'later' 
Starting server to listen on port 8080
  the condition has length > 1 and only the first element will be used
  the condition has length > 1 and only the first element will be used
^C


> While the above cell is running, click here [TEST NOTEBOOK](02_Testing%20our%20local%20model%20server.ipynb) to run some tests.

> After you finish the tests, press **STOP**

### Before we push our code to the repo, let's check the building process

In [62]:
import boto3

sts_client = boto3.client("sts")
session = boto3.session.Session()

account_id = sts_client.get_caller_identity()["Account"]
region = session.region_name
credentials = session.get_credentials()
credentials = credentials.get_frozen_credentials()

repo_name='sagemaker-rmars'
image_tag='test'

In [63]:
!mkdir -p tests
!cp model.py Dockerfile buildspec.yml tests/
with open('tests/vars.env', 'w') as f:
    f.write("AWS_ACCOUNT_ID=%s\n" % account_id)
    f.write("IMAGE_TAG=%s\n" % image_tag)
    f.write("IMAGE_REPO_NAME=%s\n" % repo_name)
    f.write("AWS_DEFAULT_REGION=%s\n" % region)
    f.write("AWS_ACCESS_KEY_ID=%s\n" % credentials.access_key)
    f.write("AWS_SECRET_ACCESS_KEY=%s\n" % credentials.secret_key)
    f.write("AWS_SESSION_TOKEN=%s\n" % credentials.token )
    f.close()

!cat tests/vars.env

cp: cannot stat ‘model.py’: No such file or directory
AWS_ACCOUNT_ID=607585071374
IMAGE_TAG=test
IMAGE_REPO_NAME=sagemaker-rmars
AWS_DEFAULT_REGION=us-east-2
AWS_ACCESS_KEY_ID=ASIAY25XEOEHOJZVJZRM
AWS_SECRET_ACCESS_KEY=eVcjLZ6sSpai+1RFzExq1LBCQpBUI9myv12HwdFo
AWS_SESSION_TOKEN=AgoJb3JpZ2luX2VjELj//////////wEaCXVzLWVhc3QtMiJHMEUCIQDSmqA9vZjHT80VwHw+Oc7ToOf28vNkl7sArQJ5LW27VQIgNAkxTn17plZEiMv5u1Mxn+1BObIRvvX14MYR6hZ9IFMq6wIIgv//////////ARAAGgw2MDc1ODUwNzEzNzQiDGNpHP2gUSeVQzWzriq/AqveepfRj9t1jaqUK/SKr4mQlMKCXq2FLrzIXjDZQnyw18cQPMkq2zG5OTE0FBogVYM1pZB40WieRv3cdahtY9MDiV7ZBhGa4mBMg45S9hyzdh10auFTp9gTGB/ZSnkCTuMko2q1omqVs2xVmc+/GBlpgCVsv2hlj7cPsk9Ngc0NxqtuZo6SW26aFkiHb0pUD7N3ynqy4wj8Twbb0UfJWstGGICE9i94IgCBWQjAgzs9BNd0ZLLojcncHpjr9kmYPQFM5yvV15JSp+ipCbqP85zq1iorTjQMCJPVpfe23AgacoZcHAfoHlEvIFUkC72eACokMdt7pTQXMpgxqjcqwHP5QfaWBVf35NuiZQUmCnfWtiZrm5O7lGHIaCvuzANaR2QPX9Eb+GlFIi6nuWUsl7k+yq2IrzpEIiKr4FR4s5kw3fWF7AU6tAHuvrJ085O4RUOxispojPyPQH9DyqiaJ49ZUC+xaIVkE++6H/B6LY78X4Edx2nKssFy7tWdOT0la9IqnL

In [None]:
%%time

!/tmp/aws-codebuild/local_builds/codebuild_build.sh \
    -a "$PWD/tests/output" \
    -s "$PWD/tests" \
    -i "samirsouza/aws-codebuild-standard:2.0" \
    -e "$PWD/tests/vars.env"

## Ok, now it's time to push everything to the repo

In [None]:
%%bash

cd ../../../mlops-workshop-images/sagemaker-rmars
cp $OLDPWD/buildspec.yml $OLDPWD/model.py $OLDPWD/Dockerfile .

git add --all
git commit -a -m " - files for building an iris model image"
git push

### Ok, now open the AWS console in another tab and go to the CodePipeline console to see the status of our building pipeline