# BYO Container Example:  lightGBM

In this notebook we'll examine how to BYO container in Amazon SageMaker.  This is an option for algorithms and frameworks not directly supported in Amazon SageMaker as either (1) built-in algorithms, or (2) prebuilt Amazon SageMaker containers (such as the ones for TensorFlow, PyTorch, Apache MXNet, Scikit-learn, and XGBoost).  As an example, we'll containerize the popular lightGBM gradient boosting framework, which is not supported off-the-shelf in Amazon SageMaker, and apply it to a public dataset from UCI's Machine Learning Repository.  The dataset, which relates to predicting purchase intent by online shoppers, is at https://archive.ics.uci.edu/ml/datasets/Online+Shoppers+Purchasing+Intention+Dataset.  

We'll perform the following steps:

- Obtain the dataset.
- Build a Docker image for lightGBM to be run as a container in SageMaker Processing.
- Preprocess the data with that image in SageMaker Processing.
- Build a separate Docker image for lightGBM for training models with SageMaker hosted training.
- Train a lightGBM model with that separate Docker image in SageMaker hosted training.
- Evaluate the model / do batch scoring in SageMaker Processing with the same container used for preprocessing.
- Deploy the model to a real time SageMaker endpoint using Ezsmdeploy, a Python package that provides many conveniences and automation.


## Setup and obtain dataset

We'll begin with some imports that will be useful throughout the notebook, and set up some objects and variables we'll need.

In [None]:
import boto3
import sys
import sagemaker
from sagemaker import get_execution_role

role = get_execution_role()
session = sagemaker.Session()
s3_output = session.default_bucket()
s3_prefix = 'lightGBM-BYO'

Next we'll download the dataset.

In [None]:
!mkdir -p raw
!wget -P ./raw https://archive.ics.uci.edu/ml/machine-learning-databases/00468/online_shoppers_intention.csv

Let's inspect the data briefly now, just to confirm it was properly downloaded.

In [None]:
import pandas as pd

df = pd.read_csv('./raw/online_shoppers_intention.csv')
df.head()

The target we'd like to predict is the Revenue column, which is `True` if an online purchase transaction was completed.  As you might expect, a relatively small number of transactions are actually completed, resulting in a class imbalance we can handle various ways with lightGBM.

In [None]:
import seaborn as sns
import matplotlib
from matplotlib import pyplot as plt

sns.countplot(df['Revenue'])
plt.ylim(0,12000)
plt.xlabel('Transactions Completed', fontsize=14)
plt.ylabel('Count', fontsize=14)
plt.text(x=-.175, y=11000 ,s='10,422', fontsize=16)
plt.text(x=.875, y=2500, s='1908', fontsize=16)
plt.show()

Since EDA and feature engineering are not the focus of this example, we'll now move on to upload the raw data to S3 so it can be accessed by SageMaker.

In [None]:
rawdata_s3_prefix = '{}/raw'.format(s3_prefix)
raw_s3 = session.upload_data(path='./raw/', key_prefix=rawdata_s3_prefix)
print(raw_s3)

## Docker image for data preprocessing and model evaluation

Before any further steps can be completed for SageMake Processing with lightGBM, we need to build a Docker image.  We'll build one image first, and use that same image for multiple purposes:

- Preprocessing data; and
- Evaluating the model (batch scoring).

A separate, but very similar, Docker image will be used for training below.  

To begin, we'll create a new directory for Docker-related files and write a Dockerfile.

In [None]:
!mkdir -p docker-proc-evaluate

A simple Dockerfile can be used to build the container.  Of particular note are the following statements in the Dockerfile:
- FROM statement:  this sets the parent image.  There are many choices, considerations include size (smaller may be better), "up-to-dateness", stability, and security.  The chosen image is based on a slim version of Debian 10 ("Buster").  
- RUN statements:  used here primarily to install dependencies.  Only a few are required.  Note that libgomp1 is a library used by lightgbm, but is not included in this version of Debian.
- ENTRYPOINT statement:  specifies the command used to run the scripts that will be included in the container by SageMaker.  In our case, they are ordinary Python 3 scripts so the command is simply `python3`.

In [None]:
%%writefile docker-proc-evaluate/Dockerfile

FROM python:3.7-slim-buster
RUN apt -y update && apt install -y --no-install-recommends \
    libgomp1 \
    && apt clean    
RUN pip3 install lightgbm numpy pandas scikit-learn
ENV PYTHONUNBUFFERED=TRUE
ENTRYPOINT ["python3"]

This block of code builds the image using various Docker commands, creates an Amazon Elastic Container Registry (Amazon ECR) repository, and pushes the image to Amazon ECR.

In [None]:
import boto3

region = boto3.session.Session().region_name
account_id = boto3.client('sts').get_caller_identity().get('Account')
ecr_repository = 'lightgbm-byo-proc-eval'
tag = ':latest'
uri_suffix = 'amazonaws.com'
processing_repository_uri = '{}.dkr.ecr.{}.{}/{}'.format(account_id, region, uri_suffix, ecr_repository + tag)

# Create ECR repository and push docker image
!docker build -t $ecr_repository docker-proc-evaluate
!$(aws ecr get-login --region $region --registry-ids $account_id --no-include-email)
!aws ecr create-repository --repository-name $ecr_repository
!docker tag {ecr_repository + tag} $processing_repository_uri
!docker push $processing_repository_uri

Your Docker image has all required dependencies, and enables you to run your own preprocessing, feature engineering, and model evaluation scripts all within the same container in a robust and repeatable way. 

To integrate the image with SageMaker, simply reference it in the SageMaker Python SDK's `ScriptProcessor` class, which lets you execute a command to run your own script inside a container based on this image.

In [None]:
from sagemaker.processing import ScriptProcessor

script_processor = ScriptProcessor(command=['python3'],
                image_uri=processing_repository_uri,
                role=role,
                instance_count=1,
                instance_type='ml.c5.xlarge')

## Preprocess data with SageMaker Processing

Some preprocessing should be performed on this dataset before training.  For example, the data must be normalized, and split into train and test sets.  Below is a preprocessing script.  It is an ordinary Python script with very little specific to SageMaker.  To comply with SageMaker, the script must read the input data from a specified directory, and save the preprocessed data to certain directories so it can be automatically uploaded to S3 by SageMaker at the end of the job.  

In [None]:
%%writefile preprocessing.py

import glob
import numpy as np
import os
import pandas as pd
from sklearn.model_selection import train_test_split


if __name__=='__main__':
    
    input_file = glob.glob('{}/*.csv'.format('/opt/ml/processing/input'))
    print('\nINPUT FILE: \n{}\n'.format(input_file))   
    df = pd.read_csv(input_file[0])
    
    # minor preprocessing (drop some uninformative columns etc.)
    print('Preprocessing the dataset . . . .')   
    df_clean = df.drop(['Month','Browser','OperatingSystems','Region','TrafficType','Weekend'], axis=1)
    visitor_encoded = pd.get_dummies(df_clean['VisitorType'], prefix='Visitor_Type', drop_first = True)
    df_clean_merged = pd.concat([df_clean, visitor_encoded], axis=1).drop(['VisitorType'], axis=1)
    X = df_clean_merged.drop('Revenue', axis=1)
    y = df_clean_merged['Revenue']
    
    # split the preprocessed data with stratified sampling for class imbalance
    X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=2, test_size=.2)

    # save to container directory for uploading to S3
    print('Saving the preprocessed dataset . . . .')   
    train_data_output_path = os.path.join('/opt/ml/processing/train', 'x_train.npy')
    np.save(train_data_output_path, X_train.to_numpy())
    train_labels_output_path = os.path.join('/opt/ml/processing/train', 'y_train.npy')
    np.save(train_labels_output_path, y_train.to_numpy())    
    test_data_output_path = os.path.join('/opt/ml/processing/test', 'x_test.npy')
    np.save(test_data_output_path, X_test.to_numpy())
    test_labels_output_path = os.path.join('/opt/ml/processing/test', 'y_test.npy')
    np.save(test_labels_output_path, y_test.to_numpy())   

Now the `ScriptProcessor` object created above can be used to run this `preprocessing.py` script. As mentioned above, the primary requirements are specifying input and output directories.  Here, there are two outputs because the transformed train and test data are sent to different folders in S3.   

In [None]:
from sagemaker.processing import ProcessingInput, ProcessingOutput
from time import gmtime, strftime 

processing_job_name = "lightgbm-byo-process-{}".format(strftime("%d-%H-%M-%S", gmtime()))
output_destination = 's3://{}/{}/data'.format(s3_output, s3_prefix)

script_processor.run(code='preprocessing.py',
                      job_name=processing_job_name,
                      inputs=[ProcessingInput(
                        source=raw_s3,
                        destination='/opt/ml/processing/input')],
                      outputs=[ProcessingOutput(output_name='train',
                                                destination='{}/train'.format(output_destination),
                                                source='/opt/ml/processing/train'),
                               ProcessingOutput(output_name='test',
                                                destination='{}/test'.format(output_destination),
                                                source='/opt/ml/processing/test')])

preprocessing_job_description = script_processor.jobs[-1].describe()

After the job is complete, it is easy to look up the location of the output in S3.  The code below retrieves the S3 URLs of the locations of the transformed train and test data.  These will be used as inputs to futher jobs below.  

In [None]:
output_config = preprocessing_job_description['ProcessingOutputConfig']
for output in output_config['Outputs']:
    if output['OutputName'] == 'train':
        preprocessed_training_data = output['S3Output']['S3Uri']
        print(preprocessed_training_data)
    if output['OutputName'] == 'test':
        preprocessed_test_data = output['S3Output']['S3Uri']
        print(preprocessed_test_data)

We also can download the preprocessed test data for later use.  

In [None]:
session.download_data(path='.', bucket=s3_output, key_prefix=s3_prefix+'/data/test/x_test.npy')

## Train a model with lightGBM

There are multiple different ways to train a model in SageMaker.  One of the simplest ways to do so is to reuse the same container from above within SageMaker Processing itself to do the training.  This is possible due to the fact that the `ScriptProcessor` object we instantiated above can ingest an arbitrary Python script as long as we specify the input and output locations in S3.  

An alternative is to use SageMaker hosted training.  Like SageMaker Processing, SageMaker hosted training spins up a right-sized, transient cluster for your job and then shuts it down when the job is done.  This enables you to do most of your work in lower-cost notebooks while reserving full scale training and related costs for only when you need it.  Using SageMaker hosted training offers several advantages over SageMaker Processing for training.  These include easy integrations with:  SageMaker Debugger, SageMaker Experiments, SageMaker Search, Managed Spot Training, Automatic Model Tuning, options for multiple file sources/channels with automated data shuffling and sharding, and more.  

To use SageMaker hosted training, we'll create another simple Docker image.  We'll create another directory first.

In [None]:
!mkdir -p docker-train

The Dockerfile for training is similar to the first one, with a few key differences:  
- The parent image is from another ML framework's Docker image that bundles a bunch of necessary low-level build tools for the sagemaker-containers package (see next bullet point).  Another parent with those tools could be substituted.
- There is one additional Python package:  sagemaker-containers, which integrates the container with SageMaker hosted training.
- An environment variable indicating which Python module is the entry point for training.

Note that you do NOT need to include the training script in the Docker image.  The sagemaker-containers package allows you to pass in a training script from an Amazon S3 location dynamically each time you start a training job, so you can reuse the same Docker image without rebuilding it for code changes.

In [None]:
%%writefile docker-train/Dockerfile

FROM tensorflow/tensorflow:2.0.0a0
RUN apt -y update && apt install -y --no-install-recommends \
    libgomp1 \
    && apt clean    
RUN pip install lightgbm numpy pandas sagemaker-containers scikit-learn
ENV SAGEMAKER_PROGRAM train.py
ENV PYTHONUNBUFFERED=TRUE

Now we'll create a separate ECR repository for the training images, build the new training image, and push it.

In [None]:
ecr_repository_train = 'lightgbm-byo-train'
uri_suffix = 'amazonaws.com'
train_repository_uri = '{}.dkr.ecr.{}.{}/{}'.format(account_id, region, uri_suffix, ecr_repository_train + tag)

# Create ECR repository and push docker image
!docker build -t $ecr_repository_train docker-train
!$(aws ecr get-login --region $region --registry-ids $account_id --no-include-email)
!aws ecr create-repository --repository-name $ecr_repository_train
!docker tag {ecr_repository_train + tag} $train_repository_uri
!docker push $train_repository_uri

Below is the training script.  Again, it is very similar to a Python script you would use outside SageMaker, and the main SageMaker-specific requirements are that you must specify several arguments from which you will extract hyperparameters such as the learning rate.  

In [None]:
%%writefile docker-train/train.py

import argparse
import glob
import lightgbm as lgb
import numpy as np
import os


if __name__=='__main__':
    
    # extract training data S3 location and hyperparameter values
    parser = argparse.ArgumentParser()
    parser.add_argument('--train', type=str, default=os.environ['SM_CHANNEL_TRAIN'])
    parser.add_argument('--num_leaves', type=int, default=28)
    parser.add_argument('--max_depth', type=int, default=5)
    parser.add_argument('--learning_rate', type=float, default=0.1)
    args = parser.parse_args()
    
    print('Loading data from {}\n'.format(args.train))
    input_files = glob.glob('{}/*.npy'.format(args.train))
    print('\nINPUT FILE LIST: \n{}\n'.format(input_files)) 
    for file in input_files:
        if 'x_' in file:
            x_train = np.load(file)
        else:
            y_train = np.load(file)
            
    print('\nx_train shape: \n{}\n'.format(x_train.shape))
    print('\ny_train shape: \n{}\n'.format(y_train.shape))
    train_data = lgb.Dataset(x_train, label=y_train)
    print('Training model with hyperparameters:\n\t num_leaves: {}\n\t max_depth: {}\n\t learning_rate: {}\n'
          .format(args.num_leaves, args.max_depth, args.learning_rate))
    parameters = {
        'objective': 'binary',
        'metric': 'binary_logloss',
        'is_unbalance': 'true',
        'boosting': 'gbdt',
        'num_leaves': args.num_leaves,
        'max_depth': args.max_depth,
        'learning_rate': args.learning_rate,
        'verbose': 1,
        'is_training_metric': True
    }
    num_round = 10
    bst = lgb.train(parameters, train_data, num_round)
    
    print('Saving model . . . .')
    bst.save_model('/opt/ml/model/online_shoppers_model.txt')

The training script must be packaged as a .tar.gz file and uploaded to S3 for access by SageMaker.  This step must be repeated every time the script is modified, but avoids having to rebuild the Docker image for code changes:  you can just reuse the same Docker image with any lightGBM training script.

In [None]:
import tarfile
import os

def create_tar_file(source_files, target=None):
    if target:
        filename = target
    else:
        _, filename = tempfile.mkstemp()

    with tarfile.open(filename, mode="w:gz") as t:
        for sf in source_files:
            t.add(sf, arcname=os.path.basename(sf))

In [None]:
create_tar_file(["docker-train/train.py"], "sourcedir.tar.gz")

In [None]:
sources = session.upload_data('sourcedir.tar.gz', s3_output, s3_prefix + '/code')
print(sources)

With our training script in Amazon S3, we can now set up an Amazon SageMaker Estimator object to represent the actual training job.  Similarly to the ScriptProcessor object, the Estimator takes in as parameters the Docker image, and instance type and amount.  Additionally, it takes in an encoded dictionary of hyperparameters for training.  The `fit` method invocation starts the training job.

In [None]:
from sagemaker.estimator import Estimator
import json

def json_encode_hyperparameters(hyperparameters):
    return {str(k): json.dumps(v) for (k, v) in hyperparameters.items()}

hyperparameters = json_encode_hyperparameters({
    "sagemaker_program": "train.py",
    "sagemaker_submit_directory": sources,
    'num_leaves': 32,
    'max_depth': 3,
    'learning_rate': 0.08})

estimator = Estimator(image_name='lightgbm-byo-train',
                      role=role,
                      train_instance_count=1,
                      train_instance_type='local',
                      hyperparameters=hyperparameters)

estimator.fit({'train': preprocessed_training_data})

We can easily download the trained model, whether for further use inside of Amazon SageMaker or anywhere else.

In [None]:
!aws s3 cp {estimator.model_data} ./model/model.tar.gz
!tar -xvzf ./model/model.tar.gz -C ./model

We'll upload the unzipped version of the model back to Amazon S3 for use by SageMaker Processing in model evaluation / batch scoring. 

In [None]:
s3_model = session.upload_data('./model/online_shoppers_model.txt', s3_output, s3_prefix + '/model')
print(s3_model)

## Evalutate the model / batch scoring

Next we can reuse the Docker image from data preprocessing for model evaluation, or batch scoring.  Below is the evaluation script.  This time the main SageMaker-specific requirement is specifying an input directory.   

In [None]:
%%writefile evaluation.py

import glob
import lightgbm as lgb
import numpy as np
from sklearn.metrics import accuracy_score, roc_auc_score


if __name__=='__main__':
    
    print('Loading data . . . .')
    input_files = glob.glob('{}/*.npy'.format('/opt/ml/processing/input'))
    print('\nINPUT FILE LIST: \n{}\n'.format(input_files)) 
    for file in input_files:
        if 'x_' in file:
            x_test = np.load(file)
        else:
            y_test = np.load(file)
            
    print('\nx_test shape: \n{}\n'.format(x_test.shape))
    print('\ny_test shape: \n{}\n'.format(y_test.shape))
 
    print('Loading model . . . .\n')    
    model_path = '/opt/ml/processing/model/'
    bst_loaded = lgb.Booster(model_file=model_path+'online_shoppers_model.txt')
    y_pred = bst_loaded.predict(x_test)
    
    print('Evaluating model . . . .\n')    
    acc = accuracy_score(y_test.astype(int), y_pred.round(0).astype(int))
    auc = roc_auc_score(y_test, y_pred)
    print('Accuracy:  {:.2f}'.format(acc))
    print('AUC Score: {:.2f}'.format(auc))

We'll also reuse the `ScriptProcessor` object we instantiated above, this time for the evaluation script.  Instead of having two outputs, as in the preprocessing job, there are two inputs:  one for the input data, and another for the model artifact to be used in the evaluation.  At the end of the job, we'll log the accuracy and AUC score metrics.  We also could have stored evaluation results to a file, or even saved visualization graphics, and asked SageMaker Processing to upload those to S3 at the end of the job.

In [None]:
processing_job_name = "lightgbm-byo-eval-{}".format(strftime("%d-%H-%M-%S", gmtime()))
output_destination = 's3://{}/{}/eval'.format(s3_output, s3_prefix)

script_processor.run(code='evaluation.py',
                      job_name=processing_job_name,
                      inputs=[ProcessingInput(
                                source=preprocessed_test_data,
                                destination='/opt/ml/processing/input'),
                             ProcessingInput(
                                source=s3_model,
                                destination='/opt/ml/processing/model')],
                      outputs=[ProcessingOutput(output_name='eval',
                                                destination=output_destination,
                                                source='/opt/ml/processing/eval')])

eval_job_description = script_processor.jobs[-1].describe()

## Deploy the model to SageMaker

There are several ways to deploy models within Amazon SageMaker.  For example, for offline batch use cases, it is possible to use either SageMaker Processing or SageMaker Batch Transform (which has some extra conveniences for very large scale jobs).  For real time prediction use cases, SageMaker hosted endpoints are applicable.  These offer many advantages including built-in options for A/B testing, autoscaling, and integration with SageMaker Model Monitor to detect data drift and other issues.

In this example, we'll deploy the lightGBM model to an Amazon SageMaker hosted endpoint.  Again, there are multiple options for doing this, including using objects provided by the SageMaker Python SDK (such as the Estimator from above), or the AWS SDK for Python (boto3).  

However, for convenience we'll use Ezsmdeploy, https://pypi.org/project/ezsmdeploy/.  Ezsmdeploy provides several conveniences such as automatically choosing an instance based on model size or based on a budget, enabling load testing endpoints using an intuitive API, an more.  This is especially a very convenient way to deploy a model when the model is:

- a preexisting model that you trained outside Amazon SageMaker, or
- trained within Amazon SageMaker but is simply an ordinary artifact, and has not yet been wrapped in a SageMaker Model object.  This is the case if you train models within SageMaker Processing.

First we need to install Ezsmdeploy.

In [None]:
!{sys.executable} -m pip install ezsmdeploy

### Create a model script and test locally

Next, we need to supply Ezsmdeploy with a model script that contains `load_model()` and `predict()` functions. The first function is self-explanatory. For local testing, the second function allows sending a Numpy array payload instead of bytes, which is the actual input format for the model when deployed in Amazon SageMaker.  

In [None]:
%%writefile modelscript_lightgbm.py

import lightgbm as lgb
import numpy as np
import os

NUM_FEATURES = 12

# return loaded model
def load_model(modelpath):
    
    print('Model path:  {}'.format(modelpath))
    model = lgb.Booster(model_file=os.path.join(modelpath,'online_shoppers_model.txt'))
    return model


# return prediction based on loaded model (from the step above) and an input payload
def predict(model, payload):
    
    print('Type of payload:  {}'.format(type(payload)))
    
    try:
        
        # locally, payload may come in as an np.ndarray
        if type(payload)==np.ndarray:
            out = model.predict(payload)
            
        # in remote / container based deployment, payload comes in as a stream of bytes
        else:
            data = np.frombuffer(payload, dtype=np.float64)
            data = data.reshape((data.size // NUM_FEATURES, NUM_FEATURES))
            out = model.predict(data)
                
    except Exception as e:
        out = 'EXCEPTION: {}'.format(str(e))
        
    return out if type(out) is str else out.tobytes() 

We can test the `modelscript_lightgbm.py` script locally to make sure it is working correctly.  The output should be an array of floats representing prediction probabilities.

In [None]:
from modelscript_lightgbm import *
import numpy as np

x_test = np.load('./x_test.npy')
print(x_test.shape)

x_bytes = x_test.tobytes()

model = load_model('./model') 
result = predict(model, x_bytes)
print(type(result))
print(np.frombuffer(result, dtype=np.float64))

Just in case there are other inference containers running in local mode, we'll stop existing containers to avoid conflict.

In [None]:
!docker container stop $(docker container ls -aq) >/dev/null

Now we can try a local deployment in a container:

In [None]:
import ezsmdeploy

ez = ezsmdeploy.Deploy(model='./model',
                       script='modelscript_lightgbm.py',
                       requirements=['numpy','joblib','lightgbm'],
                       instance_type = 'local',
                       wait = True)

We can test a payload against the container running locally:

In [None]:
out = ez.predictor.predict(x_test.tobytes())

The result comes back as bytes, which can be examined after decoding.  

In [None]:
print(np.frombuffer(out, dtype=np.float64))

### Deploy in Amazon SageMaker

Now that we have confirmed that everything is working locally, we can deploy to an Amazon SageMaker endpoint for real time predictions served by SageMaker-managed hardware for autoscaling, blue/green update deployments, and more.  

The `Deploy` method invocation is very similar to the local one above.  The main difference is that we no longer specify `instance_type = 'local'`.  Instead, ezsmdeploy will choose an instance based on the total size of the model (or multiple models passed in), take into account the multiple workers per endpoint, and also optionally a “budget” that will choose `instance_type` based on a maximum acceptible cost per hour.  For details, see https://pypi.org/project/ezsmdeploy/#other-features.  

In [None]:
ezonsm = ezsmdeploy.Deploy(model='./model',
                           script='modelscript_lightgbm.py',
                           requirements=['numpy','joblib','lightgbm'],
                           wait = True)

Similarly to the local test, we can now test a payload against the container running on a SageMaker-managed endpoint.  The code is the same:

In [None]:
out_from_sm = ezonsm.predictor.predict(x_test.tobytes())

Finally, we can examine the result returned by the Amazon SageMaker endpoint.  It should be the same as the result returned by local testing.

In [None]:
print(np.frombuffer(out_from_sm, dtype=np.float64))

To avoid charges for unneeded resources, be sure to delete the Amazon SageMaker endpoint you just created after you are finished with this example.

In [None]:
ezonsm.predictor.delete_endpoint()