# Better Performance with MKL BLAS on MXNet 1.6 Deep Learning Containers

Inference speed and performance is often times one of the most crucial factors for deciding to deploy a model in a production environment. Small increases in latency can be costly, so every bit of performance boost that can be had can help the overall costs. Because of our obessesion for customers, the MXNet team has brought a solution to improve the latency for inference using MXNet 1.6 Deep learning containers. In this post, we dicuss the improvement that was made to the MXNet 1.6.0 DLC version to make use of highly optimized matrix operators. The enhancement comes in the form of compiling MXNet with a dependency on Intel MKL BLAS instead of the default, oneDNN. As one will see, the performance boost can be up to 30% in latency reduction, making this a worthwhile option for our customers to implement in their production environment. To describe and show the enhancements in more detail, we will use the MNIST dataset to briefly give context of performing inference on Amazon SageMaker. Then, we will discuss the differences between the MKL BLAS and oneDNN libraries. Lastly, we will show how to implement the enhancement in your environment so that you can take advantage of the performance boost. 

**Note: MXNet 1.7+ Deep Learning Containers have this enhancement as a default, so this solution applies to customers who don't want to change the MXNet version from 1.6.0**

## Inference on Amazon SageMaker

Amazon SageMaker makes it really easy to deploy, host and maintain models. As part of that, choosing what framework or deep learning container to use is also a matter of setting a parameter. For the remainder of this post, we will use the MNIST dataset to quickly give examples for context and to show the performance difference between the MKL BLAS library and oneDNN library. But first, here is a brief example of how to perform inference in SageMaker.

### Setup

First we define a few variables that are needed to perform operations in SageMaker

In [8]:
tweets = pd.read_csv('training.1600000.processed.noemoticon.csv', encoding="ISO-8859-1", names=["target", "ids", "date", "flag", "user", "text"])

In [1]:
!pip install gluonnlp --quiet
!pip install bert --quiet
!pip install mxnet --upgrade --quiet

You should consider upgrading via the '/home/ec2-user/anaconda3/envs/mxnet_p36/bin/python -m pip install --upgrade pip' command.[0m
You should consider upgrading via the '/home/ec2-user/anaconda3/envs/mxnet_p36/bin/python -m pip install --upgrade pip' command.[0m
You should consider upgrading via the '/home/ec2-user/anaconda3/envs/mxnet_p36/bin/python -m pip install --upgrade pip' command.[0m


In [2]:
!pip install gluoncv --quiet

You should consider upgrading via the '/home/ec2-user/anaconda3/envs/mxnet_p36/bin/python -m pip install --upgrade pip' command.[0m


In [7]:
from sagemaker import get_execution_role
from sagemaker.session import Session
from sagemaker.mxnet.model import MXNetModel, MXNetPredictor
import pandas as pd
from sagemaker import utils
import boto3
import gzip
import os
import struct
import numpy as np
import tempfile
import mxnet as mx
import tarfile
import urllib.request
import gluonnlp as nlp
from transform import BERTDatasetTransform
from gluonnlp.calibration import BertLayerCollector
# S3 bucket for saving code and model artifacts.
# Feel free to specify a different bucket here if you wish.
bucket = Session().default_bucket()

# Bucket location where results of model training are saved.
model_artifacts_location = 's3://{}/mxnet-mnist-example/artifacts'.format(bucket)

# IAM execution role that gives SageMaker access to resources in your AWS account.
# We can use the SageMaker Python SDK to get the role from our notebook environment. 
role = get_execution_role()
sagemaker_session = Session()
region = boto3.Session().region_name
test_data_location = 'sagemaker-sample-data-{}'.format(region)

### Setup Data 

This dataset contains 10,000 images that 28x28 pixels.

### Define Utility Function
These functions will help to load data into memory

In [4]:
def deploy_bert(sagemaker_session, ecr_image, instance_type, framework_version, role):
    import urllib.request
    tmpdir = tempfile.mkdtemp()
    tmpfile = 'bert_sst.tar.gz'
    urllib.request.urlretrieve('https://aws-dlc-sample-models.s3.amazonaws.com/bert_sst/bert_sst.tar.gz', tmpfile)

    prefix = 'bert-model'
    model_data = sagemaker_session.upload_data(path=tmpfile, key_prefix=prefix)

    script = "bert_inference.py"
    model = MXNetModel(model_data,
                       role,
                       script,
                       image_uri=ecr_image,
                       py_version="py3",
                       framework_version=framework_version,
                       sagemaker_session=sagemaker_session)

    endpoint_name = utils.unique_name_from_base('bert')
    print("Deploying...")
    predictor = model.deploy(1, instance_type, endpoint_name=endpoint_name)

    ################################# Use this as an example for inference
    #print("\nPredicting...")
    #output = predictor.predict(["Positive sentiment", "Negative sentiment"])
    #assert [1, 0] == output
    #################################
    
    return predictor


def inference(data, predictor, loops=10):
    times = []
    for i in range(loops):
        output = predictor.predict(list(tweets.text[:100]))
        times.append(output['time'])
    
    print("Avg:", np.mean(times), "Std:", np.std(times), "with {} loops".format(loops))
    return times

### Create an inference Endpoint

We use the ``MXNet model`` object to load model data and deploy an ``MXNetPredictor``. This creates a Sagemaker **Endpoint** -- a hosted prediction service that we can use to perform inference. 

The arguments to the ``deploy`` function allow us to set the number and type of instances that will be used for the Endpoint. Here we will deploy the model to a single ``ml.m4.xlarge`` instance. By not setting the ``image`` parameter, ``MXNetModel`` uses the default deep learning container image.

## MKL BLAS vs oneDNN

The MKL BLAS and oneDNN libraries of math routines that are used to perform mathmatical operations on data. You can think of these as low level instructions that progamming languages use to perform computations. In the case of MXNet, it uses these libraries to for its operations such as dot products and other computationally expensive operations. MKL BLAS as version implemented by Intel, that uses highly optimized operators for CPU. These operators like the ``dot`` operator, are much faster than the operators found on the default library, oneDNN. So in order to take advantage of the speed boost, the MXNet team packaged the the MXNet 1.6 version with Intel's MKL BLAS library in a deep learning container available for use. In the diagram below, you can see the changes that are made between the different math libraries and MXNet versions.  

### Performance

In order to see the performance increase, we describe now how to implement the MKL BLAS and oneDNN based deep learning containers using SageMaker. 

#### oneDNN 

Notice the parameter ``image`` is set to a uri. This is the deep learning container uri, and the ``v3.7`` specifies a version of this MXNet 1.6 container that is compiled with oneDNN. 

In [5]:
ecr_image = "763104351884.dkr.ecr.us-east-1.amazonaws.com/mxnet-inference:1.6.0-cpu-py36-ubuntu16.04-v3.7"
onednn_predictor = deploy_bert(sagemaker_session, ecr_image, 'ml.m4.xlarge', "1.6.0", role)

Deploying...
-----------------!
Predicting...


In [9]:
times = inference(tweets.text[:100], onednn_predictor)

Avg: 55.409968686103824 Std: 0.12656422301418305 with 10 loops


[55.78212642669678,
 55.4226975440979,
 55.36242318153381,
 55.377811670303345,
 55.33304786682129,
 55.36034035682678,
 55.37596607208252,
 55.37017631530762,
 55.3291597366333,
 55.38593769073486]

#### MKL BLAS

To specify the deep learning container with MKL BLAS, you change the version identifier in the image uri to ``v3.8``.

In [None]:
ecr_image = "763104351884.dkr.ecr.us-east-1.amazonaws.com/mxnet-inference:1.6.0-cpu-py36-ubuntu16.04-v3.8"
mkl_predictor = deploy_bert(sagemaker_session, ecr_image, 'ml.m4.xlarge', "1.6.0", role)

Deploying...
------------

In [None]:
times = inference(tweets.text[:100], mkl_predictor)

# (Optional) Delete the Endpoint

After you have finished with this example, remember to delete the prediction endpoint to release the instance(s) associated with it.

In [36]:
predictor.delete_endpoint()
oneDNN_predictor.delete_endpoint()
mklblas_predictor.delete_endpoint()

# BERT

In [35]:
def test_bert(sagemaker_session, ecr_image, instance_type, framework_version, role):
    import urllib.request
    tmpdir = tempfile.mkdtemp()
    tmpfile = 'bert_sst.tar.gz'
    urllib.request.urlretrieve('https://aws-dlc-sample-models.s3.amazonaws.com/bert_sst/bert_sst.tar.gz', tmpfile)

    prefix = 'bert-model'
    model_data = sagemaker_session.upload_data(path=tmpfile, key_prefix=prefix)

    script = "bert_inference.py"
    model = MXNetModel(model_data,
                       role,
                       script,
                       image_uri=ecr_image,
                       py_version="py3",
                       framework_version=framework_version,
                       sagemaker_session=sagemaker_session)

    endpoint_name = utils.unique_name_from_base('bert')
    print("Deploying...")
    predictor = model.deploy(1, instance_type, endpoint_name=endpoint_name)

    print("\nPredicting...")
    #output = predictor.predict(["Positive sentiment", "Negative sentiment"])
    #assert [1, 0] == output
    
    return predictor

def deploy_bert(model_data, sagemaker_session, ecr_image, instance_type, framework_version, role, script="bert_inference.py"):
    """
    """
    model = MXNetModel(model_data,
                       role,
                       script,
                       image_uri=ecr_image,
                       py_version="py3",
                       framework_version=framework_version,
                       sagemaker_session=sagemaker_session)

    endpoint_name = utils.unique_name_from_base('test-mxnet-gluonnlp')
    print("Deploying...")
    predictor = model.deploy(1, instance_type, endpoint_name=endpoint_name)
    
    print("\nEndpoint Name: {}".format(endpoint_name))
    
    return predictor

def deploy_model(model_data, endpoint_name, sagemaker_session, ecr_image, instance_type, framework_version, role, script="bert_inference.py"):
    """
    """
    model = MXNetModel(model_data,
                       role,
                       script,
                       image_uri=ecr_image,
                       py_version="py3",
                       framework_version=framework_version,
                       sagemaker_session=sagemaker_session)

    endpoint_name = utils.unique_name_from_base(endpoint_name)
    print("Deploying...")
    predictor = model.deploy(1, instance_type, endpoint_name=endpoint_name)
    
    print("\nEndpoint Name: {}".format(endpoint_name))
    
    return predictor

In [47]:
ecr_image = "763104351884.dkr.ecr.us-east-1.amazonaws.com/mxnet-inference:1.6.0-cpu-py36-ubuntu16.04-v3.7"
onednn_predictor = test_bert(sagemaker_session, ecr_image, 'ml.m4.xlarge', "1.6.0", role)

Deploying...
-------------------!
Predicting...


In [48]:
times = []
for i in range(10):
    output = onednn_predictor.predict(list(tweets.text[:100]))
    times.append(output['time'])

In [49]:
import numpy as np
print("Avg:", np.mean(times))
print("Std:", np.std(times))

Avg: 55.713221716880795
Std: 0.20049065918782286


In [50]:
ecr_image = "763104351884.dkr.ecr.us-east-1.amazonaws.com/mxnet-inference:1.6.0-cpu-py36-ubuntu16.04-v3.8"
mkl_predictor = test_bert(sagemaker_session, ecr_image, 'ml.m4.xlarge', "1.6.0", role)

Deploying...
---------------!
Predicting...


In [51]:
times = []
for i in range(10):
    output = mkl_predictor.predict(list(tweets.text[:100]))
    times.append(output["time"])

In [52]:
print("Avg:", np.mean(times))
print("Std:", np.std(times))

Avg: 48.21942739486694
Std: 0.34405162161675223
