# BentoML Example: Tensorflow GPU Serving

BentoML makes moving trained ML models to production easy:

    Package models trained with any ML framework and reproduce them for model serving in production
    Deploy anywhere for online API serving or offline batch serving
    High-Performance API model server with adaptive micro-batching support
    Central hub for managing models and deployment process via Web UI and APIs
    Modular and flexible design making it adaptable to your infrastrcuture

BentoML is a framework for serving, managing, and deploying machine learning models. It is aiming to bridge the gap between Data Science and DevOps, and enable teams to deliver prediction services in a fast, repeatable, and scalable way. Before reading this example project, be sure to check out the Getting started guide to learn about the basic concepts in BentoML.

This notebook demonstrates how to serve your Tensorflow2.0 model with BentoML, building a Docker Images that has GPU supports. Please refers to [GPU Serving guides](https://docs.bentoml.org/en/latest/guides/gpu_serving.html) for more information.

In [1]:
%reload_ext autoreload
%autoreload 2

In [2]:
!pip install -q bentoml tensorflow==2.5.0

We are building a sentiment analysis classifier with IMDB dataset, retrieved from [Standford's](https://ai.stanford.edu/~amaas/data/sentiment/)

In [3]:
import os
import json
import string

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder


from bentoml import BentoService, api, artifacts, env
from bentoml.adapters import JsonInput
from bentoml.frameworks.keras import KerasModelArtifact
from bentoml.service.artifacts.common import PickleArtifact

import tensorflow as tf
from tensorflow import config
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.preprocessing.text import Tokenizer, text_to_word_sequence
from tensorflow.keras.models import model_from_json
from tensorflow.keras.preprocessing.text import tokenizer_from_json
from tensorflow.keras.layers import Activation, Dense, Dropout, Embedding, LSTM
from tensorflow.keras.models import Sequential
from tensorflow.keras.callbacks import EarlyStopping
from tensorflow.keras.optimizers import RMSprop

In [4]:
tf.config.list_physical_devices("GPU")

[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]

## Preprocessing Data

In [5]:
def preprocess(s):
    return strip_punctuation(remove_br(s.lower()))


def strip_punctuation(s):
    for c in string.punctuation + "’":
        s = s.replace(c, "")
    return s


def remove_br(s):
    return s.replace("<br /><br />", "")

We will build our custom IMDB DataLoader using `sklearn.preprocessing.LabelEncoder` and `tf.keras.preprocessing.text.Tokenizer`

In [6]:
class IMDB:
    def __init__(self, max_seq_len, vocab_size):
        self.MAX_SEQ_LEN = max_seq_len
        self.VOCAB_SIZE = vocab_size

        print('Loading IMDB dataset')
        df = pd.read_csv('data/imdb.csv', names=["X", "Y"], skiprows=1)
        # print(df.head())

        # cast X to str and preprocess
        df['X'] = df.X.apply(str)
        df['X'] = df.X.apply(preprocess)

        X = df.X
        Y = df.Y

        # encode labels
        label_encoder = LabelEncoder()
        Y = label_encoder.fit_transform(Y)
        Y = Y.reshape(-1, 1)

        # 15/85 train test split
        self.X_train, self.X_test, self.Y_train, self.Y_test = train_test_split(
            X, Y, test_size=0.15
        )

        self.tokenizer = Tokenizer(num_words=self.VOCAB_SIZE, oov_token="<OOV>")
        self.tokenizer.fit_on_texts(self.X_train)

        self.tokenize()
        self.pad()

    def tokenize(self):
        self.X_train = self.tokenizer.texts_to_sequences(self.X_train)
        self.X_test = self.tokenizer.texts_to_sequences(self.X_test)

    def pad(self):
        self.X_train = pad_sequences(self.X_train, maxlen=self.MAX_SEQ_LEN, padding="post")
        self.X_test = pad_sequences(self.X_test, maxlen=self.MAX_SEQ_LEN, padding="post")

## Define our Model

We will build a simple Bidirectional RNN with LSTM using `tf.keras.models.Sequential`

image source: [PaperWithCode](https://paperswithcode.com/method/bilstm#)

![bidirectional-lstm](./bidirectional-lstm.png)

In [7]:
# TF RNN model.
def RNN(max_seq_len, vocab_size):
    model = Sequential()
    model.add(Embedding(vocab_size, 64, input_length=max_seq_len))
    model.add(LSTM(64, return_sequences=True))
    model.add(Dropout(0.5))
    model.add(LSTM(64))
    model.add(Dense(256, name='fc1'))
    model.add(Dropout(0.5))
    model.add(Dense(1, name='out'))
    model.add(Activation('sigmoid'))
    return model

## Preparing Hyperparameters

In [8]:
# CONSTANT
VOCAB_SIZE = 5000
MAX_SEQ_LEN = 100
SEED = 1234

## Train and save our model locally

In [9]:
gpu = tf.config.experimental.list_physical_devices('GPU')
tf.config.experimental.set_memory_growth(gpu[0], True)  # gpu name: /GPU:0

model = RNN(MAX_SEQ_LEN, VOCAB_SIZE)
model.summary()
model.compile(loss='binary_crossentropy', optimizer=RMSprop(), metrics=['accuracy'])

imdb = IMDB(MAX_SEQ_LEN, VOCAB_SIZE)

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 100, 64)           320000    
_________________________________________________________________
lstm (LSTM)                  (None, 100, 64)           33024     
_________________________________________________________________
dropout (Dropout)            (None, 100, 64)           0         
_________________________________________________________________
lstm_1 (LSTM)                (None, 64)                33024     
_________________________________________________________________
fc1 (Dense)                  (None, 256)               16640     
_________________________________________________________________
dropout_1 (Dropout)          (None, 256)               0         
_________________________________________________________________
out (Dense)                  (None, 1)                 2

In [10]:
if not os.path.exists("model"):
    os.makedirs("model", exist_ok=True)
    
with tf.device("/GPU:0"):
    # Model Training
    model.fit(
        imdb.X_train,
        imdb.Y_train,
        batch_size=512,
        epochs=10,
        validation_split=0.2,
        callbacks=[EarlyStopping(patience=2, verbose=1)],
    )

    # Run model on test set
    accr = model.evaluate(imdb.X_test, imdb.Y_test)
    print(
        'Test set\n  Loss: {:0.4f}\n  Accuracy: {:0.2f}'.format(
            accr[0], accr[1] * 100
        )
    )

    # save weights as HDF5
    model.save("model/weights.h5")
    print("Saved model to disk")

    # save model as JSON
    model_json = model.to_json()
    with open("model/model.json", "w") as file:
        file.write(model_json)

    # save tokenizer as JSON
    tokenizer_json = imdb.tokenizer.to_json()
    with open("model/tokenizer.json", 'w', encoding='utf-8') as file:
        file.write(json.dumps(tokenizer_json, ensure_ascii=True))


Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 00007: early stopping
Test set
  Loss: 0.3949
  Accuracy: 84.07
Saved model to disk


## Defining our BentoService

Please refers to our [GPU Serving guide](https://docs.bentoml.org/en/latest/guides/gpu_serving.html) to setup your environment correctly.

We will be using Docker images provided by *BentoML* : `bentoml/model-server:0.12.1-py38-gpu` to prepare our CUDA-enabled images.

In [11]:
%%writefile bento_svc.py

import string

from bentoml import BentoService, api, artifacts, env
from bentoml.adapters import JsonInput
from bentoml.frameworks.keras import KerasModelArtifact
from bentoml.service.artifacts.common import PickleArtifact
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.preprocessing.text import text_to_word_sequence

def preprocess(s):
    return strip_punctuation(remove_br(s.lower()))


def strip_punctuation(s):
    for c in string.punctuation + "’":
        s = s.replace(c, "")
    return s


def remove_br(s):
    return s.replace("<br /><br />", "")

@env(requirements_txt_file="./requirements.txt", docker_base_image="bentoml/model-server:0.12.1-py38-gpu")
@artifacts([KerasModelArtifact('model'), PickleArtifact('tokenizer')])
class TensorflowService(BentoService):
    def word_to_index(self, word):
        if word in self.artifacts.tokenizer and self.artifacts.tokenizer[word] <= 5000:
            return self.artifacts.tokenizer[word]
        else:
            return self.artifacts.tokenizer["<OOV>"]

    def preprocessing(self, text_str):
        proc = text_to_word_sequence(preprocess(text_str))
        tokens = list(map(self.word_to_index, proc))
        return tokens

    @api(input=JsonInput())
    def predict(self, parsed_json):
        raw = self.preprocessing(parsed_json['text'])
        input_data = [raw[: n + 1] for n in range(len(raw))]
        input_data = pad_sequences(input_data, maxlen=100, padding="post")
        return self.artifacts.model.predict(input_data)

Overwriting bento_svc.py


## Pack our BentoService

In [12]:
from bento_svc import TensorflowService

gpu = config.experimental.list_physical_devices('GPU')
config.experimental.set_memory_growth(gpu[0], True)

def load_tokenizer():
    with open('model/tokenizer.json', 'r') as f:
        data = json.load(f)
        tokenizer = tokenizer_from_json(data)
        j = tokenizer.get_config()['word_index']
        return json.loads(j)


def load_model():
    # load json and create model
    json_file = open('model/model.json', 'r')
    loaded_model_json = json_file.read()
    json_file.close()
    model = model_from_json(loaded_model_json)
    model.load_weights("model/weights.h5")
    return model


model = load_model()
tokenizer = load_tokenizer()

bento_svc = TensorflowService()
bento_svc.pack('model', model)
bento_svc.pack('tokenizer', tokenizer)

saved_path = bento_svc.save()

[2021-06-04 12:20:08,875] INFO - Using user specified docker base image: `bentoml/model-server:0.12.1-py38-gpu`, usermust make sure that the base image either has Python 3.8 or conda installed.
[2021-06-04 12:20:16,042] INFO - Detected non-PyPI-released BentoML installed, copying local BentoML modulefiles to target saved bundle path..


no previously-included directories found matching 'e2e_tests'
no previously-included directories found matching 'tests'
no previously-included directories found matching 'benchmark'


UPDATING BentoML-0.12.1+53.g9d8b599/bentoml/_version.py
set BentoML-0.12.1+53.g9d8b599/bentoml/_version.py to '0.12.1+53.g9d8b599'
[2021-06-04 12:20:24,200] INFO - BentoService bundle 'TensorflowService:20210604122013_B189A7' saved to: /home/aarnphm/bentoml/repository/TensorflowService/20210604122013_B189A7


## REST API Model Serving

To start a REST API model server with the BentoService save above, use the `serve` command:

In [13]:
!bentoml serve TensorflowService:latest

[2021-06-04 12:20:25,979] INFO - Getting latest version TensorflowService:20210604122013_B189A7
[2021-06-04 12:20:25,988] INFO - Starting BentoML API proxy in development mode..
[2021-06-04 12:20:25,990] INFO - Starting BentoML API server in development mode..
[2021-06-04 12:20:26,149] INFO - Your system nofile limit is 4096, which means each instance of microbatch service is able to hold this number of connections at same time. You can increase the number of file descriptors for the server process, or launch more microbatch instances to accept more concurrent connection.
(Press CTRL+C to quit)
2021-06-04 12:20:26.380507: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
[2021-06-04 12:20:28,087] INFO - Using user specified docker base image: `bentoml/model-server:0.12.1-py38-gpu`, usermust make sure that the base image either has Python 3.8 or conda installed.
2021-06-04 12:20:28.099245: I tensorflow/stream_executor/p

 * Serving Flask app 'TensorflowService' (lazy loading)
 * Environment: production
[2m   Use a production WSGI server instead.[0m
 * Debug mode: off
INFO:werkzeug: * Running on http://127.0.0.1:54691/ (Press CTRL+C to quit)
INFO:werkzeug:127.0.0.1 - - [04/Jun/2021 12:20:35] "GET / HTTP/1.1" 200 -
INFO:werkzeug:127.0.0.1 - - [04/Jun/2021 12:20:35] "GET /static_content/main.css HTTP/1.1" 200 -
INFO:werkzeug:127.0.0.1 - - [04/Jun/2021 12:20:35] "GET /static_content/swagger-ui.css HTTP/1.1" 200 -
INFO:werkzeug:127.0.0.1 - - [04/Jun/2021 12:20:35] "GET /static_content/readme.css HTTP/1.1" 200 -
INFO:werkzeug:127.0.0.1 - - [04/Jun/2021 12:20:35] "GET /static_content/marked.min.js HTTP/1.1" 200 -
INFO:werkzeug:127.0.0.1 - - [04/Jun/2021 12:20:35] "[36mGET /static_content/swagger-ui-bundle.js HTTP/1.1[0m" 304 -
INFO:werkzeug:127.0.0.1 - - [04/Jun/2021 12:20:36] "GET /docs.json HTTP/1.1" 200 -
2021-06-04 12:20:56.380680: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:176] None o

Check if `BentoService` is running on GPU

In [14]:
!nvidia-smi

Fri Jun  4 12:21:07 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 465.31       Driver Version: 465.31       CUDA Version: 11.3     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  NVIDIA GeForce ...  Off  | 00000000:01:00.0 Off |                  N/A |
| N/A   77C    P2    31W /  N/A |    891MiB /  6078MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+---------------------------------------------------------------------------

If you are running this notebook from Google Colab, start the dev server with `--run-with-ngrok` option to gain access to the API endpoint via a public endpoint managed by [ngrok](https://ngrok.com/):

In [None]:
!bentoml serve PyTorchFashionClassifier:latest --run-with-ngrok

## Containerize our model server with Docker

One common way of distributing this model API server for production deployment, is via Docker containers. And BentoML provides a convenient way to do that.

Note that docker is not available in Google Colab. You will need to download and run this notebook locally to try out this containerization with docker feature.

If you already have docker configured, simply run the follow command to product a docker container serving the IrisClassifier prediction service created above:

In [15]:
!bentoml containerize TensorflowService:latest -t tensorflow-service-gpu:latest

[2021-06-04 12:21:09,517] INFO - Getting latest version TensorflowService:20210604122013_B189A7
[39mFound Bento: /home/aarnphm/bentoml/repository/TensorflowService/20210604122013_B189A7[0m
Containerizing TensorflowService:20210604122013_B189A7 with local YataiService and docker daemon from local environment\^C
 

In [None]:
!docker run --gpus all --device /dev/nvidia0 --device /dev/nvidiactl --device /dev/nvidia-modeset --device /dev/nvidia-uvm --device /dev/nvidia-uvm-tools -p 5000:5000 tensorflow-service-gpu

[2021-06-04 05:21:16,787] INFO - Starting BentoML proxy in production mode..
[2021-06-04 05:21:16,788] INFO - Starting BentoML API server in production mode..
[2021-06-04 05:21:16,803] INFO - Running micro batch service on :5000
[2021-06-04 05:21:16 +0000] [19] [INFO] Starting gunicorn 20.1.0
[2021-06-04 05:21:16 +0000] [19] [INFO] Listening at: http://0.0.0.0:53393 (19)
[2021-06-04 05:21:16 +0000] [19] [INFO] Using worker: sync
[2021-06-04 05:21:16 +0000] [20] [INFO] Booting worker with pid: 20
[2021-06-04 05:21:16 +0000] [1] [INFO] Starting gunicorn 20.1.0
[2021-06-04 05:21:16 +0000] [1] [INFO] Listening at: http://0.0.0.0:5000 (1)
[2021-06-04 05:21:16 +0000] [1] [INFO] Using worker: aiohttp.worker.GunicornWebWorker
[2021-06-04 05:21:16 +0000] [21] [INFO] Booting worker with pid: 21
[2021-06-04 05:21:16,974] INFO - Your system nofile limit is 1048576, which means each instance of microbatch service is able to hold this number of connections at same time. You can increase the number o

2021-06-04 05:21:34.021454: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:176] None of the MLIR Optimization Passes are enabled (registered 2)
2021-06-04 05:21:34.039973: I tensorflow/core/platform/profile_utils/cpu_utils.cc:114] CPU Frequency: 2200660000 Hz
2021-06-04 05:21:34.761591: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudnn.so.8
2021-06-04 05:21:35.535064: I tensorflow/stream_executor/cuda/cuda_dnn.cc:359] Loaded cuDNN version 8200
2021-06-04 05:21:35.735112: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcublas.so.11
2021-06-04 05:21:36.770853: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcublasLt.so.11
[2021-06-04 05:21:36,780] INFO - {'service_name': 'TensorflowService', 'service_version': '20210604120626_2294FE', 'api': 'predict', 'task': {'data': '{"text":"I love you so much"}', 'task_id

## Deployment Options

If you are at a small team with limited engineering or DevOps resources, try out automated deployment with BentoML CLI, currently supporting AWS Lambda, AWS SageMaker, and Azure Functions:
- [AWS Lambda Deployment Guide](https://docs.bentoml.org/en/latest/deployment/aws_lambda.html)
- [AWS SageMaker Deployment Guide](https://docs.bentoml.org/en/latest/deployment/aws_sagemaker.html)
- [Azure Functions Deployment Guide](https://docs.bentoml.org/en/latest/deployment/azure_functions.html)

If the cloud platform you are working with is not on the list above, try out these step-by-step guide on manually deploying BentoML packaged model to cloud platforms:
- [AWS ECS Deployment](https://docs.bentoml.org/en/latest/deployment/aws_ecs.html)
- [Google Cloud Run Deployment](https://docs.bentoml.org/en/latest/deployment/google_cloud_run.html)
- [Azure container instance Deployment](https://docs.bentoml.org/en/latest/deployment/azure_container_instance.html)
- [Heroku Deployment](https://docs.bentoml.org/en/latest/deployment/heroku.html)

Lastly, if you have a DevOps or ML Engineering team who's operating a Kubernetes or OpenShift cluster, use the following guides as references for implementating your deployment strategy:
- [Kubernetes Deployment](https://docs.bentoml.org/en/latest/deployment/kubernetes.html)
- [Knative Deployment](https://docs.bentoml.org/en/latest/deployment/knative.html)
- [Kubeflow Deployment](https://docs.bentoml.org/en/latest/deployment/kubeflow.html)
- [KFServing Deployment](https://docs.bentoml.org/en/latest/deployment/kfserving.html)
- [Clipper.ai Deployment Guide](https://docs.bentoml.org/en/latest/deployment/clipper.html)