# BentoML Example: Tensorflow 2.0 example (Echo model)

[BentoML](http://bentoml.ai) is an open source platform for machine learning model serving and deployment. 

This notebook demonstrates how to use BentoML to turn a Tensorflow model into a docker image containing a REST API server serving this model, how to use your ML service built with BentoML as a CLI tool, and how to distribute it a pypi package.

![Impression](https://www.google-analytics.com/collect?v=1&tid=UA-112879361-3&cid=555&t=event&ec=tensorflow&ea=tensorflow_2_echo&dt=tensorflow_2_echo)

In [1]:
%reload_ext autoreload
%autoreload 2
%matplotlib inline

In [1]:
from __future__ import absolute_import, division, print_function, unicode_literals

import tensorflow as tf
import numpy as np
print(tf.__version__)

import os
import time
import requests
import json

2.2.0


In [2]:
class EchoModel(tf.keras.Model):
    def call(self, x):
        return tf.multiply(x, 1)

custom_model = EchoModel()
custom_model.compile(optimizer='sgd',
              loss="mean_squared_error",
              metrics=['accuracy'])

test_input =  tf.constant(np.zeros([1, 2, 2]))
test_output = tf.constant(np.zeros([1, 2, 2]))

custom_model.fit(test_input, test_output, epochs=1)  # required. it will generate the signature automaticlly

# test
custom_model(tf.constant(np.ones([4, 2, 3]), dtype=tf.float32))



To change all layers to have dtype float64 by default, call `tf.keras.backend.set_floatx('float64')`. To change just this layer, pass dtype='float64' to the layer constructor. If you are the author of this layer, you can disable autocasting by passing autocast=False to the base Layer constructor.



<tf.Tensor: shape=(4, 2, 3), dtype=float32, numpy=
array([[[1., 1., 1.],
        [1., 1., 1.]],

       [[1., 1., 1.],
        [1., 1., 1.]],

       [[1., 1., 1.],
        [1., 1., 1.]],

       [[1., 1., 1.],
        [1., 1., 1.]]], dtype=float32)>

In [3]:
test_tensor = tf.constant(np.zeros([2,4,1]), dtype=tf.float32)
custom_model(test_tensor)

<tf.Tensor: shape=(2, 4, 1), dtype=float32, numpy=
array([[[0.],
        [0.],
        [0.],
        [0.]],

       [[0.],
        [0.],
        [0.],
        [0.]]], dtype=float32)>

# Create BentoService with BentoML


In [8]:
%%writefile tensorflow_echo.py

import bentoml
import tensorflow as tf
import numpy as np

from bentoml.frameworks.tensorflow import TensorflowSavedModelArtifact
from bentoml.adapters import TfTensorInput


@bentoml.env(pip_dependencies=['tensorflow', 'numpy', 'scikit-learn'])
@bentoml.artifacts([TensorflowSavedModelArtifact('model')])
class EchoServicer(bentoml.BentoService):
    @bentoml.api(input=TfTensorInput(), batch=True)
    def predict(self, tensor):
        outputs = self.artifacts.model(tensor)
        return outputs


Overwriting tensorflow_echo.py


In [10]:
# save model
from tensorflow_echo import EchoServicer
bento_svc = EchoServicer()
bento_svc.pack("model", custom_model)
saved_path = bento_svc.save()

Instructions for updating:
If using Keras pass *_constraint arguments to layers.
INFO:tensorflow:Assets written to: /tmp/bentoml-temp-1qhxa2k2/EchoServicer/artifacts/model_saved_model/assets
[2020-09-23 02:11:33,631] INFO - Detected non-PyPI-released BentoML installed, copying local BentoML modulefiles to target saved bundle path..


  "Distutils was imported before Setuptools. This usage is discouraged "
no previously-included directories found matching 'e2e_tests'
no previously-included directories found matching 'tests'
no previously-included directories found matching 'benchmark'


UPDATING BentoML-0.9.0rc0+7.g8af1c8b/bentoml/_version.py
set BentoML-0.9.0rc0+7.g8af1c8b/bentoml/_version.py to '0.9.0.pre+7.g8af1c8b'
[2020-09-23 02:11:34,436] INFO - BentoService bundle 'EchoServicer:20200923021113_51B9BE' saved to: /home/bentoml/bentoml/repository/EchoServicer/20200923021113_51B9BE


**Test packed BentoML service**

In [11]:
bento_svc.predict([1, 2, 3])

<tf.Tensor: shape=(3,), dtype=int32, numpy=array([1, 2, 3], dtype=int32)>

# Use BentoService with BentoML CLI

**`bentoml get` retrieves the service and all of its versions**

In [12]:
!bentoml get EchoServicer

^C


With additional version information, `bentoml get` will display metadata and additional information

In [13]:
!bentoml get EchoServicer:latest

[2020-09-23 02:12:06,352] INFO - Getting latest version EchoServicer:20200923021113_51B9BE
[39m{
  "name": "EchoServicer",
  "version": "20200923021113_51B9BE",
  "uri": {
    "type": "LOCAL",
    "uri": "/home/bentoml/bentoml/repository/EchoServicer/20200923021113_51B9BE"
  },
  "bentoServiceMetadata": {
    "name": "EchoServicer",
    "version": "20200923021113_51B9BE",
    "createdAt": "2020-09-22T18:11:34.411857Z",
    "env": {
      "condaEnv": "name: bentoml-default-conda-env\nchannels:\n- conda-forge\n- defaults\ndependencies:\n- pip\n",
      "pythonVersion": "3.6.10",
      "dockerBaseImage": "bentoml/model-server:0.9.0.pre-py36",
      "pipPackages": [
        "bentoml==0.9.0.pre",
        "tensorflow==2.2.0",
        "numpy==1.19.1",
        "scikit-learn==0.22.2.post1"
      ]
    },
    "artifacts": [
      {
        "name": "model",
        "artifactType": "TensorflowSavedModelArtifact"
      }
    ],
    "apis": [
      {
        "name": "predict",
        "inputType": 

Make prediction with CLI is very simple, use `bentoml run` command to quickly get your prediction result

In [1]:
!bentoml run EchoServicer:latest predict --input='{"instances": [[1, 2]]}'

[2020-09-23 02:13:50,798] INFO - Getting latest version EchoServicer:20200923021113_51B9BE
2020-09-23 02:13:53.681812: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2020-09-23 02:13:53.696190: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-09-23 02:13:53.696582: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1561] Found device 0 with properties: 
pciBusID: 0000:01:00.0 name: GeForce GTX 1060 computeCapability: 6.1
coreClock: 1.6705GHz coreCount: 10 deviceMemorySize: 5.93GiB deviceMemoryBandwidth: 178.99GiB/s
2020-09-23 02:13:53.696783: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2020-09-23 02:13:53.698106: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic

#### Run REST API server locally

In [2]:
!bentoml serve EchoServicer:latest

[2020-07-28 16:03:41,001] INFO - Getting latest version EchoServicer:20200728160149_E7E0E9
[2020-07-28 16:03:41,002] INFO - Starting BentoML API server in development mode..
2020-07-28 16:03:43.931423: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2020-07-28 16:03:43.944089: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-07-28 16:03:43.944466: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1555] Found device 0 with properties: 
pciBusID: 0000:01:00.0 name: GeForce GTX 1060 computeCapability: 6.1
coreClock: 1.6705GHz coreCount: 10 deviceMemorySize: 5.93GiB deviceMemoryBandwidth: 178.99GiB/s
2020-07-28 16:03:43.944640: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2020-07-28 16:03:43.946157: I tensor

### Send prediction request to REST API server

*Run the following command in terminal to make a HTTP request to the API server*
```bash
curl -i \
--header "Content-Type: application/json" \
--request POST \
--data '{"instances": [[1, 2]]}' \
localhost:5000/predict
```

In [3]:
import requests
import json
headers = {"content-type": "application/json"}
data = json.dumps(
    {"instances": [[1, 2, 2, 3], [2, 3, 3, 4]]}
)
print('Data: {} ... {}'.format(data[:50], data[len(data)-52:]))
json_response = requests.post(f'http://127.0.0.1:5000/predict', data=data, headers=headers)
print(json_response)
print(json_response.text)

Data: {"instances": [[1, 2, 2, 3], [2, 3, 3, 4]]} ... , 3, 4]]}
<Response [200]>
[[1.0, 2.0, 2.0, 3.0], [2.0, 3.0, 3.0, 4.0]]


# "pip install" a BentoService bundle

BentoML user can directly pip install saved BentoML archive with `pip install $SAVED_PATH`,  and use it as a regular python package.

In [4]:
!pip install -q {saved_path}

In [5]:
import EchoServicer

pip_installed_svc = EchoServicer.load()

In [11]:
pip_installed_svc.predict(test_tensor)

<tf.Tensor: shape=(2, 4, 1), dtype=float32, numpy=
array([[[0.],
        [0.],
        [0.],
        [0.]],

       [[0.],
        [0.],
        [0.],
        [0.]]], dtype=float32)>

## CLI access

`pip install $SAVED_PATH` also installs a CLI tool for accessing the BentoML service

In [12]:
!EchoServicer --help

Usage: EchoServicer [OPTIONS] COMMAND [ARGS]...

  BentoML CLI tool

Options:
  --version  Show the version and exit.
  --help     Show this message and exit.

Commands:
  containerize        Containerizes given Bento into a ready-to-use Docker
                      image

  info                List APIs
  install-completion  Install shell command completion
  open-api-spec       Display OpenAPI/Swagger JSON specs
  run                 Run API function
  serve               Start local dev API server
  serve-gunicorn      Start production API server


### Print model service information:

In [13]:
!EchoServicer info

[39m{
  "name": "EchoServicer",
  "version": "20200728160149_E7E0E9",
  "created_at": "2020-07-28T08:01:59.060883Z",
  "env": {
    "conda_env": "name: bentoml-EchoServicer\nchannels:\n- defaults\ndependencies:\n- python=3.6.10\n- pip\n",
    "pip_dependencies": "tensorflow\nbentoml==0.8.3\nnumpy\nscikit-learn",
    "python_version": "3.6.10",
    "docker_base_image": "bentoml/model-server:0.8.3"
  },
  "artifacts": [
    {
      "name": "model",
      "artifact_type": "TensorflowSavedModelArtifact"
    }
  ],
  "apis": [
    {
      "name": "predict",
      "input_type": "TfTensorInput",
      "docs": "BentoService inference API 'predict', input: 'TfTensorInput', output: 'DefaultOutput'",
      "input_config": {
        "method": "predict",
        "is_batch_input": true
      },
      "output_config": {
        "cors": "*"
      },
      "output_type": "DefaultOutput",
      "mb_max_latency": 10000,
      "mb_max_batch_size": 2000
    }
  ]
}[0m


### Run 'predict' api with json data:

In [1]:
!EchoServicer run predict --input='{"instances": [[1, 2]]}'

2020-07-28 16:28:36.115351: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2020-07-28 16:28:36.128965: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-07-28 16:28:36.129383: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1555] Found device 0 with properties: 
pciBusID: 0000:01:00.0 name: GeForce GTX 1060 computeCapability: 6.1
coreClock: 1.6705GHz coreCount: 10 deviceMemorySize: 5.93GiB deviceMemoryBandwidth: 178.99GiB/s
2020-07-28 16:28:36.129541: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2020-07-28 16:28:36.130790: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2020-07-28 16:28:36.132114: I tensorflow/stream_executor/platform/

# Deploy BentoService as REST API server to the cloud


BentoML support deployment to multiply cloud provider services, such as AWS Lambda, AWS Sagemaker, Google Cloudrun and etc. You can find the full list and guide on the documentation site at https://docs.bentoml.org/en/latest/deployment/index.html

For this demo, we are going to deploy to AWS Sagemaker

In [49]:
!bentoml sagemaker deploy tf2-echo -b EchoServicer:latest --api-name predict

Deploying Sagemaker deployment -[2020-02-24 14:16:43,609] INFO - Step 1/11 : FROM continuumio/miniconda3:4.7.12
[2020-02-24 14:16:43,610] INFO - 

[2020-02-24 14:16:43,611] INFO -  ---> 406f2b43ea59

[2020-02-24 14:16:43,611] INFO - Step 2/11 : EXPOSE 8080
[2020-02-24 14:16:43,611] INFO - 

[2020-02-24 14:16:43,611] INFO -  ---> Using cache

[2020-02-24 14:16:43,611] INFO -  ---> 58636f0540f4

[2020-02-24 14:16:43,612] INFO - Step 3/11 : RUN set -x      && apt-get update      && apt-get install --no-install-recommends --no-install-suggests -y libpq-dev build-essential     && apt-get install -y nginx      && rm -rf /var/lib/apt/lists/*
[2020-02-24 14:16:43,612] INFO - 

[2020-02-24 14:16:43,612] INFO -  ---> Using cache

[2020-02-24 14:16:43,612] INFO -  ---> 70d334258584

[2020-02-24 14:16:43,612] INFO - Step 4/11 : RUN conda install pip numpy scipy       && pip install gunicorn gevent
[2020-02-24 14:16:43,612] INFO - 

[2020-02-24 14:16:43,612] INFO -  ---> Using cache

[2020-02-24 14

In [50]:
!bentoml sagemaker get tf2-echo

[39m{
  "namespace": "bobo",
  "name": "tf2-echo",
  "spec": {
    "bentoName": "EchoServicer",
    "bentoVersion": "20200224141541_D891E3",
    "operator": "AWS_SAGEMAKER",
    "sagemakerOperatorConfig": {
      "region": "us-west-2",
      "instanceType": "ml.m4.xlarge",
      "instanceCount": 1,
      "apiName": "predict"
    }
  },
  "state": {
    "state": "RUNNING",
    "infoJson": {
      "EndpointName": "bobo-tf2-echo",
      "EndpointArn": "arn:aws:sagemaker:us-west-2:192023623294:endpoint/bobo-tf2-echo",
      "EndpointConfigName": "bobo-tf2-echo-EchoServicer-20200224141541-D891E3",
      "ProductionVariants": [
        {
          "VariantName": "bobo-tf2-echo-EchoServicer-20200224141541-D891E3",
          "DeployedImages": [
            {
              "SpecifiedImage": "192023623294.dkr.ecr.us-west-2.amazonaws.com/echoservicer-sagemaker:20200224141541_D891E3",
              "ResolvedImage": "192023623294.dkr.ecr.us-west-2.amazonaws.com/echoservicer-sagemaker@sha256:5bb688

In [51]:
!aws sagemaker-runtime invoke-endpoint --endpoint-name bobo-tf2-echo --content-type 'application/json' \
--body '{"instances": [[1, 2]]}' \
output.json && cat output.json

{
    "ContentType": "application/json",
    "InvokedProductionVariant": "bobo-tf2-echo-EchoServicer-20200224141541-D891E3"
}
[[1, 2]]

In [52]:
!bentoml sagemaker delete tf2-echo

[32mSuccessfully deleted AWS Sagemaker deployment "tf2-echo"[0m


Additional: Serve with tf-serving
----
Bentoml TensorFlow handler and artifact is following the API of tensorflow-serving REST API.  
To install tensorflow-serving, see: https://www.tensorflow.org/tfx/serving/setup


In [28]:
TMP_MODEL_DIR = "/tmp/test-echo-model"
TMP_MODEL_VERSION = "1"
TMP_MODEL_DIR_V = f"{TMP_MODEL_DIR}/{TMP_MODEL_VERSION}"
MODEL_NAME = "echo_model"

tf.saved_model.save(custom_model, TMP_MODEL_DIR_V)
!tensorflow_model_server --rest_api_port=5001 --model_name={MODEL_NAME} --model_base_path={TMP_MODEL_DIR}

INFO:tensorflow:Assets written to: /tmp/test-echo-model/2/assets
2019-12-20 12:03:01.458521: I tensorflow_serving/model_servers/server.cc:85] Building single TensorFlow model file config:  model_name: echo_model model_base_path: /tmp/test-echo-model
2019-12-20 12:03:01.458658: I tensorflow_serving/model_servers/server_core.cc:462] Adding/updating models.
2019-12-20 12:03:01.458673: I tensorflow_serving/model_servers/server_core.cc:573]  (Re-)adding model: echo_model
2019-12-20 12:03:01.559267: I tensorflow_serving/core/basic_manager.cc:739] Successfully reserved resources to load servable {name: echo_model version: 2}
2019-12-20 12:03:01.559323: I tensorflow_serving/core/loader_harness.cc:66] Approving load for servable version {name: echo_model version: 2}
2019-12-20 12:03:01.559349: I tensorflow_serving/core/loader_harness.cc:74] Loading servable version {name: echo_model version: 2}
2019-12-20 12:03:01.559384: I external/org_tensorflow/tensorflow/cc/saved_model/reader.cc:31] Reading

In [32]:
import requests
import json

TMP_MODEL_DIR = "/tmp/test-echo-model"
TMP_MODEL_VERSION = "1"
TMP_MODEL_DIR_V = f"{TMP_MODEL_DIR}/{TMP_MODEL_VERSION}"
MODEL_NAME = "echo_model"
headers = {"content-type": "application/json"}
data = json.dumps(
    {"instances": [[1, 2, 2, 3], [2, 3, 3, 4]]}
)
print('Data: {} ... {}'.format(data[:50], data[len(data)-52:]))
json_response = requests.post(f'http://127.0.0.1:5001/v{TMP_MODEL_VERSION}/models/{MODEL_NAME}:predict',
                              data=data, headers=headers)
print(json_response)
print(json_response.text)


Data: {"instances": [[1, 2, 2, 3], [2, 3, 3, 4]]} ... , 3, 4]]}
<Response [200]>
{
    "predictions": [[1.0, 2.0, 2.0, 3.0], [2.0, 3.0, 3.0, 4.0]
    ]
}
