# BentoML Example: Keras Toxic Comment Classification


[BentoML](http://bentoml.ai) is an open source platform for machine learning model serving and deployment. 

This notebook demonstrates how to use BentoML to turn a Keras model into a docker image containing a REST API server serving this model, how to use your ML service built with BentoML as a CLI tool, and how to distribute it a pypi package.


This notebook is built based on: https://www.kaggle.com/sarvajna/keras-sequential-model-lb-0-052

![Impression](https://www.google-analytics.com/collect?v=1&tid=UA-112879361-3&cid=555&t=event&ec=keras&ea=keras-toxic-comment-classification&dt=keras-toxic-comment-classification)

In [1]:
%reload_ext autoreload
%autoreload 2
%matplotlib inline

In [2]:
!pip install bentoml
!pip install keras==2.3.1 kaggle tensorflow==1.14.0 scikit-learn



In [3]:
import bentoml
import numpy as np
import pandas as pd
from keras.preprocessing import text, sequence
from keras.models import Sequential
from keras.layers import Dense, Dropout, Activation
from keras.layers import Embedding
from keras.layers import Conv1D, GlobalMaxPooling1D
from sklearn.model_selection import train_test_split

In [4]:
list_of_classes = ["toxic", "severe_toxic", "obscene", "threat", "insult", "identity_hate"]
max_features = 20000
max_text_length = 400
embedding_dims = 50
filters = 250
kernel_size = 3
hidden_dims = 250
batch_size = 32
epochs = 2

## Prepare Dataset

Please Download data with Kaggle at https://www.kaggle.com/sarvajna/keras-sequential-model-lb-0-052/data

If you are running this notebook in Google Colab, fill in your kaggle credential below and download the training dataset from Kaggle:

In [12]:
%%bash

export KAGGLE_USERNAME=
export KAGGLE_KEY=

if [ ! -f ./train.csv.zip ]; then
    kaggle competitions download -c jigsaw-toxic-comment-classification-challenge
    unzip jigsaw-toxic-comment-classification-challenge.zip
    unzip train.csv.zip
    unzip sample_submission.csv.zip
    unzip test.csv.zip
    unzip test_labels.csv.zip
fi

In [13]:
train_df = pd.read_csv('./train.csv')

print(train_df.head())

                 id                                       comment_text  toxic  \
0  0000997932d777bf  Explanation\nWhy the edits made under my usern...      0   
1  000103f0d9cfb60f  D'aww! He matches this background colour I'm s...      0   
2  000113f07ec002fd  Hey man, I'm really not trying to edit war. It...      0   
3  0001b41b1c6bb37e  "\nMore\nI can't make any real suggestions on ...      0   
4  0001d958c54c6e35  You, sir, are my hero. Any chance you remember...      0   

   severe_toxic  obscene  threat  insult  identity_hate  
0             0        0       0       0              0  
1             0        0       0       0              0  
2             0        0       0       0              0  
3             0        0       0       0              0  
4             0        0       0       0              0  


In [14]:
x = train_df['comment_text'].values
print(x)

["Explanation\nWhy the edits made under my username Hardcore Metallica Fan were reverted? They weren't vandalisms, just closure on some GAs after I voted at New York Dolls FAC. And please don't remove the template from the talk page since I'm retired now.89.205.38.27"
 "D'aww! He matches this background colour I'm seemingly stuck with. Thanks.  (talk) 21:51, January 11, 2016 (UTC)"
 "Hey man, I'm really not trying to edit war. It's just that this guy is constantly removing relevant information and talking to me through edits instead of my talk page. He seems to care more about the formatting than the actual info."
 ...
 'Spitzer \n\nUmm, theres no actual article for prostitution ring.  - Crunch Captain.'
 'And it looks like it was actually you who put on the speedy to have the first version deleted now that I look at it.'
 '"\nAnd ... I really don\'t think you understand.  I came here and my idea was bad right away.  What kind of community goes ""you have bad ideas"" go away, instead o

In [15]:
y = train_df[list_of_classes].values
print(y)

[[0 0 0 0 0 0]
 [0 0 0 0 0 0]
 [0 0 0 0 0 0]
 ...
 [0 0 0 0 0 0]
 [0 0 0 0 0 0]
 [0 0 0 0 0 0]]


In [16]:
x_tokenizer = text.Tokenizer(num_words=max_features)
print(x_tokenizer)
x_tokenizer.fit_on_texts(list(x))
print(x_tokenizer)
x_tokenized = x_tokenizer.texts_to_sequences(x) #list of lists(containing numbers), so basically a list of sequences, not a numpy array
#pad_sequences:transform a list of num_samples sequences (lists of scalars) into a 2D Numpy array of shape 
x_train_val = sequence.pad_sequences(x_tokenized, maxlen=max_text_length)

<keras_preprocessing.text.Tokenizer object at 0x7f1381625ac8>
<keras_preprocessing.text.Tokenizer object at 0x7f1381625ac8>


In [17]:
x_train, x_val, y_train, y_val = train_test_split(x_train_val, y, test_size=0.1, random_state=1)

In [18]:
print('Build model...')
model = Sequential()

# we start off with an efficient embedding layer which maps
# our vocab indices into embedding_dims dimensions
model.add(Embedding(max_features,
                    embedding_dims,
                    input_length=max_text_length))
model.add(Dropout(0.2))

# we add a Convolution1D, which will learn filters
# word group filters of size filter_length:
model.add(Conv1D(filters,
                 kernel_size,
                 padding='valid',
                 activation='relu',
                 strides=1))
# we use max pooling:
model.add(GlobalMaxPooling1D())

# We add a vanilla hidden layer:
model.add(Dense(hidden_dims))
model.add(Dropout(0.2))
model.add(Activation('relu'))

# We project onto 6 output layers, and squash it with a sigmoid:
model.add(Dense(6))
model.add(Activation('sigmoid'))

model.compile(loss='binary_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

model.summary()

Build model...
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where
Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 400, 50)           1000000   
_________________________________________________________________
dropout_1 (Dropout)          (None, 400, 50)           0         
_________________________________________________________________
conv1d_1 (Conv1D)            (None, 398, 250)          37750     
_________________________________________________________________
global_max_pooling1d_1 (Glob (None, 250)               0         
_________________________________________________________________
dense_1 (Dense)              (None, 250)               62750     
_________________________________________________________________
dropout_2 (Dropout)          (None, 250)               0         
___________

In [19]:
model.fit(x_train, y_train,
          batch_size=batch_size,
          epochs=epochs,
validation_data=(x_val, y_val))


Train on 143613 samples, validate on 15958 samples
Epoch 1/2
Epoch 2/2


<keras.callbacks.callbacks.History at 0x7f1384ab7ef0>

In [20]:
test_df = pd.read_csv('./test.csv')

In [21]:
x_test = test_df['comment_text'].values

In [22]:
x_test_tokenized = x_tokenizer.texts_to_sequences(x_test)
x_testing = sequence.pad_sequences(x_test_tokenized, maxlen=max_text_length)

In [23]:
y_testing = model.predict(x_testing, verbose=1)



In [24]:
sample_submission = pd.read_csv("./sample_submission.csv")
sample_submission[list_of_classes] = y_testing
sample_submission.to_csv("toxic_comment_classification.csv", index=False)

## Create BentoService for model serving

In [25]:
%%writefile toxic_comment_classifier.py

from bentoml import api, artifacts, env, BentoService
from bentoml.artifact import PickleArtifact, KerasModelArtifact
from bentoml.adapters import DataframeInput

from keras.preprocessing import text, sequence
import numpy as np

list_of_classes = ["toxic", "severe_toxic", "obscene", "threat", "insult", "identity_hate"]
max_text_length = 400

@env(pip_dependencies=['tensorflow==1.14.0', 'keras==2.3.1', 'pandas', 'numpy'])
@artifacts([PickleArtifact('x_tokenizer'), KerasModelArtifact('model')])
class ToxicCommentClassification(BentoService):
    
    def tokenize_df(self, df):
        comments = df['comment_text'].values
        tokenized = self.artifacts.x_tokenizer.texts_to_sequences(comments)        
        input_data = sequence.pad_sequences(tokenized, maxlen=max_text_length)
        return input_data
    
    @api(input=DataframeInput())
    def predict(self, df):
        input_data = self.tokenize_df(df)
        prediction = self.artifacts.model.predict(input_data)
        result = []
        for i in prediction:
            result.append(list_of_classes[np.argmax(i)])
        return result

Overwriting toxic_comment_classifier.py


## Save BentoService to file archive

In [26]:
# 1) import the custom BentoService defined above
from toxic_comment_classifier import ToxicCommentClassification

# 2) `pack` it with required artifacts
svc = ToxicCommentClassification()
svc.pack('x_tokenizer', x_tokenizer)
svc.pack('model', model)

# 3) save your BentoSerivce
saved_path = svc.save()


[2020-08-04 15:56:14,668] INFO - Detect BentoML installed in development model, copying local BentoML module file to target saved bundle path
running sdist
running egg_info
writing BentoML.egg-info/PKG-INFO
writing dependency_links to BentoML.egg-info/dependency_links.txt
writing entry points to BentoML.egg-info/entry_points.txt
writing requirements to BentoML.egg-info/requires.txt
writing top-level names to BentoML.egg-info/top_level.txt
reading manifest file 'BentoML.egg-info/SOURCES.txt'
reading manifest template 'MANIFEST.in'


no previously-included directories found matching 'e2e_tests'
no previously-included directories found matching 'tests'
no previously-included directories found matching 'benchmark'


writing manifest file 'BentoML.egg-info/SOURCES.txt'
running check
creating BentoML-0.8.3+49.gdcc2e8b
creating BentoML-0.8.3+49.gdcc2e8b/BentoML.egg-info
creating BentoML-0.8.3+49.gdcc2e8b/bentoml
creating BentoML-0.8.3+49.gdcc2e8b/bentoml/adapters
creating BentoML-0.8.3+49.gdcc2e8b/bentoml/artifact
creating BentoML-0.8.3+49.gdcc2e8b/bentoml/cli
creating BentoML-0.8.3+49.gdcc2e8b/bentoml/clipper
creating BentoML-0.8.3+49.gdcc2e8b/bentoml/configuration
creating BentoML-0.8.3+49.gdcc2e8b/bentoml/configuration/__pycache__
creating BentoML-0.8.3+49.gdcc2e8b/bentoml/handlers
creating BentoML-0.8.3+49.gdcc2e8b/bentoml/marshal
creating BentoML-0.8.3+49.gdcc2e8b/bentoml/saved_bundle
creating BentoML-0.8.3+49.gdcc2e8b/bentoml/server
creating BentoML-0.8.3+49.gdcc2e8b/bentoml/utils
creating BentoML-0.8.3+49.gdcc2e8b/bentoml/yatai
creating BentoML-0.8.3+49.gdcc2e8b/bentoml/yatai/client
creating BentoML-0.8.3+49.gdcc2e8b/bentoml/yatai/deployment
creating BentoML-0.8.3+49.gdcc2e8b/bentoml/yatai/dep

## Load BentoService from archive

In [27]:
sample_test = test_df.iloc[40:42]
print(sample_test)
bento_service = bentoml.load(saved_path)

print(bento_service.predict(sample_test))

                  id                                       comment_text
40  0011cefc680993ba                      REDIRECT Talk:Mi Vida Eres Tú
41  0011ef6aa33d42e6  " \n I'm not convinced that he was blind. Wher...

['toxic', 'toxic']


In [None]:
!bentoml get ToxicCommentClassification

In [29]:
!bentoml get ToxicCommentClassification:latest 

[2020-08-04 16:08:02,608] INFO - Getting latest version ToxicCommentClassification:20200804155557_9AD242
[39m{
  "name": "ToxicCommentClassification",
  "version": "20200804155557_9AD242",
  "uri": {
    "type": "LOCAL",
    "uri": "/home/bentoml/bentoml/repository/ToxicCommentClassification/20200804155557_9AD242"
  },
  "bentoServiceMetadata": {
    "name": "ToxicCommentClassification",
    "version": "20200804155557_9AD242",
    "createdAt": "2020-08-04T07:56:15.887774Z",
    "env": {
      "condaEnv": "name: bentoml-ToxicCommentClassification\nchannels:\n- defaults\ndependencies:\n- python=3.6.10\n- pip\n",
      "pipDependencies": "tensorflow==1.14.0\npandas\nbentoml==0.8.3\nkeras==2.3.1\nnumpy",
      "pythonVersion": "3.6.10",
      "dockerBaseImage": "bentoml/model-server:0.8.3"
    },
    "artifacts": [
      {
        "name": "x_tokenizer",
        "artifactType": "PickleArtifact"
      },
      {
        "name": "model",
        "artifactType": "KerasModelArtifact"
      }
 

In [30]:
!bentoml info ToxicCommentClassification:latest

[2020-08-04 16:08:14,398] INFO - Getting latest version ToxicCommentClassification:20200804155557_9AD242
[39m{
  "name": "ToxicCommentClassification",
  "version": "20200804155557_9AD242",
  "created_at": "2020-08-04T07:56:14.442752Z",
  "env": {
    "conda_env": "name: bentoml-ToxicCommentClassification\nchannels:\n- defaults\ndependencies:\n- python=3.6.10\n- pip\n",
    "pip_dependencies": "tensorflow==1.14.0\npandas\nbentoml==0.8.3\nkeras==2.3.1\nnumpy",
    "python_version": "3.6.10",
    "docker_base_image": "bentoml/model-server:0.8.3"
  },
  "artifacts": [
    {
      "name": "x_tokenizer",
      "artifact_type": "PickleArtifact"
    },
    {
      "name": "model",
      "artifact_type": "KerasModelArtifact"
    }
  ],
  "apis": [
    {
      "name": "predict",
      "input_type": "DataframeInput",
      "docs": "BentoService inference API 'predict', input: 'DataframeInput', output: 'DefaultOutput'",
      "input_config": {
        "orient": null,
        "typ": "frame",
     

In [1]:
!bentoml run ToxicCommentClassification:latest predict --input '[{"comment_text": "bad terrible"}]'

[2020-08-04 16:16:37,636] INFO - Getting latest version ToxicCommentClassification:20200804155557_9AD242
Using TensorFlow backend.
2020-08-04 16:16:39.974511: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcuda.so.1
2020-08-04 16:16:39.995019: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-08-04 16:16:39.995436: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 0 with properties: 
name: GeForce GTX 1060 major: 6 minor: 1 memoryClockRate(GHz): 1.6705
pciBusID: 0000:01:00.0
2020-08-04 16:16:39.995656: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudart.so.10.1
2020-08-04 16:16:39.996963: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcublas.so.10
2020-0

## Use BentoService as PyPI package

In [33]:
!pip install  --quiet {saved_path}

In [None]:
import ToxicCommentClassification

svc = ToxicCommentClassification.load()
result = svc.predict(sample_test)
result

## Deploy BentoService as REST API server to the cloud


BentoML support deployment to multiply cloud provider services, such as AWS Lambda, AWS Sagemaker, Google Cloudrun and etc. You can find the full list and guide on the documentation site at https://docs.bentoml.org/en/latest/deployment/index.html

For this project, we are going to deploy to AWS Sagemaker

**Use `bentoml sagemaker deploy` to deploy BentoService to AWS Sagemaker**

In [None]:
!bentoml sagemaker deploy keras-toxic -b ToxicCommentClassification:latest \
    --api-name predict --verbose

`bentoml sagemaker list` displays all deployed Sagemaker deployments

In [None]:
!bentoml sagemaker list

`bentoml sagemaker get` retrieve the latest status of Sagemaker deployment

In [None]:
!bentoml sagemaker get keras-toxic

Validate and test Sagemaker deployment with sample data

In [None]:
!aws sagemaker-runtime invoke-endpoint --endpoint-name bobo-keras-toxic \
--body '[{"comment_text": "bad terrible"}]' --content-type application/json output.json && cat output.json

`bentoml sagemaker delete` will remove Sagmaker deployment and related resources

In [None]:
!bentoml sagemaker delete keras-toxic