# Hands-on: Deploying Question Answering with BERT

Pre-trained language representations have been shown to improve many downstream NLP tasks such as question answering, and natural language inference. Devlin, Jacob, et al proposed BERT [1] (Bidirectional Encoder Representations from Transformers), which fine-tunes deep bidirectional representations on a wide range of tasks with minimal task-specific parameters, and obtained state- of-the-art results.

After finishing training QA with BERT (the previous notebook "QA_Training.ipydb"), let us load a trained model to perform inference on the SQuAD dataset

### A quick overview: an example from SQuAD dataset is like below:

    (2, 
    '56be4db0acb8001400a502ee', 
    'Where did Super Bowl 50 take place?', 

    'Super Bowl 50 was an American football game to determine the champion of the National 
    Football League (NFL) for the 2015 season. The American Football Conference (AFC) 
    champion Denver Broncos defeated the National Football Conference (NFC) champion 
    Carolina Panthers 24–10 to earn their third Super Bowl title. The game was played 
    on February 7, 2016, at Levi\'s Stadium in the San Francisco Bay Area at Santa Clara, 
    California. As this was the 50th Super Bowl, the league emphasized the "golden 
    anniversary" with various gold-themed initiatives, as well as temporarily suspending 
    the tradition of naming each Super Bowl game with Roman numerals (under which the 
    game would have been known as "Super Bowl L"), so that the logo could prominently 
    feature the Arabic numerals 50.', 

    ['Santa Clara, California', "Levi's Stadium", "Levi's Stadium 
    in the San Francisco Bay Area at Santa Clara, California."], 

    [403, 355, 355])

## Deploy on SageMaker

1. Preparing functions for inference 
2. Saving the model parameters
3. Building a docker container with dependencies installed
4. Launching a serving end-point with SageMaker SDK

### 1. Preparing functions for inference

Two functions: 
1. ```model_fn``` to load model parameters
2. ```transform_fn(``` to run model inference given an input

In [1]:
# %%writefile code/serve.py
import collections, json, logging, warnings
import multiprocessing as mp
from functools import partial

import gluonnlp as nlp
import mxnet as mx
from mxnet.gluon import Block, nn
from bert.data.qa import preprocess_dataset, SQuADTransform
import bert.bert_qa_evaluate


def model_fn(params_path, model_dir = ""):
    """
    Load the gluon model. Called once when hosting service starts.
    
    Parameters
    ----------
        params_path: filename of trained BERT model weights, 
            e.g., params_path = "bert_qa-7eb11865.params"
        model_dir: The directory where model files are stored

    Returns
    -------
        net: a Gluon model,
        vocab: the BERT vocabulary,
        transform: a SQuADTransform
    """
    bert_model, vocab = nlp.model.get_model('bert_12_768_12',
                                        dataset_name='book_corpus_wiki_en_uncased',
                                        use_classifier=False,
                                        use_decoder=False,
                                        use_pooler=False,
                                        pretrained=False)
    net = bert_qa_evaluate.BertForQA(bert_model)
    if len(model_dir) > 0:
        params_path = model_dir + "/" +params_path
    net.load_parameters(params_path, ctx=mx.cpu())
    
    tokenizer = nlp.data.BERTTokenizer(vocab,  lower=True)
    transform = SQuADTransform(tokenizer, is_pad=False, is_training=False, do_lookup=False)
    return net, vocab, transform


def transform_fn(model, input_data, input_content_type=None, output_content_type=None):
    """
    Transform a request using the Gluon model. Called once per request.
    
    Parameters
    ----------
        model: The Gluon model from model_fn()
        input_data: The input data, will be a list(tuples) here
            Example:
            ## (example_id, [question, content], ques_cont_token_types, valid_length, _, _)

        input_content_type: The request content type, assume json
        output_content_type: The (desired) response content type, assume json
    
    Returns
    -------
        response payload and output content type.
    """
    net, vocab, squadTransform = model
    data = json.loads(input_data)
    test_examples_tuples = bert_qa_evaluate._test_example_transform(data)
    test_dataset = mx.gluon.data.SimpleDataset(test_examples_tuples)
    all_results = bert_qa_evaluate.get_all_results(net, vocab, squadTransform, test_dataset, ctx=mx.cpu())
    all_predictions = collections.defaultdict(list)
    data_transform = test_dataset.transform(squadTransform._transform)
    for features in data_transform:
        f_id = features[0].example_id
        results = all_results[f_id]
        prediction, nbest = bert_qa_evaluate.predict(
            features=features,
            results=results,
            tokenizer=nlp.data.BERTBasicTokenizer(vocab))        
        nbest_prediction = [] 
        for i in range(3):
            nbest_prediction.append('%.2f%% \t %s'%(nbest[i][1] * 100, nbest[i][0]))
        all_predictions[f_id] = nbest_prediction
    response_body = json.dumps(all_predictions)
    return response_body, output_content_type


### 2. Saving the model parameters

We are going to zip the BERT model parameters, vocabulary file, and all the inference files (```code/serve.py```, ```bert/data/qa.py```, ```bert_qa_evaluate.py```) to a ```model.tar.gz``` file. (Note that the ```serve.py``` is the "entry_point" for Sagemaker to do the inference, and it needs to be under ```code/``` directory.)

In [2]:
import tarfile
with tarfile.open("model.tar.gz", "w:gz") as tar:
    tar.add("code/serve.py")
    tar.add("bert/data/qa.py")
    tar.add("bert_qa_evaluate.py")
    tar.add("bert_qa-7eb11865.params")
    tar.add("vocab.json")

### 3. Building a docker container with dependencies installed

Let's prepare a docker container with all the dependencies required for model inference. Here we build a docker container based on the SageMaker MXNet inference container, and you can find the list of all available inference containers at https://docs.aws.amazon.com/sagemaker/latest/dg/pre-built-containers-frameworks-deep-learning.html

Here we use local mode for demonstration purpose. To deploy on actual instances, you need to login into AWS elastic container registry (ECR) service, and push the container to ECR. 

```
docker build -t $YOUR_EDR_DOCKER_TAG . -f Dockerfile
$(aws ecr get-login --no-include-email --region $YOUR_REGION)
docker push $YOUR_EDR_DOCKER_TAG
```

In [3]:
%%writefile Dockerfile

ARG REGION
FROM 763104351884.dkr.ecr.$REGION.amazonaws.com/mxnet-inference:1.6.0-gpu-py3

RUN pip install --upgrade --user --pre 'mxnet-mkl' 'https://github.com/dmlc/gluon-nlp/tarball/v0.9.x'
RUN pip list | grep mxnet

Overwriting Dockerfile


In [4]:
!export REGION=$(wget -qO- http://169.254.169.254/latest/meta-data/placement/availability-zone) &&\
 docker build --no-cache --build-arg REGION=${REGION::-1} -t my-docker:inference . -f Dockerfile

Sending build context to Docker daemon  844.4MB
Step 1/4 : ARG REGION
Step 2/4 : FROM 763104351884.dkr.ecr.$REGION.amazonaws.com/mxnet-inference:1.6.0-gpu-py3
 ---> f9c75f96f647
Step 3/4 : RUN pip install --upgrade --user --pre 'mxnet-mkl' 'https://github.com/dmlc/gluon-nlp/tarball/v0.9.x'
 ---> Running in 517c633ca1c2
Collecting https://github.com/dmlc/gluon-nlp/tarball/v0.9.x
  Downloading https://github.com/dmlc/gluon-nlp/tarball/v0.9.x
Collecting mxnet-mkl
  Downloading https://files.pythonhosted.org/packages/64/72/c5566aabde6ee0bda1f09d026603169a717dbd9f26f6be85ee2b4ed2cf03/mxnet_mkl-1.6.0b20191025-py2.py3-none-manylinux1_x86_64.whl (64.9MB)
Installing collected packages: mxnet-mkl, gluonnlp
    Running setup.py install for gluonnlp: started
    Running setup.py install for gluonnlp: finished with status 'done'
Successfully installed gluonnlp-0.9.0.dev0 mxnet-mkl-1.6.0b20191025
Removing intermediate container 517c633ca1c2
 ---> b4b705e287b7
Step 4/4 : RUN pip list | grep mxnet
 --

### 4. Launching a serving end-point with SageMaker SDK

We create a MXNet model which can be deployed later, by specifying the docker image, and entry point for the inference code. If ```serve.py``` does not work, use ```dummy_hosting_module.py``` for debugging purpose. 

In [5]:
import sagemaker
from sagemaker.mxnet.model import MXNetModel
sagemaker_model = MXNetModel(model_data='file:///home/ec2-user/SageMaker/ako2020-bert/tutorial/model.tar.gz',
                             image='my-docker:inference', # docker images
                             role=sagemaker.get_execution_role(), 
                             py_version='py3',            # python version
                             entry_point='serve.py',
                             source_dir='.')

We use 'local' mode to test our deployment code, where the inference happens on the current instance.
If you are ready to deploy the model on a new instance, change the `instance_type` argument to values such as `ml.c4.xlarge`.

Here we use 'local' mode for testing, for real instances use c5.2xlarge, p2.xlarge, etc. **The following line will start docker container building.**

In [6]:
predictor = sagemaker_model.deploy(initial_instance_count=1, instance_type='local')

Attaching to tmpmil27owm_algo-1-k003q_1
[36malgo-1-k003q_1  |[0m 2020-01-15 07:35:17,400 [INFO ] main com.amazonaws.ml.mms.ModelServer - 
[36malgo-1-k003q_1  |[0m MMS Home: /usr/local/lib/python3.6/site-packages
[36malgo-1-k003q_1  |[0m Current directory: /
[36malgo-1-k003q_1  |[0m Temp directory: /home/model-server/tmp
[36malgo-1-k003q_1  |[0m Number of GPUs: 0
[36malgo-1-k003q_1  |[0m Number of CPUs: 8
[36malgo-1-k003q_1  |[0m Max heap size: 13646 M
[36malgo-1-k003q_1  |[0m Python executable: /usr/local/bin/python3.6
[36malgo-1-k003q_1  |[0m Config file: /etc/sagemaker-mms.properties
[36malgo-1-k003q_1  |[0m Inference address: http://0.0.0.0:8080
[36malgo-1-k003q_1  |[0m Management address: http://0.0.0.0:8080
[36malgo-1-k003q_1  |[0m Model Store: /.sagemaker/mms/models
[36malgo-1-k003q_1  |[0m Initial Models: ALL
[36malgo-1-k003q_1  |[0m Log dir: /logs
[36malgo-1-k003q_1  |[0m Metrics dir: /logs
[36malgo-1-k003q_1  |[0m Netty threads: 0
[36malgo-1-k0

[36malgo-1-k003q_1  |[0m 2020-01-15 07:35:20,812 [INFO ] W-9007-model com.amazonaws.ml.mms.wlm.WorkerThread - Backend response time: 3029
[36malgo-1-k003q_1  |[0m 2020-01-15 07:35:20,821 [INFO ] W-9004-model com.amazonaws.ml.mms.wlm.WorkerThread - Backend response time: 3042
[36malgo-1-k003q_1  |[0m 2020-01-15 07:35:20,838 [INFO ] W-9006-model com.amazonaws.ml.mms.wlm.WorkerThread - Backend response time: 3035
[36malgo-1-k003q_1  |[0m 2020-01-15 07:35:20,843 [INFO ] W-9003-model com.amazonaws.ml.mms.wlm.WorkerThread - Backend response time: 3060
[36malgo-1-k003q_1  |[0m 2020-01-15 07:35:20,849 [INFO ] W-9002-model com.amazonaws.ml.mms.wlm.WorkerThread - Backend response time: 3057
[36malgo-1-k003q_1  |[0m 2020-01-15 07:35:20,879 [INFO ] W-9001-model com.amazonaws.ml.mms.wlm.WorkerThread - Backend response time: 3088
[36malgo-1-k003q_1  |[0m 2020-01-15 07:35:20,910 [INFO ] W-9005-model com.amazonaws.ml.mms.wlm.WorkerThread - Backend response time: 3120
[36malgo-1-k003q_1 

Now let us try to submit a inference job. Here we simply grab two datapoints from the SQuAD dataset and pass the examples to our predictor by calling ```predictor.predict```

In [7]:
## test
my_test_example_0 = ('Which NFL team represented the AFC at Super Bowl 50?',
 'Super Bowl 50 was an American football game to determine the champion of the National Football League (NFL) for the 2015 season. The American Football Conference (AFC) champion Denver Broncos defeated the National Football Conference (NFC) champion Carolina Panthers 24–10 to earn their third Super Bowl title. The game was played on February 7, 2016, at Levi\'s Stadium in the San Francisco Bay Area at Santa Clara, California. As this was the 50th Super Bowl, the league emphasized the "golden anniversary" with various gold-themed initiatives, as well as temporarily suspending the tradition of naming each Super Bowl game with Roman numerals (under which the game would have been known as "Super Bowl L"), so that the logo could prominently feature the Arabic numerals 50.')

my_test_example_1 = ('Where did Super Bowl 50 take place?',
 'Super Bowl 50 was an American football game to determine the champion of the National Football League (NFL) for the 2015 season. The American Football Conference (AFC) champion Denver Broncos defeated the National Football Conference (NFC) champion Carolina Panthers 24–10 to earn their third Super Bowl title. The game was played on February 7, 2016, at Levi\'s Stadium in the San Francisco Bay Area at Santa Clara, California. As this was the 50th Super Bowl, the league emphasized the "golden anniversary" with various gold-themed initiatives, as well as temporarily suspending the tradition of naming each Super Bowl game with Roman numerals (under which the game would have been known as "Super Bowl L"), so that the logo could prominently feature the Arabic numerals 50.')

my_test_examples = (my_test_example_0, my_test_example_1)

# mymodel = model_fn(params_path = "bert_qa-7eb11865.params")
# transform_fn(mymodel, my_test_examples)
output = predictor.predict(my_test_examples)  

[36malgo-1-k003q_1  |[0m 2020-01-15 07:35:22,535 [INFO ] W-9007-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Vocab file is not found. Downloading.
[36malgo-1-k003q_1  |[0m 2020-01-15 07:35:22,536 [INFO ] W-9007-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Downloading /root/.mxnet/models/1579073722.5359378book_corpus_wiki_en_uncased-a6607397.zip from https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/dataset/vocab/book_corpus_wiki_en_uncased-a6607397.zip...
[36malgo-1-k003q_1  |[0m 2020-01-15 07:35:23,904 [INFO ] W-9007-model-stdout com.amazonaws.ml.mms.wlm.WorkerLifeCycle - Done! Transform dataset costs 0.56 seconds.
[36malgo-1-k003q_1  |[0m 2020-01-15 07:35:25,676 [INFO ] W-9007-model com.amazonaws.ml.mms.wlm.WorkerThread - Backend response time: 3175
[36malgo-1-k003q_1  |[0m 2020-01-15 07:35:25,677 [INFO ] W-9007-model ACCESS_LOG - /172.18.0.1:34338 "POST /invocations HTTP/1.1" 200 3179


In [8]:
print("\nPrediction output: \n\n")

for k in output.keys():
    print('{}\n\n'.format(output[k]))


Prediction output: 


['99.36% \t Denver Broncos', '0.23% \t The American Football Conference (AFC) champion Denver Broncos', '0.20% \t Broncos']


["25.86% \t Levi's Stadium in the San Francisco Bay Area at Santa Clara, California", "23.11% \t Levi's Stadium", '17.88% \t San Francisco Bay Area at Santa Clara, California']




### Clean Up

Remove the endpoint after we are done. 

In [9]:
predictor.delete_endpoint()

Gracefully stopping... (press Ctrl+C again to force)
