# Inferencing Iris XGBoost PMML Model using AI-Serving

PMML stands for Predictive Model Markup Language. It is the de facto standard to present the classic machine learning models. With PMML, it is easy to develop a model on one system using one application and deploy the model on another system using another application.

In this tutorial, we will use the PMML to show how to deploy the famous Iris classifier using AI-Serving

## Prerequisites to run the notebook
#### 1. Train and export your model to PMML.

In this example, we're going to use XGBoost to train our classifier. Once trained, we will convert our model to PMML that will be deployed in later steps. You can train, convert, and validate your model by simply running this notebook [`Training an Iris classifier using XGBoost`](IrisXGBoost.ipynb). Once your model is validated, you can deploy it.

#### 2. Download the AI-Serving image.

Pull the latest docker image of AI-Serving. Please, refer to [Docker Containers for AI-Serving](https://github.com/autodeployai/ai-serving/tree/master/dockerfiles) about more docker images.

```bash
docker pull autodeployai/ai-serving
```

#### 3. Start the docker image.

Run a docker container of AI-Serving. The port `9090` is the port of HTTP endpoint while `9091` is for gRPC, you could see an error likes `Bind for 0.0.0.0:9090 failed: port is already allocated`, please use another new port instead of the first part as follows `-p $(NEW_PORT):9090` to run the command again, and remember the port is always needed in the URL of HTTP endpoint. 

```bash
docker run --rm -it -v $(PWD):/opt/ai-serving -p 9090:9090 -p 9091:9091 autodeployai/ai-serving
```

## Additional information about two python files
In the current directory, there are two python files `onnx_ml_pb2.py` and `ai_serving_pb2.py`, which are generated from compiling the [two proto files](https://github.com/autodeployai/ai-serving/tree/master/src/main/protobuf) using [protoc](https://developers.google.com/protocol-buffers/docs/pythontutorial), for example, the command as follows:

```bash
protoc -I=$SRC_DIR --python_out=. ai-serving.proto onnx-ml.proto
```

## Import dependent libraries
Import some dependent libraries that we are going to need to run the Iris XGBoost model.

In [None]:
import numpy as np
import requests
from pprint import pprint

import onnx_ml_pb2
import ai_serving_pb2

## Define the base HTTP URL
Change the port number `9090` to the appropriate port number if you had changed it during AI-Serving docker instantiation.

In [None]:
port = 9090
base_url = 'http://localhost:' + str(port)

## Test the server availability
Use the specific endpoint `http://host:port/up` to test whether the server has been initialized and is ready to accept requests. The `OK` message indicates it's already available.

In [None]:
test_url = base_url + '/up'
response = requests.get(test_url)
print('The status of the server: ', response.text)

## Deploy the PMML model into AI-Serving
First, we need to deploy the PMML model `xgb-iris.pmml` into AI-Serving, which can serve multiple models or multiple versions for a named model at once.

You must specify a correct content type for PMML models when constructing an HTTP request to deploy a PMML model, the candidates are:
 * application/xml
 * text/xml

In [None]:
# The specified servable name
model_name = 'iris'
deployment_url = base_url + '/v1/models/' + model_name

# The specified content type for the model:
headers = {'Content-Type': 'application/xml'}

model_path = 'xgb-iris.pmml'
with open(model_path, 'rb') as file:
    deployment_response = requests.put(deployment_url, headers=headers, data=file)

# The response is a JSON object contains the sepcified servable name and the model version deployed
deployment_response_info = deployment_response.json()
print('The depoyment response: ', deployment_response_info)

## Retrieve metadata of the deployed model
The metadata will contain model inputs and outputs, which are needed when constructing an input request and consume an output response.

In [None]:
model_version = deployment_response_info['version']
metadata_url = base_url + '/v1/models/' + model_name + '/versions/' + str(model_version)
metadata_response = requests.get(metadata_url)

print('The model metadata response:\n')
pprint(metadata_response.json())

## HTTP request formats for the AI-Serving
The request for AI-Serving could have two formats: JSON and binary, the HTTP header Content-Type tells the server which format to handle and thus it is required for all requests. The binary payload has better latency, especially for the big tensor value for ONNX models, while the JSON format is easy for human readability.

- Content-Type: application/json. The request body must be a JSON object formatted as described [here](https://github.com/autodeployai/ai-serving#4-predict-api).


- Content-Type: application/octet-stream, application/vnd.google.protobuf or application/x-protobuf. The request body must be the protobuf message PredictRequest, besides of those common scalar values, it can use the standard TensorProto value directly.

## Construct JSON requests for the AI-Serving
We will create both JSON objects, one is using the `Records` format that contains a single record, the other is using `Split` format that contains two records.

In [None]:
# Create a JSON object using `records` that contains a single record.
request_json_recoreds = {
    'X': [{
        'sepal length (cm)': 5.7,
        'sepal width (cm)': 4.4,
        'petal length (cm)': 1.5,
        'petal width (cm)': 0.4
    }]
}

# Create a JSON object using `split` that contains two records with a filter that
# only the output `predicted_Species` is expected.
request_json_split = {
    'X': {
        'columns': ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)'],
        'data': [[5.7, 4.4, 1.5, 0.4], [6.4, 2.8, 5.6, 2.1]]
    },
    'filter': ['predicted_species']
}

## Make the HTTP requests with JSON data to the AI-Serving
Make predictions using the AI-Serving, the content type of requests with JSON data must be `application/json`.

In [None]:
# When version is omitted, the latest version is used.
prediction_url = base_url + '/v1/models/' + model_name

# The Content-Type: application/json is specified implicitly when using json instead of data
prediction_json_response_records = requests.post(prediction_url, json=request_json_recoreds)
prediction_json_response_split = requests.post(prediction_url, json=request_json_split)

## Consume the HTTP response with JSON data from the AI-serving
Having received the results from the server, we are going to parse the JSON text that we just received for us to make sense of the results. 

**NOTE: The data format of the output response is always the same as the input request.**

In [None]:
print('JSON prediction response of the request using `records`:')
pprint(prediction_json_response_records.json())

print('\n----------------------------------------------------------------------------\n')

# Only the predicton column `predicted_Species` is expected.
print('JSON prediction response of the request using `records` with a filter:')
pprint(prediction_json_response_split.json())

## Construct binary requests for the AI-Serving
We will create both instances of PredictRequest, one is using the `Records` format that contains a single record, the other is using the `Split` format that contains two records.

In [None]:
from ai_serving_pb2 import RecordSpec, Record, PredictRequest, ListValue, Value

# Create an instance of RecordSpec using `records` that contains a single record.
request_message_records = PredictRequest(X=RecordSpec(
    records=[Record(fields={
        'sepal length (cm)': Value(number_value=5.7),
        'sepal width (cm)': Value(number_value=4.4),
        'petal length (cm)': Value(number_value=1.5),
        'petal width (cm)': Value(number_value=0.4),
    })]))

# Create an instance of RecordSpec using `split` that contains two records with a filter that
# only the output `predicted_Species` is expected.
request_message_split = PredictRequest(
    X=RecordSpec(
        columns = ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)'],
        data = [
            ListValue(values=[Value(number_value=5.7), Value(number_value=4.4), Value(number_value=1.5), Value(number_value=0.4)]),
            ListValue(values=[Value(number_value=6.4), Value(number_value=2.8), Value(number_value=5.6), Value(number_value=2.1)])]),
    filter=['predicted_species'])

## Make the HTTP requests with binary data to the AI-Serving
Make predictions using the AI-Serving, the content type of requests with binary data must be one of those three candidates above.

In [None]:
headers = {'Content-Type': 'application/x-protobuf'}

# When version is omitted, the latest version is used.
prediction_url = base_url + '/v1/models/' + model_name

# Make prediction for the `records` request message.
prediction_response_records = requests.post(prediction_url, 
                                           headers=headers, 
                                           data=request_message_records.SerializeToString())

# Make prediciton for the `split` request message.
prediction_response_split = requests.post(prediction_url, 
                                           headers=headers, 
                                           data=request_message_split.SerializeToString())

## Consume the HTTP response with binary data from the AI-serving
Having received the results from the server, we are going to parse the "serialized" message that we just received for us to make sense of the results.

**NOTE: The data format of the output response is always the same as the input request.**

In [None]:
# Parse the response message from the `recrods` request.
response_message = ai_serving_pb2.PredictResponse()
response_message.ParseFromString(prediction_response_records.content)
print('Binary prediction response of the request using `records`:')
print(response_message)

print('\n----------------------------------------------------------------------------\n')

# Parse the response message from the `split` request.
response_message = ai_serving_pb2.PredictResponse()
response_message.ParseFromString(prediction_response_split.content)
print('Binary prediction response of the request using `split` with a filter:')
print(response_message)