# Amazon SageMaker Batch Transform: Associate prediction results with their corresponding input records
_**Use SageMaker's XGBoost to train a binary classification model and for a list of tumors in batch file, predict if each is malignant**_

_**It also shows how to use the input output joining / filter feature in Batch transform in details**_

---

## Setup

Let's start by specifying:

* The SageMaker role arn used to give training and batch transform access to your data. The snippet below will use the same role used by your SageMaker notebook instance. Otherwise, specify the full ARN of a role with the SageMakerFullAccess policy attached.
* The S3 bucket that you want to use for training and storing model objects.

In [None]:
import os
import boto3
import sagemaker
from sagemaker import get_execution_role
from time import gmtime, strftime

role = get_execution_role()

region = boto3.Session().region_name

sagemaker_session = sagemaker.Session()

bucket=sagemaker.Session().default_bucket()
prefix = 'sagemaker/DEMO-xgboost-tripfare'

In [12]:
!python -m pip install -Uq pip
!python -m pip install -q awswrangler sagemaker==2.59.5 boto3==1.18.51

  from cryptography.utils import int_from_bytes
  from cryptography.utils import int_from_bytes
  from cryptography.utils import int_from_bytes
  from cryptography.utils import int_from_bytes


In [6]:
%store
%store -r

Stored variables and their in-db values:
model_url                     -> 's3://sagemaker-us-east-1-631450739534/sagemaker/D
test_path                     -> 's3://sagemaker-us-east-1-631450739534/sagemaker/D
train_path                    -> 's3://sagemaker-us-east-1-631450739534/sagemaker/D
training_job_name             -> 'DEMO-xgboost-tripfare-train-2021-09-30-03-08-58-4
validation_path               -> 's3://sagemaker-us-east-1-631450739534/sagemaker/D


##  XGBoost Bring Your Own Model

Amazon SageMaker includes functionality to support a hosted notebook environment, distributed, serverless training, and real-time hosting. We think it works best when all three of these services are used together, but they can also be used independently. Some use cases may only require hosting. Maybe the model was trained prior to Amazon SageMaker existing, in a different service.

This section shows how to use a pre-existing trained XGBoost model with the Amazon SageMaker XGBoost Algorithm container to quickly create a hosted endpoint for that model. 

In [8]:
container = sagemaker.image_uris.retrieve(region=boto3.Session().region_name, framework='xgboost', version='1.3-1')

In [9]:
%%time

model_file_name = "DEMO-byo-xgboost-model"
model_name = model_file_name + strftime("%Y-%m-%d-%H-%M-%S", gmtime())
model_data = model_url
print(model_data)
sm_client = boto3.client("sagemaker")


primary_container = {
    "Image": container,
    "ModelDataUrl": model_data,
}

create_model_response2 = sm_client.create_model(
    ModelName=model_name, ExecutionRoleArn=role, PrimaryContainer=primary_container
)

print(create_model_response2["ModelArn"])

s3://sagemaker-us-east-1-631450739534/sagemaker/DEMO-xgboost-tripfare/model/DEMO-xgboost-tripfare-train-2021-09-30-03-08-58-428/output/model.tar.gz
arn:aws:sagemaker:us-east-1:631450739534:model/demo-byo-xgboost-model2021-09-30-06-08-20
CPU times: user 62 ms, sys: 11.2 ms, total: 73.2 ms
Wall time: 637 ms


In [10]:
endpoint_config_name = "DEMO-XGBoostEndpointConfig-" + strftime("%Y-%m-%d-%H-%M-%S", gmtime())
print(endpoint_config_name)
create_endpoint_config_response = sm_client.create_endpoint_config(
    EndpointConfigName=endpoint_config_name,
    ProductionVariants=[
        {
            "InstanceType": "ml.m4.xlarge",
            "InitialInstanceCount": 1,
            "InitialVariantWeight": 1,
            "ModelName": model_name,
            "VariantName": "AllTraffic",
        }
    ],
)

print("Endpoint Config Arn: " + create_endpoint_config_response["EndpointConfigArn"])

DEMO-XGBoostEndpointConfig-2021-09-30-06-08-49
Endpoint Config Arn: arn:aws:sagemaker:us-east-1:631450739534:endpoint-config/demo-xgboostendpointconfig-2021-09-30-06-08-49


In [11]:
%%time
import time

endpoint_name = "BYOM-XGBoostEndpoint-" + strftime("%Y-%m-%d-%H-%M-%S", gmtime())
print(endpoint_name)
create_endpoint_response = sm_client.create_endpoint(
    EndpointName=endpoint_name, EndpointConfigName=endpoint_config_name
)
print(create_endpoint_response["EndpointArn"])

resp = sm_client.describe_endpoint(EndpointName=endpoint_name)
status = resp["EndpointStatus"]
print("Status: " + status)

while status == "Creating":
    time.sleep(60)
    resp = sm_client.describe_endpoint(EndpointName=endpoint_name)
    status = resp["EndpointStatus"]
    print("Status: " + status)

print("Arn: " + resp["EndpointArn"])
print("Status: " + status)


BYOM-XGBoostEndpoint-2021-09-30-06-09-09
arn:aws:sagemaker:us-east-1:631450739534:endpoint/byom-xgboostendpoint-2021-09-30-06-09-09
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: InService
Arn: arn:aws:sagemaker:us-east-1:631450739534:endpoint/byom-xgboostendpoint-2021-09-30-06-09-09
Status: InService
CPU times: user 130 ms, sys: 29.2 ms, total: 159 ms
Wall time: 10min 1s


In [14]:
import awswrangler as wr
test_df = wr.s3.read_csv(
        path=test_path, dataset=True, nrows=5, header=None
    )

In [29]:
import io
import csv

runtime_client = boto3.client("runtime.sagemaker")

data = test_df.iloc[:,1:].to_numpy()

results = []
csv_buffer = io.StringIO()
csv_writer = csv.writer(csv_buffer, delimiter=",")
for record in data:
    csv_writer.writerow(record)

response = runtime_client.invoke_endpoint(
        EndpointName=endpoint_name, ContentType="text/csv", Body=csv_buffer.getvalue()
    )
print("Predicted Class Probabilities: {}.".format(response["Body"].read().decode("ascii")))


Predicted Class Probabilities: 18.545743942260742,6.25778341293335,19.946535110473633,26.39051055908203,7.497164726257324,17.29636573791504,11.487503051757812,35.032718658447266,18.201818466186523,8.741583824157715.


## Batch Transform


In SageMaker Batch Transform, we introduced 3 new attributes - __input_filter__, __join_source__ and __output_filter__. In the below cell, we use the [SageMaker Python SDK](https://github.com/aws/sagemaker-python-sdk) to kick-off several Batch Transform jobs using different configurations of these 3 new attributes. Please refer to [this page](https://docs.aws.amazon.com/sagemaker/latest/dg/batch-transform-data-processing.html) to learn more about how to use them.


#### 1. Create a transform job with the default configurations
Let's first skip these 3 new attributes and inspect the inference results. We'll use it as a baseline to compare to the results with data processing.

In [30]:
from sagemaker.model import Model
xgb_model = Model(
    image_uri=container,
    model_data=model_url,
    role=role,
    name=model_file_name + strftime("%Y-%m-%d-%H-%M-%S", gmtime()),
    sagemaker_session=sagemaker_session,
)

#### 2. Join the input and the prediction results 
Now, let's associate the prediction results with their corresponding input records. We can also use the __input_filter__ to exclude the ID column easily and there's no need to have a separate file in S3.

* Set __input_filter__ to "$[1:]": indicates that we are excluding column 0 (the 'ID') before processing the inferences and keeping everything from column 1 to the last column (all the features or predictors)  
  
  
* Set __join_source__ to "Input": indicates our desire to join the input data with the inference results  

* Leave __output_filter__ to default ('$'), indicating that the joined input and inference results be will saved as output.

In [34]:
xgb_transformer = xgb_model.transformer(instance_count=2, instance_type="ml.m4.xlarge")

# content_type / accept and split_type / assemble_with are required to use IO joining feature
xgb_transformer.assemble_with = "Line"
xgb_transformer.accept = "text/csv"

# start a transform job
xgb_transformer.transform(test_path, 
                         content_type="text/csv", 
                         split_type="Line",
                         input_filter="$[1:]",
                         join_source="Input",
                        )
xgb_transformer.wait()

Using already existing model: DEMO-byo-xgboost-model2021-09-30-06-47-15


..................................[35m[2021-09-30:06:57:24:INFO] No GPUs detected (normal if no gpus installed)[0m
[35m[2021-09-30:06:57:24:INFO] No GPUs detected (normal if no gpus installed)[0m
[35m[2021-09-30:06:57:24:INFO] nginx config: [0m
[35mworker_processes auto;[0m
[35mdaemon off;[0m
[35mpid /tmp/nginx.pid;[0m
[35merror_log  /dev/stderr;
[0m
[35mworker_rlimit_nofile 4096;
[0m
[35mevents {
  worker_connections 2048;[0m
[35m}
[0m
[35mhttp {
  include /etc/nginx/mime.types;
  default_type application/octet-stream;
  access_log /dev/stdout combined;

  upstream gunicorn {
    server unix:/tmp/gunicorn.sock;
  }

  server {
    listen 8080 deferred;
    client_max_body_size 0;

    keepalive_timeout 3;

    location ~ ^/(ping|invocations|execution-parameters) {
      proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
      proxy_set_header Host $http_host;
      proxy_redirect off;
      proxy_read_timeout 60s;
      proxy_pass http://gunicorn;
    }


Let's inspect the output of the Batch Transform job in S3. It should show the list of tumors identified by their original feature columns and their corresponding probabilities of being malignant.

In [None]:
output_df = get_csv_output_from_s3(sm_transformer.output_path, batch_file)
output_df.head

Additionally you can update the output filter to keep only ID and prediction results. For example, you can change __output_filter__ to "$[0,-1]", indicating that when presenting the output, we only want to keep column 0 and the last column (the inference result i.e. the predicted trip fare)

In summary, we can use newly introduced 3 attributes - __input_filter__, __join_source__, __output_filter__ to 
1. Filter / select useful features from the input dataset. e.g. exclude ID columns.
2. Associate the prediction results with their corresponding input records.
3. Filter the original or joined results before saving to S3. e.g. keep ID and probability columns only.