# Amazon SageMaker Batch Transform

## Background
This purpose of this notebook is to train a model using SageMaker's XGBoost and UCI's breast cancer diagnostic data set to illustrate at how to run batch inferences and how to use the Batch Transform I/O join feature. UCI's breast cancer diagnostic data set is available at https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Diagnostic%29. The data set is also available on Kaggle at https://www.kaggle.com/uciml/breast-cancer-wisconsin-data. The purpose here is to use this data set to build a predictve model of whether a breast mass image indicates benign or malignant tumor. 



---

## Setup

Let's start by specifying:

* The SageMaker role arn used to give training and batch transform access to your data. The snippet below will use the same role used by your SageMaker notebook instance. Otherwise, specify the full ARN of a role with the SageMakerFullAccess policy attached.
* The S3 bucket that you want to use for training and storing model objects.

In [7]:
!python -m pip install --upgrade pip --quiet
!pip install -U awscli --quiet
!pip install sagemaker --upgrade

[0m

In [8]:
import os
import boto3
import sagemaker
from time import gmtime, strftime
from datetime import datetime

boto_session = boto3.session.Session()
sm_session = sagemaker.session.Session()
sm_role = sagemaker.get_execution_role()
region = boto_session.region_name
s3_bucket = sm_session.default_bucket()
bucket_prefix = "DEMO-breast-cancer-prediction-xgboost-highlevel"
resource_name = "BatchInferenceDemo-{}-{}"

print(f"Will use bucket '{s3_bucket}' for storing all resources related to this notebook")
print(f"Using Role: {sm_role}")

Will use bucket 'sagemaker-us-east-1-620171311143' for storing all resources related to this notebook
Using Role: arn:aws:iam::620171311143:role/mod-6297809195fe4845-SageMakerExecutionRole-1X0TNEFMV0U32


---
## Data sources

> Dua, D. and Graff, C. (2019). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.

> Breast Cancer Wisconsin (Diagnostic) Data Set [https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic)].

> _Also see:_ Breast Cancer Wisconsin (Diagnostic) Data Set [https://www.kaggle.com/uciml/breast-cancer-wisconsin-data].

## Data preparation


Let's download the data and save it in the local folder with the name data.csv and take a look at it.

In [9]:
import pandas as pd
import numpy as np

s3 = boto3.client("s3")

filename = "wdbc.csv"
s3.download_file("sagemaker-sample-files", "datasets/tabular/breast_cancer/wdbc.csv", filename)
data = pd.read_csv(filename, header=None)

# specify columns extracted from wbdc.names
data.columns = [
    "id",
    "diagnosis",
    "radius_mean",
    "texture_mean",
    "perimeter_mean",
    "area_mean",
    "smoothness_mean",
    "compactness_mean",
    "concavity_mean",
    "concave points_mean",
    "symmetry_mean",
    "fractal_dimension_mean",
    "radius_se",
    "texture_se",
    "perimeter_se",
    "area_se",
    "smoothness_se",
    "compactness_se",
    "concavity_se",
    "concave points_se",
    "symmetry_se",
    "fractal_dimension_se",
    "radius_worst",
    "texture_worst",
    "perimeter_worst",
    "area_worst",
    "smoothness_worst",
    "compactness_worst",
    "concavity_worst",
    "concave points_worst",
    "symmetry_worst",
    "fractal_dimension_worst",
]

# save the data
data.to_csv("data.csv", sep=",", index=False)

data.sample(8)

Unnamed: 0,id,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,...,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst
391,903483,B,8.734,16.84,55.27,234.3,0.1039,0.07428,0.0,0.0,...,10.17,22.8,64.01,317.0,0.146,0.131,0.0,0.0,0.2445,0.08865
465,9113239,B,13.24,20.13,86.87,542.9,0.08284,0.1223,0.101,0.02833,...,15.44,25.5,115.0,733.5,0.1201,0.5646,0.6556,0.1357,0.2845,0.1249
58,857810,B,13.05,19.31,82.61,527.2,0.0806,0.03789,0.000692,0.004167,...,14.23,22.25,90.24,624.1,0.1021,0.06191,0.001845,0.01111,0.2439,0.06289
463,911320501,B,11.6,18.36,73.88,412.7,0.08508,0.05855,0.03367,0.01777,...,12.77,24.02,82.68,495.1,0.1342,0.1808,0.186,0.08288,0.321,0.07863
530,91858,B,11.75,17.56,75.89,422.9,0.1073,0.09713,0.05282,0.0444,...,13.5,27.98,88.52,552.3,0.1349,0.1854,0.1366,0.101,0.2478,0.07757
360,901034302,B,12.54,18.07,79.42,491.9,0.07436,0.0265,0.001194,0.005449,...,13.72,20.98,86.82,585.7,0.09293,0.04327,0.003581,0.01635,0.2233,0.05521
397,90401602,B,12.8,17.46,83.05,508.3,0.08044,0.08895,0.0739,0.04083,...,13.74,21.06,90.72,591.0,0.09534,0.1812,0.1901,0.08296,0.1988,0.07053
35,854253,M,16.74,21.59,110.1,869.5,0.0961,0.1336,0.1348,0.06018,...,20.01,29.02,133.5,1229.0,0.1563,0.3835,0.5409,0.1813,0.4863,0.08633


#### Key observations:
* The data has 569 observations and 32 columns.
* The first field is the 'id' attribute that we will want to drop before batch inference and add to the final inference output next to the probability of malignancy.
* Second field, 'diagnosis', is an indicator of the actual diagnosis ('M' = Malignant; 'B' = Benign).
* There are 30 other numeric features that we will use for training and inferencing.

Let's replace the M/B diagnosis with a 1/0 boolean value. 

In [10]:
data["diagnosis"] = data["diagnosis"].apply(lambda x: ((x == "M")) + 0)
data.sample(8)

Unnamed: 0,id,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,...,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst
325,89511502,0,12.67,17.3,81.25,489.9,0.1028,0.07664,0.03193,0.02107,...,13.71,21.1,88.7,574.4,0.1384,0.1212,0.102,0.05602,0.2688,0.06888
548,923169,0,9.683,19.34,61.05,285.7,0.08491,0.0503,0.02337,0.009615,...,10.93,25.59,69.1,364.2,0.1199,0.09546,0.0935,0.03846,0.2552,0.0792
25,852631,1,17.14,16.4,116.0,912.7,0.1186,0.2276,0.2229,0.1401,...,22.25,21.4,152.4,1461.0,0.1545,0.3949,0.3853,0.255,0.4066,0.1059
9,84501001,1,12.46,24.04,83.97,475.9,0.1186,0.2396,0.2273,0.08543,...,15.09,40.68,97.65,711.4,0.1853,1.058,1.105,0.221,0.4366,0.2075
230,881972,1,17.05,19.08,113.4,895.0,0.1141,0.1572,0.191,0.109,...,19.59,24.89,133.5,1189.0,0.1703,0.3934,0.5018,0.2543,0.3109,0.09061
144,869254,0,10.75,14.97,68.26,355.3,0.07793,0.05139,0.02251,0.007875,...,11.95,20.72,77.79,441.2,0.1076,0.1223,0.09755,0.03413,0.23,0.06769
495,914333,0,14.87,20.21,96.12,680.9,0.09587,0.08345,0.06824,0.04951,...,16.01,28.48,103.9,783.6,0.1216,0.1388,0.17,0.1017,0.2369,0.06599
406,905189,0,16.14,14.86,104.3,800.0,0.09495,0.08501,0.055,0.04528,...,17.71,19.58,115.9,947.9,0.1206,0.1722,0.231,0.1129,0.2778,0.07012


Let's split the data and set 10% aside for our batch inference job. In addition, let's drop the 'id' field on the training set and validation set as 'id' is not a training feature. For our batch set however, we keep the 'id' feature. We'll want to filter it out prior to running our inferences so that the input data features match the ones of training set and then ultimately, we'll want to join it with inference result. We are however dropping the diagnosis attribute for the batch set since this is what we'll try to predict.

Let's upload those data sets in S3

In [13]:
rand_split = np.random.rand(len(data))
batch_list = rand_split >= 0.9
data_batch = data[batch_list].drop(["diagnosis"], axis=1)
data_batch_noID = data_batch.drop(["id"], axis=1)

batch_file = "batch_data.csv"
data_batch.to_csv(batch_file, index=False, header=False)
sm_session.upload_data(batch_file, key_prefix="{}/batch".format(bucket_prefix))

batch_file_noID = "batch_data_noID.csv"
data_batch_noID.to_csv(batch_file_noID, index=False, header=False)
sm_session.upload_data(batch_file_noID, key_prefix="{}/batch".format(bucket_prefix))

's3://sagemaker-us-east-1-620171311143/DEMO-breast-cancer-prediction-xgboost-highlevel/batch/batch_data_noID.csv'

## Create a SageMaker Model

Specify the location of the pre-trained model stored in Amazon S3. This example uses a pre-trained XGBoost model name demo-xgboost-model.tar.gz. The full Amazon S3 URI is stored in a string variable model_url.

In [14]:
model_s3_key = f"{bucket_prefix}/model.tar.gz"
model_url = f"s3://{s3_bucket}/{model_s3_key}"
print(f"Uploading Model to {model_url}")

with open("model/model.tar.gz", "rb") as model_file:
    boto_session.resource("s3").Bucket(s3_bucket).Object(model_s3_key).upload_fileobj(model_file)

Uploading Model to s3://sagemaker-us-east-1-620171311143/DEMO-breast-cancer-prediction-xgboost-highlevel/model.tar.gz


Specify a primary container. For the primary container, you specify the Docker image that contains inference code, artifacts (from prior training), and a custom environment map that the inference code uses when you deploy the model for predictions. In this example, we specify an XGBoost built-in algorithm container image.

In [15]:
from sagemaker import image_uris

# Specify an AWS container image and region as desired
container = image_uris.retrieve(region=region, framework="xgboost", version="0.90-1")

Create a SageMaker Model by specifying the name, the role (the ARN of the IAM role that Amazon SageMaker can assume to access model artifacts/ docker images for deployment), and the image_uri of the XGBoost built-in algorithm container image.

In [16]:
from sagemaker.model import Model
from sagemaker.predictor import Predictor

model_name = resource_name.format("Model", datetime.now().strftime("%Y-%m-%d-%H-%M-%S"))

model_predictor = Model(
    name=model_name,
    image_uri=container,
    model_data=model_url,
    role=sm_role,
    predictor_cls=Predictor,
)
model_name

'BatchInferenceDemo-Model-2022-08-13-00-14-46'

In [17]:
# Deploy the model
instance_count=1,
instance_type="ml.m5.4xlarge"

predictor = model_predictor.deploy(
        instance_type='ml.m5.4xlarge',
        initial_instance_count=1)


-----!

---

## Batch Transform

In SageMaker Batch Transform, we introduced 3 new attributes - __input_filter__, __join_source__ and __output_filter__. In the below cell, we use the [SageMaker Python SDK](https://github.com/aws/sagemaker-python-sdk) to kick-off several Batch Transform jobs using different configurations of these 3 new attributes. Please refer to [this page](https://docs.aws.amazon.com/sagemaker/latest/dg/batch-transform-data-processing.html) to learn more about how to use them.




#### 1. Create a transform job with the default configurations
Let's first skip these 3 new attributes and inspect the inference results. We'll use it as a baseline to compare to the results with data processing.

In [21]:
# %%time

from sagemaker.transformer import Transformer

sm_transformer = Transformer(model_name=model_name,
                          instance_count=1,
                          instance_type='ml.m4.xlarge')

#Start the Batch Transform job

input_location = "s3://{}/{}/batch/{}".format(
    s3_bucket, bucket_prefix, batch_file_noID
)  # use input data without ID column

sm_transformer.transform(input_location, content_type="text/csv", split_type="Line")


.................................
[34m[2022-08-13 00:24:22 +0000] [14] [INFO] Starting gunicorn 19.10.0[0m
[35m[2022-08-13 00:24:22 +0000] [14] [INFO] Starting gunicorn 19.10.0[0m
[34m[2022-08-13 00:24:22 +0000] [14] [INFO] Listening at: unix:/tmp/gunicorn.sock (14)[0m
[34m[2022-08-13 00:24:22 +0000] [14] [INFO] Using worker: gevent[0m
[34m[2022-08-13 00:24:22 +0000] [21] [INFO] Booting worker with pid: 21[0m
[34m[2022-08-13 00:24:22 +0000] [22] [INFO] Booting worker with pid: 22[0m
[34m[2022-08-13 00:24:22 +0000] [26] [INFO] Booting worker with pid: 26[0m
[34m[2022-08-13 00:24:22 +0000] [30] [INFO] Booting worker with pid: 30[0m
[35m[2022-08-13 00:24:22 +0000] [14] [INFO] Listening at: unix:/tmp/gunicorn.sock (14)[0m
[35m[2022-08-13 00:24:22 +0000] [14] [INFO] Using worker: gevent[0m
[35m[2022-08-13 00:24:22 +0000] [21] [INFO] Booting worker with pid: 21[0m
[35m[2022-08-13 00:24:22 +0000] [22] [INFO] Booting worker with pid: 22[0m
[35m[2022-08-13 00:24:22 +000

Let's inspect the output of the Batch Transform job in S3. It should show the list probabilities of tumors being malignant.

In [23]:
import re


def get_csv_output_from_s3(s3uri, batch_file):
    file_name = "{}.out".format(batch_file)
    match = re.match("s3://([^/]+)/(.*)", "{}/{}".format(s3uri, file_name))
    output_bucket, output_prefix = match.group(1), match.group(2)
    s3.download_file(output_bucket, output_prefix, file_name)
    return pd.read_csv(file_name, sep=",", header=None)

Let's inspect the output of the Batch Transform job in S3. It should show the list of tumors identified by their original feature columns and their corresponding probabilities of being malignant.

In [24]:
output_df = get_csv_output_from_s3(sm_transformer.output_path, batch_file_noID)
output_df.head(8)

Unnamed: 0,0
0,0.987773
1,0.992002
2,0.990161
3,0.987554
4,0.006414
5,0.545305
6,0.900308
7,0.013795
