# Amazon SageMaker Batch Transform: Associate prediction results with their corresponding input records
_**Use SageMaker's XGBoost to train a binary classification model and for a list of tumors in batch file, predict if each is malignant**_

_**It also shows how to use the input output joining / filter feature in Batch transform in details**_

---
## Contents

1. [Background](#Background)
2. [Setup](#Setup)
3. [Data Preparation](#Data-Preparation)
 1. [Key Observations](#Key-Observations)
4. [Training Job and Model Creation](#Training-Job-and-Model-Creation)
5. [Batch Transform](#Batch-Transform)
  1. [Create a transform job with the default configurations](#Create-a-transform-job-with-the-default-configurations)
  2. [Join the input and the prediction results](#Join-the-input-and-the-prediction-results)
  3. [Update the output filter to keep only ID and prediction results](#Update-the-output-filter-to-keep-only-ID-and-prediction-results)


## Background
This purpose of this notebook is to train a model using SageMaker's XGBoost and UCI's breast cancer diagnostic data set to illustrate at how to run batch inferences and how to use the Batch Transform I/O join feature. UCI's breast cancer diagnostic data set is available at https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Diagnostic%29. The data set is also available on Kaggle at https://www.kaggle.com/uciml/breast-cancer-wisconsin-data. The purpose here is to use this data set to build a predictve model of whether a breast mass image indicates benign or malignant tumor. 


---

## Setup

Let's start by specifying:

* The SageMaker role arn used to give training and batch transform access to your data. The snippet below will use the same role used by your SageMaker notebook instance. Otherwise, specify the full ARN of a role with the SageMakerFullAccess policy attached.
* The S3 bucket that you want to use for training and storing model objects.

In [2]:
import os
import boto3
import sagemaker
from datamaker_sdk.common import get_workspace, get_demo

workspace = get_workspace()
print(workspace)
#role = sagemaker.get_execution_role()
role = workspace['eks-nodegroup-role-arn']
print(role)
sess = sagemaker.Session()
bucket=sess.default_bucket()
prefix = 'sagemaker/breast-cancer-prediction-xgboost' # place to upload training files within the bucket

{'base-image-address': '198245574422.dkr.ecr.us-west-2.amazonaws.com/datamaker-dev-env-jupyter-user', 'bootstrap-s3-prefix': 'teams/lake-creator/bootstrap/', 'ecs-cluster-name': 'datamaker-dev-env-lake-creator-cluster', 'ecs-container-runner-arn': 'arn:aws:states:us-west-2:198245574422:stateMachine:datamaker-dev-env-lake-creator-ecs-container-runner', 'ecs-task-definition-arn': 'arn:aws:ecs:us-west-2:198245574422:task-definition/datamaker-dev-env-lake-creator-task-definition:18', 'efs-id': 'fs-46242543', 'eks-container-runner-arn': 'arn:aws:states:us-west-2:198245574422:stateMachine:datamaker-dev-env-lake-creator-eks-container-runner', 'eks-nodegroup-role-arn': 'arn:aws:iam::198245574422:role/datamaker-dev-env-lake-creator-role', 'final-image-address': '198245574422.dkr.ecr.us-west-2.amazonaws.com/datamaker-dev-env-lake-creator', 'grant-sudo': False, 'image': None, 'instance-type': 'm5.4xlarge', 'jupyter-url': 'aeeff621e69504f32915abeb04d55740-523956185.us-west-2.elb.amazonaws.com', 'j

---
## Data preparation

Data Source: https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wdbc.data
        https://www.kaggle.com/uciml/breast-cancer-wisconsin-data

Let's download the data and save it in the local folder with the name data.csv and take a look at it.

In [3]:
import pandas as pd
import numpy as np

#data = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wdbc.data', header = None)
data = pd.read_csv('./wdbc.data', header = None)


# specify columns extracted from wbdc.names
data.columns = ["id","diagnosis","radius_mean","texture_mean","perimeter_mean","area_mean","smoothness_mean",
                "compactness_mean","concavity_mean","concave points_mean","symmetry_mean","fractal_dimension_mean",
                "radius_se","texture_se","perimeter_se","area_se","smoothness_se","compactness_se","concavity_se",
                "concave points_se","symmetry_se","fractal_dimension_se","radius_worst","texture_worst",
                "perimeter_worst","area_worst","smoothness_worst","compactness_worst","concavity_worst",
                "concave points_worst","symmetry_worst","fractal_dimension_worst"] 

# save the data
data.to_csv("data.csv", sep=',', index=False)

data.sample(8)


Unnamed: 0,id,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,...,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst
279,8911834,B,13.85,15.18,88.99,587.4,0.09516,0.07688,0.04479,0.03711,...,14.98,21.74,98.37,670.0,0.1185,0.1724,0.1456,0.09993,0.2955,0.06912
534,919537,B,10.96,17.62,70.79,365.6,0.09687,0.09752,0.05263,0.02788,...,11.62,26.51,76.43,407.5,0.1428,0.251,0.2123,0.09861,0.2289,0.08278
297,892189,M,11.76,18.14,75.0,431.1,0.09968,0.05914,0.02685,0.03515,...,13.36,23.39,85.1,553.6,0.1137,0.07974,0.0612,0.0716,0.1978,0.06915
538,921092,B,7.729,25.49,47.98,178.8,0.08098,0.04878,0.0,0.0,...,9.077,30.92,57.17,248.0,0.1256,0.0834,0.0,0.0,0.3058,0.09938
506,91544001,B,12.22,20.04,79.47,453.1,0.1096,0.1152,0.08175,0.02166,...,13.16,24.17,85.13,515.3,0.1402,0.2315,0.3535,0.08088,0.2709,0.08839
147,86973701,B,14.95,18.77,97.84,689.5,0.08138,0.1167,0.0905,0.03562,...,16.25,25.47,107.1,809.7,0.0997,0.2521,0.25,0.08405,0.2852,0.09218
267,8910499,B,13.59,21.84,87.16,561.0,0.07956,0.08259,0.04072,0.02142,...,14.8,30.04,97.66,661.5,0.1005,0.173,0.1453,0.06189,0.2446,0.07024
110,864033,B,9.777,16.99,62.5,290.2,0.1037,0.08404,0.04334,0.01778,...,11.05,21.47,71.68,367.0,0.1467,0.1765,0.13,0.05334,0.2533,0.08468


#### Key observations:
* The data has 569 observations and 32 columns.
* The first field is the 'id' attribute that we will want to drop before batch inference and add to the final inference output next to the probability of malignancy.
* Second field, 'diagnosis', is an indicator of the actual diagnosis ('M' = Malignant; 'B' = Benign).
* There are 30 other numeric features that we will use for training and inferencing.

Let's replace the M/B diagnosis with a 1/0 boolean value. 

In [4]:
data['diagnosis']=data['diagnosis'].apply(lambda x: ((x =="M"))+0)
data.sample(8)

Unnamed: 0,id,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,...,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst
185,874158,0,10.08,15.11,63.76,317.5,0.09267,0.04695,0.001597,0.002404,...,11.87,21.18,75.39,437.0,0.1521,0.1019,0.00692,0.01042,0.2933,0.07697
344,89864002,0,11.71,15.45,75.03,420.3,0.115,0.07281,0.04006,0.0325,...,13.06,18.16,84.16,516.4,0.146,0.1115,0.1087,0.07864,0.2765,0.07806
65,859283,1,14.78,23.94,97.4,668.3,0.1172,0.1479,0.1267,0.09029,...,17.31,33.39,114.6,925.1,0.1648,0.3416,0.3024,0.1614,0.3321,0.08911
257,886776,1,15.32,17.27,103.2,713.3,0.1335,0.2284,0.2448,0.1242,...,17.73,22.66,119.8,928.8,0.1765,0.4503,0.4429,0.2229,0.3258,0.1191
504,915186,0,9.268,12.87,61.49,248.7,0.1634,0.2239,0.0973,0.05252,...,10.28,16.38,69.05,300.2,0.1902,0.3441,0.2099,0.1025,0.3038,0.1252
511,915664,0,14.81,14.7,94.66,680.7,0.08472,0.05016,0.03416,0.02541,...,15.61,17.58,101.7,760.2,0.1139,0.1011,0.1101,0.07955,0.2334,0.06142
95,86208,1,20.26,23.03,132.4,1264.0,0.09078,0.1313,0.1465,0.08683,...,24.22,31.59,156.1,1750.0,0.119,0.3539,0.4098,0.1573,0.3689,0.08368
205,879523,1,15.12,16.68,98.78,716.6,0.08876,0.09588,0.0755,0.04079,...,17.77,20.24,117.7,989.5,0.1491,0.3331,0.3327,0.1252,0.3415,0.0974


Let's split the data as follows: 80% for training, 10% for validation and let's set 10% aside for our batch inference job. In addition, let's drop the 'id' field on the training set and validation set as 'id' is not a training feature. For our batch set however, we keep the 'id' feature. We'll want to filter it out prior to running our inferences so that the input data features match the ones of training set and then ultimately, we'll want to join it with inference result. We are however dropping the diagnosis attribute for the batch set since this is what we'll try to predict.

In [5]:
#data split in three sets, training, validation and batch inference
rand_split = np.random.rand(len(data))
train_list = rand_split < 0.8
val_list = (rand_split >= 0.8) & (rand_split < 0.9)
batch_list = rand_split >= 0.9

data_train = data[train_list].drop(['id'],axis=1)
data_val = data[val_list].drop(['id'],axis=1)
data_batch = data[batch_list].drop(['diagnosis'],axis=1)
data_batch_noID = data_batch.drop(['id'],axis=1)


Let's upload those data sets in S3

In [6]:
train_file = 'train_data.csv'
data_train.to_csv(train_file,index=False,header=False)
sess.upload_data(train_file, key_prefix='{}/train'.format(prefix))

validation_file = 'validation_data.csv'
data_val.to_csv(validation_file,index=False,header=False)
sess.upload_data(validation_file, key_prefix='{}/validation'.format(prefix))

batch_file = 'batch_data.csv'
data_batch.to_csv(batch_file,index=False,header=False)
sess.upload_data(batch_file, key_prefix='{}/batch'.format(prefix))
    
batch_file_noID = 'batch_data_noID.csv'
data_batch_noID.to_csv(batch_file_noID,index=False,header=False)
sess.upload_data(batch_file_noID, key_prefix='{}/batch'.format(prefix))   


's3://sagemaker-us-west-2-198245574422/sagemaker/breast-cancer-prediction-xgboost/batch/batch_data_noID.csv'

---

## Training job and model creation

The below cell uses the [SageMaker Python SDK](https://github.com/aws/sagemaker-python-sdk) to kick off the training job using both our training set and validation set. Not that the objective is set to 'binary:logistic' which trains a model to output a probability between 0 and 1 (here the probability of a tumor being malignant).

In [7]:
%%time
from time import gmtime, strftime
from sagemaker.amazon.amazon_estimator import get_image_uri


job_name = 'xgb-' + strftime("%Y-%m-%d-%H-%M-%S", gmtime())
output_location = 's3://{}/{}/output/{}'.format(bucket, prefix, job_name)
image = get_image_uri(boto3.Session().region_name, 'xgboost')

sm_estimator = sagemaker.estimator.Estimator(image,
                                             role,
                                             train_instance_count=1,
                                             train_instance_type='ml.m5.4xlarge',
                                             train_volume_size=50,
                                             input_mode='File',
                                             output_path=output_location,
                                             sagemaker_session=sess)

sm_estimator.set_hyperparameters(objective="binary:logistic",
                                 max_depth=5,
                                 eta=0.2,
                                 gamma=4,
                                 min_child_weight=6,
                                 subsample=0.8,
                                 silent=0,
                                 num_round=100)

train_data = sagemaker.session.s3_input('s3://{}/{}/train'.format(bucket, prefix), distribution='FullyReplicated', 
                                        content_type='text/csv', s3_data_type='S3Prefix')
validation_data = sagemaker.session.s3_input('s3://{}/{}/validation'.format(bucket, prefix), distribution='FullyReplicated', 
                                             content_type='text/csv', s3_data_type='S3Prefix')
data_channels = {'train': train_data, 'validation': validation_data}


# Start training by calling the fit method in the estimator
sm_estimator.fit(inputs=data_channels, logs=True)

'get_image_uri' method will be deprecated in favor of 'ImageURIProvider' class in SageMaker Python SDK v2.
There is a more up to date SageMaker XGBoost image. To use the newer image, please set 'repo_version'='1.0-1'. For example:
	get_image_uri(region, 'xgboost', '1.0-1').
Parameter image_name will be renamed to image_uri in SageMaker Python SDK v2.
's3_input' class will be renamed to 'TrainingInput' in SageMaker Python SDK v2.
's3_input' class will be renamed to 'TrainingInput' in SageMaker Python SDK v2.


2020-12-14 19:00:36 Starting - Starting the training job...
2020-12-14 19:00:39 Starting - Launching requested ML instances......
2020-12-14 19:01:50 Starting - Preparing the instances for training...
2020-12-14 19:02:46 Downloading - Downloading input data...
2020-12-14 19:03:18 Training - Training image download completed. Training in progress.
2020-12-14 19:03:18 Uploading - Uploading generated training model[34mArguments: train[0m
[34m[2020-12-14:19:03:13:INFO] Running standalone xgboost training.[0m
[34m[2020-12-14:19:03:13:INFO] File size need to be processed in the node: 0.13mb. Available memory size in the node: 54796.09mb[0m
[34m[2020-12-14:19:03:13:INFO] Determined delimiter of CSV input is ','[0m
[34m[19:03:13] S3DistributionType set as FullyReplicated[0m
[34m[19:03:13] 445x30 matrix with 13350 entries loaded from /opt/ml/input/data/train?format=csv&label_column=0&delimiter=,[0m
[34m[2020-12-14:19:03:13:INFO] Determined delimiter of CSV input is ','[0m
[34m[19

---

## Batch Transform

In SageMaker Batch Transform, we introduced 3 new attributes - __input_filter__, __join_source__ and __output_filter__. In the below cell, we use the [SageMaker Python SDK](https://github.com/aws/sagemaker-python-sdk) to kick-off several Batch Transform jobs using different configurations of these 3 new attributes. Please refer to [this page](https://docs.aws.amazon.com/sagemaker/latest/dg/batch-transform-data-processing.html) to learn more about how to use them.




#### 1. Create a transform job with the default configurations
Let's first skip these 3 new attributes and inspect the inference results. We'll use it as a baseline to compare to the results with data processing.

In [8]:
%%time

sm_transformer = sm_estimator.transformer(1, 'ml.m4.xlarge')

# start a transform job
input_location = 's3://{}/{}/batch/{}'.format(bucket, prefix, batch_file_noID) # use input data without ID column
sm_transformer.transform(input_location, split_type='Line')
sm_transformer.wait()

Parameter image will be renamed to image_uri in SageMaker Python SDK v2.


.............................[32m2020-12-14T19:09:03.343:[sagemaker logs]: MaxConcurrentTransforms=4, MaxPayloadInMB=6, BatchStrategy=MULTI_RECORD[0m
[34mArguments: serve[0m
[34m[2020-12-14 19:09:03 +0000] [1] [INFO] Starting gunicorn 19.7.1[0m
[34m[2020-12-14 19:09:03 +0000] [1] [INFO] Listening at: http://0.0.0.0:8080 (1)[0m
[34m[2020-12-14 19:09:03 +0000] [1] [INFO] Using worker: gevent[0m
[34m[2020-12-14 19:09:03 +0000] [36] [INFO] Booting worker with pid: 36[0m
[34m[2020-12-14:19:09:03:INFO] Model loaded successfully for worker : 36[0m
[34m[2020-12-14 19:09:03 +0000] [37] [INFO] Booting worker with pid: 37[0m
[34m[2020-12-14:19:09:03:INFO] Sniff delimiter as ','[0m
[34m[2020-12-14:19:09:03:INFO] Determined delimiter of CSV input is ','[0m
[34m[2020-12-14:19:09:03:INFO] Model loaded successfully for worker : 37[0m
[34m[2020-12-14 19:09:03 +0000] [38] [INFO] Booting worker with pid: 38[0m
[34m[2020-12-14 19:09:03 +0000] [39] [INFO] Booting worker with pid: 3

Let's inspect the output of the Batch Transform job in S3. It should show the list probabilities of tumors being malignant.

In [9]:
import json
import io
from urllib.parse import urlparse

def get_csv_output_from_s3(s3uri, file_name):
    parsed_url = urlparse(s3uri)
    bucket_name = parsed_url.netloc
    prefix = parsed_url.path[1:]
    s3 = boto3.resource('s3')
    obj = s3.Object(bucket_name, '{}/{}'.format(prefix, file_name))
    return obj.get()["Body"].read().decode('utf-8')    



In [10]:
output = get_csv_output_from_s3(sm_transformer.output_path, '{}.out'.format(batch_file_noID))
output_df = pd.read_csv(io.StringIO(output), sep=",", header=None)
output_df.head(8)

Unnamed: 0,0
0,0.59349
1,0.983232
2,0.990054
3,0.965993
4,0.068566
5,0.844889
6,0.011774
7,0.225861


#### 2. Join the input and the prediction results 
Now, let's associate the prediction results with their corresponding input records. We can also use the __input_filter__ to exclude the ID column easily and there's no need to have a separate file in S3.

* Set __input_filter__ to "$[1:]": indicates that we are excluding column 0 (the 'ID') before processing the inferences and keeping everything from column 1 to the last column (all the features or predictors)  
  
  
* Set __join_source__ to "Input": indicates our desire to join the input data with the inference results  

* Leave __output_filter__ to default ('$'), indicating that the joined input and inference results be will saved as output.

In [11]:
# content_type / accept and split_type / assemble_with are required to use IO joining feature
sm_transformer.assemble_with = 'Line'
sm_transformer.accept = 'text/csv'

# start a transform job
input_location = 's3://{}/{}/batch/{}'.format(bucket, prefix, batch_file) # use input data with ID column cause InputFilter will filter it out
sm_transformer.transform(input_location, split_type='Line', content_type='text/csv', input_filter='$[1:]', join_source='Input')
sm_transformer.wait()



..............................
.[32m2020-12-14T19:14:35.167:[sagemaker logs]: MaxConcurrentTransforms=4, MaxPayloadInMB=6, BatchStrategy=MULTI_RECORD[0m
[34mArguments: serve[0m
[34m[2020-12-14 19:14:35 +0000] [1] [INFO] Starting gunicorn 19.7.1[0m
[34m[2020-12-14 19:14:35 +0000] [1] [INFO] Listening at: http://0.0.0.0:8080 (1)[0m
[34m[2020-12-14 19:14:35 +0000] [1] [INFO] Using worker: gevent[0m
[34m[2020-12-14 19:14:35 +0000] [36] [INFO] Booting worker with pid: 36[0m
[34m[2020-12-14 19:14:35 +0000] [37] [INFO] Booting worker with pid: 37[0m
[34m[2020-12-14 19:14:35 +0000] [38] [INFO] Booting worker with pid: 38[0m
[34m[2020-12-14:19:14:35:INFO] Model loaded successfully for worker : 36[0m
[34m[2020-12-14:19:14:35:INFO] Model loaded successfully for worker : 37[0m
[34m[2020-12-14:19:14:35:INFO] Model loaded successfully for worker : 38[0m
[34m[2020-12-14 19:14:35 +0000] [39] [INFO] Booting worker with pid: 39[0m
[34m[2020-12-14:19:14:35:INFO] Sniff delimiter a

Let's inspect the output of the Batch Transform job in S3. It should show the list of tumors identified by their original feature columns and their corresponding probabilities of being malignant.

In [12]:
output = get_csv_output_from_s3(sm_transformer.output_path, '{}.out'.format(batch_file))
output_df = pd.read_csv(io.StringIO(output), sep=",", header=None)
output_df.head(8)


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,22,23,24,25,26,27,28,29,30,31
0,843786,12.45,15.7,82.57,477.1,0.1278,0.17,0.1578,0.08089,0.2087,...,23.75,103.4,741.6,0.1791,0.5249,0.5355,0.1741,0.3985,0.1244,0.59349
1,84610002,15.78,17.89,103.6,781.0,0.0971,0.1292,0.09954,0.06606,0.1842,...,27.28,136.5,1299.0,0.1396,0.5609,0.3965,0.181,0.3792,0.1048,0.983232
2,848406,14.68,20.13,94.74,684.5,0.09867,0.072,0.07395,0.05259,0.1586,...,30.88,123.4,1138.0,0.1464,0.1871,0.2914,0.1609,0.3029,0.08216,0.990054
3,8511133,15.34,14.26,102.5,704.4,0.1073,0.2135,0.2077,0.09756,0.2521,...,19.08,125.1,980.9,0.139,0.5954,0.6305,0.2393,0.4667,0.09946,0.965993
4,855167,13.44,21.58,86.18,563.0,0.08162,0.06031,0.0311,0.02031,0.1784,...,30.25,102.5,787.9,0.1094,0.2043,0.2085,0.1112,0.2994,0.07146,0.068566
5,85638502,13.17,21.81,85.42,531.5,0.09714,0.1047,0.08259,0.05252,0.1746,...,29.89,105.5,740.7,0.1503,0.3904,0.3728,0.1607,0.3693,0.09618,0.844889
6,857155,12.05,14.63,78.04,449.3,0.1031,0.09092,0.06592,0.02749,0.1675,...,20.7,89.88,582.6,0.1494,0.2156,0.305,0.06548,0.2747,0.08301,0.011774
7,857156,13.49,22.3,86.91,561.0,0.08752,0.07698,0.04751,0.03384,0.1809,...,31.82,99.0,698.8,0.1162,0.1711,0.2282,0.1282,0.2871,0.06917,0.225861


#### 3. Update the output filter to keep only ID and prediction results
Let's change __output_filter__ to "$[0,-1]", indicating that when presenting the output, we only want to keep column 0 (the 'ID') and the last column (the inference result i.e. the probability of a given tumor to be malignant)

In [13]:
# start another transform job
sm_transformer.transform(input_location, split_type='Line', content_type='text/csv', input_filter='$[1:]', join_source='Input', output_filter='$[0,-1]')
sm_transformer.wait()



.................................
[32m2020-12-14T19:20:11.448:[sagemaker logs]: MaxConcurrentTransforms=4, MaxPayloadInMB=6, BatchStrategy=MULTI_RECORD[0m
[34mArguments: serve[0m
[34m[2020-12-14 19:20:11 +0000] [1] [INFO] Starting gunicorn 19.7.1[0m
[34m[2020-12-14 19:20:11 +0000] [1] [INFO] Listening at: http://0.0.0.0:8080 (1)[0m
[34m[2020-12-14 19:20:11 +0000] [1] [INFO] Using worker: gevent[0m
[34m[2020-12-14 19:20:11 +0000] [37] [INFO] Booting worker with pid: 37[0m
[34m[2020-12-14 19:20:11 +0000] [38] [INFO] Booting worker with pid: 38[0m
[34m[2020-12-14 19:20:11 +0000] [39] [INFO] Booting worker with pid: 39[0m
[34m[2020-12-14:19:20:11:INFO] Model loaded successfully for worker : 37[0m
[34m[2020-12-14 19:20:11 +0000] [40] [INFO] Booting worker with pid: 40[0m
[34m[2020-12-14:19:20:11:INFO] Model loaded successfully for worker : 38[0m
[34m[2020-12-14:19:20:11:INFO] Model loaded successfully for worker : 39[0m
[34m[2020-12-14:19:20:11:INFO] Sniff delimiter

Now, let's inspect the output of the Batch Transform job in S3 again. It should show 2 columns: the ID and their corresponding probabilities of being malignant.

In [14]:
output = get_csv_output_from_s3(sm_transformer.output_path, '{}.out'.format(batch_file))
output_df = pd.read_csv(io.StringIO(output), sep=",", header=None)
output_df.head(8)

Unnamed: 0,0,1
0,843786,0.59349
1,84610002,0.983232
2,848406,0.990054
3,8511133,0.965993
4,855167,0.068566
5,85638502,0.844889
6,857155,0.011774
7,857156,0.225861


In summary, we can use newly introduced 3 attributes - __input_filter__, __join_source__, __output_filter__ to 
1. Filter / select useful features from the input dataset. e.g. exclude ID columns.
2. Associate the prediction results with their corresponding input records.
3. Filter the original or joined results before saving to S3. e.g. keep ID and probability columns only.