# Module 3: Batch Scoring using a pre-trained XGBoost model
**This notebook uses the feature store to prepare test dataset for batch scoring and then use the XGBoost model trained in the model training notebook**

**Note:** Please set kernel to `Python 3 (Data Science)` and select instance to `ml.t3.medium`

---

## Contents

1. [Background](#Background)
1. [Setup](#Setup)
1. [Prepare test data](#Prepare-test-data)
1. [Batch Transform](#Batch-Transform)

## Background

After the model is trained, if the goal is to generate predictions on a large dataset where minimizing latency isn't a concern, then SageMaker batch transform is the solution. Functionally, batch transform uses the same mechanics as real-time hosting to generate predictions. It requires a web server that takes in HTTP POST requests a single observation, or mini-batch, at a time. However, unlike real-time hosted endpoints which have persistent hardware (instances stay running until you shut them down), batch transform clusters are torn down when the job completes.

In this example, we will walk through the steps to prepare the batch test dataset from feature store using processing job and perform batch transform with the test data available on Amazon S3. 

## Setup


In [None]:
import logging
import sys
from urllib.parse import urlparse
from io import StringIO

import boto3
import pandas as pd
import sagemaker
from sagemaker import get_execution_role
from sagemaker.dataset_definition.inputs import (
    AthenaDatasetDefinition,
    DatasetDefinition,
)
from sagemaker.model import Model
from sagemaker.processing import ProcessingInput, ProcessingOutput
from sagemaker.sklearn.processing import SKLearnProcessor
from sagemaker.utils import name_from_base

sys.path.append("..")
from utilities import Utils

In [None]:
logger = logging.getLogger("__name__")
logger.setLevel(logging.DEBUG)
logger.addHandler(logging.StreamHandler())

#### Essentials

In [None]:
sagemaker_execution_role = get_execution_role()
logger.info(f"Role = {sagemaker_execution_role}")
session = boto3.Session()
sagemaker_session = sagemaker.Session()
sagemaker_client = session.client(service_name="sagemaker")


default_bucket = sagemaker_session.default_bucket()
prefix = "sagemaker-featurestore-workshop"

### Prepare test data for batch transform 
<!-- job using processing job with *AthenaDatasetDefinition* -->
We create the test dataset that we use in our batch transform job using [*AthenaDatasetDefinition*](https://sagemaker.readthedocs.io/en/stable/api/utility/inputs.html#sagemaker.dataset_definition.inputs.AthenaDatasetDefinition) API.
We follow the steps below to prepare the test dataset for batch transform job:
1. firstly generates the list of feature names that we would like to read from the offline feature store by providing the feature group names as a list and an exclude feature list to the *generate_fsets* function. 
2. Construct an Athena query to read the data from offline feature store and run a SageMaker processing job to transform the data type to 'text/csv'.

#### Generate the list of features needed from feature store.

We use boto3 sagemaker_client to perform `DescribeFeatureGroup` action to describe a FeatureGroup. The response includes information on the creation time, FeatureGroup name, the unique identifier for each FeatureGroup, and more, for more details of the response syntax, please refer to [document here](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DescribeFeatureGroup.html#API_DescribeFeatureGroup_ResponseSyntax).

In [None]:
# Retrieve FG names
%store -r customers_feature_group_name
%store -r products_feature_group_name
%store -r orders_feature_group_name

customers_fg = sagemaker_client.describe_feature_group(
    FeatureGroupName=customers_feature_group_name
)
products_fg = sagemaker_client.describe_feature_group(
    FeatureGroupName=products_feature_group_name
)
orders_fg = sagemaker_client.describe_feature_group(
    FeatureGroupName=orders_feature_group_name
)

database_name = customers_fg["OfflineStoreConfig"]["DataCatalogConfig"]["Database"]
catalog = customers_fg["OfflineStoreConfig"]["DataCatalogConfig"]["Catalog"]

customers_table = customers_fg["OfflineStoreConfig"]["DataCatalogConfig"]["TableName"]
products_table = products_fg["OfflineStoreConfig"]["DataCatalogConfig"]["TableName"]
orders_table = orders_fg["OfflineStoreConfig"]["DataCatalogConfig"]["TableName"]

In [None]:
exclude_fsets = [
    "customer_id",
    "product_id",
    "order_id",
    "event_time",
    "purchase_amount",
    "n_days_since_last_purchase",
]

In [None]:
def generate_fsets(fg_list, exclude_fsets=None):
    _fg_lst = []
    for _fg in fg_list:
        _fg_tmp = pd.DataFrame(
            Utils.describe_feature_group(_fg["FeatureGroupName"])["FeatureDefinitions"]
        )
        if exclude_fsets:
            _fg_tmp = _fg_tmp[~_fg_tmp.FeatureName.isin(exclude_fsets)]

        _fg_lst.append(_fg_tmp)
    return pd.concat(_fg_lst, ignore_index=True)

In [None]:
fsets_df = generate_fsets([orders_fg, customers_fg, products_fg], exclude_fsets)
features_names = fsets_df.FeatureName.tolist()

#### Run a SageMaker Processing Job to generate test set for batch job

[SageMaker Processing](https://docs.aws.amazon.com/sagemaker/latest/dg/processing-job.html) is the tool to analyze data and evaluate machine learning models. With Processing, you can use a simplified, managed experience on SageMaker to run your data processing workloads, such as feature engineering, data validation, model evaluation, and model interpretation. In short, SageMaker takes custom script and copies data from Amazon S3 and then pulls a processing container to execute the script which performs all the data processing and other actions as needed.

When creating a [Processing job](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateProcessingJob.html), one needs to specify the `ProcessingInputs` parameter which tell the SageMaker service where to get the input data. If the data is already available on S3, we can use the `S3Input` to define the inputs for the processing job. However, in our example, the data is stored in the offline Feature Store, we can use the [DatasetDefinition](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DatasetDefinition.html) which supports the data sources like S3 which can be queried via Athena
and Redshift. We use the [AthenaDatasetDefinition](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_AthenaDatasetDefinition.html) option, it executes SQL queries and generate datasets to S3 which will be available as the inputs of the processing job.


We start by create an Athena query to get the test data from feature store. Note that the first column will be the unique identifier of customer id and the second column is the target value. Note that the query should only take the latest version of any given record that has multiple write times for the same event_time.

In [None]:
batch_transform_columns_string = ",\n    ".join(f'"{c}"' for c in features_names)

customer_uid = customers_fg["RecordIdentifierFeatureName"]
product_uid = products_fg["RecordIdentifierFeatureName"]
order_uid = orders_fg["RecordIdentifierFeatureName"]

customer_et = customers_fg["EventTimeFeatureName"]
product_et = products_fg["EventTimeFeatureName"]
order_et = orders_fg["EventTimeFeatureName"]


query_string = f"""WITH customer_table AS (
    SELECT *,
        dense_rank() OVER (
            PARTITION BY "{customer_uid}"
            ORDER BY "{customer_et}" DESC,
                "api_invocation_time" DESC,
                "write_time" DESC
        ) AS "rank"
    FROM "{customers_table}"
    WHERE NOT "is_deleted"
),
product_table AS (
    SELECT *,
        dense_rank() OVER (
            PARTITION BY "{product_uid}"
            ORDER BY "{product_et}" DESC,
                "api_invocation_time" DESC,
                "write_time" DESC
        ) AS "rank"
    FROM "{products_table}"
    WHERE NOT "is_deleted"
),
order_table AS (
    SELECT *,
        dense_rank() OVER (
            PARTITION BY "{order_uid}"
            ORDER BY "{order_et}" DESC,
                "api_invocation_time" DESC,
                "write_time" DESC
        ) AS "rank"
    FROM "{orders_table}"
    WHERE NOT "is_deleted"
)
SELECT DISTINCT
    "{order_uid}",
    {batch_transform_columns_string}
FROM customer_table,
    product_table,
    order_table
WHERE order_table."customer_id" = customer_table."customer_id"
    AND order_table."product_id" = product_table."product_id"
    AND customer_table."rank" = 1
    AND product_table."rank" = 1
    AND order_table."rank" = 1
"""
print(query_string)

In [None]:
create_batchdata_processor = SKLearnProcessor(
    framework_version="0.23-1",
    role=sagemaker_execution_role,
    instance_type="ml.m5.xlarge",
    instance_count=2,
    base_job_name=f"{prefix}-batch",
    sagemaker_session=sagemaker_session,
)

In [None]:
athena_data_path = "/opt/ml/processing/athena"
athena_output_s3_uri = f"s3://{default_bucket}/{prefix}/athena/data/"
data_sources = [
    ProcessingInput(
        input_name="athena_dataset",
        dataset_definition=DatasetDefinition(
            local_path=athena_data_path,
            data_distribution_type="ShardedByS3Key",
            athena_dataset_definition=AthenaDatasetDefinition(
                catalog=catalog,
                database=database_name,
                query_string=query_string,
                output_s3_uri=athena_output_s3_uri,
                output_format="PARQUET",
            ),
        ),
    )
]

The following processing script reads the Athena query outputs (parquet files) and save as csv files which can be used directly by SageMaker Batch jobs.

In [None]:
%%writefile create_batchdata.py
import argparse
import uuid
from pathlib import Path

import pandas as pd

# Parse argument variables passed via the CreateDataset processing step
parser = argparse.ArgumentParser()
parser.add_argument("--athena-data", type=str)
args = parser.parse_args()

dataset_path = Path("/opt/ml/processing/output/dataset")
dataset = pd.read_parquet(args.athena_data, engine="pyarrow")

# Write train, test splits to output path
dataset_output_path = Path("/opt/ml/processing/output/dataset")
dataset.to_csv(
    dataset_output_path / f"dataset-{uuid.uuid4()}.csv", index=False, header=False
)

In [None]:
destination_s3_path = f"s3://{default_bucket}/{prefix}/{name_from_base('batch')}"
create_batchdata_processor.run(
    code="create_batchdata.py",
    arguments=[
        "--athena-data",
        athena_data_path,
    ],
    inputs=data_sources,
    outputs=[
        ProcessingOutput(
            output_name="batch_transform_data",
            source="/opt/ml/processing/output/dataset",
            destination=destination_s3_path,
        )
    ],
)

## Batch Transform

In SageMaker Batch Transform, we introduced 3 new attributes - __input_filter__, __join_source__ and __output_filter__. In the below cell, we use the [SageMaker Python SDK](https://github.com/aws/sagemaker-python-sdk) to kick-off several Batch Transform jobs using different configurations of these 3 new attributes. Please refer to [this page](https://docs.aws.amazon.com/sagemaker/latest/dg/batch-transform-data-processing.html) to learn more about how to use them.

#### 1. Create a model based on the pre-trained model artifacts on S3
Let's first create a model based on the training job from the previous notebook. We can use `describe_training_job` boto3 api call to get the model data uri and the container used for the training job. Please note that the pre-built [XGBoost container](https://github.com/aws/sagemaker-xgboost-container) uses the same container image for training and hosting. However, if you are using other framework containers, such as TensorFlow, PyTorch and etc., the training and inference containers are different. For more details about the the available deep learning containers, please refer to the [github page](https://github.com/aws/deep-learning-containers/blob/master/available_images.md). 

In [None]:
%store -r training_jobName

training_job_info = sagemaker_client.describe_training_job(
    TrainingJobName=training_jobName
)
xgb_model_data = training_job_info["ModelArtifacts"]["S3ModelArtifacts"]
container_uri = training_job_info['AlgorithmSpecification']['TrainingImage']

xgb_model = Model(
    image_uri=container_uri,
    model_data=xgb_model_data,
    role=sagemaker_execution_role,
    name=name_from_base("fs-workshop-xgboost-model"),
    sagemaker_session=sagemaker_session,
)

#### 2. Join the input and the prediction results 
Now, let's associate the prediction results with their corresponding input records. We can also use the __input_filter__ to exclude the order ID column easily and there's no need to have a separate file in S3.

* Set __input_filter__ to "$[2:]": indicates that we are excluding column 0 (the 'order_id') and column 1 (the target value) before processing the inferences and keeping everything from column 1 to the last column (all the features or predictors)  
  
  
* Set __join_source__ to "Input": indicates our desire to join the input data with the inference results  

* Leave __output_filter__ to default ('$'), indicating that the joined input and inference results be will saved as output.

In [None]:
xgb_transformer = xgb_model.transformer(instance_count=1, instance_type="ml.m5.xlarge")

# content_type / accept and split_type / assemble_with are required to use IO joining feature
xgb_transformer.assemble_with = "Line"
xgb_transformer.accept = "text/csv"

# start a transform job
xgb_transformer.transform(
    destination_s3_path,
    content_type="text/csv",
    split_type="Line",
    input_filter="$[2:]",
    join_source="Input",
)
xgb_transformer.wait()

Let's inspect the output of the Batch Transform job in S3. 

In [None]:
s3 = boto3.resource("s3")


def list_s3_files(s3uri):
    parsed_url = urlparse(s3uri)
    bucket = s3.Bucket(parsed_url.netloc)
    prefix = parsed_url.path[1:]
    return [
        dict(bucket_name=k.bucket_name, key=k.key)
        for k in bucket.objects.filter(Prefix=prefix)
    ]

In [None]:
output_file_list = list_s3_files(xgb_transformer.output_path)
output_file_list

In [None]:
s3_obj = s3.Object(**output_file_list[0])
body = s3_obj.get()['Body']
csv_string = body.read().decode('utf-8')

pd.read_csv(StringIO(csv_string), 
            header=None)