# Module 3: Model Training
**This notebook uses the feature set extracted by `module-2` to create a XGBoost based machine learning model for binary classification**

**Note:** Please set kernel to `Python 3 (Data Science)` and select instance to `ml.t3.medium`

---

## Contents

1. [Background](#Background)
1. [Setup](#Setup)
1. [Load transformed feature set](#Load-transformed-feature-set)
1. [Split data](#Split-data)
1. [Train a model using SageMaker built-in XgBoost algorithm](#Train-a-model-using-SageMaker-built-in-XgBoost-algorithm)
1. [Real time inference using the deployed endpoint](#Real-time-inference-using-the-deployed-endpoint)

# Background

In this notebook, we demonstrate how to use the feature set derived in `Module-2` and create a machine learning model for predicting whether a customer will reorder a product or not based on historical records. Given the problem type is supervised binary classification, we will use a SageMaker built-in algorithm XGBoost to design this classifier. Once the model is trained, we will also deploy the trained model as a SageMaker endpoint for real-time inference.

# Setup

In [None]:
from sagemaker.serializers import CSVSerializer
from sagemaker.inputs import TrainingInput
from sagemaker.predictor import Predictor
from sagemaker import get_execution_role
import pandas as pd
import numpy as np
import sagemaker
import logging
import boto3
import json
import os
import sys
sys.path.append('..')
from utilities import Utils

In [None]:
logger = logging.getLogger('__name__')
logger.setLevel(logging.DEBUG)
logger.addHandler(logging.StreamHandler())

#### Essentials

In [None]:
sagemaker_execution_role = get_execution_role()
logger.info(f'Role = {sagemaker_execution_role}')
session = boto3.Session()
sagemaker_session = sagemaker.Session()
default_bucket = sagemaker_session.default_bucket()
prefix = 'sagemaker-featurestore-workshop'
s3 = session.resource('s3')

# Load transformed feature set

In [None]:
df = pd.read_csv('../data/train/transformed.csv')
df.head(5)

In [None]:
df.shape

Move column `is_redordered` to be the first column since our training algorithm `XGBoost` expects the target column to be the first column.

In [None]:
first_column = df.pop('is_reordered')
df.insert(0, 'is_reordered', first_column)
df.head()

# Split data

We will shuffle the whole dataset first (df.sample(frac=1, random_state=123)) and then split our data set into the following parts:

* 70% - train set,
* 20% - validation set,
* 10% - test set

**Note:**  In the code below, the first element denotes size for train (0.7 = 70%), second element denotes size for test (1-0.9 = 0.1 = 10%) and difference between the two denotes size for validation(1 - [0.7+0.1] = 0.2 = 20%).

In [None]:
train_df, validation_df, test_df = np.split(df.sample(frac=1, random_state=123), [int(.7*len(df)), int(.9*len(df))])

In [None]:
train_df.shape

In [None]:
validation_df.shape

In [None]:
test_df.shape

Save split datasets to local

In [None]:
train_df.to_csv('../data/train/train.csv', index=False)
validation_df.to_csv('../data/validation/validation.csv', index=False)
test_df.to_csv('../data/test/test.csv', index=False)

Copy datasets to S3 from local

In [None]:
s3.Bucket(default_bucket).Object(os.path.join(prefix, 'train/train.csv')).upload_file('../data/train/train.csv')
s3.Bucket(default_bucket).Object(os.path.join(prefix, 'validation/validation.csv')).upload_file('../data/validation/validation.csv')
s3.Bucket(default_bucket).Object(os.path.join(prefix, 'test/test.csv')).upload_file('../data/test/test.csv')

Create Pointers to the uploaded files

In [None]:
train_set_location = 's3://{}/{}/train/'.format(default_bucket, prefix)
validation_set_location = 's3://{}/{}/validation/'.format(default_bucket, prefix)
test_set_location = 's3://{}/{}/test/'.format(default_bucket, prefix)

In [None]:
train_set_pointer = TrainingInput(s3_data=train_set_location, content_type='csv')
validation_set_pointer = TrainingInput(s3_data=validation_set_location, content_type='csv')
test_set_pointer = TrainingInput(s3_data=test_set_location, content_type='csv')

In [None]:
print(json.dumps(train_set_pointer.__dict__, indent=2))

# Train a model using SageMaker built-in XgBoost algorithm

In [None]:
container_uri = sagemaker.image_uris.retrieve(region=session.region_name, 
                                              framework='xgboost', 
                                              version='1.0-1', 
                                              image_scope='training')

In [None]:
xgb = sagemaker.estimator.Estimator(image_uri=container_uri,
                                    role=sagemaker_execution_role, 
                                    instance_count=2, 
                                    instance_type='ml.m5.xlarge',
                                    output_path='s3://{}/{}/model-artifacts'.format(default_bucket, prefix),
                                    sagemaker_session=sagemaker_session,
                                    base_job_name='reorder-classifier')

xgb.set_hyperparameters(objective='binary:logistic',
                        num_round=100)

In [None]:
xgb.fit({'train': train_set_pointer, 'validation': validation_set_pointer})

# Batch Scoring using the trained XGBoost model

In [None]:
from sagemaker.feature_store.feature_group import FeatureGroup
from sagemaker.sklearn.processing import SKLearnProcessor
from sagemaker.processing import ProcessingInput, ProcessingOutput
from sagemaker.dataset_definition.inputs import (
    AthenaDatasetDefinition,
    DatasetDefinition,
)

region = sagemaker_session.boto_region_name
boto_session = boto3.Session(region_name=region)
sagemaker_client = boto_session.client(service_name='sagemaker', region_name=region)

feature_store_session = sagemaker.Session(boto_session=boto_session, 
                                          sagemaker_client=sagemaker_client, 
                                          sagemaker_featurestore_runtime_client=featurestore_runtime)

#### Option1: Prepare test data for batch transform job using processing job with *AthenaDatasetDefinition*
The following approach firstly generates the list of feature names that we would like to read from the offline feature store by providing the feature group names as a list and a exclude feature list to the *generate_fsets* function. Here we get the list of features that was expected from the model for inference.

In [None]:
# Retreive FG names
%store -r customers_feature_group_name
%store -r products_feature_group_name
%store -r orders_feature_group_name

customers_fg = sagemaker_client.describe_feature_group(FeatureGroupName=customers_feature_group_name)  
products_fg = sagemaker_client.describe_feature_group(FeatureGroupName=products_feature_group_name)
orders_fg = sagemaker_client.describe_feature_group(FeatureGroupName=orders_feature_group_name)

database_name = customers_fg["OfflineStoreConfig"]["DataCatalogConfig"]["Database"]
catalog = customers_fg["OfflineStoreConfig"]["DataCatalogConfig"]["Catalog"]

customers_table = customers_fg["OfflineStoreConfig"]["DataCatalogConfig"]["TableName"]
products_table = products_fg["OfflineStoreConfig"]["DataCatalogConfig"]["TableName"]
orders_table = orders_fg["OfflineStoreConfig"]["DataCatalogConfig"]["TableName"]

In [None]:
exclude_fsets = ['customer_id', 'product_id', 'order_id', 'event_time', 'purchase_amount', 'n_days_since_last_purchase']
target_fname = 'is_reordered'

In [None]:
def generate_fsets(fg_list, exclude_fsets=None, target_fname=None):
    _fg_lst = []
    for _fg in fg_list:
        _fg_tmp = pd.DataFrame(Utils.describe_feature_group(_fg['FeatureGroupName'])['FeatureDefinitions'])
        if exclude_fsets:
            _fg_tmp = _fg_tmp[~_fg_tmp.FeatureName.isin(exclude_fsets)]
            
        _fg_lst.append(_fg_tmp)
    return pd.concat(_fg_lst, ignore_index=True)

In [None]:
fsets_df = generate_fsets([orders_fg, customers_fg, products_fg], exclude_fsets)
features_names = fsets_df.FeatureName.tolist()

In [None]:

batch_transform_columns_string = ",".join(f'"{c}"' for c in features_names)
batch_transform_columns_string

In [None]:
query_string = f'SELECT {batch_transform_columns_string} FROM "{customers_table}", "{products_table}", "{orders_table}" ' \
               f'WHERE ("{orders_table}"."customer_id" = "{customers_table}"."customer_id") ' \
               f'AND ("{orders_table}"."product_id" = "{products_table}"."product_id")'
query_string

In [None]:
create_batchdata_processor = SKLearnProcessor(
    framework_version="0.23-1",
    role=sagemaker_execution_role,
    instance_type="ml.m5.xlarge",
    instance_count=1,
    base_job_name=f"{prefix}-batch",
    sagemaker_session=sagemaker_session,
)


In [None]:
athena_data_path = "/opt/ml/processing/athena"
data_sources = []
athena_output_s3_uri = f"s3://{default_bucket}/{prefix}/athena/data/"
data_sources.append(
    ProcessingInput(
        input_name="athena_dataset",
        dataset_definition=DatasetDefinition(
            local_path=athena_data_path,
            data_distribution_type="FullyReplicated",
            athena_dataset_definition=AthenaDatasetDefinition(
                catalog=catalog,
                database=database_name,
                query_string=query_string,
                output_s3_uri=athena_output_s3_uri,
                output_format="PARQUET",
            ),
        ),
    )
)

In [None]:
%%writefile create_batchdata.py
import argparse
from pathlib import Path

import pandas as pd

# Parse argument variables passed via the CreateDataset processing step
parser = argparse.ArgumentParser()
parser.add_argument("--athena-data", type=str)
args = parser.parse_args()

dataset_path = Path("/opt/ml/processing/output/dataset")
dataset = pd.read_parquet(args.athena_data, engine="pyarrow")

# Write train, test splits to output path
dataset_output_path = Path("/opt/ml/processing/output/dataset")
dataset.to_csv(dataset_output_path / "dataset.csv", index=False, header=False)

In [None]:
destination_s3_path = f"s3://{default_bucket}/{prefix}/batch"
create_batchdata_processor.run(
    code='create_batchdata.py',
    arguments=[
        "--athena-data",
        athena_data_path,
    ],
    inputs=data_sources,
    outputs=[
        ProcessingOutput(
            output_name="batch_transform_data",
            source="/opt/ml/processing/output/dataset",
            destination=destination_s3_path,
        )
    ],
        
)

#### Option2: Use feature sets Utility function to extract features from offline store

In [None]:
## TODO

## Batch Transform


In SageMaker Batch Transform, we introduced 3 new attributes - __input_filter__, __join_source__ and __output_filter__. In the below cell, we use the [SageMaker Python SDK](https://github.com/aws/sagemaker-python-sdk) to kick-off several Batch Transform jobs using different configurations of these 3 new attributes. Please refer to [this page](https://docs.aws.amazon.com/sagemaker/latest/dg/batch-transform-data-processing.html) to learn more about how to use them.

#### 1. Create a model based on the pretrained model artifacts on s3
Let's first create a model based on the training job finished in the first section.

In [None]:
from sagemaker.model import Model
from time import gmtime, strftime
xgb_model = Model(
    image_uri=container_uri,
    model_data=xgb.model_data,
    role=sagemaker_execution_role,
    name="fs-workshop-xgboost-model-" + strftime("%Y-%m-%d-%H-%M", gmtime()),
    sagemaker_session=sagemaker_session,
)

#### 2. Join the input and the prediction results 
Now, let's associate the prediction results with their corresponding input records. We can also use the __input_filter__ to exclude the ID column easily and there's no need to have a separate file in S3.

* Set __input_filter__ to "$[1:]": indicates that we are excluding column 0 (the 'ID') before processing the inferences and keeping everything from column 1 to the last column (all the features or predictors)  
  
  
* Set __join_source__ to "Input": indicates our desire to join the input data with the inference results  

* Leave __output_filter__ to default ('$'), indicating that the joined input and inference results be will saved as output.

In [None]:
xgb_transformer = xgb_model.transformer(instance_count=1, instance_type="ml.m5.xlarge")

# content_type / accept and split_type / assemble_with are required to use IO joining feature
xgb_transformer.assemble_with = "Line"
xgb_transformer.accept = "text/csv"

# start a transform job
xgb_transformer.transform(destination_s3_path, 
                         content_type="text/csv", 
                         split_type="Line",
                         input_filter="$[1:]",
                         join_source="Input",
                        )
xgb_transformer.wait()

Let's inspect the output of the Batch Transform job in S3. It should show the list of trips identified by their original feature columns and their corresponding predicted trip fares.

In [None]:
import json
import io
from urllib.parse import urlparse


def get_csv_output_from_s3(s3uri, file_name):
    parsed_url = urlparse(s3uri)
    bucket_name = parsed_url.netloc
    prefix = parsed_url.path[1:]
    s3 = boto3.resource("s3")
    obj = s3.Object(bucket_name, "{}/{}".format(prefix, file_name))
    return obj.get()["Body"].read().decode("utf-8")

In [None]:
output_df = get_csv_output_from_s3(xgb_transformer.output_path, 'dataset.csv.out')
output_df.split('\n')[0]