## Title: Building Customised Amazon SageMaker XGBoost
Author: Yiran Jing
- Use Own Algorithms or Models with Amazon SageMaker
- Automatically processing row data before making predictions 
Author: Yiran Jing

Date: 02-07-2019

- further steps after **[AWS_BUILTIN_MODEL_DEPLOYMENT](https://github.com/YiranJing/BigDataAnalysis/blob/master/AWS_SageMaker_CustomerChurn/notebook/AmazonSageMaker/AWS_BUILTIN_MODEL_DEPLOYMENT.ipynb)**
- Data cleaning and Engineering details in **[Churn_Example](https://github.com/YiranJing/BigDataAnalysis/blob/master/AWS_SageMaker_CustomerChurn/notebook/ChurnDataAnalysis/Churn_Example.ipynb)**



#### Dataset
- row dataset: Telco-Customer-Churn.csv
- clean and transformation by UDF

Please note that scikit-learn XGBoost model is compatible with SageMaker XGBoost container, whereas other gradient boosted tree models (such as one trained in SparkML) are not.

In [134]:
%%time
import pandas as pd
import os
import boto3
import re
import json
import sagemaker
import numpy as np
from sagemaker import get_execution_role
from sklearn.metrics import confusion_matrix
from sklearn import metrics
from sagemaker.amazon.amazon_estimator import get_image_uri
from scipy import stats
import xgboost as xgb
import sklearn as sk 
import warnings
warnings.filterwarnings('ignore')

## import UDF function to process row data
from clean_transformation_churn import get_train_validation_test_data

CPU times: user 55 µs, sys: 7 µs, total: 62 µs
Wall time: 66.3 µs


### Set up Amazon SageMaker role

In [135]:
region = boto3.Session().region_name
role = get_execution_role()
sagemaker_session = sagemaker.Session()

bucket = 'taysolsdev'
prefix = 'datasets/churn'

#### Install XGboost
Note that for conda based installation, you'll need to change the Notebook kernel to the environment with conda and Python3.

### Fetch the dataset
Differ from standard AWS model that we need to split the y and x for the model training

We can use pandas to read in data, Donot forget set**header=None**

In [136]:
data_path = 's3://taysolsdev/datasets/Telco-Customer-Churn.csv'

train_set, valid_set, test_set, batch_input = get_train_validation_test_data(data_path)

The lmbda is: 0.25614406206807805


In [133]:
# the batch dataset used for prediction cannot have target column
batch_output = 's3://{}/{}/batch/batch-inference'.format(bucket, prefix) # specify the location of batch output

### Train the XGBClassifier
Note that in SciKit Model for AWS case, we need validation dataset for model training and **Y is the last column**

In [None]:
# split X and Y for datsets
train_y = train_set.iloc[:,0] # 70% 
train_X = train_set.iloc[:,1:]

valid_y = valid_set.iloc[:,0]  # 20%
valid_X = valid_set.iloc[:,1:]

test_y = test_set.iloc[:,0]  # 10%
test_X = test_set.iloc[:,1:]

# Setup xgboost model
bt = xgb.XGBClassifier( max_depth=3,
                        verbosity=1,
                        random_stae=960428,
                        gamma=0,
                        subsample=1,
                        reg_lambda=1,
                        silent=0, # silent must be integer, cannot be none
                        colsample_bytree=1,
                        min_child_weight=1,  
                        learning_rate = 0.02,
                        tree_method='hist',
                        n_estimators=200,
                        class_weight='balanced',
                        objective='binary:logistic') # binary classification


bt.fit(train_X, train_y, # Train it to our data
       eval_set=[(valid_X, valid_y)]) 


### Save the trained model file
- Note that the model file name must satisfy the regular expression pattern:
^\[a-zA-Z0-9\](-\*\[a-zA-Z0-9\])\*;
- The model file also need to tar-zipped.


In [111]:
model_file_name = "DEMO-customised-xgboost-model"
bt._Booster.save_model(model_file_name)

In [112]:
!tar czvf model.tar.gz $model_file_name

DEMO-customised-xgboost-model


### Upload the pre-trained model to S3

In [113]:
fObj = open("model.tar.gz", 'rb')
key= os.path.join(prefix, model_file_name, 'model.tar.gz')
boto3.Session().resource('s3').Bucket(bucket).Object(key).upload_fileobj(fObj)

### Import model container
This involves creating a SageMaker model from the model file previously uploaded to S3.

In [114]:
from sagemaker.amazon.amazon_estimator import get_image_uri
container = get_image_uri(boto3.Session().region_name, 'xgboost')

### Loads customised Model Artifacts to SageMaker 


In [115]:
# URI where a pre-trained model is stored
model_url = 's3://{}/{}'.format(bucket,key)
model_url

's3://taysolsdev/datasets/churn/DEMO-customised-xgboost-model/model.tar.gz'

### Train customised model

In [None]:
#  The session object that manages interactions with Amazon SageMaker APIs and any other AWS service that the training job uses.
sess = sagemaker.Session()

# model_uri: URI of our pre-trained model 
customised_xgb = sagemaker.estimator.Estimator(container,
                                    role, 
                                    model_uri = model_url, ## important
                                    train_instance_count=1, 
                                    train_instance_type='ml.m4.xlarge',
                                    output_path='s3://{}/{}/output'.format(bucket, prefix),
                                    sagemaker_session=sess)

# Most hyper-parameters we have done in the pre-trained model
customised_xgb.set_hyperparameters(num_round=50 #The number of rounds for boosting (only used in the console version of XGBoost)
                        )

# start model training
customised_xgb.fit({'train': s3_input_train, 'validation': s3_input_validation}, logs=True)

### Deploy Model with Batch Transform

In [117]:
# creates a transformer object from the trained model
transformer = customised_xgb.transformer(
                          instance_count=1,
                          instance_type='ml.m4.xlarge',
                          output_path=batch_output)

# calls that object's transform method to create a transform job
transformer.transform(data=batch_input, data_type='S3Prefix', content_type='text/csv', split_type='Line')

transformer.wait()

...........................................!


### Validate Model Deployed with Batch Transform
The following same as Standard AWS model

In [118]:
# batch output based on test data
batch_output = 's3://taysolsdev/datasets/churn/batch/batch-inference/test_data_Batch.csv.out'
batch_output = pd.read_csv(batch_output, header=None, encoding = "ISO-8859-1") # header = none 


In [119]:
def get_score(y_true,y_pred):
    f1 = metrics.f1_score(y_true, y_pred)
    precision = metrics.precision_score(y_true, y_pred)
    recall = metrics.recall_score(y_true, y_pred)
    accuracy = metrics.accuracy_score(y_true, y_pred)
    tn, fp, fn, tp = metrics.confusion_matrix(y_true, y_pred).ravel()
    return precision, recall, f1, accuracy, tn, fp, fn, tp

In [126]:
pred_y = np.round(batch_output) # threshold is 0.5


#get scores
temp_precision, temp_recall, temp_f1, temp_accuracy, tn, fp, fn, tp = get_score(test_y, pred_y)
output = [temp_precision,temp_recall,temp_f1,temp_accuracy,tp, fp, tn, fn]
output = pd.Series(output, index=['precision', 'recall', 'f1', 'accuracy', 'tp', 'fp', 'tn', 'fn']) 
print(output[['accuracy', 'tp', 'fp', 'tn', 'fn']])

from sklearn.metrics import classification_report
print(classification_report(test_y, pred_y))

accuracy      0.765292
tp           94.000000
fp           52.000000
tn          444.000000
fn          113.000000
dtype: float64
              precision    recall  f1-score   support

           0       0.80      0.90      0.84       496
           1       0.64      0.45      0.53       207

   micro avg       0.77      0.77      0.77       703
   macro avg       0.72      0.67      0.69       703
weighted avg       0.75      0.77      0.75       703



### Clean up