### Soteria - Decisions: Risk Profile Engine - Multi Classification Model
***
##### __Project: Soteria__
##### __Module: Decisions: Risk Profile Engine (RPE)__
##### __Program: Soteria-Decisions-RiskProfileEngine-Model.ipynb__
##### __Written by: Devarayan Subbu__

_Prerequisite on access to AWS environment: In order to run this python notebook, you should have necessary access privileges to your AWS environment - especially to AWS SageMaker and S3 and that you have already created and ready to go AWS SageMaker Notebook instance along with neessary IAM roles. Refer to AWS documentation for further help on AWS services (SageMaker, S3, IAM, etc) and/ or creating Notebook instances._
***

This __model is based on the XGBoost__ (eXtreme Gradient Boosting) - an implementation of the gradient boosted trees algorithm. Gradient boosting is a _supervised learning algorithm_ that attempts to accurately _predict a target variable by combining an ensemble of estimates from a set of simpler and weaker models_.

Our requirement is to __predict__ the __risk category__ based on specific features. And, in our case, these features are __temperature, spo2, travel_history, positive_contact, symptoms_none, dry_cough, shortness_of_breath, chest_pain_or_pressure, confusion_or_problems_thinking, bluish_lips_or_face, sore_throat, fatigue, aches_and_pain, loss_of_appetite_or_smell, headache, stuffy_or_runny_nose, vomiting, diarrhea, sneezing, overall_health_status__. Given these input features, the model predicts the risk category (of low/ medium/ high).

#### Install XGBoost and import necessary libraries...

In [None]:
!pip install xgboost==1.0

In [None]:
import numpy as np
import pandas as pd
import sagemaker
import boto3
import itertools
import os
import time
import matplotlib.pyplot as plt
import xgboost as xgb
from sagemaker.tuner import IntegerParameter, ContinuousParameter, HyperparameterTuner
from sagemaker.amazon.amazon_estimator import get_image_uri
from sklearn.metrics import classification_report, confusion_matrix
from xgboost import plot_tree
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from pprint import pprint

#### Set the region, client, session, role, bucket prefix and bucket to store the datasets, and evaluation metric

In [None]:
region = boto3.Session().region_name    
smclient = boto3.Session().client('sagemaker')
session=sagemaker.Session()
objective_metric_name = 'validation:mlogloss'
role = sagemaker.get_execution_role()
bucket='soteria-decisions-risk-engine'
prefix = 'SOTERIA-SM-XGB-HT/RPE-V0.53'

#### Set the train, validation, and test split percentages and also read the source data

In [None]:
TRAIN_PERCENTAGE=73
VALIDATION_PERCENTAGE=25
TEST_PERCENTAGE=2
sFile='soteria_ss_rtw_health_based_risk_class.csv'
trnFile='soteria_ss_rtw_health_based_risk_class_trn.csv'
valFile='soteria_ss_rtw_health_based_risk_class_val.csv'
tstFile='soteria_ss_rtw_health_based_risk_class_tst.csv'
cListFile='ss_rtw_profile_health_based_risk_class_clist.txt'
c=['encoded_risk_category', 'temperature', 'spo2', 'travel_history', 'positive_contact', 'symptoms_none', 'dry_cough', 'shortness_of_breath', 'chest_pain_or_pressure', 'confusion_or_problems_thinking', 'bluish_lips_or_face', 'sore_throat', 'fatigue', 'aches_and_pain', 'loss_of_appetite_or_smell', 'headache', 'stuffy_or_runny_nose', 'vomiting', 'diarrhea', 'sneezing', 'overall_health_status']
dfAll=pd.read_csv(sFile)

In [None]:
LOW=0
MEDIUM=1
HIGH=2
labels=[LOW, MEDIUM, HIGH]
risk_categories=['0 - Low', '1 - Medium', '2 - High']
lEnc=preprocessing.LabelEncoder()
lEnc.fit(risk_categories)
lEnc.classes_

#### Perform a quick check to see...
* how many observations have been read and how many features each of those observations have
* the break up of the number of observations under each category

In [None]:
ta=dfAll['risk_category'].value_counts()
print('Total Rows in dataframe dfAll: {:,}\nTotal Columns in dataframe dfAll: {:,}\n'.format(dfAll.shape[0], dfAll.shape[1]))
print('Observation composition as read...\nLow risk: {:,}\nMedium risk: {:,}\nHigh risk: {:,}'.format (ta[0], ta[1], ta[2]))

In [None]:
dfAll['encoded_risk_category']=lEnc.transform(dfAll['risk_category'])

#### Check for duplicates... and if present, remove the same

In [None]:
print ('Before checking for duplicates...\nTotal Rows: {:,}\nTotal Columns: {:,}\n\nDuplicates: {}\n'.format(dfAll.shape[0], dfAll.shape[1], dfAll.duplicated().any()))
dfFull=dfAll.drop_duplicates(keep='first')
print ('After checking for duplicates...\nTotal Rows: {:,}\nTotal Columns: {:,}'.format(dfFull.shape[0], dfFull.shape[1]))

In [None]:
tf=dfFull['risk_category'].value_counts()

In [None]:
def split_data(sdf, TRAIN_PERCENTAGE, VALIDATION_PERCENTAGE, TEST_PERCENTAGE):
    trn_set, v1=train_test_split(sdf, test_size=(100-TRAIN_PERCENTAGE)/100, stratify=sdf['encoded_risk_category'])
    val_set, tst_set=train_test_split(v1, test_size=(TEST_PERCENTAGE/(VALIDATION_PERCENTAGE+TEST_PERCENTAGE)), stratify=v1['encoded_risk_category'])
    return (trn_set, val_set, tst_set)

#### Split the data into training, validation, and test set as per the split ratio...and review the observations are split accordingly

In [None]:
trn_set, val_set, tst_set=split_data(dfFull, TRAIN_PERCENTAGE, VALIDATION_PERCENTAGE, TEST_PERCENTAGE)

#### Review how the spread of the data is...
* The total number of observations
* The number of observations in the training, validation, and test sets
* The overall composition of Low, Medium, and High risk observations
* The training set composition of Low, Medium, and High risk observations
* The validation set composition of Low, Medium, and High risk observations
* The test set composition of Low, Medium, and High risk observations

In [None]:
print ('Total observations: {:,}'.format(dfFull.shape[0]))
print ('Total observations in train set: {:,}'.format(trn_set.shape[0]))
print ('Total observations in validation set: {:,}'.format(val_set.shape[0]))
print ('Total observations in test set: {:,}\n'.format(tst_set.shape[0]))
trnc=trn_set['risk_category'].value_counts()
valc=val_set['risk_category'].value_counts()
tstc=tst_set['risk_category'].value_counts()
print('Overall observation composition:\nLow risk: {:,}\nMedium rism: {:,}\nHigh risk: {:,}\n'.format (tf[0], tf[1], tf[2]))
print('Training set composition:\nLow risk: {:,}\nMedium risk: {:,}\nHigh risk: {:,}\n'.format (trnc[0], trnc[1], trnc[2]))
print('Validation set composition:\nLow risk: {:,}\nMedium risk: {:,}\nHigh risk: {:,}\n'.format (valc[0], valc[1], valc[2]))
print('Test set composition:\nLow risk: {:,}\nMedium risk: {:,}\nHigh risk: {:,}'.format (tstc[0], tstc[1], tstc[2]))

#### Create the training, validation, and test set files and upload to S3 bucket

In [None]:
trn_set[:].to_csv(trnFile, index=False, header=False, columns=c)
val_set[:].to_csv(valFile, index=False, header=False, columns=c)
tst_set[:].to_csv(tstFile, index=False, header=False, columns=c)
with open(cListFile, 'w') as f:
    f.write(','.join(c))

In [None]:
boto3.Session().resource('s3').Bucket(bucket).Object(os.path.join(prefix, 'train/'+trnFile)).upload_file(trnFile)
boto3.Session().resource('s3').Bucket(bucket).Object(os.path.join(prefix, 'validation/'+valFile)).upload_file(valFile)
boto3.Session().resource('s3').Bucket(bucket).Object(os.path.join(prefix, 'test/'+tstFile)).upload_file(tstFile)

#### Get the XGBoost container image

In [None]:
container = get_image_uri(boto3.Session().region_name, 'xgboost', repo_version='1.0-1')

#### Create an instance of the sagemaker.estimator.Estimator class and specify the necessary parameters where,
* role is the IAM role that Amazon SageMaker can assume to perform tasks on our behalf
* train_instance_count is the number of ML compute instances to use for model training... for our purpose, we will use single training instance
* train_instance_type is the type of ML compute instances to use for model training... for our purpose, we will use ml.ml.xlarge instance type

#### Set the hyperparameters as needed and also tune some of the hyperparameters as needed...

In [None]:
classifier = sagemaker.estimator.Estimator(container,
                                    role, 
                                    train_instance_count=1, 
                                    train_instance_type='ml.m4.xlarge',
                                    output_path='s3://{}/{}/output'.format(bucket, prefix),
                                    sagemaker_session=session)

classifier.set_hyperparameters(num_class=3,
                        num_round=15000,
                        objective='multi:softmax',
                        early_stopping_rounds=10
                       )

hyperparameter_ranges = {'eta': ContinuousParameter(0, 1),
                         'alpha': ContinuousParameter(0,2),
                        'min_child_weight': ContinuousParameter(1, 10),
                        'max_depth': IntegerParameter(1, 6)}

tuner = HyperparameterTuner(classifier,
                            objective_metric_name,
                            hyperparameter_ranges,
                            objective_type='Minimize',
                            base_tuning_job_name='XGB-HT-RPE',
                            max_jobs=20,
                            max_parallel_jobs=3)

#### Create the train and validation channels and start the training job

In [None]:
s3_input_train = sagemaker.s3_input(s3_data='s3://{}/{}/train'.format(bucket, prefix), content_type='csv')
s3_input_validation = sagemaker.s3_input(s3_data='s3://{}/{}/validation/'.format(bucket, prefix), content_type='csv')
tuner.fit({'train': s3_input_train, 'validation': s3_input_validation}, include_cls_metadata=False)

#### Check the progress of the job...

In [None]:
boto3.client('sagemaker').describe_hyper_parameter_tuning_job(
    HyperParameterTuningJobName=tuner.latest_tuning_job.job_name)['HyperParameterTuningJobStatus']

#### In our case, the <a id="reference_to_tuning_job">tuning job name</a> is _XGB-HT-RPE-200712-1419_
Make a note of the tuning job when you execute your tuning job (you will need this for a [later step](#section_tuning_job)) as it would be different and wait till all training jobs complete

In [None]:
tuning_job_result = smclient.describe_hyper_parameter_tuning_job(HyperParameterTuningJobName=tuning_job_name)

status = tuning_job_result['HyperParameterTuningJobStatus']
if status != 'Completed':
    print('Reminder: the tuning job has not been completed.')
    
job_count = tuning_job_result['TrainingJobStatusCounters']['Completed']
print("%d training jobs have completed" % job_count)
    
is_minimize = (tuning_job_result['HyperParameterTuningJobConfig']['HyperParameterTuningJobObjective']['Type'] != 'Maximize')
objective_name = tuning_job_result['HyperParameterTuningJobConfig']['HyperParameterTuningJobObjective']['MetricName']

#### Checkout the details of the best model that has been found so far...

In [None]:
if tuning_job_result.get('BestTrainingJob',None):
    print("Best model found so far:")
    pprint(tuning_job_result['BestTrainingJob'])
else:
    print("No training jobs have reported results yet.")

In [None]:
tuner = sagemaker.HyperparameterTuningJobAnalytics(tuning_job_name)

full_df = tuner.dataframe()

if len(full_df) > 0:
    df = full_df[full_df['FinalObjectiveValue'] > -float('inf')]
    if len(df) > 0:
        df = df.sort_values('FinalObjectiveValue', ascending=is_minimize)
        print("Number of training jobs with valid objective: %d" % len(df))
        print({"lowest":min(df['FinalObjectiveValue']),"highest": max(df['FinalObjectiveValue'])})
        pd.set_option('display.max_colwidth', -1)  # Don't truncate TrainingJobName        
    else:
        print("No training jobs have reported valid results yet.")
        
df

#### Wait till the training job completes...once complete, we should have the best performing model that we could deploy...
* Let's get the best training job from the list of training jobs that were completed for the given tuning job and deploy the same
  - As indicated earlier, in our case, the _tuning job name_ is _XGB-HT-RPE-200712-1419_
  - Set the value for HP_TUNING_JOB_NAME below with the name of your <a id="section_tuning_job">tuning job</a> that you noted in the [earlier step](#reference_to_tuning_job)
* In our case, the _best training job_ was found to be _XGB-HT-RPE-200712-1419-016-33cf1420_
* Make a note of the best training job in your case
* Checkout the details of the best training job

In [None]:
import boto3
sm = boto3.client("sagemaker")
HP_TUNING_JOB_NAME = 'XGB-HT-RPE-200712-1419'

In [None]:
tuningJobStatus=sm.describe_hyper_parameter_tuning_job(HyperParameterTuningJobName=HP_TUNING_JOB_NAME)['HyperParameterTuningJobStatus']
bestTrainingJobName=sm.describe_hyper_parameter_tuning_job(HyperParameterTuningJobName=HP_TUNING_JOB_NAME)['BestTrainingJob']['TrainingJobName']
tunedhps=sm.describe_hyper_parameter_tuning_job(HyperParameterTuningJobName=HP_TUNING_JOB_NAME)['BestTrainingJob']['TunedHyperParameters']
totalTrainingJobs=sm.describe_hyper_parameter_tuning_job(HyperParameterTuningJobName=HP_TUNING_JOB_NAME)['HyperParameterTuningJobConfig']['ResourceLimits']['MaxNumberOfTrainingJobs']
print ('Tuning job name: {}\nTuning job status: {}\n\nTotal jobs run: {}\nBest training job: {}\n\nTuned hyperparameters for best job:'.format (HP_TUNING_JOB_NAME, tuningJobStatus, totalTrainingJobs, bestTrainingJobName))
for hp in tunedhps:
    print(' - {}: {:}'.format(hp, tunedhps[hp]))

In [None]:
bestjobdetails=sm.describe_training_job(TrainingJobName=bestTrainingJobName)
trainingImage=bestjobdetails['AlgorithmSpecification']['TrainingImage']
modelPath=bestjobdetails['OutputDataConfig']['S3OutputPath']
modelDataURL=bestjobdetails['ModelArtifacts']['S3ModelArtifacts']
jobStatus=bestjobdetails['TrainingJobStatus']
hps=bestjobdetails['HyperParameters']
instanceType=bestjobdetails['ResourceConfig']['InstanceType']
instanceCount=bestjobdetails['ResourceConfig']['InstanceCount']
roleArn=bestjobdetails['RoleArn']

In [None]:
print('Training job name: {}\nTraining job status: {}\nTraining image name: {}\nModel path: {}\nModel data URL: {}\nHyperparameters:'.format(bestTrainingJobName, jobStatus, trainingImage, modelPath, modelDataURL))
for hp in hps:
    print('- {}: {:}'.format(hp, hps[hp]))
print('Instance type: {}\nInstance count: {}'.format(instanceType, instanceCount))

#### Create a deployable model by identifying the location of model artifacts and the Docker image that contains the inference code

In [None]:
modelName = bestTrainingJobName + '-model'

primary_container = {
    'Image': trainingImage,
    'ModelDataUrl': modelDataURL
}

createModelResponse = sm.create_model(
    ModelName=modelName,
    ExecutionRoleArn=roleArn,
    PrimaryContainer=primary_container)

print('Model name: {}\nModel data: {}\nModel Arn: {}'.format(modelName, modelDataURL, createModelResponse['ModelArn']))

#### Create an Amazon SageMaker endpoint configuration by specifying the ML compute instances that you want to deploy your model to

In [None]:
endpointConfigName = bestTrainingJobName + '-epc'
createEndpointConfigResponse = sm.create_endpoint_config(EndpointConfigName = endpointConfigName,
                                                            ProductionVariants=[{'InstanceType':'ml.m4.xlarge',
                                                                                 'InitialVariantWeight':1,
                                                                                 'InitialInstanceCount':1,
                                                                                 'ModelName':modelName,
                                                                                 'VariantName':'AllTraffic'}])
print('Endpoint config name: {}\nEndpoint Config Arn: {}'.format(endpointConfigName, createEndpointConfigResponse['EndpointConfigArn']))

#### Create the model endpoint... it would take a few minutes... so, wait...
* you should see a _status_ of ___Creating___ while the endpoint is being created
* once the endpoint creation process is complete, you should see the _status_ as ___InService___

In [None]:
%%time
import time
endpointName = bestTrainingJobName + '-ep'

createEndpointResponse = sm.create_endpoint(EndpointName=endpointName,
                                            EndpointConfigName=endpointConfigName)
print('Endpoint name: {}\nEndpoint Arn: {}'.format(endpointName, createEndpointResponse['EndpointArn']))

response = sm.describe_endpoint(EndpointName=endpointName)
status = response['EndpointStatus']
print("Status: " + status)

while status=='Creating':
    time.sleep(60)
    response = sm.describe_endpoint(EndpointName=endpointName)
    status = response['EndpointStatus']
    print("Status: " + status)

print("Arn: " + response['EndpointArn'])
print("Status: " + status)

#### The status of ___InService___ from the above step indicates that the model endpoint has been successfully deployed for consumption