In [None]:
!pip install sagemaker-experiments

In [None]:
import sagemaker
import boto3
import numpy as np                                
import pandas as pd                               
import os                                         
from sagemaker import get_execution_role
from datetime import datetime

# Get default bucket
bucket = sagemaker.Session().default_bucket()
prefix = 'sagemaker/DEMO-xgboost-dm'

# Get SageMaker Execution Role
role = get_execution_role()
region = boto3.Session().region_name

# SageMaker Session
sess = sagemaker.session.Session()

---

## Training

To train a model in SageMaker, you create a training job. The training job includes the following information:

* The Amazon Elastic Container Registry path where the training code is stored.

* The URL of the Amazon Simple Storage Service (Amazon S3) bucket where you've stored the training data.

* The compute resources that you want SageMaker to use for model training. Compute resources are ML compute instances that are managed by SageMaker.

* The URL of the S3 bucket where you want to store the output of the job.

SageMaker built-in algorithms require the least effort and scale if the data set is large and significant resources are needed to train and deploy the model. For this use case, we will use the built-in xgboost algorithm in SageMaker.

`xgboost` is an extremely popular, open-source package for gradient boosted trees.  It is computationally powerful, fully featured, and has been successfully used in many machine learning competitions.  Let's start with a simple `xgboost` model, trained using Amazon SageMaker's managed, distributed training framework.

In [None]:
container = sagemaker.image_uris.retrieve(region=region, framework='xgboost', version='latest')

In [None]:
s3_input_train = sagemaker.inputs.TrainingInput(s3_data='s3://{}/{}/train'.format(bucket, prefix), content_type='csv')
s3_input_validation = sagemaker.inputs.TrainingInput(s3_data='s3://{}/{}/validation/'.format(bucket, prefix), content_type='csv')
s3_input_test ='s3://{}/{}/test/test.csv'.format(bucket, prefix)

## Create an Experiment

To ensure we are able to keep track of our parameters and metrics that correspond to the training job, we create an Experiment and add this Training job to a Trial within that Experiment. 

Experiments are organized as -
```
Experiment
    Trial
        Trial Component 1
        Trial Component 2
        ...
```     
In this notebook, each time we run the Training job, it will correspond to a Trial Component and we organize that into Trials that represent each iterative experiment we run. 

In [None]:
current_time = datetime.now().strftime("%d-%m-%Y-%H-%M-%S")

### Create the Experiment

In [None]:
from smexperiments.experiment import Experiment

sm = boto3.client('sagemaker')
xgboost_experiment = Experiment.create(experiment_name=f'xgboost-banking-dataset-experiment-{current_time}')

### Create the Trial

In [None]:
trial = xgboost_experiment.create_trial(trial_name=f'trial-{current_time}')

An estimator is a high level interface for SageMaker training. We will create an estimator object by supplying the required parameters, such as IAM role, compute instance count and type. and the S3 output path. 

We also supply hyperparameters for the algoirthm and then call its fit() method to start training the model.

In [None]:
xgb = sagemaker.estimator.Estimator(
    container,
    role, 
    instance_count=1, 
    instance_type='ml.m4.xlarge',
    output_path='s3://{}/{}/output'.format(bucket, prefix),
    sagemaker_session=sess
)

xgb.set_hyperparameters(
    max_depth=5,
    eta=0.2,
    gamma=4,
    min_child_weight=6,
    subsample=0.8,
    silent=0,
    objective='binary:logistic',
    num_round=100
)

xgb.fit(
    inputs = {
        'train': s3_input_train, 
        'validation': s3_input_validation
    },
    experiment_config = {
        "ExperimentName": xgboost_experiment.experiment_name,
        "TrialName": trial.trial_name,
        "TrialComponentDisplayName": "XGB-Training"
    }
) 

---

## Hosting
Hoting the trained model allows us to make inferences against it. The code below deploys our trained model to a real-time endpoint.

In [None]:
xgb_predictor = xgb.deploy(
    initial_instance_count = 1,
    instance_type = 'ml.m4.xlarge'
)

---

## Evaluation
Let us evaluade our model against the test dataset.

As our data is currently stored as NumPy arrays in memory of our notebook instance.  To send it in an HTTP POST request, we'll serialize it as a CSV string and then decode the resulting CSV.

*Note: For inference with CSV format, SageMaker XGBoost requires that the data does NOT include the target variable.*

In [None]:
xgb_predictor.serializer = sagemaker.serializers.CSVSerializer()

The helper method below allows us to pass in our test data and make predictions against it. The following steps are performed in this helper method. 
1. Loop over our test dataset
1. Split it into mini-batches of rows 
1. Convert those mini-batches to CSV string payloads (notice, we drop the target variable from our dataset first)
1. Retrieve mini-batch predictions by invoking the XGBoost endpoint
1. Collect predictions and convert from the CSV output our model provides into a NumPy array

In [None]:
test_data = pd.read_csv(s3_input_test)
test_data

In [None]:
def predict(data, predictor, rows=500 ):
    split_array = np.array_split(data, int(data.shape[0] / float(rows) + 1))
    predictions = ''
    for array in split_array:
        predictions = ','.join([predictions, predictor.predict(array).decode('utf-8')])

    return np.fromstring(predictions[1:], sep=',')

predictions = predict(test_data.drop(['y_no', 'y_yes'], axis=1).to_numpy(), xgb_predictor)

A confusion matrix is a table that is often used to describe the performance of a classification model. Below we will check our confusion matrix to see how well we predicted versus actuals.

In [None]:
pd.crosstab(index=test_data['y_yes'], columns=np.round(predictions), rownames=['actuals'], colnames=['predictions'])

Our model predicted that 144 of the nearly 4000 customers would subscribe and 92 of them actually did.  We also had 344 subscribers who subscribed that we did not predict would.  This is less than desirable, but the model can (and should) be tuned to improve this.

_Note that because there is some element of randomness in the algorithm's subsample, your results may differ slightly from the text written above._

### (Optional) Clean-up

If you are done with this notebook, please run the cell below.  This will remove the hosted endpoint you created and avoid any charges from a stray instance being left on.

In [None]:
xgb_predictor.delete_endpoint(delete_endpoint_config=True)