# Amazon SageMaker with XGBoost and Hyperparameter Tuning for Direct Marketing predictions 
_**Supervised Learning with Gradient Boosted Trees: A Binary Prediction Problem With Unbalanced Classes**_

---

---

## Contents

1. [Objective](#Objective)
1. [Background](#Background)
1. [Environment Prepration](#Environment-preparation)
1. [Data Downloading](#Data-downloading-and-exploration)
1. [Data Transformation](#Data-Transformation)
1. [SageMaker: Training](#Training)
1. [SageMaker: Deploying and evaluating model](#Deploying-and-evaluating-model)
1. [SageMaker: Hyperparameter Optimization (HPO)](#Hyperparameter-Optimization-(HPO))
1. [Conclusions](#Conclusions)
1. [Releasing cloud resources](#Releasing-cloud-resources)


---

## Objective
The goal of this workshop is to serve as a **Minimum Viable Example about SageMaker**, teaching you how to do a **basic ML training** and **Hyper-Parameter Optimization (HPO)** in AWS. Teaching an in-depth Data Science approach is out of the scope of this workshop. We hope that you can use it as a starting point and modify it according to your future projects. 

---

## Background (problem description and approach)

- **Direct marketing**: contacting potential new customers via mail, email, phone call etc. 
- **Challenge**: A) too many potential customers. B) limited resources of the approacher (time, money etc.).
- **Problem: Which are the potential customers with the higher chance of becoming actual customers**? (so as to focus the effort only on them). 
- **Our setting**: A bank who wants to predict *whether a customer will enroll for a term deposit, after one or more phone calls*.
- **Our approach**: Build a ML model to do this prediction, from readily available information e.g. demographics, past interactions etc. (features).
- **Our tools**: We will be using the **XGBoost** algorithm in AWS **SageMaker**, followed by **Hyperparameter Optimization (HPO)** to produce the best model.



---

## Environment preparation

SageMaker requires some minimal setup at the begining. This setup is standard and you can use it for any of your future projects.  
Things to specify:
- The **S3 bucket** and **prefix** that you want to use for training and model data. **This should be within the same region as SageMaker training**!
- The **IAM role** used to give training access to your data. See SageMaker documentation for how to create these.

In [None]:
import numpy as np  # For matrix operations and numerical processing
import pandas as pd  # For munging tabular data
import time
import os
from util.ml_reporting_tools import generate_classification_report  # helper function for classification reports

# setting up SageMaker parameters
import sagemaker
import boto3

sgmk_region = boto3.Session().region_name    
sgmk_client = boto3.Session().client("sagemaker")
sgmk_role = sagemaker.get_execution_role()
sgmk_bucket = sagemaker.Session().default_bucket()  # a default bucket has been created for this session
sgmk_prefix = "sagemaker/xgboost-hpo"


---

## Data downloading and exploration
Let's start by downloading the [direct marketing dataset](https://archive.ics.uci.edu/ml/datasets/bank+marketing) from UCI's ML Repository.  
We can run shell commands from Jupyter using the following code:

In [None]:
# (Running shell commands from Jupyter)
!wget -P data/ -N https://archive.ics.uci.edu/ml/machine-learning-databases/00222/bank-additional.zip
!unzip -o data/bank-additional.zip -d data/


Now lets read this into a Pandas data frame and take a look.

In [None]:
df_data = pd.read_csv("./data/bank-additional/bank-additional-full.csv", sep=";")
df_data.head()  # show part of the dataframe


_**Specifics on each of the features:**_

*Demographics:*
* `age`: Customer's age (numeric)
* `job`: Type of job (categorical: 'admin.', 'services', ...)
* `marital`: Marital status (categorical: 'married', 'single', ...)
* `education`: Level of education (categorical: 'basic.4y', 'high.school', ...)

*Past customer events:*
* `default`: Has credit in default? (categorical: 'no', 'unknown', ...)
* `housing`: Has housing loan? (categorical: 'no', 'yes', ...)
* `loan`: Has personal loan? (categorical: 'no', 'yes', ...)

*Past direct marketing contacts:*
* `contact`: Contact communication type (categorical: 'cellular', 'telephone', ...)
* `month`: Last contact month of year (categorical: 'may', 'nov', ...)
* `day_of_week`: Last contact day of the week (categorical: 'mon', 'fri', ...)
* `duration`: Last contact duration, in seconds (numeric). Important note: If duration = 0 then `y` = 'no'.
 
*Campaign information:*
* `campaign`: Number of contacts performed during this campaign and for this client (numeric, includes last contact)
* `pdays`: Number of days that passed by after the client was last contacted from a previous campaign (numeric)
* `previous`: Number of contacts performed before this campaign and for this client (numeric)
* `poutcome`: Outcome of the previous marketing campaign (categorical: 'nonexistent','success', ...)

*External environment factors:*
* `emp.var.rate`: Employment variation rate - quarterly indicator (numeric)
* `cons.price.idx`: Consumer price index - monthly indicator (numeric)
* `cons.conf.idx`: Consumer confidence index - monthly indicator (numeric)
* `euribor3m`: Euribor 3 month rate - daily indicator (numeric)
* `nr.employed`: Number of employees - quarterly indicator (numeric)

*Target variable* **(the one we want to eventually predict):**
* `y`: Has the client subscribed to a term deposit? (binary: 'yes','no')

---

## Data Transformation
Cleaning up data is part of nearly every ML project. Several common steps include:

* **Handling missing values**: In our case there are no missing values.
* **Handling weird/outlier values**: There are some values in the dataset that may require manipulation.
* **Converting categorical to numeric**: There are a lot of categorical variables in our dataset. We need to address this.
* **Oddly distributed data**: We will be using XGBoost, which is a non-linear method, and is minimally affected by the data distribution.
* **Remove unnecessary data**: There are lots of columns representing general economic features that may not be available during inference time.

To summarise, we need to A) address some weird values, B) convert the categorical to numeric valriables and C) Remove unnecessary data:

1. Many records have the value of "999" for `pdays`. It is very likely to be a 'magic' number to represent that *no contact was made before*. Considering that, we will create a new column called "no_previous_contact", then grant it value of "1" when pdays is 999 and "0" otherwise.

2. In the `job` column, there are more than one categories for people who don't work e.g., "student", "retired", and "unemployed". It is very likely the decision to enroll or not to a term deposit depends a lot on whether the customer is working or not. A such, we generate a new column to show whether the customer is working based on `job` column.

3. We will remove the economic features and `duration` from our data as they would need to be forecasted with high precision to be used as features during inference time.

4. We convert categorical variables to numeric using *one hot encoding*.

In [None]:
# Indicator variable to capture when pdays takes a value of 999
df_data["no_previous_contact"] = np.where(df_data["pdays"] == 999, 1, 0)

# Indicator for individuals not actively employed
df_data["not_working"] = np.where(np.in1d(df_data["job"], ["student", "retired", "unemployed"]), 1, 0)

# remove unnecessary data
df_model_data = df_data.drop(
    ["duration", 
    "emp.var.rate", 
    "cons.price.idx", 
    "cons.conf.idx", 
    "euribor3m", 
    "nr.employed"], 
    axis=1,
)

df_model_data = pd.get_dummies(df_model_data)  # Convert categorical variables to sets of indicators

df_model_data.head()  # Show part of the new transformed dataframe (which will be used for training)


---

## Training

Before initializing training, there are some things that need to be done:
1. Suffle and split dataset. 
2. Convert the dataset to the right format the SageMaker algorithm expects (e.g. CSV).
3. Copy the dataset to S3 in order to be accessed by SageMaker during training. 
4. Create s3_inputs that our training function can use as a pointer to the files in S3.
5. Specify the ECR container location for SageMaker's implementation of XGBoost.

We will shuffle and split the dataset into **Training (70%)**, **Validation (20%)**, and **Test (10%)**. We will use the Training and Validation splits during the training phase, while the 'holdout' Test split will be used to evaluate the model performance after it is deployed to production.  

Amazon SageMaker's XGBoost algorithm expects data in the **libSVM** or **CSV** formats. For the CSV format, the following specifications should be met:
- The first column must be the target variable.
- No headers should be included.

In [None]:
# shuffle and splitting dataset
train_data, validation_data, test_data = np.split(
    df_model_data.sample(frac=1, random_state=1729), 
    [int(0.7 * len(df_model_data)), int(0.9*len(df_model_data))],
) 

# create CSV files for Train / Validation / Test
# XGBoost expects a CSV file with no headers, with the 1st row being the ground truth
# We are preparing such a CSV file in the following lines
pd.concat([train_data["y_yes"], train_data.drop(["y_no", "y_yes"], axis=1)], axis=1).to_csv("data/train.csv", index=False, header=False)
pd.concat([validation_data["y_yes"], validation_data.drop(["y_no", "y_yes"], axis=1)], axis=1).to_csv("data/validation.csv", index=False, header=False)
pd.concat([test_data["y_yes"], test_data.drop(["y_no", "y_yes"], axis=1)], axis=1).to_csv("data/test.csv", index=False, header=False)

# copy CSV files to S3 for SageMaker training (training files should reside in S3)
boto3.Session().resource("s3").Bucket(sgmk_bucket).Object(os.path.join(sgmk_prefix, "train.csv")).upload_file("data/train.csv")
boto3.Session().resource("s3").Bucket(sgmk_bucket).Object(os.path.join(sgmk_prefix, "validation.csv")).upload_file("data/validation.csv")

# create s3_inputs channels (objects pointing to the S3 locations)
s3_input_train = sagemaker.s3_input(s3_data="s3://{}/{}/train".format(sgmk_bucket, sgmk_prefix), content_type="csv")
s3_input_validation = sagemaker.s3_input(s3_data="s3://{}/{}/validation".format(sgmk_bucket, sgmk_prefix), content_type="csv")


### Specify algorithm container image

In [None]:
# specify object of the xgboost container image
from sagemaker.amazon.amazon_estimator import get_image_uri
xgb_container_image = get_image_uri(sgmk_region, "xgboost", repo_version="latest")


### A small competition: try to predict the best values for 4 hyper-parameters!
SageMaker's XGBoost includes 38 parameters. You can find more information about them [here](https://docs.aws.amazon.com/sagemaker/latest/dg/xgboost_hyperparameters.html).
For simplicity, we choose to experiment only with 6 of them.

**Please select values for the 4 hyperparameters (by replacing the "?") based on the provided ranges.** Later we will see which model performed best and compare it with the one from the Hyperparameter Optimization step.

In [None]:
sess = sagemaker.Session()  # initiate a SageMaker session

# instantiate an XGBoost estimator object
xgb_estimator = sagemaker.estimator.Estimator(
    image_name=xgb_container_image,           # XGBoost algorithm container
    role=sgmk_role,                      # IAM role to be used
    train_instance_type="ml.m4.xlarge",  # type of training instance
    train_instance_count=1,              # number of instances to be used
    output_path="s3://{}/{}/output".format(sgmk_bucket, sgmk_prefix),
    sagemaker_session=sess,
    train_use_spot_instances=True,       # Use spot instances to reduce cost
    train_max_run=20*60,                 # Maximum allowed active runtime
    train_max_wait=30*60,                # Maximum clock time (including spot delays)
)

# scale_pos_weight is a paramater that controls the relative weights of the classes.
# Because the data set is so highly skewed, we set this parameter according to the ratio (y_no/y_yes)
scale_pos_weight = np.count_nonzero(train_data["y_yes"].values==0) / np.count_nonzero(train_data["y_yes"].values)

# define its hyperparameters
xgb_estimator.set_hyperparameters(
    num_round=?,     # int: [1,300]
    max_depth=?,     # int: [1,10]
    alpha=?,         # float: [0,5]
    eta=?,           # float: [0,1]
    silent=0,
    objective="binary:logistic",
    scale_pos_weight=scale_pos_weight,
)

xgb_estimator.fit({"train": s3_input_train, "validation": s3_input_validation}, wait=True)  # start a training (fitting) job


---

## Deploying and evaluating model

### Deployment
Now that we've trained the xgboost algorithm on our data, deploying the model (hosting it behind a real-time endpoint) is just one line of code!

*Attention! This may take up to 10 minutes, depending on the AWS instance you select*.

In [None]:
xgb_predictor = xgb_estimator.deploy(initial_instance_count=1, instance_type="ml.m5.large")


### Evaluation

First we'll need to determine how we pass data into and receive data from our endpoint. Our data is currently stored as NumPy a array in memory of our notebook instance. To send it in an HTTP POST request, we will serialize it as a CSV string and then decode the resulting CSV.  
Note: For inference with CSV format, SageMaker XGBoost requires that the data **does NOT include the target variable.**

In [None]:
# Converting strings for HTTP POST requests on inference
from sagemaker.predictor import csv_serializer

def predict_prob(predictor, data):
    # predictor settings
    predictor.content_type = "text/csv"
    predictor.serializer = csv_serializer
    return np.fromstring(predictor.predict(data).decode("utf-8"), sep=",")  # convert back to numpy 


# getting the predicted probabilities 
predictions = predict_prob(xgb_predictor, test_data.drop(["y_no", "y_yes"], axis=1).values)

print(predictions)


These numbers are the **predicted probabilities** (in the interval [0,1]) of a potential customer enrolling for a term deposit. 
- 0: the person WILL NOT enroll.
- 1: the person WILL enroll (which makes him/her good candidate for direct marketing).

Now we will generate a **comprehensive model report**, using the following functions. 

In [None]:
generate_classification_report(
    y_actual=test_data["y_yes"].values, 
    y_predict_proba=predictions, 
    decision_threshold=0.5,
    class_names_list=["Did not enroll","Enrolled"],
    model_info="XGBoost SageMaker inbuilt"
)


---

## Hyperparameter Optimization (HPO)
*Note, with the default setting below, the hyperparameter tuning job can take up to 30 minutes to complete.*

We will use SageMaker HyperParameter Optimization (HPO) to automate the searching process effectively. Specifically, we **specify a range**, or a list of possible values in the case of categorical hyperparameters, for each of the hyperparameter that we plan to tune.  

We will tune 4 hyperparameters in this example:
* **eta**: Step size shrinkage used in updates to prevent overfitting. After each boosting step, you can directly get the weights of new features. The eta parameter actually shrinks the feature weights to make the boosting process more conservative. 
* **alpha**: L1 regularization term on weights. Increasing this value makes models more conservative. 
* **min_child_weight**: Minimum sum of instance weight (hessian) needed in a child. If the tree partition step results in a leaf node with the sum of instance weight less than min_child_weight, the building process gives up further partitioning. In linear regression models, this simply corresponds to a minimum number of instances needed in each node. The larger the algorithm, the more conservative it is. 
* **max_depth**: Maximum depth of a tree. Increasing this value makes the model more complex and likely to be overfitted. 

SageMaker hyperparameter tuning will automatically launch **multiple training jobs** with different hyperparameter settings, evaluate results of those training jobs based on a predefined "objective metric", and select the hyperparameter settings for future attempts based on previous results. For each hyperparameter tuning job, we will specify the maximum number of HPO tries (`max_jobs`) and how many of these can happen in parallel (`max_parallel_jobs`).

Tip: `max_parallel_jobs` creates a **trade-off between parformance and speed** (better hyperparameter values vs how long it takes to find these values). If `max_parallel_jobs` is large, then HPO is faster, but the discovered values may not be optimal. Smaller `max_parallel_jobs` will increase the chance of finding optimal values, but HPO will take more time to finish.

Next we'll specify the objective metric that we'd like to tune and its definition, which includes the regular expression (Regex) needed to extract that metric from the CloudWatch logs of the training job. Since we are using built-in XGBoost algorithm here, it emits two predefined metrics: **validation:auc** and **train:auc**, and we elected to monitor *validation:auc* as you can see below. In this case, we only need to specify the metric name and do not need to provide regex.  

If you bring your own algorithm, your algorithm emits metrics by itself. In that case, you'll need to add a MetricDefinition object here to define the format of those metrics through regex, so that SageMaker knows how to extract those metrics from your CloudWatch logs.

For more information on the documentation of the Sagemaker HPO please refer [here](https://sagemaker.readthedocs.io/en/stable/tuner.html).

In [None]:
# import required HPO objects
from sagemaker.tuner import IntegerParameter, CategoricalParameter, ContinuousParameter, HyperparameterTuner

# set up hyperparameter ranges
ranges = {
    "num_round": IntegerParameter(1, 300),
    "max_depth": IntegerParameter(1, 10),
    "alpha": ContinuousParameter(0, 5),
    "eta": ContinuousParameter(0, 1)
}

# set up the objective metric
objective = "validation:auc"

# instantiate a HPO object
tuner = HyperparameterTuner(
    estimator=xgb_estimator,          # the SageMaker estimator object
    objective_metric_name=objective,  # the objective metric to be used for HPO
    hyperparameter_ranges=ranges,     # the range of hyperparameters
    max_jobs=20,                      # total number of HPO jobs
    max_parallel_jobs=4,              # how many HPO jobs can run in parallel
    strategy="Bayesian",              # the internal optimization strategy of HPO
    objective_type="Maximize"         # maximize or minimize the objective metric
)  


### Launch HPO
Now we can launch a hyperparameter tuning job by calling *fit()* function. After the hyperparameter tuning job is created, we can go to SageMaker console to track the progress of the hyperparameter tuning job until it is completed.

In [None]:
# start HPO
tuner.fit({"train": s3_input_train, "validation": s3_input_validation}, include_cls_metadata=False)


**Important notice**: HPO jobs are expected to take quite long to finsih and as such, **they do not wait by default** (the cell will look as 'done' while the job will still be running on the cloud). As such, all subsequent cells relying on the HPO output cannot run unless the job is finished. In order to check whether the HPO has finished (so we can proceed with executing the subsequent code) we can run the following polling script:

In [None]:
# wait, until HPO is finished
hpo_state = "InProgress"

while hpo_state == "InProgress":
    hpo_state = sgmk_client.describe_hyper_parameter_tuning_job(
                HyperParameterTuningJobName=tuner.latest_tuning_job.job_name)["HyperParameterTuningJobStatus"]
    print("-", end="")
    time.sleep(60)  # poll once every 1 min

print("\nHPO state:", hpo_state)


### Deploy and test optimized model
Deploying the best model is simply one line of code:

In [None]:
# deploy the best model from HPO
best_model_predictor = tuner.deploy(initial_instance_count=1, instance_type="ml.m5.large")


Once deployed, we can now evaluate the performance of the best model.

In [None]:
# getting the predicted probabilities of the best model
predictions = predict_prob(best_model_predictor, test_data.drop(["y_no", "y_yes"], axis=1).values)
print(predictions)

# generate report for the best model
generate_classification_report(
    y_actual=test_data["y_yes"].values, 
    y_predict_proba=predictions, 
    decision_threshold=0.5,
    class_names_list=["Did not enroll","Enrolled"],
    model_info="XGBoost SageMaker inbuilt + HPO"
)

---

## Conclusions

The optimized HPO model exhibits approximately AUC=0.773.
Depending on the number of tries, HPO can give a better performing model, compared to simply trying different hyperparameters (by trial and error).  
You can learn more in-depth details about HPO [here](https://docs.aws.amazon.com/sagemaker/latest/dg/automatic-model-tuning-how-it-works.html).

---

## Releasing cloud resources

It is generally a good practice to deactivate all endpoints which are not in use.  
Please uncomment the following lines and run the cell in order to deactive the 2 endpoints that were created before. 

In [None]:
# xgb_predictor.delete_endpoint(delete_endpoint_config=True)
# best_model_predictor.delete_endpoint(delete_endpoint_config=True)
