# Customer Churn with Customised XGBoost and Batch Transform
_**Using SageMaker built-in XGBoost and Sklearn Gradient Boosted Trees to Predict Mobile Customer Departure from a Batch Transformer**_

---

---

## Contents 

1. [Background](#Background)
1. [Batch Transform](#Batch-Transform)
1. [Setup](#Setup)
1. [Data](#Data)
  1. [Exploratory Data Analysis](#Exploratory-Data-Analysis)
    1. [Varibles Distribution](#Varibles-Distribution)
    1. [Correlation between Features](#Correlation-between-Features)
    1. [Variable Selection](#Variable-Selection)
1. [Train](#Train)
  1. [Build-in XGboost](#Build-in-XGboost)
  1. [Customised XGboost](#Customised-XGboost)
1. [Batch Prediction](#DBatch-Prediction)
1. [Evaluate](#Evaluate)
1. [Extensions](#Extensions)
  1. [Call existing batch job in sagemaker](#Call-existing-batch-job-in-sagemaker)

---

## Background

_This notebook has been adapted from an [AWS blog post](https://aws.amazon.com/blogs/ai/predicting-customer-churn-with-amazon-machine-learning/)_

Losing customers is costly for any business.  Identifying unhappy customers early on gives you a chance to offer them incentives to stay.  This notebook describes using machine learning (ML) for the automated identification of unhappy customers, also known as customer churn prediction. ML models rarely give perfect predictions though, so this notebook is also about how to incorporate the relative costs of prediction mistakes when determining the financial outcome of using ML.

We use an example of churn that is familiar to all of us–leaving a mobile phone operator.  Seems like I can always find fault with my provider du jour! And if my provider knows that I’m thinking of leaving, it can offer timely incentives–I can always use a phone upgrade or perhaps have a new feature activated–and I might just stick around. Incentives are often much more cost effective than losing and reacquiring a customer.

---

## Batch Transform
[xgboost_customer_churn](https://github.com/awslabs/amazon-sagemaker-examples/blob/master/introduction_to_applying_machine_learning/xgboost_customer_churn/xgboost_customer_churn.ipynb) uses real time endpoint and does real-time prediction, batch transform uses the same mechanics as real-time hosting to generate predictions. However, unlike real-time hosted endpoints which have persistent hardware (instances stay running until you shut them down), batch transform clusters are torn down when the job completes. 

The main advantage of batch transformer compared to real-time hosting is that batch transformer is more convenient to handle large dataste and we donot need to worry about the running time. For example, we can call batch job in Amazon Lambda function using huge size of dataset, by contrast, we cannot handle large dataset use real-time hosting in lambda function, since lambda function runs up to 15 minutes. [Call batch job in sagemaker or Amazon Lmabda](#Call-batch-job-in-sagemaker-or-Amazon-Lmabda)

---

## Setup

_This notebook was created and tested on an ml.m4.xlarge notebook instance._

Let's start by specifying:

- The S3 bucket and prefix that you want to use for training and model data.  This should be within the same region as the Notebook Instance, training, and hosting.
- The IAM role arn used to give training and hosting access to your data. See the documentation for how to create these.  Note, if more than one role is required for notebook instances, training, and/or hosting, please replace the boto regexp with a the appropriate full IAM role arn string(s).

In [None]:
bucket = '<your_s3_bucket_name_here>'
prefix = 'sagemaker/DEMO-xgboost-churn-batch-transform'

# Define IAM role
import boto3
import re
from sagemaker import get_execution_role

role = get_execution_role()

Next, we'll import the Python libraries we'll need for the remainder of the exercise.

- **Python Standard Library**: (we use them as same as what we do in jupyter notebook or python script.)
  1. data cleaning, feature engineering 
  2. Customised XGBoost model training
  3. model validation
- **SageMaker Python SDK**: we used it for 
  1. Build-in XGBoost model training
  2. Create Batch transform job
  3. Call Batch transform job for the new data prediction (In the case that we can use the existing batch job)

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import io
import os
import sys
import time
import json
from sklearn import metrics
from IPython.display import display
from time import strftime, gmtime
import sagemaker # SageMaker Python SDK
from sagemaker.predictor import csv_serializer

---
## Data

Mobile operators have historical records on which customers ultimately ended up churning and which continued using the service. We can use this historical information to construct an ML model of one mobile operator’s churn using a process called training. After training the model, we can pass the profile information of an arbitrary customer (the same profile information that we used to train the model) to the model, and have the model predict whether this customer is going to churn. Of course, we expect the model to make mistakes–after all, predicting the future is tricky business! But I’ll also show how to deal with prediction errors.

The dataset we use is publicly available and was mentioned in the book [Discovering Knowledge in Data](https://www.amazon.com/dp/0470908742/) by Daniel T. Larose. It is attributed by the author to the University of California Irvine Repository of Machine Learning Datasets.  Let's download and read that dataset in now:

In [None]:
!wget http://dataminingconsultant.com/DKD2e_data_sets.zip
!unzip -o DKD2e_data_sets.zip

In [None]:
churn = pd.read_csv('./Data sets/churn.txt')
pd.set_option('display.max_columns', 500) # to ensure that the dataframe show all columns 
churn.head()

#### Data overview
Firstly, we need to understand the raw data size, check missing value and know the meaning of each column.

In [None]:
print ("Rows     : " ,churn.shape[0])
print ("Columns  : " ,churn.shape[1])
print ("\nFeatures : \n" ,churn.columns.tolist())
print ("\nMissing values :  ", churn.isnull().sum().values.sum())
print ("\nUnique values :  \n",churn.nunique())

No missing value in the dataset. By modern standards, it’s a relatively small dataset, with only 3,333 records, where each record uses 21 attributes to describe the profile of a customer of an unknown US mobile operator. The attributes are:

- `State`: the US state in which the customer resides, indicated by a two-letter abbreviation; for example, OH or NJ
- `Account Length`: the number of days that this account has been active
- `Area Code`: the three-digit area code of the corresponding customer’s phone number
- `Phone`: the remaining seven-digit phone number
- `Int’l Plan`: whether the customer has an international calling plan: yes/no
- `VMail Plan`: whether the customer has a voice mail feature: yes/no
- `VMail Message`: presumably the average number of voice mail messages per month
- `Day Mins`: the total number of calling minutes used during the day
- `Day Calls`: the total number of calls placed during the day
- `Day Charge`: the billed cost of daytime calls
- `Eve Mins, Eve Calls, Eve Charge`: the billed cost for calls placed during the evening
- `Night Mins`, `Night Calls`, `Night Charge`: the billed cost for calls placed during nighttime
- `Intl Mins`, `Intl Calls`, `Intl Charge`: the billed cost for international calls
- `CustServ Calls`: the number of calls placed to Customer Service
- `Churn?`: whether the customer left the service: true/false

The last attribute, `Churn?`, is known as the target attribute–the attribute that we want the ML model to predict.  Because the target attribute is binary, our model will be performing binary prediction, also known as binary classification.


Let's begin clean and exploring the data:

#### Clean data
- Drop useless column: `Phone` takes on too many unique values to be of any practical use.  It's possible parsing out the prefix could have some value, but without more context on how these are allocated, we should avoid using it.

In [None]:
churn = churn.drop('Phone', axis=1) 
churn['Area Code'] = churn['Area Code'].astype(object)

In [None]:
### Check duplicated rows
print(len(churn)) # initial 
churn = churn.drop_duplicates()
print(len(churn)) # after removing duplicates

### Exploratory Data Analysis
Produce plots and summaries to get ideas about featurer and modelling.

#### Target variable
Only 14% of customers churned, so there is some class imabalance, but nothing extreme.

In [None]:
churn["Churn?"].value_counts()

In [None]:
categorial_variable = ['State', 'Area Code', 'Int’l Plan', 'VMail Plan', ]

continuous_variable = ['Account Length', 'VMail Message', 'Day Mins', 'Day Calls', 'Day Charge', 
                       'Eve Mins', 'Eve Calls', 'Eve Charge', 'Night Mins', 'Night Calls', 'Night Charge',
                      'Intl Mins', 'Intl Calls', 'Intl Charge', 'CustServ Calls']


#### Categoraical varibles Distribution
The relationship between each of the features and our target variable.

In [None]:
colours=['#1F77B4', '#FF7F0E']

for column in churn.columns[:-1]: # the last column is target (churn)
    if column in categorial_variable:
        table=pd.crosstab(churn["Churn?"],churn[column])
        print(table)
        print("--------------------------------------------------")
        fig, ax = plt.subplots()
        (table.T).plot(kind='bar', alpha=0.9, color=colours, ax=ax)
        ax.set_xlabel(column)
        ax.set_ylabel('Churn?')
        plt.tight_layout()
        sns.despine()
        plt.show()

- `VMail Plan`: quite relevant to customer churn since some difference between No and Yes class. From plots we can see that, if customer have VMail Plan, then less likely to churn, compared to the customers without VMail Plan.
- `State`: not quite relevant to customer churn since it appears to be quite evenly distributed
- `Area code`: relevent to customer churn. From plots we can see that, if customer is in the area with code 415, then more likely to churn, compared to other area.

#### Continuous varibles Distribution
The relationship between each of the features and our target variable.

In [None]:
for column in churn.select_dtypes(exclude=['object']).columns:
    print(column)
    hist = churn[[column, 'Churn?']].hist(by='Churn?', bins=30)
    plt.show()

Interestingly we see that churners appear:
- Fairly evenly distributed geographically
- More likely to have an international plan
- Less likely to have a voicemail plan
- To exhibit some bimodality in daily minutes (either higher or lower than the average for non-churners)
- To have a larger number of customer service calls (which makes sense as we'd expect customers who experience lots of problems may be more likely to churn)

In addition, we see that churners take on very similar distributions for features like `Day Mins` and `Day Charge`.  That's not surprising as we'd expect minutes spent talking to correlate with charges.  Let's dig deeper into the relationships between our features.

#### Correlation between Features
The purpose for this section is to check if continuous variables have quite simiar information with each other, i.e. Collinearity(a condition in which some of the independent variables are highly correlated) might be a problem in statistical modelling, and can also worse the performance of ML models.

If some columns are quite high correlatied, we need to consider combining them or delete some columns. 

In [None]:
## Correlation Matrix among continuous variables
f,ax = plt.subplots(figsize=(15, 15))
sns.heatmap(churn.corr(), annot=True, linewidths=.5, fmt= '.4f',ax=ax)

From the correlation matrix below, we can see several features that essentially have 100% correlation with one another.

#### Variable Selection
From correlation plots above, we already see several features that essentially have 100% correlation with one another. Let's remove one feature from each of the highly correlated pairs: Day Charge from the pair with Day Mins, Night Charge from the pair with Night Mins, Intl Charge from the pair with Intl Mins:

In [None]:
churn = churn.drop(['Day Charge', 'Eve Charge', 'Night Charge', 'Intl Charge'], axis=1)

#### Feature transformation for continuous variable (Optional)
Data transformation is not important for ML models, but useful in statistical model, such as logistic model. The reasoning for data transformation is: we prefer more symmetric distribution in the statistical modelling, such as normal or uniform distribution.

Let's plot the distribution of remaining continuous variables

In [None]:
pd.plotting.scatter_matrix(churn, figsize=(15, 15))
plt.show()

From the plot above we can see that `intl Call`, `CustServ Calls` and  `VMail Message` show right-tail distribution, and thus we need do log or box-cox transformation to make them more symmetry distributied if using statistical modeling.

Now that we've cleaned up our dataset, let's determine which algorithm to use. As mentioned above, there appear to be some variables where both high and low (but not intermediate) values are predictive of churn. In order to accommodate this in an algorithm like linear regression, we'd need to generate polynomial (or bucketed) terms. Instead, let's attempt to model this problem using gradient boosted trees. Amazon SageMaker provides an XGBoost container that we can use to train in a managed, distributed setting, and then host as a real-time prediction endpoint. XGBoost uses gradient boosted trees which naturally account for non-linear relationships between features and the target variable, as well as accommodating complex interactions between features.

Amazon SageMaker XGBoost can train on data in either a CSV or LibSVM format. For this example, we'll stick with CSV. It should:

Have the predictor variable in the first column
Not have a header row
But first, let's convert our categorical features into numeric features.

In [None]:
model_data = pd.get_dummies(churn)
model_data = pd.concat([model_data['Churn?_True.'], model_data.drop(['Churn?_False.', 'Churn?_True.'], axis=1)], axis=1)

#### Split dataset
And now let's split the data into training, validation, and test sets. This will help prevent us from overfitting the model, and allow us to test the models accuracy on data it hasn't already seen.

In [None]:
train_data, validation_data, test_data = np.split(model_data.sample(frac=1, random_state=1729), [int(0.7 * len(model_data)), int(0.9 * len(model_data))])
train_data.to_csv('train.csv', header=False, index=False)
validation_data.to_csv('validation.csv', header=False, index=False)

Now we'll upload these files to S3.

In [None]:
boto3.Session().resource('s3').Bucket(bucket).Object(os.path.join(prefix, 'train/train.csv')).upload_file('train.csv')
boto3.Session().resource('s3').Bucket(bucket).Object(os.path.join(prefix, 'validation/validation.csv')).upload_file('validation.csv')

---
## Train
### Build-in XGboost model Training
Moving onto training, first we'll need to specify the locations of the XGBoost algorithm containers.

In [None]:
from sagemaker.amazon.amazon_estimator import get_image_uri
container = get_image_uri(boto3.Session().region_name, 'xgboost')

Then, because we're training with the CSV file format, we'll create `s3_input`s that our training function can use as a pointer to the files in S3.

In [None]:
s3_input_train = sagemaker.s3_input(s3_data='s3://{}/{}/train'.format(bucket, prefix), content_type='csv')
s3_input_validation = sagemaker.s3_input(s3_data='s3://{}/{}/validation/'.format(bucket, prefix), content_type='csv')

Now, we can specify a few parameters like what type of training instances we'd like to use and how many, as well as our XGBoost hyperparameters.  A few key hyperparameters are:
- `max_depth` controls how deep each tree within the algorithm can be built.  Deeper trees can lead to better fit, but are more computationally expensive and can lead to overfitting.  There is typically some trade-off in model performance that needs to be explored between a large number of shallow trees and a smaller number of deeper trees.
- `subsample` controls sampling of the training data.  This technique can help reduce overfitting, but setting it too low can also starve the model of data.
- `num_round` controls the number of boosting rounds.  This is essentially the subsequent models that are trained using the residuals of previous iterations.  Again, more rounds should produce a better fit on the training data, but can be computationally expensive or lead to overfitting.
- `eta` controls how aggressive each round of boosting is.  Larger values lead to more conservative boosting.
- `gamma` controls how aggressively trees are grown.  Larger values lead to more conservative models.
- `objective` logistic regression for binary classification, output probability.
- `base_job_name` user defined model name.

More detail on XGBoost's hyperparmeters can be found on their GitHub [page](https://github.com/dmlc/xgboost/blob/master/doc/parameter.md).

In [None]:

model_name = 'DEMO-xgboost-churn' # user defined model name

In [None]:
sess = sagemaker.Session()

xgb = sagemaker.estimator.Estimator(container,
                                    role, 
                                    train_instance_count=1, 
                                    train_instance_type='ml.m4.xlarge',
                                    base_job_name = model_name,
                                    output_path='s3://{}/{}/output'.format(bucket, prefix),
                                    sagemaker_session=sess)
xgb.set_hyperparameters(max_depth=5,
                        eta=0.2,
                        gamma=4,
                        min_child_weight=6,
                        subsample=0.8,
                        silent=0,
                        objective='binary:logistic',
                        num_round=100)

xgb.fit({'train': s3_input_train, 'validation': s3_input_validation}) 

### Customised XGboost model Training
Use Own Algorithms (Sklearn) with Amazon SageMaker

In this case, I set hyperparameters same as the build-in XGboost. But actually, if you compare the predicted result of build-in and customised models, you will find they have slightly different results.

In [None]:
# install Sklearn library
! pip install xgboost

In [None]:
#### Split featuers and target
train_y = train_data.iloc[:,0] 
train_X = train_data.iloc[:,1:]

valid_y = validation_data.iloc[:,0]  
valid_X = validation_data.iloc[:,1:]

In [None]:
import xgboost # Sklearn library

# Setup xgboost model
customised_xgb = xgboost.XGBClassifier(max_depth=5,
                        gamma=4,
                        eta=0.2,        
                        subsample=0.8,
                        silent=0, 
                        min_child_weight=6,  
                        num_round=100,
                        n_estimators=200,
                        objective='binary:logistic') # binary classification


customised_xgb.fit(train_X, train_y, # Train it to our data
       eval_set=[(valid_X, valid_y)])

#### Save the trained model file
- The model file name must satisfy the regular expression pattern: ^\[a-zA-Z0-9\](-\*[a-zA-Z0-9])*;
- The model file also need to tar-zipped.

In [None]:
model_file_name = "DEMO-customised-xgboost-churn" # user-defined model name
customised_xgb._Booster.save_model(model_file_name)

In [None]:
!tar czvf model.tar.gz $model_file_name

Now, upload the pre-trained model to S3

In [None]:
fObj = open("model.tar.gz", 'rb')
key= os.path.join(prefix, model_file_name, 'model.tar.gz')
boto3.Session().resource('s3').Bucket(bucket).Object(key).upload_fileobj(fObj)

Loads customised Model Artifacts to SageMaker

In [None]:
# URI where a pre-trained model is stored
model_url = 's3://{}/{}'.format(bucket,key)
model_url

In [None]:
customised_model_name = 'DEMO-customised-xgboost-churn' # user defined model name

In [None]:
## Train customised model
sess = sagemaker.Session()

# model_uri: URI of our pre-trained model 
customised_xgb = sagemaker.estimator.Estimator(container,
                                    role, 
                                    model_uri = model_url, ## load our customised model
                                    train_instance_count=1, 
                                    train_instance_type='ml.m4.xlarge',
                                    base_job_name = customised_model_name,
                                    output_path='s3://{}/{}/output'.format(bucket, prefix),
                                    sagemaker_session=sess)

# Most hyper-parameters we have done in the pre-trained model
customised_xgb.set_hyperparameters(num_round=50 #The number of rounds for boosting (only used in the console version of XGBoost)
                        )

# start model training
customised_xgb.fit({'train': s3_input_train, 'validation': s3_input_validation}, logs=True)

## Batch Prediction
Batch Transform manages all necessary compute resources, including launching instances to deploy endpoints and deleting them afterward.
- `batch_input` The batch input dataset used for prediction(test dataset) cannot have target column and should be saved in S3 buckets
- `batch_output` We need to specify the path for the batch output

In this example, I use buid-in model for evaluation, which is same as what we need to do for customised model.

In [None]:
test_data_batch = test_data.iloc[:, 1:] # delete the target column
test_data_batch.to_csv('test_data_Batch.csv',header=False, index = False)
# upload to S3
boto3.Session().resource('s3').Bucket(bucket).Object(os.path.join(prefix, 'batch/test_data_Batch.csv')).upload_file('test_data_Batch.csv')


In [None]:
s3_batch_input = 's3://{}/{}/batch/test_data_Batch.csv'.format(bucket,prefix) # test data used for prediction
s3_batch_output = 's3://{}/{}/batch/batch-inference'.format(bucket, prefix) # specify the location of batch output

### Create Batch job and make batch predictions

In [None]:
# creates a transformer object from the trained model
transformer = xgb.transformer(
                          instance_count=1,
                          instance_type='ml.m4.xlarge',
                          output_path=s3_batch_output)

# calls that object's transform method to create a transform job
transformer.transform(data=s3_batch_input, data_type='S3Prefix', content_type='text/csv', split_type='Line')

transformer.wait()

## Evaluate
we have made batch predictions and get the whole predicted putput written in `.csv.out` file. The folder of batch output is `s3_batch_output`.

Now we can read in the batch output and compare with the actual data in `test data` for model evaluation. In this example, I use buid-in model for evaluation, which is same as customised model.

In [None]:
batch_output = 's3://{}/{}/batch/batch-inference/test_data_Batch.csv.out'.format(bucket,prefix)
batch_output = pd.read_csv(batch_output, header=None, encoding = "ISO-8859-1") # header = none

_Note, due to randomized elements of the algorithm, you results may differ slightly._

An important point here is that because of the `np.round()` function above we are using a simple threshold (or cutoff) of 0.5.  Our predictions from `xgboost` come out as continuous values between 0 and 1 and we force them into the binary classes that we began with.  However, because a customer that churns is expected to cost the company more than proactively trying to retain a customer who we think might churn, we should consider adjusting this cutoff.  That will almost certainly increase the number of false positives, but it can also be expected to increase the number of true positives and reduce the number of false negatives.

To get a rough intuition here, let's look at the continuous values of our predictions.

In [None]:
pred_y = np.round(batch_output) # threshold is 0.5

In [None]:
def get_score(y_true,y_pred):
    f1 = metrics.f1_score(y_true, y_pred)
    precision = metrics.precision_score(y_true, y_pred)
    recall = metrics.recall_score(y_true, y_pred)
    accuracy = metrics.accuracy_score(y_true, y_pred)
    tn, fp, fn, tp = metrics.confusion_matrix(y_true, y_pred).ravel()
    return precision, recall, f1, accuracy, tn, fp, fn, tp

In [None]:
#get scores
temp_precision, temp_recall, temp_f1, temp_accuracy, tn, fp, fn, tp = get_score(test_y, pred_y)
output = [temp_precision,temp_recall,temp_f1,temp_accuracy,tp, fp, tn, fn]
output = pd.Series(output, index=['precision', 'recall', 'f1', 'accuracy', 'tp', 'fp', 'tn', 'fn']) 
print(output[['accuracy', 'tp', 'fp', 'tn', 'fn']])

from sklearn.metrics import classification_report
print(classification_report(test_y, pred_y))

Of the 48 churners, we've correctly predicted 39 of them (true positives). And, we incorrectly predicted 4 customers would churn who then ended up not doing so (false positives).  There are also 9 customers who ended up churning, that we predicted would not (false negatives).

In [None]:
plt.hist(pred_y)
plt.show()

The continuous valued predictions coming from our model tend to skew toward 0 or 1, but there is sufficient mass between 0.1 and 0.9 that adjusting the cutoff should indeed shift a number of customers' predictions. 

## Extensions
### Call existing batch job in sagemaker

When we already trained model, and want to re-use it for new data prediction, we can call exisiting model and create a new batch transform job. 
- `batch_input` any new cleand dataset we want to get prediction using existing model (in this example, I reuse testdata for demo)
- `batch_output` define batch output 
- `ModelName` the name of existing model. You can find it in `Amazon Sagemaker dashboard` >- `Inference` >- `Models`. For example, DEMO-xgboost-churn-2019-08-08-03-30-05-889
- `TransformJobName` user-defined name for the new batch transform job

In [None]:
from time import gmtime, strftime
batch_input = 's3://{}/{}/batch/test_data_Batch.csv'.format(bucket,prefix) # test data used for prediction
batch_output = 's3://{}/{}/batch/batch-inference/test_data_Batch.csv.out'.format(bucket,prefix)
Modelname = '<your_model_name_here>' # the model name we already have
transformJobName = 'DEMO-xgboost-churn-call-batch'+ strftime("%Y-%m-%d-%H-%M-%S", gmtime())

In [None]:
client = boto3.client('sagemaker')

In [None]:
create_batch = client.create_transform_job(
    TransformJobName=transformJobName,
    ModelName=Modelname,
    MaxConcurrentTransforms=0,
    MaxPayloadInMB=6,
    BatchStrategy='MultiRecord',
    TransformInput={
        'DataSource': {
            'S3DataSource': {
                'S3DataType': 'S3Prefix',
                'S3Uri': batch_input 
            }
        },
        'ContentType': 'text/csv',
        'CompressionType': 'None',
        'SplitType': 'Line'
    },
    TransformOutput={
        'S3OutputPath': batch_output,
        'AssembleWith': 'Line'
    },
    TransformResources={
        'InstanceType': 'ml.m4.xlarge',
        'InstanceCount': 1
    }
    )

Now, go to `Amazon Sagemaker dashboard` >- `Inference` >- `Batch transform jobs`, you can find the new batch job is `InProgress`. 

After the batch job `Completed`, you can go to the batch output folder to download the new predictions.

### (Optional) Clean-up

If you're ready to be done with this notebook, please go to `Amazon Sagemaker dashboard` >- `Inference` >- `Models` and delete the corresponding model.