# Customer Churn Prediction with XGBoost
<p>Using Gradient Boosted Trees to Predict Mobile Customer Departure</p>

## Background

_This notebook has been adapted from an [AWS blog post](https://aws.amazon.com/blogs/ai/predicting-customer-churn-with-amazon-machine-learning/)_

Losing customers is costly for any business.  Identifying unhappy customers early on gives you a chance to offer them incentives to stay.  This notebook describes using machine learning (ML) for the automated identification of unhappy customers, also known as customer churn prediction.

## Setup

### Imports

In [28]:
from sagemaker.amazon.amazon_estimator import get_image_uri
from sagemaker import get_execution_role
import pandas as pd
import sagemaker
import boto3
import re

In [29]:
sagemaker.__version__

'1.67.1.post0'

### S3 Bucket + Session + Role

In [30]:
bucket = 'arunprsh-sagemaker'
prefix = 'xgboost-churn'
session = sagemaker.Session()
role = get_execution_role()
role

'arn:aws:iam::892313895307:role/service-role/AmazonSageMaker-ExecutionRole-20200722T144616'

### Container

In [31]:
container_image_uri = get_image_uri(boto3.Session().region_name, 'xgboost', '1.0-1')
container_image_uri

'get_image_uri' method will be deprecated in favor of 'ImageURIProvider' class in SageMaker Python SDK v2.


'683313688378.dkr.ecr.us-east-1.amazonaws.com/sagemaker-xgboost:1.0-1-cpu-py3'

### S3 Pointers 

In [32]:
s3_train_data_path = 's3://{}/{}/train'.format(bucket, prefix)
s3_train_pointer = sagemaker.s3_input(s3_data=s3_train_data_path, content_type='csv')
s3_validation_data_path = 's3://{}/{}/validation'.format(bucket, prefix)
s3_validation_pointer = sagemaker.s3_input(s3_data=s3_validation_data_path, content_type='csv')
s3_test_data_path = 's3://{}/{}/test'.format(bucket, prefix)
s3_test_pointer = sagemaker.s3_input(s3_data=s3_test_data_path, content_type='csv')

's3_input' class will be renamed to 'TrainingInput' in SageMaker Python SDK v2.
's3_input' class will be renamed to 'TrainingInput' in SageMaker Python SDK v2.
's3_input' class will be renamed to 'TrainingInput' in SageMaker Python SDK v2.


In [33]:
s3_train_pointer.__dict__

{'config': {'DataSource': {'S3DataSource': {'S3DataType': 'S3Prefix',
    'S3Uri': 's3://arunprsh-sagemaker/xgboost-churn/train',
    'S3DataDistributionType': 'FullyReplicated'}},
  'ContentType': 'csv'}}

#### Check Data

In [40]:
s3_train_data_path

's3://arunprsh-sagemaker/xgboost-churn/train'

In [39]:
train_df = pd.read_csv('/'.join([s3_train_data_path, 'train.csv']), sep=',')
train_df.head()

Unnamed: 0,0,106,0.1,274.4,120,198.6,82,160.8,62,6.0,...,0.49,0.50,0.51,0.52,0.53,1.2,1.3,0.54,1.4,0.55
0,0,28,0,187.8,94,248.6,86,208.8,124,10.6,...,0,0,1,0,1,0,1,0,1,0
1,1,148,0,279.3,104,201.6,87,280.8,99,7.9,...,0,0,0,0,1,0,1,0,1,0
2,0,132,0,191.9,107,206.9,127,272.0,88,12.6,...,0,0,0,0,0,1,1,0,1,0
3,0,92,29,155.4,110,188.5,104,254.9,118,8.0,...,0,0,0,0,0,1,1,0,0,1
4,0,131,25,192.7,85,225.9,105,254.2,59,10.9,...,0,0,0,0,1,0,1,0,0,1


## Train

Now, we can specify a few parameters like what type of training instances we'd like to use and how many, as well as our XGBoost hyperparameters.  A few key hyperparameters are:
- `max_depth` controls how deep each tree within the algorithm can be built.  Deeper trees can lead to better fit, but are more computationally expensive and can lead to overfitting.  There is typically some trade-off in model performance that needs to be explored between a large number of shallow trees and a smaller number of deeper trees.
- `subsample` controls sampling of the training data.  This technique can help reduce overfitting, but setting it too low can also starve the model of data.
- `num_round` controls the number of boosting rounds.  This is essentially the subsequent models that are trained using the residuals of previous iterations.  Again, more rounds should produce a better fit on the training data, but can be computationally expensive or lead to overfitting.
- `eta` controls how aggressive each round of boosting is.  Larger values lead to more conservative boosting.
- `gamma` controls how aggressively trees are grown.  Larger values lead to more conservative models.

More detail on XGBoost's hyperparmeters can be found on their GitHub [page](https://github.com/dmlc/xgboost/blob/master/doc/parameter.md).

In [43]:
xgb = sagemaker.estimator.Estimator(container_image_uri,
                                    role, 
                                    train_instance_count=1, 
                                    train_instance_type='ml.m4.xlarge',
                                    output_path='s3://{}/{}/output'.format(bucket, prefix),
                                    sagemaker_session=session)

xgb.set_hyperparameters(max_depth=5,
                        eta=0.2,
                        gamma=4,
                        min_child_weight=6,
                        subsample=0.8,
                        silent=0,
                        objective='binary:logistic',
                        num_round=100)

xgb.fit({'train': s3_input_train, 'validation': s3_input_validation}) 



2020-07-23 01:24:22 Starting - Starting the training job...
2020-07-23 01:24:26 Starting - Launching requested ML instances.........
2020-07-23 01:26:08 Starting - Preparing the instances for training......
2020-07-23 01:27:09 Downloading - Downloading input data......
2020-07-23 01:28:22 Training - Training image download completed. Training in progress...[34mINFO:sagemaker-containers:Imported framework sagemaker_xgboost_container.training[0m
[34mINFO:sagemaker-containers:Failed to parse hyperparameter objective value binary:logistic to Json.[0m
[34mReturning the value itself[0m
[34mINFO:sagemaker-containers:No GPUs detected (normal if no gpus installed)[0m
[34mINFO:sagemaker_xgboost_container.training:Running XGBoost Sagemaker in algorithm mode[0m
[34mINFO:root:Determined delimiter of CSV input is ','[0m
[34mINFO:root:Determined delimiter of CSV input is ','[0m
[34mINFO:root:Determined delimiter of CSV input is ','[0m
[34m[01:28:25] 2333x69 matrix with 160977 entries

In [45]:
xgb.__dict__

{'image_name': '683313688378.dkr.ecr.us-east-1.amazonaws.com/sagemaker-xgboost:1.0-1-cpu-py3',
 'hyperparam_dict': {'max_depth': 5,
  'eta': 0.2,
  'gamma': 4,
  'min_child_weight': 6,
  'subsample': 0.8,
  'silent': 0,
  'objective': 'binary:logistic',
  'num_round': 100},
 'role': 'arn:aws:iam::892313895307:role/service-role/AmazonSageMaker-ExecutionRole-20200722T144616',
 'train_instance_count': 1,
 'train_instance_type': 'ml.m4.xlarge',
 'train_volume_size': 30,
 'train_volume_kms_key': None,
 'train_max_run': 86400,
 'input_mode': 'File',
 'tags': None,
 'metric_definitions': None,
 'model_uri': None,
 'model_channel_name': 'model',
 'code_uri': None,
 'code_channel_name': 'code',
 'sagemaker_session': <sagemaker.session.Session at 0x7f61144949e8>,
 'base_job_name': None,
 '_current_job_name': 'sagemaker-xgboost-2020-07-23-01-24-22-116',
 'output_path': 's3://arunprsh-sagemaker/xgboost-churn/output',
 'output_kms_key': None,
 'latest_training_job': <sagemaker.estimator._TrainingJo

## Host

Now that we've trained a model, let's deploy it to a hosted endpoint.

In [46]:
xgb_predictor = xgb.deploy(initial_instance_count = 1, instance_type = 'ml.m4.xlarge')



---------------!

In [49]:
xgb_predictor.endpoint

'sagemaker-xgboost-2020-07-23-01-24-22-116'

In [50]:
xgb_predictor.__dict__

{'endpoint': 'sagemaker-xgboost-2020-07-23-01-24-22-116',
 'sagemaker_session': <sagemaker.session.Session at 0x7f61144949e8>,
 'serializer': None,
 'deserializer': None,
 'content_type': None,
 'accept': None,
 '_endpoint_config_name': 'sagemaker-xgboost-2020-07-23-01-24-22-116',
 '_model_names': <map at 0x7f6113b36390>}

## Evaluate

Now that we have a hosted endpoint running, we can make real-time predictions from our model very easily, simply by making an http POST request.  But first, we'll need to setup serializers and deserializers for passing our `test_data` NumPy arrays to the model behind the endpoint.

In [52]:
from sagemaker.predictor import csv_serializer

In [53]:
xgb_predictor.content_type = 'text/csv'
xgb_predictor.serializer = csv_serializer

Now, we'll use a simple function to:
1. Loop over our test dataset
1. Split it into mini-batches of rows 
1. Convert those mini-batchs to CSV string payloads
1. Retrieve mini-batch predictions by invoking the XGBoost endpoint
1. Collect predictions and convert from the CSV output our model provides into a NumPy array

In [54]:
def predict(data, rows=500):
    split_array = np.array_split(data, int(data.shape[0] / float(rows) + 1))
    predictions = ''
    for array in split_array:
        predictions = ','.join([predictions, xgb_predictor.predict(array).decode('utf-8')])

    return np.fromstring(predictions[1:], sep=',')

In [55]:
test_df = pd.read_csv('/'.join([s3_test_data_path, 'test.csv']), sep=',')
test_df.head()

Unnamed: 0,0,186,0.1,137.8,97,187.7,118,146.4,85,8.7,...,0.49,0.50,0.51,0.52,0.53,1.2,1.3,0.54,1.4,0.55
0,0,132,25,113.2,96,269.9,107,229.1,87,7.1,...,0,0,0,0,1,0,1,0,0,1
1,0,112,17,183.2,95,252.8,125,156.7,95,9.7,...,0,0,0,0,1,0,1,0,0,1
2,0,91,24,93.5,112,183.4,128,240.7,133,9.9,...,0,0,0,0,0,1,0,1,0,1
3,0,22,0,110.3,107,166.5,93,202.3,96,9.5,...,0,0,0,1,0,0,1,0,1,0
4,0,102,0,186.8,92,173.7,123,250.9,131,9.7,...,0,0,0,0,1,0,1,0,1,0


In [58]:
predictions = predict(test_df.to_numpy()[:, 1:])

In [59]:
predictions

array([0.01076382, 0.00815068, 0.17197835, 0.00727295, 0.02683159,
       0.77646941, 0.01884228, 0.15261421, 0.02097434, 0.01072737,
       0.04403256, 0.01101517, 0.02866043, 0.01557919, 0.01282453,
       0.02490271, 0.03327551, 0.21406855, 0.02446287, 0.00662337,
       0.69447809, 0.02028675, 0.02458457, 0.0297976 , 0.11261535,
       0.01323374, 0.98956352, 0.01778056, 0.08464119, 0.01221189,
       0.03980241, 0.01717943, 0.02953824, 0.01349098, 0.74248719,
       0.03136658, 0.00886904, 0.02141312, 0.04439792, 0.00824427,
       0.07754114, 0.01717059, 0.00951407, 0.02062972, 0.8589865 ,
       0.04142113, 0.01364505, 0.02368555, 0.11331131, 0.01155735,
       0.03511472, 0.04872791, 0.031212  , 0.43253064, 0.01630785,
       0.01716655, 0.52175391, 0.01216088, 0.03639773, 0.37913007,
       0.01785273, 0.13224347, 0.04007643, 0.00741498, 0.00938071,
       0.01384279, 0.00928919, 0.04003248, 0.03315024, 0.01156359,
       0.34496883, 0.01788814, 0.02180807, 0.03097621, 0.76915

In [60]:
THRESHOLD = 0.3

In [61]:
pd.crosstab(index=test_df.iloc[:, 0], columns=np.where(predictions > THRESHOLD, 1, 0))

col_0,0,1
0,Unnamed: 1_level_1,Unnamed: 2_level_1
0,278,7
1,7,41
